96% Correct Next Token Prediction, with No DNN, no Training, auto-distilled model

Over the last 12 months, I’ve built a model to predict the next token and to suggest synonyms or related queries to a user prompt, with 100% correct predictions on the training set in one shot, without training or deep neural networks (DNNs). The same model is now integrated in some of the most recent LLM architectures, albeit with costly training via DNNs. My version does not need DNNs or training.

The purpose of this article is to provide validation to my deep neural network alternative in the context of LLMs. The new model is as a substitute to standard DNNs, with increased explainability and higher accuracy. It is designed for corporate corpuses. The end goal is to provide better accuracy at a much lower cost, while providing full control over all the components.

An interesting feature is auto-distillation, whereas the model self-identifies weights that do not contribute over time to 99.9% of your user-generated or synthetic prompts, and drop them, based on prompts from a large, specialized user base. The gain is most spectacular in open-weight LLMs applied to specialized contexts, whether based on DNNs or not. In this case, the original, full model is called the “teacher”, while the distilled version is called the “student” and performs just as well despite much smaller size, but only on your corpus.

Overview

My alternative to DNNs for LLM architecture may have been perceived as an isolated, one-off model untested by others 12 months ago. With Chinese researchers now actively working on the exact same model, it is becoming a topic of significant interest. They call it “RBF networks” while I used the word “kernel method” in the past. Both terms are correct and widely known in contexts other than LLMs. The difference reflects the research field you are coming from, but both point to the exact same equations. However, my approach is unique in the sense that it does not use DNNs to compute the weights. Instead, I obtain them in one-shot without training, with 100% correct prediction on the training set, without bad overfitting, in high dimensions.

I introduce auto-distillation and pre-tabulated values (similar to KV cache) as mechanisms to speed up computations. I also discuss why it works with 10,000 fewer embeddings. In the original book where my method was first published, I also discuss distillation-resistant invisible watermarking techniques to protect your model against unauthorized uses. Last but not least, I feature a case study (NVIDIA corpus) with 96% correct prediction rate for next token, and discuss replicability, explainability and deterministic AI attached to the model, with the ability to allow for controlled randomness in the response if desired. Due to perfect predictions on the training set, I explain how to perform three-way training to fine-tune the hyperparameters. The 96% correct prediction rate outside the training set is far above the 30 to 55% achieved by standard transformer-based models, while avoiding costly training and without increased compute time post-training. This high performance is due to specialization to the specific corpus, by contrast to generic predictors.

The next steps include working with a larger corpus, and performing tasks beyond predicting the next token, suggesting relating queries, or finding synonyms. The methodology is also well suited for image classification and problems with numerical data (time series and so on).

Total number of unique weights used over time (Y-axis) vs cumulative number of prompts (X-axis), monitored for auto-distillation

Download the free paper

The 9-page technical paper explains the models with link to the full description in my previous book. It also describes benefits, computational aspects, and the NVIDIA case study with illustrations. Below is the table of contents:

Building an LLM with alternatives to deep neural networks
- Connection between RBF networks and standard LLMs
- Combining RBF networks with standard LLMs
Fast, high-accuracy RBF network without training
- Model description and formulation
- Benign overfitting, other features and benefits
- From billions to fewer than a million parameters
Case study: 96% correct prediction rate
- NVIDIA case study
- Next token prediction: computational complexity
- Earlier DNN-free model with exact predictions on training set

Get the full technical paper, here. For a PowerPoint summary, see here.

To not miss future announcements, sign up to my newsletter, here.

About the Author

Towards Better GenAI: 5 Major Issues, and How to Fix Them

Vincent Granville is a pioneering GenAI scientist, co-founder at BondingAI.io, the LLM 2.0 platform for hallucination-free, secure, in-house, lightning-fast Enterprise AI at scale with zero weight and no GPU. He is also author (Elsevier, Wiley), publisher, and successful entrepreneur with multi-million-dollar exit. Vincent’s past corporate experience includes Visa, Wells Fargo, eBay, NBC, Microsoft, and CNET. He completed a post-doc in computational statistics at University of Cambridge.

	messerb5467 on Quantum Derivatives, GenAI, an…
	Vincent Granville on Quantum Derivatives, GenAI, an…
	Brad Messer on Quantum Derivatives, GenAI, an…
	Sanjay Gautam on Number Theory: Longest Runs of…
	Artem Melnyk on Autonomous Driving: Boosting O…

96% Correct Next Token Prediction, with No DNN, no Training, auto-distilled model

Overview

Download the free paper

About the Author

Like this:

Leave a ReplyCancel reply

96% Correct Next Token Prediction, with No DNN, no Training, auto-distilled model

Overview

Download the free paper

About the Author

Share this:

Like this:

Leave a ReplyCancel reply

Discover more from xLLM and AI Technology