
While most AI companies keep building LLMs with more weights and tokens (now one trillion is a standard number), I went in the opposite direction. Of course, zero weight means that there is no neural network behind the scenes. More specifically, it means that there is no lengthy Blackbox process to find the “best” weights optimizing a loss function. In reality, weights are still present, very much like in a neural network, but they are explicitly specified. Indeed, I use parametric weights, governed by a few explainable parameters. The optimization focuses on these few parameters, and reduces overfitting. It has similarities to regularization methods, where weights are highly constrained to better control the outcome and interpretation of the results.
I implemented similar techniques in the past with xLLM, see here. However, in this new application, I made the core formula very straightforward and prominent in my article to help you make the connection with deep neural networks, find the analogy, and see the exact point where and how both approaches start diverging.
Speeding up training and increasing output quality
Breaking down your data into homogeneous chunks, for instance top categories, leads to better results and increased speed. If the computational complexity is O(n2), breaking down your input data into 20 blocks reduces complexity to 20 times O(n2/400), that is, 20 times faster. And it provides more relevant output!
No neural network and no gradient descent mean that the training is done in much less than one second, as opposed to hours or days. The cost reduction is dramatic, and it also helps you perform a lot of testing to enhance the method in no time, further facilitated by the fact that all the parameters and components are explainable, many being decoupled.
One of my new clients, a fortune 100 financial institution, asked me how long it takes to train the model. I did not know what to answer: telling the truth (less than 1 second) would sound either not believable or leading to bad results. I decided to share the app with him instead, so that he could judge by himself.
Finally, a rule of thumb to improve quality is to use a loss function identical to the model evaluation metric. This is rarely done in neural networks because any good evaluation metric is very hard to update efficiently each time a weight is modified, which happens trillions of times in standard LLMs. Here the loss and evaluation functions are identical. In my next article, I will show how to implement an evaluation metric suitable as a loss function for neural networks.
Case study
The dataset used here consists of all the 4000 articles published on Data Science Central between 2016 and 2020, prior to the acquisition. The goal is to predict pageview counts for new articles based on the title, before publication. The app is used to recommend good keywords to potential authors, as well as keywords to avoid. Finally, the clustering algorithm aims at grouping high-performance articles into clusters, to further understand what works well, and what to avoid.
The internal back-end tables (tokens and so on) are stored mostly as nested hashes, a format similar to JSON. It is very efficient to deal with highly sparse graphs. Indeed, the Python libraries for clustering showed their limitations due to using gigantic distance matrices. I had to use my own that runs faster, needs much less memory, and handles sparsity very well.
As in all LLMs, building small, specialized yet large enough embeddings and token lists, works a lot better and faster than working with gigantic and generic lists. These huge lists consist mostly of noise with unused tokens, never fetched to answer a prompt (and when they are, you end up with poor quality).
I use not just simple tokens, but multi-tokens including contextual tokens, consisting of multiple single tokens not adjacent in the text. There are mechanisms in place to keep the number of such tokens under control.
The dendrogram in the featured picture is based on the multi-token classification. Articles are then grouped based on these clusters: see above example, showcasing one of the smallest groups linked to a cluster consisting of 3 multi-tokens, including a contextual one (‘Machine^vs’).
Full documentation, source code, and results
The full documentation with links to the code and everything, is in the same project textbook on GitHub, here. Check out project 8.3, added to the textbook on May 3.
Note that the project textbook contains a lot more than the material discussed here. The reason to share the whole book rather than just the relevant chapters is because of cross-references with other projects. Also, clickable links and other navigation features in the PDF version work well only in the full document, on Chrome and other viewers, after download.
To not miss future updates on this topic and GenAI in general, sign-up to my newsletter, here. Upon signing-up, you will get a code to access member-only content. There is no cost. The same code gives you a 20% discount on all my eBooks in my eStore, here.
Author
Vincent Granville is a pioneering GenAI scientist and machine learning expert, co-founder of Data Science Central (acquired by a publicly traded company in 2020), Chief AI Scientist at MLTechniques.com and GenAItechLab.com, former VC-funded executive, author (Elsevier) and patent owner — one related to LLM. Vincent’s past corporate experience includes Visa, Wells Fargo, eBay, NBC, Microsoft, and CNET. Follow Vincent on LinkedIn.