New Trends in LLM: Overview with Focus on xLLM

If you ever wondered how xLLM is different from other LLM and RAG architectures, what are the foundational changes that make it appealing to fortune 100 companies, and what are the innovations being copied by competitors, read on. In this article, I share the latest trends and provide a high-level summary of xLLM, describing the ground-breaking technologies that make it unique, faster, and better for professional users and experts. In particular, I share my PowerPoint presentation on the topic.

Search is becoming hot again, this time powered by RAG and LLMs rather than PageRank.  New LLMs may not use transformers, and energy-efficient implementations are gaining popularity, with an attempt to lower GPU usage, and thus costs. Yet all but xLLM still rely on Blackbox neural networks.

Great evaluation metrics remain elusive and will remain so probably forever: in the end, LLMs, just like clustering, are part of unsupervised learning. Two users looking at a non-trivial dataset will never agree on what the “true” underlying cluster structure is. Because “true” is meaningless in this context. The same applies to LLMs. With some exceptions: when used for predictive analytics, that is, supervised learning, it is possible to tell which LLM is best in absolute terms (to some extent; it also depends on the dataset).

1. From big to small LLMs, back to big ones

The first LLMs were very big, monolithic systems. Now simple LLMs deal with specialized content or applications, such as corporate corpus. The benefit is faster training, easier fine-tuning, and reduced risk of hallucinations. But the trend could change, moving back to big LLMs. For instance, the xLLM architecture consists of small, specialized sub-LLMs, each one focusing on a top category. If you bundle 2000 of them together, you cover the entire human knowledge. The whole system, sometimes called mixture of experts, is managed with an LLM router.

2. LLM routers

The word multi-agent system is sometimes used instead, although not with the exact same meaning. An LLM router is a top layer above the sub-LLMs, that guides the user to the correct sub-LLMs relevant to his prompt. It can be explicit to the user (asking him which sub-LLM to choose), or transparent (automatically performed), or semi-transparent. For instance, a user looking for “gradient descent” using the “statistical science” sub-LLM, may find very little: the relevant information is in the “calculus” sub-LLM. The LLM router takes care of this problem.

3. Fast self-tuning, auto-tuning, and evaluation

Fine-tuning an LLM on part of the system, rather than the whole, can speed up the process tremendously. With xLLM, you can fine-tune hyperparameters locally on a sub-LLM (fast), or across all sub-LLMs at once (slow). Hyperparameters can be local or global. In the case of xLLM, they are intuitive, as the system is based on explainable AI. In standard LLMs, LoRA, an abbreviation for Low-Rank Adaptation, achieves a similar goal.

Self-tuning, also called auto-tuning, works as follows: collect the favorite hyperparameters chosen by the users and build a default hyperparameter set based on these choices. It also allows the user to work with customized hyperparameters, with two users getting different answers to the same prompt. Make this process even easier by returning a relevancy score to each item listed in the answer (URLs, related concepts, definitions, references, examples, and so on).

Regarding evaluation, I proceed as follows. Reconstruct the taxonomy attached to the corpus: for each web page, assign a category, and compare it to the real category embedded in the corpus. I worked with Wolfram, Wikipedia, and corporate corpus: all have a very similar structure with taxonomy and related items; this structure can be retrieved while crawling.

Finally, whenever possible, use the evaluation metric as your loss function in the underlying gradient descent algorithm — typically a deep neural network. Loss functions currently in use are poor proxies to model quality, so why not use the evaluation metric instead? This is hard to do because you need a loss function that can be updated with atomic changes such as weight update or neuron activation, billions of times during training. My workaround is to start with a rough approximation of the evaluation metric and refine it over time until it converges to the desired metric. The result is an adaptive loss function. It also prevents you from getting stuck in a local minimum.

4. Search, clustering and predictions

At the beginning, LLM for search was looked down. Now that this is what most corporate clients are looking for, and since it can do a far better job than Google search or all search boxes found on company websites, it starts to get a lot of attention. Great search on your website leads to more sales. Besides search, there are plenty of other applications: code generation, clustering, and predictive analytics based on text only.

5. Knowledge graphs and other improvements

There is a lot of talk about long-range context and knowledge graphs, built as a top layer to add more context to LLMs. In my xLLM, the knowledge graph is actually the bottom layer and retrieved from the corpus while browsing. If none is found or if quality is poor, I import one from an external source, calling it augmented knowledge graph. I also built some from scratch using synonyms, indexes, glossaries, and books. It may consist of a taxonomy and related concepts. In any case, it brings the long-range context missing in the first LLM implementations.

I also introduced longer tokens consisting of multiple tokens, such as “data~science”. I call them multi-tokens. Meta also uses them. Finally, I use contextual tokens, denoted as (say) “data^science”. It means that the two words “data” and “science” are found in a same paragraph, but not adjacent to each other. Special care is needed to avoid an explosion in the number of tokens. In addition to the corpus itself, I leverage user prompts as augmented data to enrich the input data. The most frequent embeddings are stored in a cache for faster retrieval in the backend tables. Then, variable-length embeddings further increase the speed. While vector and graph databases are popular to store embeddings, in my case I use nested hashes, that is, an hash (or key-value database) where the value is an hash itself. It is very efficient to handle sparsity.

Cosine distance and dot product, to compare embeddings, is receiving increased criticism. There are alternative metrics, such as pointwise mutual information (PMI).

6. Local, secure, enterprise versions

There is more and more interest in local, secure implementations to serve corporate clients. Afterall, that’s where the money is. For these clients, hallucinations are a liability. Low latency, easy fine-tuning, and explainable parameters are other important criteria for them. Thus, their interest in my open source xLLM that solves all these problems.

Reference

I illustrate all the concepts discussed here in my new book “State of the Art in GenAI & LLMs — Creative Projects, with Solutions”, available here. For a high-level presentation, see my PowerPoint presentation here on Google drive (easy to view), or on GitHub, here. Both the book and the presentation focus on xLLM.

Author

Towards Better GenAI: 5 Major Issues, and How to Fix Them

Vincent Granville is a pioneering GenAI scientist and machine learning expert, co-founder of Data Science Central (acquired by a publicly traded company in 2020), Chief AI Scientist at MLTechniques.com and GenAItechLab.com, former VC-funded executive, author (Elsevier) and patent owner — one related to LLM. Vincent’s past corporate experience includes Visa, Wells Fargo, eBay, NBC, Microsoft, and CNET. Follow Vincent on LinkedIn.

Leave a Reply

Discover more from Machine Learning Techniques

Subscribe now to keep reading and get access to the full archive.

Continue reading