Hallucination-Free, Self-Tuned, Fast Hierarchical LLMs with Multi-Token Embeddings

The new generation of RAG / LLM architecture is moving away from the original monolithic and generic OpenAI model, towards a collection of decentralized and specialized LLMs jointly organized and governed via multi-agent systems.

The benefits are obvious: low latency, smaller tables (one per LLM), faster training and fine-tuning, energy-efficient, better results, with much lower GPU consumption. The number of tokens or weights is dramatically reduced.  If you charge customers by the token as many vendors do, this is another competitive advantage. It also leads to local implementations and secure enterprise solutions augmented with external sources.

My own product, xLLM, is the pioneering solution that ignited this new trend. It offers additional benefits: self-tuning, user-customized, no neural networks and thus even much faster and more frugal in terms of GPU usage. Embeddings is just one of the many backend tables (one per LLM), and not even the most important one.  In particular, it heavily relies on the reconstructed structure found in the crawled repository, especially the taxonomy and related items. The user can select a specific LLM in addition to the standard prompt. A future version will also integrate user prompts as input data for some of the backend tables.  By contrast to deep neural networks, a core feature of xLLM is explainable AI.

Figure 1: xLLM (left) compared to standard LLMs (right)

So far, nothing new. It has been available as open source with full Python code, written from scratch and well documented, for quite some time: see here. An enterprise version for a fortune 100 company is currently tested, and some advertisers are interested in blending sponsored results along with the organic output delivered to user queries. The parent company is funded and operated by the author of this article.

Multi-token embeddings

The new feature is the introduction, for the first time to my knowledge, of embeddings consisting of multi-token words, rather than single tokens. As one would expect, it leads to better results for the output section based on embeddings. However, the initial goal was to further improve, create, or update the taxonomy tables. It is especially useful when augmenting the corpus with external sources that lack an obvious, easy-to-detect structure.

Dealing with words rather than tokens leads to a combinatorial explosion in the size and number of multi-token embeddings, called x-embeddings. In order to keep these new tables as small as possible while still bringing extra value, special mechanisms are needed.

Interestingly, the very first attempt produced massive backend tables, reminiscent of standard LLMs. There was a lot of noise, indeed mostly noise: useless text elements that are never fetched when creating the output to a user prompt. This noise can potentially result in hallucinations. The reason I mention it is because I believe that the same issue is still present today in standard LLMs based on trillions of weights. Now I solved this problem: xLLM tables are short again, even those that store the x-embeddings.

Finally, in standard LLMs, hallucinations are a byproduct of design flaws in the backend architecture, and hard to fix as it is based on black box neural networks. You need frontend prompt engineering as a workaround. To the contrary, xLLM has better foundations, easy to understand and fine-tune; it returns meaningful results in one single prompt.

Figure 2: xLLM6, top results for search keyword “hypothesis”

Full documentation, source code, and backend tables

I created a new folder xLLM6 on GitHub for the new version with the x-embeddings. It contains the Python code and all the required backend tables, as well as the code to produce these new tables. The previous version is stored in the xLLM5 folder. The full documentation with links to the code and everything, is in the same project textbook on GitHub, here. Check out appendix C.4 and the new project 7.2.3 dealing with the upgraded architecture: it’s just a few pages long.

Note that the project textbook (still under development) contains a lot more than xLLM. The reason to share the whole book rather than just the relevant chapters is because of cross-references with other projects. Also, clickable links and other navigation features in the PDF version work well only in the full document, on Chrome and other viewers, after download.

To not miss future updates on this topic and GenAI in general, sign-up to my newsletter, here. Upon signing-up, you will get a code to access member-only content. There is no cost. The same code gives you a 20% discount on all my eBooks in my eStore, here.

Author

Vincent Granville is a pioneering GenAI scientist and machine learning expert, co-founder of Data Science Central (acquired by a publicly traded company in 2020), Chief AI Scientist at MLTechniques.com and GenAItechLab.com, former VC-funded executive, author (Elsevier) and patent owner — one related to LLM. Vincent’s past corporate experience includes Visa, Wells Fargo, eBay, NBC, Microsoft, and CNET. Follow Vincent on LinkedIn.

Check your inbox or spam folder to confirm your subscription.

Leave a ReplyCancel reply

Discover more from NextGen AI Technology

Subscribe now to keep reading and get access to the full archive.

Continue reading

Exit mobile version