Build and Evaluate High Performance Taxonomy-Based LLMs From Scratch

One obvious way to dramatically improve the quality of LLM and RAG systems is to use high-quality input sources, as opposed to just raw text from the crawled or parsed content. Combine it with specialization: one LLM per top domain, allowing the user to customize parameters and specify the domain in addition to standard concise prompts. Then you end up with very fast, lightweight, self-tuned, hallucination-free implementations, suitable for enterprise needs and inexpensive (much fewer tokens, no GPU, no neural networks, no training). Also, you can deploy these multi-LLMs locally even on a modest laptop, boosting security.

That was the goal when I developed the xLLM architecture. Even though it creates its own embeddings, even x-embeddings with tokens replaced by multi-token words (see here), the strength comes from highly structured information detected in the corpus or brought in externally. This extra information leads to several backend tables in addition to x-embeddings; these tables are responsible for the quality of the output, more so than the embeddings.

What’s more, when designing xLLM, earlier tests showed that systems with billions of tokens are extremely sparse. Most of the tokens are noise. These useless tokens rarely get activated or fetched when producing an answer to a prompt, and if they do, it may result in hallucinations or poor quality. But customers pay by the token, so there is little incentive to clean this mess. The problem is compounded by the Blackbox / neural network architecture of standard LLMs. It makes testing and implementing changes slow and expensive, an art more than a science, in sharp contrast to xLLM.

Integrating taxonomies into LLMs

Besides taxonomies, integrating indexes, titles and subtitles, glossaries, synonyms dictionaries and other structured data, further contributes to the quality. Whether gathered on the corpus or coming externally as augmented data. However, here I focus on taxonomies only.

The first version of xLLM heavily relied on a high-quality taxonomy found in the crawled data (Wolfram in this case) and other structures such as a graph of related concepts. All this was very easy to detect and retrieve from the website, thanks to smart crawling. But what if this type of structure is missing in your corpus? For instance, Wikipedia also has a decent structure, very similar to Wolfram, and easy to detect. But it is a hit and miss. Some topics such as “machine learning” are well organized. For “statistical science”, the quality of the embedded structure is low. The goal of this article is to discuss options when facing this situation.

Fig. 1: Wolfram top categories for “Stats & Proba” (left) vs home-made LLM (right)

The two main options are:

  • Create a taxonomy from scratch based on the crawled corpus, in a semi-automated way. See Figure 1 for illustration.
  • Use an external taxonomy that covers your specific domain: one for each specialized sub-LLM. This process is fully automated.

These two options are discussed in the technical document accompanying this article, with open-source code. More about xLLM can be found here, especially articles 36-38 listed there.

Evaluating LLMs

Evaluation is a tricky problem, as two users – a layman versus a professional expert – are bound to have opposite ratings. In the context of xLLM, two users with the same prompt may get different answers if they choose different sets of hyperparameters.

That said, I came up with an evaluation method specific to xLLM. The Wolfram xLLM is based on the Wolfram taxonomy. However, you can use that taxonomy as if it was external, that is, not part of the crawled data. You then categorize all the crawled webpages using the Wolfram taxonomy as augmented data. Then you compare the results with the native categories assigned by Wolfram. The amount of mismatch between both, across all webpages, is an indicator of quality.

But the problem is more complicated than that. First, my algorithm assigns multiple categories to each webpage, each with its relevancy score. Wolfram assigns only one category per page, though there are other structure elements achieving the same goal.

Fig. 2: Wolfram categories assigned to URLs, vs. reconstructed categories

What it means is that “exact match” is not a good metric. Out of 600 pages and 600 categories, I get between 100 and 150 categorized exactly as Wolfram, depending on the parameters used to produce my relevancy scores. This sounds very bad, but most of the mismatches are actually pretty good. Just not 100% identical as you can see in Figure 2. This is due to the very high granularity of the Wolfram taxonomy.

Full documentation, source code, and backend tables

I created a new folder “build-taxonomy” under LLM/xLLM6/ on GitHub for this project. It contains the Python code and all the required backend tables, as well as the code to produce the new tables. The full documentation with links to the code and everything, is in the same project textbook on GitHub, here. Check out project 8.2, added to the textbook on April 20.

Note that the project textbook (still under development) contains a lot more than xLLM. The reason to share the whole book rather than just the relevant chapters is because of cross-references with other projects. Also, clickable links and other navigation features in the PDF version work well only in the full document, on Chrome and other viewers, after download.

To not miss future updates on this topic and GenAI in general, sign-up to my newsletter, here. Upon signing-up, you will get a code to access member-only content. There is no cost. The same code gives you a 20% discount on all my eBooks in my eStore, here.

Author

Towards Better GenAI: 5 Major Issues, and How to Fix Them

Vincent Granville is a pioneering GenAI scientist and machine learning expert, co-founder of Data Science Central (acquired by a publicly traded company in 2020), Chief AI Scientist at MLTechniques.com and GenAItechLab.com, former VC-funded executive, author (Elsevier) and patent owner — one related to LLM. Vincent’s past corporate experience includes Visa, Wells Fargo, eBay, NBC, Microsoft, and CNET. Follow Vincent on LinkedIn.

Leave a Reply

Discover more from Machine Learning Techniques

Subscribe now to keep reading and get access to the full archive.

Continue reading