
In this document, you will learn how to build a system that decides, among dozens of candidate paragraphs selected from the corpus to answer a prompt, which ones to show in the results, and in what order. The goal is to maximize relevancy while not overwhelming the user with a long, cluttered answer. Think of it as the new PageRank for RAG/LLM, although the algorithm is radically different, and much simpler. The approach is generic and works for all RAG/LLM systems whether based on neural networks or not. The main steps are:
Backend processing (linked to the corpus)
- Split your corpus into text entities such as webpages, paragraphs, sections and so on. This step is similar to chunking. Attach an ID (called index) to each text entity.
- Text entities have two types of fields: regular text like in all LLMs, and knowledge graph elements such as categories, related items, URL, tags, parent categories, title, and so on. These knowledge graph elements are found while crawling and part of the original corpus. Or they can be added after the full crawl. For instance, in xLLM, agents are assigned to text entities post-crawling, using a clustering algorithm.
- You need two types of tokens: regular ones, and those found in the knowledge graph elements. The latter are called graph tokens. You then create a key-value table
Hash_ID, where the key is a token, and the value is the list of text entity IDs attached to the token in question, with a token count for each one. Graph tokens start with “__”, to differentiate them from regular tokens.
Frontend processing (linked to the prompt)
- You create a local, small key-value table
ID_Hash, a transposed version ofHash_ID, where the key is a text entity ID, and the value is a list of tokens t found in the prompt, with ID inHash_ID[t]. - For each ID, you compute (say) 4 relevancy scores: SA, SB corresponding to regular tokens, and SC, SD identical to SA, SB but for graph tokens. For instance, SB(ID) is the number of regular tokens found both in the prompt and in text entity indexed as ID. More about SA in the next section.
- You sort the collected IDs according to each score, and assign 4 ranks (one per score) to each ID. The global rank is a weighted combinations of the 4 ranks, and subject to fine-tuning. In the prompt results, you display text entity IDs (their full content) with a global rank above some threshold, or the top 10 according to global rank.
Smart scoring
Rare tokens present in the prompt (based on their occurrence in the corpus) may be boosted as they usually carry more specialized information. To achieve this, for SA(ID), I use a sum of inverse powers of n(t), where n(t) is the number of occurrences of token t in the corpus. The sum is over all tokens t found both in the text entity indexed as ID, and in the corpus. There is an extra parameter β, representing the exponent, and subject to fine-tuning.
That’s it, in a nutshell!
View article, get the code and data
The technical document is available on GitHub: see paper 46, here. It features detailed documentation with illustrations and the code. The case study features a portion of the anonymized augmented corpus of a fortune 100 company.
Coming soon
Besides search and retrieval, another possible application of xLLM is to automatically update the corpus by adding relevant material to text entities, based on augmentation, or detecting and deduping redundant entries. Or to build a taxonomy on the corpus, possibly seeded using some external taxonomy, via taxonomy augmentation. You can also use xLLM as an auto-indexer and glossary generation for book collections, large websites or repositories: it will automatically detect index entries and sub-entries, create the full index or glossary, and flag the corresponding terms in the corpus using a mechanism similar to text entity IDs. At the time of writing, the only product offering this capability is notebookML.
One of the challenges is automated model evaluation: xLLM returns concise yet exhaustive bullet lists with a score attached to each item. How do you assess exhaustivity? How to take into account the relevancy scores in your evaluation metric? I discussed an approach based on the ability to correctly reconstruct the underlying taxonomy in the corpus. Another idea is to use xLLM to generate an index and compare it with the existing one in the corpus.
To not miss future versions with more features, subscribe to my newsletter, here.
About the Author
Vincent Granville is a pioneering GenAI scientist and machine learning expert, co-founder of Data Science Central (acquired by a publicly traded company in 2020), Chief AI Scientist at MLTechniques.com and GenAItechLab.com, former VC-funded executive, author (Elsevier) and patent owner — one related to LLM. Vincent’s past corporate experience includes Visa, Wells Fargo, eBay, NBC, Microsoft, and CNET. Follow Vincent on LinkedIn.