Building a Ranking System to Enhance Prompt Results: The New PageRank for RAG/LLM

In this document, you will learn how to build a system that decides, among dozens of candidate paragraphs selected from the corpus to answer a prompt, which ones to show in the results, and in what order. The goal is to maximize relevancy while not overwhelming the user with a long, cluttered answer. Think of it as the new PageRank for RAG/LLM, although the algorithm is radically different, and much simpler. The approach is generic and works for all RAG/LLM systems whether based on neural networks or not. The main steps are:

Backend processing (linked to the corpus)

Split your corpus into text entities such as webpages, paragraphs, sections and so on. This step is similar to chunking. Attach an ID (called index) to each text entity.
Text entities have two types of fields: regular text like in all LLMs, and knowledge graph elements such as categories, related items, URL, tags, parent categories, title, and so on. These knowledge graph elements are found while crawling and part of the original corpus. Or they can be added after the full crawl. For instance, in xLLM, agents are assigned to text entities post-crawling, using a clustering algorithm.
You need two types of tokens: regular ones, and those found in the knowledge graph elements. The latter are called graph tokens. You then create a key-value table Hash_ID, where the key is a token, and the value is the list of text entity IDs attached to the token in question, with a token count for each one. Graph tokens start with “__”, to differentiate them from regular tokens.

Frontend processing (linked to the prompt)

You create a local, small key-value table ID_Hash, a transposed version of Hash_ID, where the key is a text entity ID, and the value is a list of tokens t found in the prompt, with ID in Hash_ID[t].
For each ID, you compute (say) 4 relevancy scores: S_A, S_B corresponding to regular tokens, and S_C, S_D identical to S_A, S_B but for graph tokens. For instance, S_B(ID) is the number of regular tokens found both in the prompt and in text entity indexed as ID. More about S_A in the next section.
You sort the collected IDs according to each score, and assign 4 ranks (one per score) to each ID. The global rank is a weighted combinations of the 4 ranks, and subject to fine-tuning. In the prompt results, you display text entity IDs (their full content) with a global rank above some threshold, or the top 10 according to global rank.

Smart scoring

Rare tokens present in the prompt (based on their occurrence in the corpus) may be boosted as they usually carry more specialized information. To achieve this, for S_A(ID), I use a sum of inverse powers of n(t), where n(t) is the number of occurrences of token t in the corpus. The sum is over all tokens t found both in the text entity indexed as ID, and in the corpus. There is an extra parameter β, representing the exponent, and subject to fine-tuning.

That’s it, in a nutshell!

View article, get the code and data

The technical document is available on GitHub: see paper 46, here. It features detailed documentation with illustrations and the code. The case study features a portion of the anonymized augmented corpus of a fortune 100 company.

Coming soon

Besides search and retrieval, another possible application of xLLM is to automatically update the corpus by adding relevant material to text entities, based on augmentation, or detecting and deduping redundant entries. Or to build a taxonomy on the corpus, possibly seeded using some external taxonomy, via taxonomy augmentation. You can also use xLLM as an auto-indexer and glossary generation for book collections, large websites or repositories: it will automatically detect index entries and sub-entries, create the full index or glossary, and flag the corresponding terms in the corpus using a mechanism similar to text entity IDs. At the time of writing, the only product offering this capability is notebookML.

One of the challenges is automated model evaluation: xLLM returns concise yet exhaustive bullet lists with a score attached to each item. How do you assess exhaustivity? How to take into account the relevancy scores in your evaluation metric? I discussed an approach based on the ability to correctly reconstruct the underlying taxonomy in the corpus. Another idea is to use xLLM to generate an index and compare it with the existing one in the corpus.

To not miss future versions with more features, subscribe to my newsletter, here.

About the Author

Vincent Granville is a pioneering GenAI scientist and machine learning expert, co-founder of Data Science Central (acquired by a publicly traded company in 2020), Chief AI Scientist at MLTechniques.com and GenAItechLab.com, former VC-funded executive, author (Elsevier) and patent owner — one related to LLM. Vincent’s past corporate experience includes Visa, Wells Fargo, eBay, NBC, Microsoft, and CNET. Follow Vincent on LinkedIn.

	messerb5467 on Quantum Derivatives, GenAI, an…
Vincent Granville – Author, publisher, machine learning scientist. Founder of MLtechniques.com. Co-founder of Data Science Central, acquired by Tech Target.	Vincent Granville on Quantum Derivatives, GenAI, an…
	Brad Messer on Quantum Derivatives, GenAI, an…
	Sanjay Gautam on Number Theory: Longest Runs of…
Artem Melnyk – Ukraine – Hello there! My name is Artem. I am an AI enthusiast and affiliate marketer. As an AI enthusiast, I'm always on the lookout for new tools, techniques, and ideas that can help businesses and individuals utilize AI to stimulate innovation and growth. As an affiliate marketer, I'm passionate about helping people discover the best AI products and services available. Whether it's an advanced AI platform or powerful machine learning tool, my insights and recommendations are always eager to be shared with others. Are you passionate about AI content? Look no further! I enjoy liking, following and commenting on blogs related to AI, as well as finding new opportunities to collaborate with fellow AI enthusiasts and marketers. If you're interested in learning more about my affiliate marketing endeavors, feel free to check out https://zeep.ly/SmdwN. I'm sure that you'll find some fantastic AI products and services that can help take your business or personal projects to the next level. Thanks for stopping by; I look forward to connecting with you soon!	Artem Melnyk on Autonomous Driving: Boosting O…