LLM Deep Contextual Retrieval and Multi-Index Chunking: Nvidia PDFs, Case Study

The technology described here boosts exhaustivity and structuredness in LLM prompt results, efficiently exploiting the knowledge graph and contextual structure present in any professional or enterprise corpus. The case study deals with public financial reports from Nvidia, available as PDF documents.

In this article, I discuss the preprocessing steps used to turn a PDF repository into input suitable for LLMs. It includes contextual chunking, indexing text entities with hierarchical multi-index system, and retrieving contextual elements including lists, sub-lists, fonts (type, color, and size), images and tables – some not detected by standard Python libraries. I also discuss how to build additional contextual information such as agents, categories, or tags, to add to text entities to further improve any LLM architecture, and prompt results.

Methodology

I use the PyMuPDF Python library, with home-made algorithms to retrieve tables that it fails to detect. Another useful library is LlamaParse. Yet, a different approach consists of saving each PDF page (a slide in this case) as an image, then using computer vision technology to retrieve the various elements from the images and turn them into text when appropriate. However, I will not follow this approach. Instead, I convert the PDFs to JSON and then parse the JSON elements, including tables, diagrams, and images.

Figure 1 shows one input slide. Figure 2 shows the retrieved elements, including an undetected table embedded in the histogram (ID2 = TD0, TL0) and a bullet list and sub list. Note the multi-index consisting of ID1, ID2, ID3, Size (the size of the font) and other components not shown here: document ID, page number, font type (bold, italic) and color. Other examples show slides with multiples lists, images, or tables.

Figure 2: Slide turned into format suitable for xLLM

Get the full article

Includes high resolution images, the full input data, source code with output file, links to GitHub, and code documentation. The code does a lot more than the highlights discussed here, for instance dealing with special characters.

Download the free article, here. It contains the whole new chapter 10 added today to my book “Building Disruptive AI & LLM Technology from Scratch”. The relevant material is in section 10.1. There is also a subsection explaining how I build agents on the backend rather than the frontend (standard approach), leading to better design. Section 10.3 explains the differences between LLM 1.0 and 2.0.

In the full book, available here, all links are clickable. For GitHub, look for filenames starting with “PDF”, here. To not miss future articles on this topic and about GenAI in general, subscribe to my free newsletter, here. Subscribers get a 20% discount on all my books.

About the Author

Vincent Granville is a pioneering GenAI scientist and machine learning expert, co-founder of Data Science Central (acquired by a publicly traded company in 2020), Chief AI Scientist at MLTechniques.com and GenAItechLab.com, former VC-funded executive, author (Elsevier) and patent owner — one related to LLM. Vincent’s past corporate experience includes Visa, Wells Fargo, eBay, NBC, Microsoft, and CNET. Follow Vincent on LinkedIn.

	messerb5467 on Quantum Derivatives, GenAI, an…
Vincent Granville – Author, publisher, machine learning scientist. Founder of MLtechniques.com. Co-founder of Data Science Central, acquired by Tech Target.	Vincent Granville on Quantum Derivatives, GenAI, an…
	Brad Messer on Quantum Derivatives, GenAI, an…
	Sanjay Gautam on Number Theory: Longest Runs of…
Artem Melnyk – Ukraine – Hello there! My name is Artem. I am an AI enthusiast and affiliate marketer. As an AI enthusiast, I'm always on the lookout for new tools, techniques, and ideas that can help businesses and individuals utilize AI to stimulate innovation and growth. As an affiliate marketer, I'm passionate about helping people discover the best AI products and services available. Whether it's an advanced AI platform or powerful machine learning tool, my insights and recommendations are always eager to be shared with others. Are you passionate about AI content? Look no further! I enjoy liking, following and commenting on blogs related to AI, as well as finding new opportunities to collaborate with fellow AI enthusiasts and marketers. If you're interested in learning more about my affiliate marketing endeavors, feel free to check out https://zeep.ly/SmdwN. I'm sure that you'll find some fantastic AI products and services that can help take your business or personal projects to the next level. Thanks for stopping by; I look forward to connecting with you soon!	Artem Melnyk on Autonomous Driving: Boosting O…

LLM Deep Contextual Retrieval and Multi-Index Chunking: Nvidia PDFs, Case Study

Methodology

Get the full article

About the Author

Share this:

Leave a ReplyCancel reply

Discover more from NextGen AI Technology