Benchmarking xLLM and Specialized Language Models: New Approach & Results

Standard benchmarking techniques using LLM as a judge have strong limitations. First it creates a circular loop and reflects the flaws present in the AI judges. Then, the perceived quality depends on the end user: an enterprise LLM appeals to professionals and business people, while a generic one appeals to laymen. The two have almost opposite criteria to assess the value. Finally, benchmarking metrics currently in use fail to capture many of the unique features of specialized LLMs, such as exhaustivity, or the quality of the relevancy and trustworthiness scores attached to each element in the response. In fact, besides xLLM, very few if any LLMs display such scores to the user.

I now discuss these points, as well as the choice of test prompts, and preliminary results about xLLM, compared to others.

Structured output vs standard response

A peculiarity of xLLM is that if offers two types of responses. The top layer is the classic response, though much less reworded than in other systems to keep it close to the original corpus, and well organized. The layer below — we call it the structured output — is accessible to authorized end users via the UI; it displays clickable summary boxes with raw extracts and contextual elements (title, category, tags, timestamp, contact person and so on). It also shows relevancy and trustworthiness scores:

  • Trustworthiness score: it tells you how trustworthy the input source is, for each summary box. In particular, if the same information is found in two different input sources but with a mismatch, the trustworthiness score tells you which one is most reliable.
  • Relevancy score: it tells you how relevant a summary box is to your prompt.

The structured output provides very precise links to where the information is coming from. Also, models based mostly on transformers are not able to generate meaningful trustworthiness scores, as each sentence in the response blends elements from multiple sources — some good, some not so good — in addition to transition and other words coming from a Blackbox.

One layer below the structured output (prompt: “Restricted Stock Unit Grant”)

There is currently no benchmarking metrics to assess the quality of the scores, or to adapt the metrics to take into account these scores. For instance, a low-score response element should not be penalized because it’s poor; xLLM knows it’s poor and signals it to the user. This typically happens when the prompt is irrelevant to the enterprise corpus: xLLM does not make up answers.

Exhaustivity

In our internal document (available upon request) I describe how to automatically generate a large number of synthetic prompts, matching elements scattered in the corpus, but with transformed wording (synonyms, typos, and so on). Then I check if the relevant chunks are retrieved. On multiple occasions, chunks even better than those used in the test, are found.

Note that xLLM uses high quality home-made lists of acronyms and synonyms, automatically generated with a proprietary algorithm, to handle prompts written in a lingo different from the corpus. It also has its own stemmer and un-stemmer. The goal of all this is to achieve exhaustivity.  Yet, standard benchmarking metrics ignore exhaustivity as it is not obvious how you evaluate it. But we do.

Preliminary results – standard metrics

The results below were obtained with “LLM as judge”, a method with the drawbacks mentioned earlier. Also, it does not capture important qualities that I just discussed. Finally, it applies to the standard response, not the structured output (classic LLMs don’t share that layer with the user). Despite these caveats, on the corporate corpuses tested, xLLM outperforms the other models.

The global score is shown in the rightmost column. I used xLLM 1.0 for the test. The new version xLLM 2.0 has many new components that further enhance the response. The global score does not include the ingestion time labeled as “time” in the table. This is a metric where xLLM outshines all others by a long shot.

We are continuously working on improvements, including distillation techniques that further improve quality and reduce time. The original xLLM for developers has its own proprietary database architecture (nested hashes) and allows for lightning-fast fine-tuning, even in real time. Thanks to intuitive parameters attached to the various components, it is easier and faster to post-train than other models, even by a non-expert.

To learn more about our proprietary AI technology and agents, visit our website BondingAI.io, or contact the author.

About the Author

Vincent Granville is a pioneering GenAI scientist, co-founder at BondingAI.io, the LLM 2.0 platform for hallucination-free, secure, in-house, lightning-fast Enterprise AI at scale with zero weight and no GPU. He is also author (Elsevier, Wiley), publisher, and successful entrepreneur with multi-million-dollar exit. Vincent’s past corporate experience includes Visa, Wells Fargo, eBay, NBC, Microsoft, and CNET. He completed a post-doc in computational statistics at University of Cambridge.

Check your inbox or spam folder to confirm your subscription.

Leave a ReplyCancel reply

Discover more from NextGen AI Technology

Subscribe now to keep reading and get access to the full archive.

Continue reading

Exit mobile version