Probabilistic ANN: The Swiss Army Knife of GenAI

ANN — Approximate Nearest Neighbors —  is at the core of fast vector search, itself central to GenAI, especially GPT and LLM. My new methodology, abbreviated as PANN, has many other applications: clustering, classification, measuring the similarity between two datasets (images, soundtracks, time series, and so on), tabular data synthetization (improving poor synthetizations), model evaluation, and even detecting extreme observations.

Just to give an example, you could use it to categorize all time series without statistical theory. Statistical models are redundant and less explainable, leading to definitions less useful to developers, and math-heavy.  PANN avoids that.

Fast and simple, PANN (for Probabilistic ANN) does not involve training or neural networks, and it is essentially math-free. Its versatility comes from four features:

  • Most algorithms aim at minimizing a loss function. Here I also explore what you can achieve by maximizing the loss.
  • Rather than focusing on one set of datasets, I use two sets S and T. For instance, K-NN looks for nearest neighbors within a set S. What about looking for nearest neighbors in T, to observations in S? This leads to far more applications than the one-set approach.
  • Some algorithms are very slow and may never converge. No one looks at them. But what if the loss function drops very fast at the beginning, fast enough that you get better results in a fraction of the time, by stopping early, compared to using the “best” method?
  • In many contexts, a good approximate solution obtained in little time from an otherwise non-converging algorithm, may be as good for practical purposes as a more accurate solution obtained after far more steps using a more sophisticated algorithm.

The figure below shows how quickly the loss function drops at the beginning. In this case, the loss represents the average distance to the approximate nearest neighbor, obtained so far in the iterative algorithm. The X-axis represents the iteration number. Note the excellent curve fitting (in orange) to the loss function, allowing you to predict its baseline (minimum loss, or optimum) even after a small number of iterations. To see what happens if you maximize the loss instead, read the full technical document.

For another example of non-converging algorithm doing a lot better than any kind of gradient descent, see here.

Fast convergence of PANN, at the beginning (Y-axis is loss function)

Download the full article and Python code

Download the full article on GitHub, here. No sign-up required. It includes a detailed section on variable-length LLM embeddings, and the code which is also available in the same folder on GitHub. This article is part of my upcoming book “Practical AI & Machine Learning Projects”, to be published here. You may request a free copy of the book (126 pages so far, not yet finished) if you purchased any other book in my e-Store.

To not miss future articles and access members-only content, sign-up to my free newsletter, here.

Author

Towards Better GenAI: 5 Major Issues, and How to Fix Them

Vincent Granville is a pioneering GenAI scientist and machine learning expert, co-founder of Data Science Central (acquired by a publicly traded company in 2020), Chief Scientist at MLTechniques and GenAItechLab, former VC-funded executive, author (GenAI book, Elsevier, 2024) and patent owner. Vincent’s past corporate experience includes Visa, Wells Fargo, eBay, NBC, Microsoft, and CNET. Follow Vincent on LinkedIn.

Leave a Reply

Discover more from Machine Learning Techniques

Subscribe now to keep reading and get access to the full archive.

Continue reading