GenAI Evaluation Metrics: Your Best Loss Functions to Boost Quality

Whether dealing with LLM, computer vision, clustering, predictive analytics, synthetization, or any other AI problem, the goal is to deliver high quality results in as little time as possible.  Typically, you assess the output quality after producing the results, using model evaluation metrics. These metrics are also used to compare various models, or to measure improvement over the baseline.

In unsupervised learning such as LLM or clustering, evaluation is not trivial. But in many cases, the task is straightforward. Yet you need to choose the best possible metric for quality assessment. Otherwise, it results in bad output rated as good. The best evaluation metrics may be hard to implement and compute.

At the same time, pretty much all modern techniques rely on minimizing a loss function to achieve good performance. In particular, all neural networks are massive gradient descent algorithms that aim at minimizing a loss function. The loss function is usually basic (for instance, sums of squared differences) because it must be updated extremely fast each time a neuron gets activated and a weight is modified. There may be trillions of changes needed before getting a stable solution.

In practice, the loss function is a proxy to the model evaluation metric: the lower the loss, the better the evaluation. At least, that’s the expectation.

Using Model Evaluation as the Loss Function

In this paper, I discuss a case study in the context of tabular data synthetization. The full multivariate KS distance between the real and generated data is the best evaluation metric, see here. It takes into account all potential interactions among all the features, but it requires a lot of computing time. The multivariate Hellinger distance is an alternative easier to implement.

However, it depends on the chosen granularity in the highly sparse feature space. There is an easy way to do it, not requiring more bins than the number of observations in the training set regardless of the dimension. And it leads to very fast atomic updates, making it suitable as a loss function. You need to start with a low granularity, that is, a rough approximation. Then increase the granularity at regular intervals until the Hellinger and KS distances are equivalent. Thus, the loss function changes over time, as pictured in Figure 1. That is, you work with an adaptive loss function.

Adaptive loss function, modified 10 times from beginning to end

Results and Challenges

Using the ideal evaluation metric as the loss function leads to spectacular improvements. I did a test where the initial synthetic data is a scrambled version of the real data. The algorithm then reallocates observed values via a large number of swaps, for each feature separately. I chose this example because it is well known that in this case, the best possible synthetization — the global optimum — is the real data itself, up to a permutation of the observations.

Interestingly, most vendors have a hard time getting a decent solution, let alone retrieving the global optimum. My method is the only one that found it, exactly, in little time. Combinatorial algorithms are also able to retrieve it but require far more iterations. Neural networks also require a lot more time and won’t retrieve the global optimum.

So, while most vendors don’t face the risk of producing a synthetization that is too good, my approach does. To avoid this problem, I must put constraints on the desired synthetization, for instance, requesting the Hellinger distance to stay above some threshold at all times.  The result is a constrained synthetization, illustrated in Figure 2. Without the constraint, the real and synthetic data would be identical.

Real data (blue), constrained synthetization (red dots)

I also worked on different datasets: the featured image, coming from the technical paper, illustrates a real data set with a simulated Gaussian mixture distribution. I asked OpenAI to generate the Python code that produces the mixture, that is, the real data, then I moved to the synthetization.  All this is discussed in detail in the paper. As a final note, the method works with categorical and numerical features (or a blend of both), without distinction between both. I handle categorical features such as text, with smart encoding; in the end they are easier to deal with than numerical features.

Takeaway

Using the evaluation metric as the loss function is the right move. Assuming you find a way to very efficiently update it millions or billions of times, with atomic changes. It remains to be seen how this approach could be adapted to deep neural networks (DNNs). You would think that it will work only if dealing with a continuous loss, as DNNs use gradient descent, itself based on the derivatives of the loss. In this case, the loss is a multivariate stepwise function, thus with a large number of discontinuities. Further work is needed to make it DNN-friendly.

Nevertheless, the application discussed here is a great sandbox to test various features before implementing them in DNNs. My synthetization uses a probabilistic algorithm (no DNN) and runs very fast at least to get a great first approximation. Thus, my algorithm is easy to fine-tune. But it becomes a lot slower than DNN over time. Could a DNN use the best of both worlds: great adaptive loss function, with faster convergence after a while even if slower at the beginning? Starting with a great initial configuration may help; my algorithm does. And changing the loss function when it stops decreasing may prevent you from getting stuck in a local optimum.

Finally, in cases where there is no obvious evaluation metric, such as unsupervised learning, it makes sense to use a weighted mixture of normalized metrics (each with a value between 0 and 1) and try to use it as your loss function.

Full documentation, source code, and results

The full documentation with links to the code and everything, is in the same project textbook on GitHub, here. Check out project 2.4, added to the textbook on May 16.

Note that the project textbook contains a lot more than the material discussed here. The reason to share the whole book rather than just the relevant chapters is because of cross-references with other projects. Also, clickable links and other navigation features in the PDF version work well only in the full document, on Chrome and other viewers, after download.

To not miss future updates on this topic and GenAI in general, sign-up to my newsletter, here. Upon signing-up, you will get a code to access member-only content. There is no cost. The same code gives you a 20% discount on all my eBooks in my eStore, here.

Author

Vincent Granville is a pioneering GenAI scientist and machine learning expert, co-founder of Data Science Central (acquired by a publicly traded company in 2020), Chief AI Scientist at MLTechniques.com and GenAItechLab.com, former VC-funded executive, author (Elsevier) and patent owner — one related to LLM. Vincent’s past corporate experience includes Visa, Wells Fargo, eBay, NBC, Microsoft, and CNET. Follow Vincent on LinkedIn.

 

Check your inbox or spam folder to confirm your subscription.

Leave a ReplyCancel reply

Discover more from NextGen AI Technology

Subscribe now to keep reading and get access to the full archive.

Continue reading

Exit mobile version