Sampling Outside the Observation Range with Quantile Convolution

All of the GenAI apps that I tested, including my own, have the same problem. They cannot easily generate data outside the observation range. As an example, let’s focus on the insurance dataset discussed in my new book. I use it to generate synthetic data with GAN (generative adversarial networks) and the NoGAN models discussed in chapters 6 and 7. In the training set, one of the features is “charges”, that is, the medical expenses incurred by the policy holder, in a given year. The range is from $1,121 to $63,770. In the synthesized data, the amount always stays within these two bounds. Worst, most models are unable to produce a synthetic maximum above $60,000, see here. The issue is undetected due to poor evaluation metrics, and compounded by the small size of the training set. The same is true for all the other features. The problem shows up in all the tested datasets, no matter how many observations you generate.

The consequences are persistent algorithm bias, and the inability to generate enriched or unusual data. The solution currently adopted is to work with gigantic training sets, further increasing costs linked to training, cloud and GPU time usage. What I propose here goes in the opposite direction: cost reduction, smaller training sets, high quality output based on the best evaluation metrics, and the ability to generate more diversified data, including meaningful outliers. All this with a fast, simple algorithm based on a clever idea.

New Technique: Quantile Stretching

To generate data outside the observation range while preserving the distribution in the original training set, I use a clever idea to generate “unobserved” quantiles beyond the minimum and maximum.  It easily generalizes to multivariate quantiles. You can call it quantile stretching although this makes it sound like an image spectrum enhancement problem.  The statistical term used in the literature is extrapolated quantiles. However, the method is very different from anything discussed in statistical or mathematical articles. It is a pure, typical black-box machine learning technique relying — like many others — on a convolution product.  Thus, I call it quantile convolution. The originality is in the model-free, fast implementation, not so much in the convolution. No neural network is needed.

The idea consists of replacing each observation x in the training set by a number of deviates from a Gaussian distribution centered at x, with standard deviation proportional to that observed in the real data. The proportionality factor is denoted as v and may depend on the number n of observations.  I also used truncated Gaussians when the range is constrained due to business rules. The larger v, the smoother the resulting quantiles, with v = 0 corresponding to the original data. It has nice convergence properties, easy to prove. The image below illustrates the methodology, with v ranging from 0.0 to 0.4. In this example, the total number of generated points is 1000. The histogram has 100 bins of equal widths.

Conclusions

The quantile convolution technique helps you generate data outside the observation range, thus creating truly enriched datasets, contrarily to all the tools that I tried in the context of synthetic data, whether based on deep neural networks or not, whether open-source or vendor platforms. Generalizing quantiles to higher dimensions may not seem trivial, but it has been done with NoGAN and sister methods discussed in chapters 6 and 7 in my new book. The new method, akin to quantile extrapolation, blends easily with NoGAN to enhance its performance.

Current techniques to evaluate the quality of synthetic fail to capture complex feature dependencies, resulting in false negatives: generated data scored as excellent, when it is actually very poor. Deep neural networks can be very slow and volatile, requiring ad-hoc tuning for each new dataset. The technique discussed here fits in a new breed of algorithms: fast and easy to train, leading to explainable AI and auto-tuning, and requiring less rather than more data to address the traditional challenges. Another one is data thinning: I illustrate how you can get better results, in addition to saving time, by randomly deleting 50% of the data in the training set. All of this using sound evaluation metrics and cross-validation.

The main goal of this new framework is cost savings while delivering better results: using less training, GPU and cloud time. It goes against the modern trend of using bigger and bigger datasets. The popularity of oversized training sets stems from the fact that it seems to be the easy solution. Yet my algorithms are simpler. Then, large companies offering cloud and GPU services have strong incentives to favor big data: the bigger, the more revenue for them, the higher the costs for the client. Since I offer free solutions, thus bearing the cost of computations, I have strong incentives to optimize for speed while maintaining high quality output. In the end, my goals are thus aligned with those of the client, not with those of cloud companies or vendor charging a premium for cloud usage, based on the volume of data.

Python Code and Documentation

The Python code is on GitHub, here. The version producing the video is available here. The corresponding article with technical documentation (7 pages including the code) is also on GitHub, here. Note that the tech document is an extract from my new book “Statistical Optimization for GenAI and Machine Learning” (200 pages). The relevant material starts at page 181. Links are not clickable in this extract, but they are in the full version of the book, available here. The tech document features real-life use cases, in addition to the artificial one shown in the video.

To not miss future articles and access members-only content, sign-up to my free newsletter, here.

Author

Vincent Granville is a pioneering GenAI scientist and machine learning expert, co-founder of Data Science Central (acquired by a publicly traded company in 2020), Chief AI Scientist at MLTechniques.com, former VC-funded executive, author and patent owner — one related to LLM. Vincent’s past corporate experience includes Visa, Wells Fargo, eBay, NBC, Microsoft, and CNET.

Vincent is also a former post-doc at Cambridge University, and the National Institute of Statistical Sciences (NISS). He published in Journal of Number Theory,  Journal of the Royal Statistical Society (Series B), and IEEE Transactions on Pattern Analysis and Machine Intelligence. He is the author of multiple books, including “Synthetic Data and Generative AI” (Elsevier, 2024). Vincent lives in Washington state, and enjoys doing research on stochastic processes, dynamical systems, experimental math and probabilistic number theory. He recently launched a GenAI certification program, offering state-of-the-art, enterprise grade projects to participants.

Leave a Reply

Discover more from NextGen AI Technology

Subscribe now to keep reading and get access to the full archive.

Continue reading