GenAI: Fast Data Synthetization with Distribution-free Hierarchical Bayesian Models

Deep learning models such as generative adversarial networks (GAN) require a lot of computing power, and are thus expensive. Also, they may not convergence. What if you could produce better data synthetizations, in a fraction of the time, with explainable AI and substantial cost savings? This is what Hierarchical Deep Resampling was designed for. It is abbreviated here as NoGAN2.

Very different from my first tree-based NoGAN, this new technology relies on resampling, an hierarchical sequence of runs, simulated annealing, and batch processing to boost performance, both in terms of output quality and time requirements. No neural network is involved. It is indeed a distribution-free Hierarchical Bayesian Model in disguise, with a loss function consisting of a large number of correlation distances measured on transformed features.

One of the strengths is the use of sophisticated output evaluation metrics for the loss function, and the ability to very efficiently update the loss function at each iteration, with a very small number of computations. In addition, default hyperparemeter values already provide good performance, making the method more stable than neural networks in the context of tabular data generation. It uses an auto-tuning algorithm, to automatically optimize hyperparameters via reinforcement learning. This capability helps you save a lot of time and money.

The purpose of this article is to show the spectacular performance of NoGAN2, using the base model. One case study involves a dataset with 21 features, to predict student success based on college admission metrics. It includes categorical, ordinal and continuous features as well as missing values. Another case study is a telecom data set to predict customer attrition. It has been tested on other datasets as well: healthcare, insurance, and cybersecurity. Applications are not limited to data synthetization, but also include complex statistical inference problems. Finally, by contrast to most neural network methods, NoGAN2 leads to fully replicable results.

Table of Contents

Downloading the Paper

The 22-page technical paper, with full implementation and description, is available as article #31, here. It also illustrates hyperparameter tuning, and the first use of GenAI-evaluation: the new Python library based on the multivariate empirical distribution function for both categorical and numerical features in any dimension. All the code and datasets are also on GitHub, accessible in one click from the document.

To download the PDF document and not miss future articles, sign-up (for free) to my newsletter, here.

Acknowledgements

I would like to thank Shakti Chaturvedi for the numerous tests and research that he performed to compare the new technique proposed here, with various generative adversarial networks. He brought the Telecom dataset to my attention, and tested improved versions of GAN and WGAN as well as vendor solutions and related methods. Earlier versions of the NoGAN2 code, along with WCGAN implementations, are available as Jupyter notebooks on his GitHub repository, here.

I am also very grateful to Rajiv Iyer for turning the multivariate empirical distribution (ECDF) and related KS distance computations into a production code Python library, available here. You can install it with pip install genAI-evaluation. I use this library to evaluate the quality of the results. Rajiv also compared NoGAN2 with CTGAN on the student dataset. All comparisons are favorable to NoGAN2.

About the Author

Vincent Granville is a pioneering AI and machine learning expert, co-founder of Data Science Central (acquired by  TechTarget in 2020), founder of MLTechniques.com, former VC-funded executive, author and patent owner. Vincent’s past corporate experience includes Visa, Wells Fargo, eBay, NBC, Microsoft, and CNET. Vincent is also a former post-doc at Cambridge University, and the National Institute of Statistical Sciences (NISS).

Vincent published in Journal of Number TheoryJournal of the Royal Statistical Society (Series B), and IEEE Transactions on Pattern Analysis and Machine Intelligence. He is also the author of multiple books, including “Synthetic Data and Generative AI” (Elsevier), available here. He lives  in Washington state, and enjoys doing research on spatial stochastic processes, chaotic dynamical systems, experimental math and probabilistic number theory.

Leave a Reply

%d