GenAI: Fast Data Synthetization with Distribution-free Hierarchical Bayesian Models

Deep learning models such as generative adversarial networks (GAN) require a lot of computing power, and are thus expensive. Also, they may not convergence. What if you could produce better data synthetizations, in a fraction of the time, with explainable AI and substantial cost savings? This is what Hierarchical Deep Resampling was designed for. It is abbreviated here as NoGAN2.

Very different from my first tree-based NoGAN, this new technology relies on resampling, an hierarchical sequence of runs, simulated annealing, and batch processing to boost performance, both in terms of output quality and time requirements. No neural network is involved. It is indeed a distribution-free Hierarchical Bayesian Model in disguise, with a loss function consisting of a large number of correlation distances measured on transformed features.

One of the strengths is the use of sophisticated output evaluation metrics for the loss function, and the ability to very efficiently update the loss function at each iteration, with a very small number of computations. In addition, default hyperparemeter values already provide good performance, making the method more stable than neural networks in the context of tabular data generation. It uses an auto-tuning algorithm, to automatically optimize hyperparameters via reinforcement learning. This capability helps you save a lot of time and money.

The purpose of this article is to show the spectacular performance of NoGAN2, using the base model. One case study involves a dataset with 21 features, to predict student success based on college admission metrics. It includes categorical, ordinal and continuous features as well as missing values. Another case study is a telecom data set to predict customer attrition. It has been tested on other datasets as well: healthcare, insurance, and cybersecurity. Applications are not limited to data synthetization, but also include complex statistical inference problems. Finally, by contrast to most neural network methods, NoGAN2 leads to fully replicable results.

Downloading the Paper

The 22-page technical paper, with full implementation and description, is available as article #31, here. It also illustrates hyperparameter tuning, and the first use of GenAI-evaluation: the new Python library based on the multivariate empirical distribution function for both categorical and numerical features in any dimension. All the code and datasets are also on GitHub, accessible in one click from the document.

To download the PDF document and not miss future articles, sign-up (for free) to my newsletter, here.

Acknowledgements

I would like to thank Shakti Chaturvedi for the numerous tests and research that he performed to compare the new technique proposed here, with various generative adversarial networks. He brought the Telecom dataset to my attention, and tested improved versions of GAN and WGAN as well as vendor solutions and related methods. Earlier versions of the NoGAN2 code, along with WCGAN implementations, are available as Jupyter notebooks on his GitHub repository, here.

I am also very grateful to Rajiv Iyer for turning the multivariate empirical distribution (ECDF) and related KS distance computations into a production code Python library, available here. You can install it with pip install genAI-evaluation. I use this library to evaluate the quality of the results. Rajiv also compared NoGAN2 with CTGAN on the student dataset. All comparisons are favorable to NoGAN2.

About the Author

Towards Better GenAI: 5 Major Issues, and How to Fix Them

Vincent Granville is a pioneering GenAI scientist, co-founder at BondingAI.io, the LLM 2.0 platform for hallucination-free, secure, in-house, lightning-fast Enterprise AI at scale with zero weight and no GPU. He is also author (Elsevier, Wiley), publisher, and successful entrepreneur with multi-million-dollar exit. Vincent’s past corporate experience includes Visa, Wells Fargo, eBay, NBC, Microsoft, and CNET. He completed a post-doc in computational statistics at University of Cambridge.

	messerb5467 on Quantum Derivatives, GenAI, an…
	Vincent Granville on Quantum Derivatives, GenAI, an…
	Brad Messer on Quantum Derivatives, GenAI, an…
	Sanjay Gautam on Number Theory: Longest Runs of…
	Artem Melnyk on Autonomous Driving: Boosting O…

GenAI: Fast Data Synthetization with Distribution-free Hierarchical Bayesian Models

Table of Contents

Downloading the Paper

Acknowledgements

About the Author

Like this:

Leave a ReplyCancel reply

GenAI: Fast Data Synthetization with Distribution-free Hierarchical Bayesian Models

Table of Contents

Downloading the Paper

Acknowledgements

About the Author

Share this:

Like this:

Leave a ReplyCancel reply

Discover more from xLLM and AI Technology