Data Synthetization: enhanced GANs vs Copulas

Using case studies, I compare generative adversarial networks (GANs) with copulas to synthesize tabular data. I discuss back-end and front-end improvements to help GANs better replicate the correlation structure present in the real data. Likewise, I discuss methods to further improve copulas, including transforms, the use of separate copulas for each population segment, and parametric model-driven copulas compared to a data-driven parameter-free approach. I apply the techniques to real-life datasets, with full Python implementation. In the end, blending both methods leads to better results. Both methods eventually need an iterative gradient-descent technique to find an optimum in the parameter space. For GANs, I provide a detailed discussion of hyperparameters and fine-tuning options.

I show examples where GANs are superior to copulas, and the other way around. My GAN implementation also leads to fully replicable results — a feature usually absent in other GAN systems. This is particularly important given the high dependency on the initial configuration determined by a seed parameter: it also allows you to find the best synthetic data using multiple runs of GAN in a replicable setting. In the process, I introduce a new matrix correlation distance to evaluate the quality of the synthetic data, taking values between 0 and 1 where 0 is best, and leverage the TableEvaluator library. I also discuss feature clustering to improve the technique, to detect groups of features independent from each other, and apply a different model to each of them. In a medical data example to predict the risk of cancer, I use random forests to classify the real data, and compare the performance with results obtained on the synthetic data.

Ability of copulas to replicate the correlation structure in the real data

In one artificial example with very strong patterns, the copula method fails at detecting the non-linear feature interactions, while GANs do a pretty good job. In another example mostly with linear feature interactions, the opposite is true. The document also has many references to technical papers available online for free, as well as a discussion of open source implementations. In particular, I feature an illustration of the CopulaGAN module — blending copulas and GANs — from the synthetic data vault library (SDV), applied to tabular data.

This applied paper is part of the newly added chapter in my book on synthetic data and explainable AI. The 289-page book now in version 4.0 and accepted by Elsevier, is currently available here. In the PDF version (the only one currently available, viewable in any browser), all back-links, external and internal links, in particular to or from other chapters, glossary, references and the index, are working. The datasets, Python code, and illustrations, are also on my GitHub repository. See the table of contents and access sample chapters, from here. You can download the free article “Data Synthetization: Enhanced GANs vs Copulas” as paper #20, from here.

To not miss future articles, sign-up to our newsletter, here.

About the Author

Towards Better GenAI: 5 Major Issues, and How to Fix Them

Vincent Granville is a pioneering GenAI scientist, co-founder at BondingAI.io, the LLM 2.0 platform for hallucination-free, secure, in-house, lightning-fast Enterprise AI at scale with zero weight and no GPU. He is also author (Elsevier, Wiley), publisher, and successful entrepreneur with multi-million-dollar exit. Vincent’s past corporate experience includes Visa, Wells Fargo, eBay, NBC, Microsoft, and CNET. He completed a post-doc in computational statistics at University of Cambridge.

	messerb5467 on Quantum Derivatives, GenAI, an…
	Vincent Granville on Quantum Derivatives, GenAI, an…
	Brad Messer on Quantum Derivatives, GenAI, an…
	Sanjay Gautam on Number Theory: Longest Runs of…
	Artem Melnyk on Autonomous Driving: Boosting O…

Data Synthetization: enhanced GANs vs Copulas

Table of Contents

About the Author

Like this:

Leave a ReplyCancel reply

Data Synthetization: enhanced GANs vs Copulas

Table of Contents

About the Author

Share this:

Like this:

Leave a ReplyCancel reply

Discover more from xLLM and AI Technology