Using case studies, I compare generative adversarial networks (GANs) with copulas to synthesize tabular data. I discuss back-end and front-end improvements to help GANs better replicate the correlation structure present in the real data. Likewise, I discuss methods to further improve copulas, including transforms, the use of separate copulas for each population segment, and parametric model-driven copulas compared to a data-driven parameter-free approach. I apply the techniques to real-life datasets, with full Python implementation. In the end, blending both methods leads to better results. Both methods eventually need an iterative gradient-descent technique to find an optimum in the parameter space. For GANs, I provide a detailed discussion of hyperparameters and fine-tuning options.
I show examples where GANs are superior to copulas, and the other way around. My GAN implementation also leads to fully replicable results — a feature usually absent in other GAN systems. This is particularly important given the high dependency on the initial configuration determined by a seed parameter: it also allows you to find the best synthetic data using multiple runs of GAN in a replicable setting. In the process, I introduce a new matrix correlation distance to evaluate the quality of the synthetic data, taking values between 0 and 1 where 0 is best, and leverage the TableEvaluator library. I also discuss feature clustering to improve the technique, to detect groups of features independent from each other, and apply a different model to each of them. In a medical data example to predict the risk of cancer, I use random forests to classify the real data, and compare the performance with results obtained on the synthetic data.
In one artificial example with very strong patterns, the copula method fails at detecting the non-linear feature interactions, while GANs do a pretty good job. In another example mostly with linear feature interactions, the opposite is true. The document also has many references to technical papers available online for free, as well as a discussion of open source implementations. In particular, I feature an illustration of the CopulaGAN module — blending copulas and GANs — from the synthetic data vault library (SDV), applied to tabular data.
This applied paper is part of the newly added chapter in my book on synthetic data and explainable AI. The 289-page book now in version 4.0 and accepted by Elsevier, is currently available here. In the PDF version (the only one currently available, viewable in any browser), all back-links, external and internal links, in particular to or from other chapters, glossary, references and the index, are working. The datasets, Python code, and illustrations, are also on my GitHub repository. See the table of contents and access sample chapters, from here. You can download the free article “Data Synthetization: Enhanced GANs vs Copulas ” from here.
Table of Contents
To not miss future articles, sign-up to our newsletter, here.
About the Author
Vincent Granville is a pioneering data scientist and machine learning expert, co-founder of Data Science Central (acquired by TechTarget in 2020), founder of MLTechniques.com, former VC-funded executive, author and patent owner. Vincent’s past corporate experience includes Visa, Wells Fargo, eBay, NBC, Microsoft, and CNET. Vincent is also a former post-doc at Cambridge University, and the National Institute of Statistical Sciences (NISS).
Vincent published in Journal of Number Theory, Journal of the Royal Statistical Society (Series B), and IEEE Transactions on Pattern Analysis and Machine Intelligence. He is also the author of multiple books, including “Intuitive Machine Learning and Explainable AI”, available here. He lives in Washington state, and enjoys doing research on spatial stochastic processes, chaotic dynamical systems, experimental math and probabilistic number theory.