Neural network methods have overshadowed all other techniques in the last decade, to the point that alternatives are simply ignored. And for good reasons: techniques such as generative adversarial networks (GAN) proved very successful in some contexts, especially computer vision. Indeed, there has been several attempts to turn every problem and traditional method — regression, supervised classification, reinforcement learning — into deep neural networks (DNN).
Yet recently, some authors showed the equivalence between DNN and other techniques such as decision trees, initiating a trend in the opposite direction. After testing various GANs for tabular data synthetization, I realized that there has to be a better way to solve the problem. My method, referred to as NoGAN, is the result of several years of research on the topic. It is inspired by the exact multivariate interpolation technique discussed in chapter 9 in my GenAI book (see here), as well as the hidden decision tree framework discussed in chapter 2 in the same book. The latter is an ensemble method based on a moderately large number of moderately small decision trees, a feature now present in NoGAN. The former is related in the following sense: NoGAN interpolates the multivariate empirical distribution function (ECDF). Another way to describe the new method is as a generalization of the copula technique discussed in chapter 10 in my book, replicating not only the marginal distributions and the correlation structure, but also the full joint distribution.
The joint or multivariate ECDF has remained elusive to this day. It is a rather non-intuitive object, hard to visualize and handle even in two dimensions, let alone in higher dimensions with categorical features. For that reason, the probability density function (PDF) is more popular. It leads to techniques such as Gaussian mixture models (GMM), frequently embedded into neural networks, or the Hellinger distance to evaluate the quality of synthesized data. However, the ECDF is more robust and avoids a number of issues such as non-differentiable PDFs. Then, while the Hellinger distance also generalizes to multivariate PDFs, in practice all the implementations are one-dimensional, with the distance computed for each feature separately. The Hellinger equivalent for the cumulative distribution function (CDF) is the Kolmogorov-Smirnov distance (KS).
It is said that KS does not generalize to the multivariate case. Thus its total absence in applications when the dimension is higher than one, despite the fact that it is the best metric to fully capture all the dependencies among features, especially the non-linear ones. Interestingly, NoGAN is the first working implementation of the multivariate ECDF and KS in the context of synthetization, breaking what was previously considered as insurmountable barriers.
I introduce a new, NoGAN alternative to standard tabular data synthetization. It is designed to run faster by several orders of magnitude, compared to training generative adversarial networks (GAN). In addition, the quality of the generated data is superior to almost all other products available on the market. The hyperparameters are intuitive, leading to explainable AI
Many evaluation metrics to measure faithfulness have critical flaws, sometimes rating generated data as excellent, when it is actually a failure, due to relying on low-dimensional indicators. I fix this problem with the full multivariate empirical distribution (ECDF). As an additional benefit, both for synthetization and evaluation, all types of features — categorical, ordinal, or continuous — are processed with a single formula, regardless of type, even in the presence of missing values.
In real-life case studies, the synthetization was generated in less than 5 seconds, versus 10 minutes with GAN. It produced higher quality results, verified via cross-validation. Thanks to the very fast implementation, it is possible to automatically and efficiently fine-tune the hyperparameters. I also discuss next steps to further improve the speed, the faithfulness of the generated data, auto-tuning, Gaussian NoGAN, and applications other than synthetization.
Table of Contents
The superiority of NoGAN is substantial and unquestionable. After all, it allows for exact replication of the real data if bins are granular enough. This is easily achieved with barely any penalty in running time or memory requirements. No matter what, the final number of bins is no larger than the number of observations. GAN is not capable of such performance, making NoGAN a game changer. The loss function is the KS distance between the multivariate ECDFs computed on the real and synthetic data. However, there is no gradient descent algorithm involved, contributing to the speed and stability of the method, regardless of the type of features. In particular, the method is not subject to mode collapse or divergence. Likewise, there is no discriminator model involved, unlike GAN. The method also leads to fully replicable results and simple parallel implementation.
NoGAN was designed for tabular data and untested so far in other contexts. It would be interesting to see~how it performs in computer vision, text data, and other problems. Indeed, the binning algorithm (hidden decision trees) was first built years ago in the context of text processing (NLP). Without any adaptation, NoGAN can be used for clustering, competing with methods based on density estimation. Or to compute model-free confidence intervals, as a better alternative to resampling techniques such as bootstrapping. The latter consists or reshuffling observations, while NoGAN creates new ones.
Finally, you can use it for supervised classification, with no risk of overfitting, by assigning a label to each bin. Rather than overfit, the technique will fail to produce a prediction for future observations located outside the bin configuration. Last but not least, NoGAN can be used as a data imputation method, by averaging across parent bins that are identical to the incomplete son, except for the missing values.
To download the free technical paper (16 pages, including case studies and full Python implementation with link to GitHub), visit our resource pages here and look for the document with same title. Available to subscribers only. Subscription is free.
About the Author
Vincent Granville is a pioneering data scientist and machine learning expert, co-founder of Data Science Central (acquired by TechTarget in 2020), founder of MLTechniques.com, former VC-funded executive, author and patent owner. Vincent’s past corporate experience includes Visa, Wells Fargo, eBay, NBC, Microsoft, and CNET. Vincent is also a former post-doc at Cambridge University, and the National Institute of Statistical Sciences (NISS).
Vincent published in Journal of Number Theory, Journal of the Royal Statistical Society (Series B), and IEEE Transactions on Pattern Analysis and Machine Intelligence. He is also the author of multiple books, including “Synthetic Data and Generative AI”, available here. He lives in Washington state, and enjoys doing research on spatial stochastic processes, chaotic dynamical systems, experimental math and probabilistic number theory.