
I introduce a new, NoGAN alternative to standard tabular data synthetization. It is designed to run faster by several orders of magnitude, compared to training generative adversarial networks (GAN). In addition, the quality of the generated data is superior to almost all other products available on the market. The hyperparameters are intuitive, leading to explainable AI.
Many evaluation metrics to measure faithfulness have critical flaws, sometimes rating generated data as excellent, when it is actually a failure, due to relying on low-dimensional indicators. I fix this problem with the full multivariate empirical distribution (ECDF). As an additional benefit, both for synthetization and evaluation, all types of features — categorical, ordinal, or continuous — are processed with a single formula, regardless of type, even in the presence of missing values.
In real-life case studies, the synthetization was generated in less than 5 seconds, versus 10 minutes with GAN. It produced higher quality results, verified via cross-validation. Thanks to the very fast implementation, it is possible to automatically and efficiently fine-tune the hyperparameters. I also discuss next steps to further improve the speed, the faithfulness of the generated data, auto-tuning, Gaussian NoGAN, and applications other than synthetization.
Speaker
🎙Vincent Granville — Founder, MLTechniques.com. Vincent Granville is a pioneering data scientist and machine learning expert, co-founder of Data Science Central (acquired by TechTarget in 2020), former VC-funded executive, author and patent owner. Vincent’s past corporate experience includes Visa, Wells Fargo, eBay, NBC, Microsoft, CNET, InfoSpace. Vincent is also a former post-doc at Cambridge University, and the National Institute of Statistical Sciences (NISS). Vincent published in Journal of Number Theory, Journal of the Royal Statistical Society (Series B), and IEEE Transactions on Pattern Analysis and Machine Intelligence. He is also the author of multiple books. He lives in Washington state, and enjoys doing research on stochastic processes, dynamical systems, experimental math and probabilistic number theory.
You can find more articles on synthetic data, here. To not miss future articles, sign-up to our newsletter, here.