In this article, I explore different front-end strategies to improve a generative adversarial network (GAN) that leads to poor synthetization, in the context of tabular data generation. It is well known that tabular data is a lot more challenging than images, when using deep neural networks for synthetization purposes. An algorithm may work very well on some datasets, and unexpectedly fails on other use cases. Here, adding one feature to a same dataset resulted in an otherwise decent GAN to behave poorly. The new feature was highly correlated to an existing one, causing the problem. I fixed it by first transforming the data via PCA (principal component analysis) before performing the synthetization, then applying the inverse transform post-GAN. Scaling and other transforms such as standardization, also help.
In addition, when dealing with tabular data, GAN may be very sensitive to the seed, and synthetizations obtained at each epoch significantly vary in quality. Thus, you could stop half-way and get much better results than using a fixed number of epochs. Small datasets present additional challenges. Better loss functions such as Wasserstein, may not solve the issue.
Evaluating the quality, that is, comparing the synthetic with the real data, is a problem of its own. In computer vision, this is straightforward as you visually compare two images. But with tabular data, many metrics fail to capture intricate patterns spanning across multiple features — some numerical and some categorical. It can result in a synthetizations scored as excellent, while in reality being a total miss. In short, a false negative.
In this article, I cover all these issues and show how to address them. In the end, the best solution, one consistently working on all datasets with barely any hyperparameter fine-tuning, was not a neural network, but an algorithm referred to as NoGAN, also running a lot faster. However, I also show several strategies in action, to significantly improve your GAN on challenging datasets. In this case, the data comes from a well known telecom use case. In the above table,
- 2D GAN corresponds to using only two features and works well (not included here).
- Failed GAN corresponds to working with three features, with the new one causing the problem.
- Fixed GAN is based on the three features, after using PCA transformation, the best seed, and the best epoch.
- NoGAN does not use any neural network. This technique is described here, along with better evaluation metrics.
The free technical paper (10 pages, including case study and full Python implementation with link to GitHub), is available as article #30, here. To not miss future articles, sign-up to our newsletter (same link).
About the Author
Vincent Granville is a pioneering AI and machine learning expert, co-founder of Data Science Central (acquired by TechTarget in 2020), founder of MLTechniques.com, former VC-funded executive, author and patent owner. Vincent’s past corporate experience includes Visa, Wells Fargo, eBay, NBC, Microsoft, and CNET. Vincent is also a former post-doc at Cambridge University, and the National Institute of Statistical Sciences (NISS).
Vincent published in Journal of Number Theory, Journal of the Royal Statistical Society (Series B), and IEEE Transactions on Pattern Analysis and Machine Intelligence. He is also the author of multiple books, including “Synthetic Data and Generative AI”, available here. He lives in Washington state, and enjoys doing research on spatial stochastic processes, chaotic dynamical systems, experimental math and probabilistic number theory.