The goal of data synthetization is to produce artificial data that mimics the patterns and features present in existing, real data. Many generation methods and evaluation techniques are available, depending on purposes, the type of data, and the application field. Everyone is familiar with synthetic images in the context of computer vision, or synthetic text in applications such as GPT. Sound, graphs, shapes, mathematical functions, artwork, videos, time series, spatial phenomena — you name it — can be synthesized. In this article, I focus on tabular data, with applications in fintech, the insurance industry, supply chain, and health care, to name a few.
The word “synthetization” has its origins in drug synthesis, or possibly music. Interestingly, the creation of new molecules also benefits from data synthetization, by producing virtual compounds, whose properties (if they could be produced in the real world) are known in advance to some degree. It also involves tabular data generation, where the features replicated are various measurements related to the molecules in question. Historically, data synthetization was invented to address the issue of missing data, that is, as a data imputation technique. It did not work as expected as missing data is usually very different from observed values. But the technique has evolved to cover many applications.
One original contribution in this article is the discussion of alternative methods to discuss faithfulness of synthesized data. In particular, I show the limitations of correlation distances and one-dimensional statistical summaries. I also describe holdout methods and utility assessment, using cross-validation techniques or post-classification, illustrated with a real case study. In addition, I mention a few performance metrics that are overlooked by other authors: time-to-train (and how to improve it), replicability, ease of use, parameter optimization, and data transformation prior to synthetization.
You can synthesize data using interpolation, agent-based modeling, adding correlated zero-mean noise to the real data, using copulas or generative adversarial networks (GANs). All these techniques are discussed in details in my book on Generative AI, available here. However, in this article, I focus on copulas and GANs: these are the techniques used by the vendors compared here. I provide a summary table with the vendor rankings on 3 challenging datasets, using various evaluation metrics.
Table of Contents
Download the Article
The technical article, entitled Generative AI: Synthetic Data Vendor Comparison and Benchmarking Best Practices, is accessible in the “Free Books and Articles” section, here. It contains links to my GitHub files including datasets, to easily copy and paste the code. The text highlighted in orange in this PDF document are keywords that will be incorporated in the index, when I aggregate all my related articles into books about machine learning, visualization and Python. The text highlighted in blue corresponds to external clickable links, mostly references. And red is used for internal links, pointing to a section, bibliography entry, equation, and so on. Additionally, a Jupyter notebook to produce the visualizations, is available on GitHub.
To not miss future articles, sign-up to our newsletter, here.
About the Author
Vincent Granville is a pioneering data scientist and machine learning expert, co-founder of Data Science Central (acquired by TechTarget in 2020), founder of MLTechniques.com, former VC-funded executive, author and patent owner. Vincent’s past corporate experience includes Visa, Wells Fargo, eBay, NBC, Microsoft, and CNET. Vincent is also a former post-doc at Cambridge University, and the National Institute of Statistical Sciences (NISS). He published in Journal of Number Theory, Journal of the Royal Statistical Society (Series B), and IEEE Transactions on Pattern Analysis and Machine Intelligence. He is also the author of multiple books, including “Intuitive Machine Learning and Explainable AI”, available here. Vincent lives in Washington state, and enjoys doing research on spatial stochastic processes, chaotic dynamical systems, experimental math and probabilistic number theory.