Generative AI: Synthetic Data Vendor Comparison and Benchmarking Best Practices

The goal of data synthetization is to produce artificial data that mimics the patterns and features present in existing, real data. Many generation methods and evaluation techniques are available, depending on purposes, the type of data, and the application field. Everyone is familiar with synthetic images in the context of computer vision, or synthetic text in applications such as GPT. Sound, graphs, shapes, mathematical functions, artwork, videos, time series, spatial phenomena — you name it — can be synthesized. In this article, I focus on tabular data, with applications in fintech, the insurance industry, supply chain, and health care, to name a few.

The word “synthetization” has its origins in drug synthesis, or possibly music. Interestingly, the creation of new molecules also benefits from data synthetization, by producing virtual compounds, whose properties (if they could be produced in the real world) are known in advance to some degree. It also involves tabular data generation, where the features replicated are various measurements related to the molecules in question. Historically, data synthetization was invented to address the issue of missing data, that is, as a data imputation technique. It did not work as expected as missing data is usually very different from observed values. But the technique has evolved to cover many applications.

One original contribution in this article is the discussion of alternative methods to discuss faithfulness of synthesized data. In particular, I show the limitations of correlation distances and one-dimensional statistical summaries. I also describe holdout methods and utility assessment, using cross-validation techniques or post-classification, illustrated with a real case study. In addition, I mention a few performance metrics that are overlooked by other authors: time-to-train (and how to improve it), replicability, ease of use, parameter optimization, and data transformation prior to synthetization.

You can synthesize data using interpolation, agent-based modeling, adding correlated zero-mean noise to the real data, using copulas or generative adversarial networks (GANs). All these techniques are discussed in details in my book on Generative AI, available here. However, in this article, I focus on copulas and GANs: these are the techniques used by the vendors compared here. I provide a summary table with the vendor rankings on 3 challenging datasets, using various evaluation metrics.

Real versus synthesized: insurance dataset, age versus charges

Download the Article

The technical article, entitled Generative AI: Synthetic Data Vendor Comparison and Benchmarking Best Practices, is accessible in the “Free Books and Articles” section as paper #26, here. It contains links to my GitHub files including datasets, to easily copy and paste the code. The text highlighted in orange in this PDF document are keywords that will be incorporated in the index, when I aggregate all my related articles into books about machine learning, visualization and Python. The text highlighted in blue corresponds to external clickable links, mostly references. And red is used for internal links, pointing to a section, bibliography entry, equation, and so on. Additionally, a Jupyter notebook to produce the visualizations, is available on GitHub.

To not miss future articles, sign-up to our newsletter, here.

About the Author

Towards Better GenAI: 5 Major Issues, and How to Fix Them

Vincent Granville is a pioneering GenAI scientist, co-founder at BondingAI.io, the LLM 2.0 platform for hallucination-free, secure, in-house, lightning-fast Enterprise AI at scale with zero weight and no GPU. He is also author (Elsevier, Wiley), publisher, and successful entrepreneur with multi-million-dollar exit. Vincent’s past corporate experience includes Visa, Wells Fargo, eBay, NBC, Microsoft, and CNET. He completed a post-doc in computational statistics at University of Cambridge.

	messerb5467 on Quantum Derivatives, GenAI, an…
	Vincent Granville on Quantum Derivatives, GenAI, an…
	Brad Messer on Quantum Derivatives, GenAI, an…
	Sanjay Gautam on Number Theory: Longest Runs of…
	Artem Melnyk on Autonomous Driving: Boosting O…

Generative AI: Synthetic Data Vendor Comparison and Benchmarking Best Practices

Table of Contents

Download the Article

About the Author

Like this:

Leave a ReplyCancel reply

Generative AI: Synthetic Data Vendor Comparison and Benchmarking Best Practices

Table of Contents

Download the Article

About the Author

Share this:

Like this:

Leave a ReplyCancel reply

Discover more from xLLM and AI Technology