Data Synthetization Explained in One Picture

The diagram is organized as follows. Dashed blue lines are associated to GANs (generative adversarial networks), where the goal is to produce a sequence of synthetic datasets that get better and better at mimicking the structure present in the real data, over successive iterations. The diagram features 5 such iterations, with the synthetized datasets denoted as S1, S2, …, S5. Typically, GANs follow the gradient of h to reach an optimum configuration q that can not be classified as non-real anymore. Synthetic data that gets closer to the real data gets rewarded in this reinforcement learning technique. Like any simulation-intensive method, training the neural network can be time-consuming, and this black-box approach may lack explainability.

Dashed pink lines are associated to modeling techniques (generative AI, GMM) where synthetic data is obtained by simulating the underlying model using the parameter values estimated on the real data, that is, qk = p for all k. In case of GMM (Gaussian mixture models), the parameters are the cluster centers, the covariance matrix attached to each cluster, and the proportions of the mixture. For stationary time series, the parameter is typically the autocorrelation function (ACF). In some applications including when using copulas, the EDPD (empirical probability density function) is used instead.

The goal is to mimic the structure in the real data, not the real data itself. The structure is represented by a parametric configuration denoted as p in the real data. I use the notation p1, …, p5 for the structures found in the 5 synthetic data sets. The quality hk of the synthetic data set k is the distance between pk and p, based on the Hellinger distance or some discriminating function in the case of GAN. It is assumed that the real data has been normalized (transformed) before synthesizing. “Estim. param.” stands for estimated parameters in the diagram, though sometimes the parameters can be a function or matrix rather than a set of elements.

Source: “Synthetic Data and Generative AI”, by Vincent Granville (273 pages, published in 2023), available here. The picture is from the preface. A full resolution, along with the table of contents, sample chapters and Python code, can be found on GitHub, here.

%d bloggers like this: