I sometimes get asked this question: could you use simulations instead of synthetizations? Below is my answer, also focusing on some particular aspects of data synthetizations, that differentiate them from other techniques.
Simulations do not simulate joint distributions
Sure, if all your features behave like a mixture of multivariate normal distributions, you can use GMMs (Gaussian mixture models) for synthetization. This is akin to Monte-Carlo simulation. The parameters of the mixture — number of clusters, covariance matrix attached to each Gaussian distribution (one per cluster), and the mixture proportions — can be estimated using the EM algorithm. It is subject to model identifiability issues, but it will work.
If the interdependence structure among the features is essentially linear, in other words well captured by the correlation matrix, you can decorrelate the features using a linear transform such as PCA to remove cross-correlations, then sample each feature separately using standard simulation techniques, and finally apply the inverse transform to add the correlations back. This is similar to what the copula method accomplishes. Each decorrelated feature can be modeled using a parametric metalog distribution to fit with various shapes, akin to Monte-Carlo simulations.
Dealing with a mix of categorical, ordinal, and continuous features
This is when synthetization becomes most useful. The copula method can handle it easily. For categorical variables, you can create buckets also called flag vectors. For instance, [smoker=yes, region=South, gender=F] is a bucket. Frequency counts are computed for each bucket in the real dataset. Generate the estimated frequencies for each bucket when synthetizing data. You may aggregate all small buckets into one catch-all bucket. This method, similar to decision trees and XGboost, is a good alternative to turning your categorical features into a large number of binary, numerical dummy variables.
To deal with non-linear interdependencies among your features, GAN synthetizations (generative adversarial networks) are usually superior to copula-based methods. Of course, the two methods can be blended: remove the cross-correlations first, then synthetize decorrelated features using GAN, then add back the cross-correlations, with the same linear transform as discussed earlier.
One issue is how to measure the correlation between categorical features, or between categorical and numerical features. Metrics such as Cramer’s V accomplish this, returning a value between 0 and 1, instead of between -1 and 1 for standard correlations.
Do Gaussian copulas work on non-Gaussian observations?
Copulas and GANs aimed at replicating the whole joint distribution, not each component separately, but also the correlations (for copulas) and non-linear feature interactions (GANs). It works with discrete and multimodal distributions combined together, regardless of the underlying distribution (copulas are based on empirical quantiles, although parametric versions are available).
Whether you use a Gaussian or Frank or Vine copula does not really matter, except when dealing with distributions with very long tails. Same with GAN: you use Gaussian distributions for latent variables regardless of the actual distributions in the real data.
My simulations do as well as synthetizations, how so?
You need to compare the feature dependencies structures as well, not just feature-to-feature (1D) comparisons. Perfect replication of univariate distributions is easy, but replication of cross-interdepencies is the challenging part. See how I measure the quality of the results using the Δavg metric in this article.
Basically, I compute the correlation matrix M1 on real data, then M2 on synthetic data, then Δ = M1 – M2 except that I take the absolute value of the difference for each entry in M1 – M2. I call the resulting “absolute difference” matrix Δ, then Δavg is the value averaged over all elements or Δ. So if there are m features, the matrix Δ is m × m, symmetric, the main diagonal is zero, and each element has a value between 0 (great fit) and 1 (bad fit).
Note that this method focuses on bivariate comparisons only, and linear interactions only. There are methods to compare more complex multidimensional interactions. More on this in my book on synthetic data and generative AI (Elsevier, 2024), used as teaching material in the GenAI certification program offered by my AI/ML research lab, and available here.
Sensitivity to changes in the real data
To avoid over-fitting, you assess the quality of the resulting synthetization using the holdout method. Say 50% of your real data is used to train GAN or set up the copula, and the remaining 50% (called validation set) used to compare with synthetic data. This cross-validation technique makes your comparison more meaningful. Sensitivity to the original distribution should not be too large unless you introduce a lot of noise to assess sensitivity to the point that the difference between D1 and D2 is larger than between S1 and D1 or S2 and D2. Here D1, D2 are the real data and real data after adding noise, while S1, S2 are the corresponding synthetizations.
My own GAN is quite sensitive (compared to vendors) but there are ways to reduce this problem by choosing the loss function and other techniques. My copula is more robust than the open source SDV library, as SDV is a combo GAN/Copula and Lite SDV (the version tested) uses a fast but poor gradient descent algorithm for GAN. Some parameter fine-tuning might be needed to reduce sensitivity. On the circle data, my GAN does better than the copula, with some vendors (YData.ai in particular) doing even much better, and some vendors including SDV Lite doing much worse. Wasserstein GANs (WGAN) is an alternative, also designed to avoid mode collapse. This happens when the underlying gradient descent method gets stuck in some local optimum.
To no miss future articles and discover the benefits offered to subscribers only, visit our newsletter sign-up page, here. Subscription is free.
About the Author
Vincent Granville is a pioneering data scientist and machine learning expert, co-founder of Data Science Central (acquired by TechTarget in 2020), founder of MLTechniques.com, former VC-funded executive, author and patent owner. Vincent’s past corporate experience includes Visa, Wells Fargo, eBay, NBC, Microsoft, and CNET. Vincent is also a former post-doc at Cambridge University, and the National Institute of Statistical Sciences (NISS).
Vincent published in Journal of Number Theory, Journal of the Royal Statistical Society (Series B), and IEEE Transactions on Pattern Analysis and Machine Intelligence. He is also the author of multiple books, including “Synthetic Data and Generative AI”, available here. He lives in Washington state, and enjoys doing research on spatial stochastic processes, chaotic dynamical systems, experimental math and probabilistic number theory.