
This article is an extract from my book “Synthetic Data and Generative AI”, available here.
In the context of synthetic data generation, I’ve been asked a few times to provide a case study focusing on real-life tabular data used in the finance or health industry. Here we go: this article fills this gap. The purpose is to generate a synthetic copy of the real data set, preserving the correlation structure and all the statistical distributions attached to it. I went one step further and compared my results with those obtained with one of the most well-known vendors in this market: Mostly.ai.
I was able to reverse-engineer the technique that they use, and I share all the details in this article. It is actually a lot easier than most people think. Indeed, the core of the method relies on a few lines of Python code, calling four classic functions from the Numpy and Scipy libraries.
Insurance Dataset
The dataset is the popular insurance file shared on Kaggle, consisting of 1338 observations: small, but that makes it more difficult, not easier for synthetization purposes. It consists of the following features, attached to each individual:
- Gender
- Smoking status (yes / no)
- Region (Northeast and so on)
- Number of children covered by the insurance policy
- Age
- BMI (body mass index)
- Charges incurred by the insurance company
Synthetized Data: Results
My simulated data is markedly similar to that produced by Mostly.ai, preserving all the statistical distributions and the interactions among the 7 features. Clearly, the Mostly.ai output file has all the hallmarks (qualities and defects) of copula-generated data. I also used copulas in my replications, for comparison purposes. My version provides better results, but only because I grouped observations by gender, region and smoking status. So I use a different copula for each group, while Mostly.ai uses a single copula covering all the observations. You would expect the former to work better if the groups are not too small.
The summary statistics computed on the real versus synthetic version of the data are surprisingly similar. So this technique works well. However, in both cases (Mostly.ai and my tests) it is impossible to generate new observations with values outside the range observed in the real data — no matter how many new observations you synthesize. I explain why in my PDF document available from this article. A workaround is to add uncorrelated white noise to each feature: this trick is actually a data synthetization technique in its own right (also preserving the correlation structure) and possibly the simplest one. Another workaround is to extrapolate the quantile functions. You really need to be able to produce observations outside the observed range in order to create rich, useful synthetic data. Otherwise, you won’t be able to really test your machine learning algorithms on truly “new” observations, or add atypical observations in small groups such as minorities or fraudulent transactions.

The real and synthetized data (both Mostly.ai and my method) is in the spreadsheet insurance.xlsx
, available here on GitHub. The original insurance data (csv file) is in the same directory. Synthetic 1 corresponds to Mostly.ai, and Synthetic 2 to my method. The fact that it is not possible to generate values outside the range in the real data set (unless you enhance the technique), is visible if you look at the Min and Max rows in the above table.
Method Based on Copulas: Description
Note that the approach is model-free. No assumption is made on the statistical distribution attached to the features. Instead, it is based on empirical quantile functions. The 4-step synthetization procedure is summarized as follows:
- Step 1: Compute the correlation matrix W associated to your real data set.
- Step 2: Generate random vectors from a multidimensional Gaussian distribution with zero mean and covariance matrix W.
- Step 3: Transform these vectors by applying the standard normal CDF transform to each value. Here CDF stands for “cumulative distribution function” (a univariate function).
- Step 4: Transform the values obtained in step 3 by applying the empirical quantile function Qj to the j-th component of the vector. Do it for each component.
A component corresponds to a feature in the dataset. The empirical quantile function Qj — here a univariate function — is the inverse of the empirical distribution computed on the j-th feature in the real data. The 4-step procedure outlined in this section corresponds to using a Gaussian copula. Other choices are possible. Also, if you want the results to be replicable when running the Python code multiple times or on different platforms, you need to use a static seed for the Numpy random generator. See the code below for the actual implementation. I provide additional explanations in my PDF document, available here on my GitHub repository. The Python code is also on GitHub, in the same folder. The PDF document is actually an extract of my book “Synthetic Data and Generative AI” available here. It is also part of my upcoming course on synthetic data, described here.
I plan on developing a Web API where you can upload your dataset and get it automatically synthetized for free, using an enhanced version of the method described here. I already have one in beta mode for synthetic terrain generation (a computer vision problem), creating animated terrains (videos) based on my article published here. You can play with it here. If you need help with synthetic data, contact me at vincentg@MLTechniques.com.
Towards Better Synthetic Data
Automatically detecting large homogeneous groups — called nodes in decision trees — and using a separate copula for each node is an ensemble technique not unlike boosted trees. In the insurance dataset, I manually picked up these groups.
Testing how close your synthetic data is to the real dataset using Hellinger or similar distances is not a good idea: the best synthetic dataset is the exact replica of your real data, leading to overfitting. Instead, you might want to favor synthetized observations with summary statistics (including the shape of the distribution in high dimensions) closely matching those in the real dataset, but with the worst (rather than best) Hellinger score. This allows you to create richer synthetic data, including atypical observations not found in your training set. Extrapolating empirical quantile functions (as opposed to interpolating only) or adding uncorrelated white noise to each feature (in the real or synthetic data) are two ways to generate observations outside the observed range when using copula-based methods, while keeping the structure present in the real data.
Python Code
Below is the code, described in more details in the PDF document and in my book. It is also on GitHub, here.
import csv
from scipy.stats import norm
import numpy as np
filename = 'insurance.csv' # make sure fields don't contain commas
# source: https://www.kaggle.com/datasets/teertha/ushealthinsurancedataset
# Fields: age, sex, bmi, children, smoker, region, charges
with open(filename, 'r') as csvfile:
reader = csv.reader(csvfile)
fields = next(reader) # Reads header row as a list
rows = list(reader) # Reads all subsequent rows as a list of lists
#-- group by (sex, smoker, region)
groupCount = {}
groupList = {}
for obs in rows:
group = obs[1] +"\t"+obs[4]+"\t"+obs[5]
if group in groupCount:
cnt = groupCount[group]
groupList[(group,cnt)]=(obs[0],obs[2],obs[3],obs[6])
groupCount[group] += 1
else:
groupList[(group,0)]=(obs[0],obs[2],obs[3],obs[6])
groupCount[group] = 1
#-- generate synthetic data customized to each group (Gaussian copula)
seed = 453
np.random.seed(seed)
OUT=open("insurance_synth.txt","w")
for group in groupCount:
nobs = groupCount[group]
age = []
bmi = []
children = []
charges = []
for cnt in range(nobs):
features = groupList[(group,cnt)]
age.append(float(features[0])) # uniform outside very young or very old
bmi.append(float(features[1])) # Gaussian distribution?
children.append(float(features[2])) # geometric distribution?
charges.append(float(features[3])) # bimodal, not gaussian
mu = [np.mean(age), np.mean(bmi), np.mean(children), np.mean(charges)]
zero = [0, 0, 0, 0]
z = np.stack((age, bmi, children, charges), axis = 0)
# cov = np.cov(z)
corr = np.corrcoef(z) # correlation matrix for Gaussian copula for this group
print("------------------")
print("\n\nGroup: ",group,"[",cnt,"obs ]\n")
print("mean age: %2d\nmean bmi: %2d\nmean children: %1.2f\nmean charges: %2d\n"
% (mu[0],mu[1],mu[2],mu[3]))
print("correlation matrix:\n")
print(np.corrcoef(z),"\n")
nobs_synth = nobs # number of synthetic obs to create for this group
gfg = np.random.multivariate_normal(zero, corr, nobs_synth)
g_age = gfg[:,0]
g_bmi = gfg[:,1]
g_children = gfg[:,2]
g_charges = gfg[:,3]
# generate nobs_synth observations for this group
print("synthetic observations:\n")
for k in range(nobs_synth):
u_age = norm.cdf(g_age[k])
u_bmi = norm.cdf(g_bmi[k])
u_children = norm.cdf(g_children[k])
u_charges = norm.cdf(g_charges[k])
s_age = np.quantile(age, u_age) # synthesized age
s_bmi = np.quantile(bmi, u_bmi) # synthesized bmi
s_children = np.quantile(children, u_children) # synthesized children
s_charges = np.quantile(charges, u_charges) # synthesized charges
line = group+"\t"+str(s_age)+"\t"+str(s_bmi)+"\t"+str(s_children)+"\t"+str(s_charges)+"\n"
OUT.write(line)
print("%3d. %d %d %d %d" %(k, s_age, s_bmi, s_children, s_charges))
OUT.close()
About the Author
Vincent Granville is a pioneering data scientist and machine learning expert, co-founder of Data Science Central (acquired by TechTarget in 2020), founder of MLTechniques.com, former VC-funded executive, author and patent owner. Vincent’s past corporate experience includes Visa, Wells Fargo, eBay, NBC, Microsoft, and CNET. Vincent is also a former post-doc at Cambridge University, and the National Institute of Statistical Sciences (NISS).
Vincent published in Journal of Number Theory, Journal of the Royal Statistical Society (Series B), and IEEE Transactions on Pattern Analysis and Machine Intelligence. He is also the author of multiple books, including “Intuitive Machine Learning and Explainable AI”, available here. He lives in Washington state, and enjoys doing research on spatial stochastic processes, chaotic dynamical systems, experimental math and probabilistic number theory.
You must log in to post a comment.