If you regularly read my articles, you know that I developed several different techniques for data synthetization. Many are explained in details in my upcoming book Synthetic Data and Generative AI (Elsevier), available here. It includes generative adversarial networks (GANs), copulas, agent-based modeling, methods based on interpolation, correlated noise mixtures, and more.
The technique presented here was first tested on time series and then extended to geospatial data, a 2D generalization. It relies on exact multivariate interpolation, and designed to avoid overfitting. Most other methods such as kriging produce smoothed data, and are thus not suitable for data synthetization. To the contrary, this method preserves the local irregularities and spikes. For instance, it can be used to model chaotic processes or reconstruct full elevation maps when the altitude is known only for a small number of locations.
The algorithm is explained in chapter 9 in my book, and applied to the Chicago temperature dataset consisting of 31 locations. Here I illustrate how it works on much larger datasets. The main novelty is measuring smoothness in higher dimensions. The concept may sound trivial, but in two or three dimensions, no one agrees on the definition, and it is not obvious to compare the smoothness of two different datasets produced by different algorithms, or featuring different geographic areas. I address this issue.
The full technical article with Python code and case study is presented as a project for participants enrolled in my GenAI certification program. It consists of multiple steps to complete. My own solutions are included. To access the document, follow this link and look out for the PDF file with corresponding title. The project textbook (offered exclusively to participants) has clickable links and features all the projects currently available. To learn more about this mentoring and certification program under my guidance, follow this link.
Figure 2 features a function of the second order gradient, used to estimate the smoothness of the data shown in Figure 1. The data consisted of a few dozens data points (shown as small circles), and the full grid was interpolated based on this training set, to produce the image in question. Because the true value is known everywhere in this example, I was able to assess the accuracy of my interpolation method.
NoGAN is a class of synthetization algorithms that do not rely on neural networks (GAN) for training. They run much faster, lead to explainable AI, and some produce even better results. My upcoming article “Generative AI Technology Break-through: Spectacular Performance of New, NoGAN Synthesizer” will be the first seminal paper on this topic, following a new trend started with NoSQL, NoCode, and NoMath in other contexts.
To no miss future articles and discover the benefits offered to subscribers only, visit our newsletter sign-up page, here. Subscription is free.
About the Author
Vincent Granville is a pioneering data scientist and machine learning expert, co-founder of Data Science Central (acquired by TechTarget in 2020), founder of MLTechniques.com, former VC-funded executive, author and patent owner. Vincent’s past corporate experience includes Visa, Wells Fargo, eBay, NBC, Microsoft, and CNET. Vincent is also a former post-doc at Cambridge University, and the National Institute of Statistical Sciences (NISS).
Vincent published in Journal of Number Theory, Journal of the Royal Statistical Society (Series B), and IEEE Transactions on Pattern Analysis and Machine Intelligence. He is also the author of multiple books, including “Synthetic Data and Generative AI”, available here. He lives in Washington state, and enjoys doing research on spatial stochastic processes, chaotic dynamical systems, experimental math and probabilistic number theory.