The post Math-free, Parameter-free Gradient Descent in Python first appeared on Machine Learning Techniques.

]]>I discuss techniques related to the gradient descent method in 2D. The goal is to find the minima of a target function, called the cost function. The values of the function are computed at evenly spaced locations on a grid and stored in memory. Because of this, the approach is not directly based on derivatives, and there is no calculus involved. It implicitly uses discrete derivatives, but foremost, it is a simple geometric algorithm. The learning parameter typically attached to gradient descend is explicitly specified here: it is equal to the granularity of the mesh and does not need fine-tuning. In addition to gradient descent and ascent, I also show how to build contour lines and orthogonal trajectories, with the exact same algorithm.

I apply the method to investigate one of the most famous unsolved problems in mathematics: the Riemann Hypothesis. The functions studied here are defined on the complex plane. However, no advanced knowledge of complex calculus is required as I use the standard 2D space in my illustrations. I show how the distribution of the minima of |*ζ*(*σ* + *it*)| can be studied by looking at (say) *σ* = 2 rather than *σ* = 1/2. These minima are the non-trivial roots of the Riemann zeta function and all of them are conjectured to have *σ* = 1/2. It is a lot easier to work with *σ* > 1 due to accelerated convergence. In the process, I introduce synthetic functions with arbitrary infinite Hadamard products (the most well known is the sine function) to assess non-Dirichlet functions that may behave like* ζ*, and gain more insights and generalization about the problem. My presentation is mostly in simple English and accessible to first year college students.

. . . Data animation (video) loading, may take 5 secs . . .

- Introduction
- Gradient descent and related optimization techniques

. . . Implementation details

. . . General comments about the methodology and parameters

. . . Mathematical version of gradient descent and orthogonal trajectories - Distribution of minima and the Riemann Hypothesis

. . . Root taxonomy

. . . Studying root propagation with synthetic math functions - Python code

. . . Contours and orthogonal trajectories

. . . Animated gradient descent starting with 100 random points

The technical article, entitled *Math-free, Parameter-free Gradient Descent in Python*, is accessible in the “Free Books and Articles” section, here. It contains links to my GitHub files, to easily copy and paste the code. The text highlighted in orange in this PDF document are keywords that will be incorporated in the index, when I aggregate all my related articles into books about machine learning, visualization and Python, similar to these ones. The text highlighted in blue corresponds to external clickable links, mostly references. And red is used for internal links, pointing to a section, bibliography entry, equation, and so on.

*To not miss future articles, sign-up to our newsletter, here.*

Vincent Granville is a pioneering data scientist and machine learning expert, co-founder of Data Science Central (acquired by TechTarget in 2020), founder of MLTechniques.com, former VC-funded executive, author and patent owner. Vincent’s past corporate experience includes Visa, Wells Fargo, eBay, NBC, Microsoft, and CNET. Vincent is also a former post-doc at Cambridge University, and the National Institute of Statistical Sciences (NISS).

Vincent published in *Journal of Number Theory*, *Journal of the Royal Statistical Society* (Series B), and *IEEE Transactions on Pattern Analysis and Machine Intelligence*. He is also the author of multiple books, including “Intuitive Machine Learning and Explainable AI”, available here. He lives in Washington state, and enjoys doing research on spatial stochastic processes, chaotic dynamical systems, experimental math and probabilistic number theory.

The post Math-free, Parameter-free Gradient Descent in Python first appeared on Machine Learning Techniques.

]]>The post New Interpolation Methods for Data Synthetization and Prediction first appeared on Machine Learning Techniques.

]]>I describe little-known original interpolation methods with applications to real-life datasets. These simple techniques are easy to implement and can be used for regression or prediction. They offer an alternative to model-based statistical methods. Applications include interpolating ocean tides at Dublin, predicting temperatures in the Chicago area with geospatial data, and a problem in astronomy: planet alignments and frequency of these events. In one example, the 5-min data can be replaced by 80-min measurements, with the 5-min increments reconstructed via interpolation, without noticeable loss. Thus, my algorithm can be used for data compression.

The first technique has strong ties to Fourier methods. In addition to the above applications, I show how it can be used to efficiently interpolate complex mathematical functions such as Bessel and Riemann zeta. For those familiar with MATLAB or Mathematica, this is an opportunity to play with the MPmath library in Python and see how it compares with the traditional tools in this context. In the process, I also show how the methodology can be used to generate synthetic data, be it time series or geospatial data.

Depending on the parameters, in the geospatial context, the interpolation is either close to nearest-neighbor methods, kriging (also known as Gaussian process regression), or a truly original and hybrid mix of additive and multiplicative techniques. There is an option not to interpolate at locations far away from the training set, where regression or interpolation results may be meaningless, regardless of the technique used. Another application is detecting the full extent of an oil field after digging only a dozen wells. Likewise, the temperature data sets also has few stations with an actual measurement, and the goal is to obtain interpolated values fully covering a specific area.

The second technique is based on ordinary least squares — the same method used to solve polynomial or multivariate regression — but instead of highly unstable polynomials leading to overfitting, I focus on generic functions that avoid these pitfalls, using an iterative greedy algorithm to find the optimum. In particular, a solution based on orthogonal functions leads to a particularly simple implementation with a direct and elegant solution.

- Introduction
- First method

. . . Example with infinite summation

. . . Applications: ocean tides, planet alignment

. . . Problem in two dimensions

. . . Spatial interpolation of the temperature dataset - Second method

. . . From unstable polynomials to robust orthogonal regression

. . . Using orthogonal functions

. . . Application to regression - Python code

. . . Time series interpolation

. . . Geospatial temperature dataset

. . . Regression with Fourier series

The technical article, entitled *New Interpolation Methods for Synthetization and Prediction*, is accessible in the “Free Books and Articles” section, here. It contains links to my GitHub files, to easily copy and paste the code. The text highlighted in orange in this PDF document are keywords that will be incorporated in the index, when I aggregate all my related articles into books about machine learning, visualization and Python, similar to these ones. The text highlighted in blue corresponds to external clickable links, mostly references. And red is used for internal links, pointing to a section, bibliography entry, equation, and so on.

*To not miss future articles, sign-up to our newsletter, here.*

Vincent Granville is a pioneering data scientist and machine learning expert, co-founder of Data Science Central (acquired by TechTarget in 2020), founder of MLTechniques.com, former VC-funded executive, author and patent owner. Vincent’s past corporate experience includes Visa, Wells Fargo, eBay, NBC, Microsoft, and CNET. Vincent is also a former post-doc at Cambridge University, and the National Institute of Statistical Sciences (NISS).

Vincent published in *Journal of Number Theory*, *Journal of the Royal Statistical Society* (Series B), and *IEEE Transactions on Pattern Analysis and Machine Intelligence*. He is also the author of multiple books, including “Intuitive Machine Learning and Explainable AI”, available here. He lives in Washington state, and enjoys doing research on spatial stochastic processes, chaotic dynamical systems, experimental math and probabilistic number theory.

The post New Interpolation Methods for Data Synthetization and Prediction first appeared on Machine Learning Techniques.

]]>The post Synthetizing the Insurance Dataset Using Copulas: Towards Better Synthetization first appeared on Machine Learning Techniques.

]]>In the context of synthetic data generation, I’ve been asked a few times to provide a case study focusing on real-life tabular data used in the finance or health industry. Here we go: this article fills this gap. The purpose is to generate a synthetic copy of the real data set, preserving the correlation structure and all the statistical distributions attached to it. I went one step further and compared my results with those obtained with one of the most well-known vendors in this market: Mostly.ai.

I was able to reverse-engineer the technique that they use, and I share all the details in this article. It is actually a lot easier than most people think. Indeed, the core of the method relies on a few lines of Python code, calling four classic functions from the Numpy and Scipy libraries.

The dataset is the popular insurance file shared on Kaggle, consisting of 1338 observations: small, but that makes it more difficult, not easier for synthetization purposes. It consists of the following features, attached to each individual:

- Gender
- Smoking status (yes / no)
- Region (Northeast and so on)
- Number of children covered by the insurance policy
- Age
- BMI (body mass index)
- Charges incurred by the insurance company

My simulated data is markedly similar to that produced by Mostly.ai, preserving all the statistical distributions and the interactions among the 7 features. Clearly, the Mostly.ai output file has all the hallmarks (qualities and defects) of copula-generated data. I also used copulas in my replications, for comparison purposes. My version provides better results, but only because I grouped observations by gender, region and smoking status. So I use a different copula for each group, while Mostly.ai uses a single copula covering all the observations. You would expect the former to work better if the groups are not too small.

The summary statistics computed on the real versus synthetic version of the data are surprisingly similar. So this technique works well. However, in both cases (Mostly.ai and my tests) it is impossible to generate new observations with values outside the range observed in the real data — no matter how many new observations you synthesize. I explain why in my PDF document available from this article. A workaround is to add uncorrelated white noise to each feature: this trick is actually a data synthetization technique in its own right (also preserving the correlation structure) and possibly the simplest one. Another workaround is to extrapolate the quantile functions. You really need to be able to produce observations outside the observed range in order to create rich, useful synthetic data. Otherwise, you won’t be able to really test your machine learning algorithms on truly “new” observations, or add atypical observations in small groups such as minorities or fraudulent transactions.

The real and synthetized data (both Mostly.ai and my method) is in the spreadsheet `insurance.xlsx`

, available here on GitHub. The original insurance data (csv file) is in the same directory. Synthetic 1 corresponds to Mostly.ai, and Synthetic 2 to my method. The fact that it is not possible to generate values outside the range in the real data set (unless you enhance the technique), is visible if you look at the Min and Max rows in the above table.

Note that the approach is model-free. No assumption is made on the statistical distribution attached to the features. Instead, it is based on empirical quantile functions. The 4-step synthetization procedure is summarized as follows:

- Step 1: Compute the correlation matrix
*W*associated to your real data set. - Step 2: Generate random vectors from a multidimensional Gaussian distribution with zero mean and covariance matrix
*W*. - Step 3: Transform these vectors by applying the standard normal CDF transform to each value. Here CDF stands for “cumulative distribution function” (a univariate function).
- Step 4: Transform the values obtained in step 3 by applying the empirical quantile function
*Q*to the_{j}*j*-th component of the vector. Do it for each component.

A component corresponds to a feature in the dataset. The empirical quantile function *Q _{j}* — here a univariate function — is the inverse of the empirical distribution computed on the

I plan on developing a Web API where you can upload your dataset and get it automatically synthetized for free, using an enhanced version of the method described here. I already have one in beta mode for synthetic terrain generation (a computer vision problem), creating animated terrains (videos) based on my article published here. You can play with it here. If you need help with synthetic data, contact me at vincentg@MLTechniques.com.

Automatically detecting large homogeneous groups — called nodes in decision trees — and using a separate copula for each node is an ensemble technique not unlike boosted trees. In the insurance dataset, I manually picked up these groups.

Testing how close your synthetic data is to the real dataset using Hellinger or similar distances is not a good idea: the best synthetic dataset is the exact replica of your real data, leading to overfitting. Instead, you might want to favor synthetized observations with summary statistics (including the shape of the distribution in high dimensions) closely matching those in the real dataset, but with the worst (rather than best) Hellinger score. This allows you to create richer synthetic data, including atypical observations not found in your training set. Extrapolating empirical quantile functions (as opposed to interpolating only) or adding uncorrelated white noise to each feature (in the real or synthetic data) are two ways to generate observations outside the observed range when using copula-based methods, while keeping the structure present in the real data.

Below is the code, described in more details in the PDF document and in my book. It is also on GitHub, here.

```
import csv
from scipy.stats import norm
import numpy as np
filename = 'insurance.csv' # make sure fields don't contain commas
# source: https://www.kaggle.com/datasets/teertha/ushealthinsurancedataset
# Fields: age, sex, bmi, children, smoker, region, charges
with open(filename, 'r') as csvfile:
reader = csv.reader(csvfile)
fields = next(reader) # Reads header row as a list
rows = list(reader) # Reads all subsequent rows as a list of lists
#-- group by (sex, smoker, region)
groupCount = {}
groupList = {}
for obs in rows:
group = obs[1] +"\t"+obs[4]+"\t"+obs[5]
if group in groupCount:
cnt = groupCount[group]
groupList[(group,cnt)]=(obs[0],obs[2],obs[3],obs[6])
groupCount[group] += 1
else:
groupList[(group,0)]=(obs[0],obs[2],obs[3],obs[6])
groupCount[group] = 1
#-- generate synthetic data customized to each group (Gaussian copula)
seed = 453
np.random.seed(seed)
OUT=open("insurance_synth.txt","w")
for group in groupCount:
nobs = groupCount[group]
age = []
bmi = []
children = []
charges = []
for cnt in range(nobs):
features = groupList[(group,cnt)]
age.append(float(features[0])) # uniform outside very young or very old
bmi.append(float(features[1])) # Gaussian distribution?
children.append(float(features[2])) # geometric distribution?
charges.append(float(features[3])) # bimodal, not gaussian
mu = [np.mean(age), np.mean(bmi), np.mean(children), np.mean(charges)]
zero = [0, 0, 0, 0]
z = np.stack((age, bmi, children, charges), axis = 0)
# cov = np.cov(z)
corr = np.corrcoef(z) # correlation matrix for Gaussian copula for this group
print("------------------")
print("\n\nGroup: ",group,"[",cnt,"obs ]\n")
print("mean age: %2d\nmean bmi: %2d\nmean children: %1.2f\nmean charges: %2d\n"
% (mu[0],mu[1],mu[2],mu[3]))
print("correlation matrix:\n")
print(np.corrcoef(z),"\n")
nobs_synth = nobs # number of synthetic obs to create for this group
gfg = np.random.multivariate_normal(zero, corr, nobs_synth)
g_age = gfg[:,0]
g_bmi = gfg[:,1]
g_children = gfg[:,2]
g_charges = gfg[:,3]
# generate nobs_synth observations for this group
print("synthetic observations:\n")
for k in range(nobs_synth):
u_age = norm.cdf(g_age[k])
u_bmi = norm.cdf(g_bmi[k])
u_children = norm.cdf(g_children[k])
u_charges = norm.cdf(g_charges[k])
s_age = np.quantile(age, u_age) # synthesized age
s_bmi = np.quantile(bmi, u_bmi) # synthesized bmi
s_children = np.quantile(children, u_children) # synthesized children
s_charges = np.quantile(charges, u_charges) # synthesized charges
line = group+"\t"+str(s_age)+"\t"+str(s_bmi)+"\t"+str(s_children)+"\t"+str(s_charges)+"\n"
OUT.write(line)
print("%3d. %d %d %d %d" %(k, s_age, s_bmi, s_children, s_charges))
OUT.close()
```

Vincent Granville is a pioneering data scientist and machine learning expert, co-founder of Data Science Central (acquired by TechTarget in 2020), founder of MLTechniques.com, former VC-funded executive, author and patent owner. Vincent’s past corporate experience includes Visa, Wells Fargo, eBay, NBC, Microsoft, and CNET. Vincent is also a former post-doc at Cambridge University, and the National Institute of Statistical Sciences (NISS).

Vincent published in *Journal of Number Theory*, *Journal of the Royal Statistical Society* (Series B), and *IEEE Transactions on Pattern Analysis and Machine Intelligence*. He is also the author of multiple books, including “Intuitive Machine Learning and Explainable AI”, available here. He lives in Washington state, and enjoys doing research on spatial stochastic processes, chaotic dynamical systems, experimental math and probabilistic number theory.

The post Synthetizing the Insurance Dataset Using Copulas: Towards Better Synthetization first appeared on Machine Learning Techniques.

]]>The post Military-grade Fast Random Number Generator Based on Quadratic Irrationals first appeared on Machine Learning Techniques.

]]>There are very few serious articles in the literature dealing with digits of irrational numbers to build a pseudo-random number generator (PRNG). It seems that this idea was abandoned long ago due to the computational complexity and the misconception that such PRNG’s are deterministic while others are not. Actually, my new algorithm is less deterministic than congruential PRNGs currently used in all applications. New developments made this concept of using irrational numbers, worth revisiting. It believe that my quadratic irrational PRNG debunks all the myths previously associated to such methods.

Thanks to new developments in number theory, quadratic irrational PRNGs — the name attached to the technique presented here — are not only just as fast as standard generators, but they also offer a higher level of randomness. Thus, they represent a serious alternative in data encryption, heavy simulation or synthetic data generation, when you need billions or trillions of truly random-like numbers. In particular, a version of my algorithm computes hundreds (or millions) of digits for billions of irrational numbers at once. It combines these digits to produce large data sets of strong random numbers, with well-known properties. The fast algorithm can easily be implemented in a distributed architecture, making it even faster. It is also highly portable and great to use when exact replicability is critical: standard generators may not lead to the same results depending on which programming language or which version of Python you use, even if your seed is static.

With quadratic irrationals, the “seed” (initial conditions) is actually a sequence of bivariate parameters. It is similar to regular re-seeding in standard generators. When seeds are hardware-generated, it leads to random numbers fit for strong encryption. This new PRNG was built after designing advanced tests of randomness that identified flaws in popular PRNGs used in many applications, including the Mersenne twister — once thought unbreakable. The Mersenne twister is the default random number generator in Python. Of course, unlike most other generators, quadratic irrational PRNG’s all have an infinite period. However, this is not their main selling point. The previous randomness tests in question include the new prime test discussed in this article.

I produced two sets of 1.5 million digits: one with the quadratic irrational PRNG, and one with the Mersenne twister. You can perform various tests on this dataset, and compare the results from both methods. Of course, using my Python code, you can produce as many digits as you want, play with the parameters and try any seed sequence that you want. I plan on sharing a data set with more than one trillion digits in the near future.

*Get the full 5-page article (PDF) with Python code and all the details, on my GitHub repository, here. For free, no sign-up required. *

The post Military-grade Fast Random Number Generator Based on Quadratic Irrationals first appeared on Machine Learning Techniques.

]]>The post Empirical Optimization with Divergent Fixed Point Algorithm – When All Else Fails first appeared on Machine Learning Techniques.

]]>While the technique discussed here is a last resort solution when all else fails, it is actually more powerful than it seems at first glance. First, it also works in standard cases with “nice” functions. However, there are better methods when the function behaves nicely, taking advantage of the differentiability of the function in question, such as the Newton algorithm (itself a fixed-point iteration). It can be generalized to higher dimensions, though I focus on univariate functions here.

Perhaps the attractive features are the fact that it is simple and intuitive, and quickly leads to a solution despite the absence of convergence. However, it is an empirical method and may require working with different parameter sets to actually find a solution. Still, it can be turned into a black-box solution by automatically testing different parameter configurations. In that respect, I compare it to the empirical elbow rule to detect the number of clusters in unsupervised clustering problems. I also turned the elbow rule into a fully automated black-box procedure, with full details offered in the same book.

Why would anyone be interested in an algorithm that never converges to the solution you are looking for? This version of the fixed-point iteration, when approaching a zero or an optimum, emits a strong signal and allows you to detect a small interval likely to contain the solution: the zero or global optimum in question. It may approach the optimum quite well, but subsequent iterations do not lead to convergence: the algorithm eventually moves away from the optimum, or oscillates around the optimum without ever reaching it.

The fixed-point iteration is the mother of all optimization and root-finding algorithms. In particular, all gradient-based optimization techniques are a particular version of this generic method. In this chapter, I use it in a very challenging setting. The target function may not be differentiable or may have a very large number of local minima and maxima. All the standard techniques fail to detect the global optima. In this case, even the fixed-point method diverges. However, somehow, it can tell you the location of a global optimum with a rather decent precision. Once an approximation is obtained, the method can be applied again, this time focusing around a narrow interval containing the solution to achieve higher precision. Also, this method is a lot faster than brute force such as grid search.

I first illustrate the method on a specific problem. Then, generating synthetic data that emulates and generalizes the setting of the initial problem, I illustrate how the method performs on different functions or data sets. The purpose is to show how synthetic data can be used to test and benchmark algorithms, or to understand when they work, and when they don’t. This, combined with the intuitive aspects of my fixed-point iteration, illustrates a particular facet of explainable AI. Finally, I use a smoothing technique to visualize the highly chaotic functions involved here. It highlights the features of the functions that we are interested in, while removing the massive noise that makes these functions almost impossible to visualize in any meaningful way.

- Introduction
- The problem, with illustration

- Non-converging fixed-point algorithm
- Trick leading to intuitive solution
- Root detection: method and parameters
- Case study with conclusions

- Generalization with synthetic data
- Example
- Connection to the Poisson-binomial distribution
- Location of next root: guesstimate
- Integer sequences with high density of primes
- Python code: finding the optimum

- Smoothing highly chaotic curves
- Python code: smoothing

The technical article, entitled *Empirical Optimization with Divergent Fixed Point Algorithm – When All Else Fails*, is accessible in the “Free Books and Articles” section, here. It contains links to my GitHub files, to easily copy and paste the code. The text highlighted in orange in this PDF document are keywords that will be incorporated in the index, when I aggregate all my related articles into books about machine learning, visualization and Python, similar to these ones. The text highlighted in blue corresponds to external clickable links, mostly references. And red is used for internal links, pointing to a section, bibliography entry, equation, and so on.

*To not miss future articles, sign-up to our newsletter, here.*

*Journal of Number Theory*, *Journal of the Royal Statistical Society* (Series B), and *IEEE Transactions on Pattern Analysis and Machine Intelligence*. He is also the author of multiple books, including “Intuitive Machine Learning and Explainable AI”, available here. He lives in Washington state, and enjoys doing research on spatial stochastic processes, chaotic dynamical systems, experimental math and probabilistic number theory.

The post Empirical Optimization with Divergent Fixed Point Algorithm – When All Else Fails first appeared on Machine Learning Techniques.

]]>The post Podcast: Synthetic Data and Generative AI – Importance, Misconceptions and Applications first appeared on Machine Learning Techniques.

]]>In this video, Vincent talks about how synthetic data can be leveraged across various industries to enhance predictions and test blackbox systems leading to more fairness and transparency in AI. Hosted by Victor Chima, co-founder at Learncrunch.com. Topics discussed include:

- How is synthetic data different from simulated data
- How to create high quality synthetic data
- How to measure quality, and why popular metrics (Hellinger score) should be avoided
- How synthetic data contributes to bias reduction and explainable AI
- Best practices to generate synthetic data, avoiding pitfalls
- Overview of current techniques: GAN, GMM, copulas, agent-based modeling, noise injection
- Applications: computer vision, time series, financial data, NLP, tabular data
- Synthetic data for benchmarking, data augmentation, imputation, and building confidence regions
- Case study: the Kaggle insurance dataset

To learn more about synthetic data and generative AI, follow this link. Vincent also offers classes on synthetic data, here. The course is based on the book with the same title, available here.

Vincent Granville created Data Science Central (acquired by TechTarget), one of the most popular online communities for Data Science and Machine Learning. He spent over 20 years in the corporate world at Microsoft, eBay, Visa, Wells Fargo, and others, holds a Ph.D. in Mathematics and Statistics, and is a former post-doc at the University of Cambridge. He is now CEO at MLTechniques.com, a private research lab focusing on machine learning technologies, especially synthetic data and explainable AI.

Vincent published in *Journal of Number Theory*, *Journal of the Royal Statistical Society* (Series B), and *IEEE Transactions on Pattern Analysis and Machine Intelligence*. He authored multiple books, including “Synthetic Data and Explainable AI”, available here. He lives in Washington state, and enjoys doing research on spatial stochastic processes, chaotic dynamical systems, experimental math and probabilistic number theory.

The post Podcast: Synthetic Data and Generative AI – Importance, Misconceptions and Applications first appeared on Machine Learning Techniques.

]]>The post Course: Synthetic Data and Interpretable Machine Learning first appeared on Machine Learning Techniques.

]]>The performance of machine learning algorithms such as classification, clustering, regression, decision trees or neural networks can be significantly improved with synthetic data. It enriches training sets, allowing you to make predictions or assign a label to new observations that are significantly different from those in your dataset.

It is very useful if your training set is small or unbalanced. It also allows you to test the limits of your algorithms and find examples where it fails to work (for instance, failing to identify spam). Or deal with missing data or create confidence regions for parameters. I will show how to design rich, good quality synthetic data to meet all these goals. In particular, I illustrate how to rebalance data sets with synthetic data when some categories have very few observations (in fraud detection or clinical trials), how to remove biases by including good-quality synthesized minority people in your data, and how to anonymize your data to boost security and for compliance with privacy laws.

In this course taught by Dr. Vincent Granville, you will learn how to create your own synthetic data in Python. One example includes a real-life insurance data set: using copulas, you will be able to create an alternate (synthetic) data set that matches extremely well the distribution of the observations in your training set – including all the correlations – and why someone would want to do that. Other examples including computer vision, time series and animated data sets. For instance, agent-based modeling and evolutionary processes (such as virus spreading) where you will also learn how to create insightful data videos in Python.

Dr Granville is consistently ranked by various media outlets as one of the top machine learning scientists in the world.

The participants should be familiar with Python or other scripting languages. Foundations in matrix algebra, time series, calculus and optimization are especially useful. However the course is unusually light in mathematics and especially statistics, as the instructor spent years in simplifying many methods and explaining advanced concepts in simple English.

This course is ideally suited to professionals with an analytic background. This includes data scientists, machine learning practitioners, engineers, software developers, analysts / business analysts, economists, quants, statisticians, scientists, and anyone dealing with data on a regular basis, whether as an individual contributor or in a senior role. Emphasis is on quick acquisition of key concepts, learning how to learn, and solving problems up to professional implementation and testing in Python.

Master a number of techniques to generate and test rich synthetic data, and be able to quickly grasp future developments on this topic. Be able to complete enterprise-grade projects from beginning to end, ranging from regression to computer vision. Learn how to learn and become independent to solve any future problems. Tasks performed during the training include writing Python code and using Python libraries, modeling and testing using cross-validation methods, implementing model-free techniques, feature and model selection, testing black-box systems using synthetic data, and state-of-the art data animations (including data videos and sound) to present your results. Successful completion of the four modules comes with a personal recommendation (endorsement) on LinkedIn.

The course is split into four modules.

**Module 1: Introduction**

What is synthetic data, generative models, explainable AI, augmented data? What are the benefits and limitations? Outlined applications:

- Terrain generation, morphing and evolution, see Web API here
- Curve fitting: estimating the shape of a meteorite with model-free confidence regions, for meteorite classification
- Time series with double periodicity mimicking ocean tides
- Synthetic tabular data with prespecified correlation matrix
- Synthetic data to test or benchmark algorithms

The next modules offer a deep dive on many of the topics quickly summarized here. In Module 2, I discuss explainable ML techniques that will be used in Modules 3 and 4 on synthetic data. Modules 3 and 4 deal with the generation of synthetic data.

**Module 2: Interpretable Machine Learning**

Some of the techniques presented here are used in the next two modules focusing on synthetic data. Before diving into these techniques, I discuss data cleaning automation, data animation (data videos), and simplicity (illustrated by case study: marketing attribution without math). The new machine learning techniques introduced include:

- Generic unsupervised regression: covers all regression techniques and more, including an alternative to K-means
- Time series with double period (mimicking ocean tides)
- Interpretable regression
- Simplified ensemble method, alternative to XGBoost.
- Superimposed spatial point processes and alternative to GMM and GAN

**Module 3: Synthetic Data in Computer Vision**

In this module I cover the terrain generation including 3D contour plots, and emulation of GPU clustering with techniques similar to deep neural networks. Depending on the interest of participants, I may cover shape generation or other evolutionary processes such as synthetic star clusters to understand possible evolution of our universe. I also discuss nearest neighbor and collision graphs (such as this one), all synthetically generated. Some of the synthetic data videos that you will be able to produce, can be seen here and here.

**Module 4: Tabular Data Generation**

This type of data is traditionally used in banking, insurance and finance industry. Synthetic data has become very popular in this sector, as it helps reduce discrimination, algorithm bias, and contributes to the protection of personal data, explainable AI, and compliance with various regulations. In this module we will build a synthetic data set with a prespecified autocorrelation matrix, such as those estimated on real-life data sets.

These testimonials pertain to the training material published by the author.

- Jackson Andreas Pola — Hello Vincent, I find all the materials you shared on your website extremely useful. I will share this with my colleagues who started their journey in machine learning. Again thank you for being connected on LinkedIn. Kind regards, Jackson
- Mohammed Alshahrani — Thanks Vincent always your materials are supportive. Most of my students used to review your online materials. You might not know but frankly your impact is very noticeable specially for low-income University students.
- Isabel Marín — Very interesting your last article “The sound that the data make”. Would you be interested, once I have introduced my students to the basics, in participating in one of the classes online? Showing them your work.
- Milan McGraw — Thank you Vincent, I appreciate your operational excellence and resources. You are an invaluable resource to the community!

Vincent Granville is a pioneering data scientist and machine learning expert, founder of MLTechniques.com, co-founder of Data Science Central (acquired by TechTarget in 2020), former VC-funded executive, author and patent owner. Vincent’s past corporate experience includes Visa, Wells Fargo, eBay, NBC, Microsoft, CNET, InfoSpace. Vincent is also a former post-doc at Cambridge University, and the National Institute of Statistical Sciences (NISS).

Vincent published in *Journal of Number Theory*, *Journal of the Royal Statistical Society* (Series B), and *IEEE Transactions on Pattern Analysis and Machine Intelligence*. He is also the author of multiple books, available here. He lives in Washington state, and enjoys doing research on stochastic processes, dynamical systems, experimental math and probabilistic number theory.

The post Course: Synthetic Data and Interpretable Machine Learning first appeared on Machine Learning Techniques.

]]>The post New Book: Synthetic Data and Generative AI first appeared on Machine Learning Techniques.

]]>This book is the culmination of years of research on the topic, by the author. Emphasis is on methodological aspects and original contributions, favoring simplicity. This document integrates all the material from the previous book “Intuitive Machine Learning and explainable AI”, and it also contains all but the most advanced math from the book on stochastic simulations. The author also added more recent advances with applications to insurance data synthesized with copulas, terrain generation (with animated data), synthetic universes and experimental math. The latter is an infinite source of synthetic data to build and benchmark new machine learning techniques. Conversely mathematics benefits from these techniques to uncover new insights related to the most famous unsolved math problems. Chapter 14 on the Riemann Hypothesis illustrates this point, with new state-of-the-art research results on the subject.

Topics cover computer vision, natural language processing, tabular data, time series, geospatial and sound data, supervised classification, clustering, agent-base modeling, generative models, nearest neighbors and collision graphs, data-driven inference, prediction (all regression techniques are unified under a single, easy-to-understand method), deep neural networks, modeling without response (unsupervised regression such as circle or curve fitting), constrained optimization, and more. The author introduces a simple alternative to XGBoost, one of the most efficient ensemble methods; it is applied to an NLP problem — categorizing and ranking articles and blog posts to predict future performance. When needed, modern or new statistical learning techniques are introduced: dual confidence regions, new test of independence, parametric bootstrap, Rayleigh test, distribution-free logistic regression, proxy estimation and minimum contrast estimators, as well as a new prime test for strong pseudo-random number generators.

About 15% of the content is well documented Python code. The code is also on GitHub, spreading across multiple top-level folders, and unified for the first time in this book. It constitutes a solid introduction to scientific computing.

Version 3.0 published in January 2023. Author and publisher: Vincent Granville, Ph.D., founder of private and self-funded machine learning research lab, MLTechniques.com.

The book is available in PDF format (272 pages) with numerous, high-quality color illustrations and clickable links to fundamental concepts described on Wikipedia, if you ever need a refresher on the basics. You can view it for instance in the Chrome browser: press Ctrl-O and select the book. Access all the navigation features and follow the links in the book, with one click. To view the table of contents, list of figures and tables, bibliography, glossary and index, follow this link.

**To purchase the book**, follow this link.

The post New Book: Synthetic Data and Generative AI first appeared on Machine Learning Techniques.

]]>The post Spectacular Videos: Synthetic Universes, with Star Collision Graph first appeared on Machine Learning Techniques.

]]>This project started as an attempt to generate simulations for the three-body problem in astronomy: studying the orbits of three celestial bodies subject to their gravitational interactions. There are many illustrations available online, and after some research, I was intrigued by Philip Mocz’s version of the N-body problem: the generalization involving an arbitrary number of celestial bodies. These bodies are referred to as stars in this article. Philip is a computational physicist at Lawrence Livermore National Laboratory, with a Ph.D. in astrophysics from Harvard University.

My simulations are based on his code, which I have significantly upgraded. The end result is the three-galaxy problem: small star clusters, each with hundreds of stars, coalescing due to gravitational forces of the individual stars. It simulates the merging of galaxies. In addition, I added a birth process, with new stars constantly generated. I also allow for star collisions, resulting in fewer but bigger stars over time. Finally, my simulations allow for stars with negative masses, as well as unusual gravitation laws, different from the classic inverse square law.

These bizarre universes lead to spectacular data animations (MP4 videos), but perhaps most importantly, it may help explain what could cause our universe to expand, including the different stages of compression and expansion over time. Depending on the initial configuration, very different outcomes are possible. Negative masses, with cluster centroids based on the absolute value of the mass while gravitational forces are based on the signed mass, could lead to a different model of the universe. Many well-known phenomena, such as rogue stars escaping their cluster at great velocity, black holes and twin stars formation, star filaments, and star clusters becoming less energetic over time (decreasing expansion, smaller velocities) are striking features visible in my videos. Star collisions lead to an interesting graph problem.

The N-body problem consists of predicting the evolution of celestial bodies bound by gravity. Here I go one step further: up to 1000 stars and star clusters are simulated using various initial conditions, to produce videos that show how these synthetic universes evolve. It tells a lot about the past and future of our current universe, corroborating the theory that it is expanding, albeit more and more slowly. In addition, stars with negative masses and gravity laws other than the standard inverse square, when allowed, lead to the most bizarre systems and spectacular videos. Star collisions are studied in details and lead to interesting graph theory applications. I provide the Python code for these simulations, including the production of animated data visualizations (videos) and graph representations.

- Introduction
- Model parameters and simulation results

. . . Explanation of color codes

. . . Detailed description of top parameters

. . . Interesting parameter sets - Analysis of star collisions and collision graph

. . . Weighted directed graph: visualization with NetworkX

. . . Interesting findings: how the universe got started - Animated data visualizations
- Python code and computational issues

. . . Simulating the real and synthetic universes

. . . Visualizing collision graphs

The technical article, entitled *Spectacular Videos: Synthetic Universes, with Star Collision Graph*, is accessible in the “Free Books and Articles” section, here. It contains links to my GitHub files, to easily copy and paste the code. The text highlighted in orange in this PDF document are keywords that will be incorporated in the index, when I aggregate all my related articles into books about machine learning, visualization and Python, similar to these ones. The text highlighted in blue corresponds to external clickable links, mostly references. And red is used for internal links, pointing to a section, bibliography entry, equation, and so on.

*To not miss future articles, sign-up to our newsletter, here.*

*Journal of Number Theory*, *Journal of the Royal Statistical Society* (Series B), and *IEEE Transactions on Pattern Analysis and Machine Intelligence*. He is also the author of multiple books, including “Intuitive Machine Learning and Explainable AI”, available here. He lives in Washington state, and enjoys doing research on spatial stochastic processes, chaotic dynamical systems, experimental math and probabilistic number theory.

The post Spectacular Videos: Synthetic Universes, with Star Collision Graph first appeared on Machine Learning Techniques.

]]>The post Dynamic Clouds and Landscape Generation: Morphing and Evolutionary Processes first appeared on Machine Learning Techniques.

]]>My previous article focused on map generation in 3D, and also features a fascinating video, see here. In this article, while focusing on 2D, I provide a simple introduction to evolutionary processes in the context of synthetic data and terrain generation. Not just terrains: depending on the color palette, other processes such as storm formation can be simulated with the same algorithm.

The focus is on stationary processes. The analogy with random walks and Brownian motions is striking. Despite the simplicity, the systems modeled here are a lot more complex than your typical Brownian motion. You can compare it to time-continuous time series, where each observation (synthetically generated here) is an image. This article will appeal to practitioners looking for more sophisticated modeling tools, that mimic natural phenomena. It will also appeal to machine learning professionals looking for serious Python code, the kind of code typically not taught in classes or textbooks, and not found on the Internet. It offers a fun application to learn scientific computing.

I also explain how to produce animated data visualizations in Python (MP4 videos) featuring 4 related sub-videos in parallel, progressing at various speeds. In particular, the video shows the probabilistic evolution of a system from A to B, compared with morphing the starting configuration A into the final state B. In the end, this article can serve as an introduction to chaotic dynamical systems.

This article is an introduction to computer vision techniques in Python, diving into the technical details of a specific class of problems. I show how to use generative models based on synthetic data, to simulate terrain evolution or other natural phenomena such as cloud formation or climate change. The presentation is accessible and targeted to software engineers interested in understanding and applying the machine learning and probabilistic background behind the scene, as well as to machine learning professionals interested in the programming aspects and scientific computing. The end-goal is to help the reader design and implement his own models and generate his own data sets, by showcasing an interesting application with all the details. My Python code can also be used as an end in itself.

From a machine learning perspective, the stochastic processes involved can be compared to spatial time series or time-continuous chaotic dynamical systems. There is a similarity with constrained Brownian motions, where at each time, rather than observing a typical observation (say a vector of stock prices), the observation consists of a particular configuration of the entire space (for instance, a moving storm system at a given time). In this article, the focus is on stationary-like processes. I briefly discuss the probabilistic models behind my algorithms, to explain when they work, and when they don’t. However, I limit theoretical discussions to the essential, so that software engineers and other professionals lacking a strong mathematical background, can easily read and benefit from my presentation.

A possible use of my methodology is to automatically generate and label a large number of different landscapes (mountains, sea, land, combinations, and so on) to create a large training set. The training set can be used as augmented data for landscape classification, or to generate more landscape within a specific category to further enrich the classifier. The methodology can also be used to simulate transitions and reconstruct the hidden statistical behavior over short periods of time, when granular observations are not available. Finally, in addition to modeling and simulating uncontrolled evolutionary processes, the animated data visualizations also feature image morphing, both in the state space (coalescing physical shapes) and the spectral space (palette and color morphing).

- Introduction
- Terrain generation and the evolutionary process

. . . Morphing and non linear palette operations

. . . The diamond-square algorithm

. . . The evolutionary process - Python code

. . . Producing data videos with four sub-videos in parallel

. . . Main program

The technical article, entitled *Dynamic Clouds and Landscape Generation: Morphing and Evolutionary Processes*, is accessible in the “Free Books and Articles” section, here. It contains links to my GitHub files, to easily copy and paste the code. The text highlighted in orange in this PDF document are keywords that will be incorporated in the index, when I aggregate all my related articles into books about machine learning, visualization and Python, similar to these ones. The text highlighted in blue corresponds to external clickable links, mostly references. And red is used for internal links, pointing to a section, bibliography entry, equation, and so on.

*To not miss future articles, sign-up to our newsletter, here.*

*Journal of Number Theory*, *Journal of the Royal Statistical Society* (Series B), and *IEEE Transactions on Pattern Analysis and Machine Intelligence*. He is also the author of multiple books, including “Intuitive Machine Learning and Explainable AI”, available here. He lives in Washington state, and enjoys doing research on spatial stochastic processes, chaotic dynamical systems, experimental math and probabilistic number theory.

The post Dynamic Clouds and Landscape Generation: Morphing and Evolutionary Processes first appeared on Machine Learning Techniques.

]]>