The post High-value AI and Machine Learning Certifications Under $50 first appeared on Machine Learning Techniques.

]]>- Automatic qualification for busy professionals with 2+ years of relevant industry experience and fluent in analyzing data. Read more here to see if you qualify.
- Online training and guidance to complete and post 2 projects on GitHub completed at your own pace within 30 days, if you don’t meet the above requirement but have programming experience.

All the material is open source and mostly free, including Jupyter notebooks, professional Python code and technical documents. You get to interact with the founder of the company – a world leader in AI and ML with substantial industry experience – for guidance, and to help you complete your projects if needed. No meaningless quizzes to succeed, and not based on memorizing and applying old-fashioned concepts. The only cost is to download the textbook relevant to the certification, priced below $50 with one exception.

The following certifications are currently offered:

- Certified Machine Learning Professional
- Certified Generative AI Professional
- Certified Data Visualization Professional
- Scientific Computing with Python
- Time Series and Geospatial Modeling
- Statistical Optimization for AI and Machine Learning
- Certified NLP Professional

For details and how to obtain your certification, follow this link.

Besides the aforementioned features, the program offers the following benefits.

- In less than 30 days, you will be able to use efficient and advanced techniques on real world data, learn trade secrets from a top expert, know the limits of each method, how to overcome them, and what works best in specific contexts.
- Obtain a certificate to add to your LinkedIn profile, in the credentials section, with authentication feature.
- Possibility to request a customized certification reflecting your own experience, if desired.
- If qualified for automatic certification and your LinkedIn profile displays the #opentowork badge, there is no cost.

To see how the certification would look like on your LinkedIn profile page, see example here, and look at the LinkedIn section entitled “Licenses and certifications”.

MLtechniques.com is a private, self-funded ML/AI research lab developing state-of-the-art open source technologies related to synthetic data, generative AI, cybersecurity, geospatial modeling, stochastic processes, chaos modeling, and AI-related statistical optimization. It was founded in 2020 by Dr. Vincent Granville, one of the top leaders in the field.

Vincent Granville is a pioneering data scientist and machine learning expert, co-founder of Data Science Central (acquired by TechTarget in 2020), founder of MLTechniques.com, former VC-funded executive, author and patent owner. Vincent’s past corporate experience includes Visa, Wells Fargo, eBay, NBC, Microsoft, and CNET. Vincent is also a former post-doc at Cambridge University, and the National Institute of Statistical Sciences (NISS). He published in *Journal of Number Theory*, *Journal of the Royal Statistical Society* (Series B), and *IEEE Transactions on Pattern Analysis and Machine Intelligence*. He is also the author of multiple books, including “Intuitive Machine Learning and Explainable AI”, available here. Vincent lives in Washington state, and enjoys doing research on spatial stochastic processes, chaotic dynamical systems, experimental math and probabilistic number theory.

The post High-value AI and Machine Learning Certifications Under $50 first appeared on Machine Learning Techniques.

]]>The post A Synthetic Stock Exchange Played with Real Money first appeared on Machine Learning Techniques.

]]>If instead the player uses the public data and algorithm to make his bets, he will quickly become a billionaire. Actually not exactly, because the operator will go bankrupt long before it happens. In the end though, it is the operator that wins. But many players will win too, some big time. In many implementations, more than 50% of the players win on any single bet. How so?

At first glance, this sounds like fintech science fiction, or a system that must have a bug somewhere. But once you read the article, you will see why players could be interested in this new, one-of-a-kind money game. Most importantly, this technical article is about the mathematics behind the scene, the business model, and all the details (including legal ones) that make this game a viable option both for the player and the operator.

Some of the features are based on new advances in number theory. Anyone interested in cryptography, risk management, fintech, synthetic data, operations research, gaming, gambling or security laws, should read this material. It describes original, state-of-the-art technology with potential applications in the fields in question. The author may work on a real implementation.

This project started several years ago with extensive, privately funded research on the topic. An earlier version was presented at the INFORMS conference in 2019. Python code is included in the article, to process truly gigantic numbers. The author holds the world record for the number of computed digits for most quadratic irrationals, using fast algorithms. This may be the first time that massive amounts of such large sequences are used and necessary to solve a real-world problem.

*The 20-page article is available on GitHub, here. The Python code is in the same folder. It is now part of my book “Gentle Introduction To Chaotic Dynamical Systems”, available here. *

Vincent Granville is a pioneering data scientist and machine learning expert, co-founder of Data Science Central (acquired by TechTarget in 2020), founder of MLTechniques.com, former VC-funded executive, author and patent owner. Vincent’s past corporate experience includes Visa, Wells Fargo, eBay, NBC, Microsoft, and CNET. Vincent is also a former post-doc at Cambridge University, and the National Institute of Statistical Sciences (NISS). He published in *Journal of Number Theory*, *Journal of the Royal Statistical Society* (Series B), and *IEEE Transactions on Pattern Analysis and Machine Intelligence*. He is also the author of multiple books, including “Intuitive Machine Learning and Explainable AI”, available here. Vincent lives in Washington state, and enjoys doing research on spatial stochastic processes, chaotic dynamical systems, experimental math and probabilistic number theory.

The post A Synthetic Stock Exchange Played with Real Money first appeared on Machine Learning Techniques.

]]>The post Autonomous Driving: Boosting Optical Flow with Synthetic Data first appeared on Machine Learning Techniques.

]]>Optical flow is defined as the task of estimating per-pixel motion between video frames. Optical flow models take two sequential frames as input and return as output a flow vector that predicts where each pixel in the first frame will be in the second frame. Optical flow is an important task for autonomous driving, but real-world flow data is very hard to label. For humans, it can actually be impossible to label. Labeling can only be done using LiDAR information to estimate object motion, whether dynamic or static, from the ego trajectory. Because LiDAR scans are inherently sparse, the very few public optical flow datasets are also sparse. A way around this problem is to use synthetic data, where dense flow labels are readily available. This post goes over how synthetic data can improve optical flow tasks and how tuning Parallel Domain’s synthetic data to mitigate important domain gaps can lead to major performance improvements.

A major strength of synthetic data is that you can generate as much perfectly labeled data as you like. Synthetic optical flow data is accurate and can have dense optical flow labels so that every pixel has a flow label. One major advantage of using synthetic data is that machine learning practitioners can iterate not just on loss or architecture, but also the datasets themselves. This means that scenes, maps, scenarios, sensors, and flow magnitudes can be quickly tailored.

Because there is so little real optical flow data, researchers typically pretrain their models on synthetic flow datasets with non-commercial license agreements like Flying Chairs (FC) and Flying Things (FT). Note that neither of these datasets emulate autonomous vehicle scenes and both contain surreal elements (e.g., random backgrounds from Flickr, CGI foreground objects) placed and rotated randomly to offer a wide variety of optical flow motions. To fine-tune their models for autonomous vehicle applications, researchers use a real-world dataset like KITTI-flow (2015 version) which contains only 200 training samples with another 200 test samples hidden behind an inaccessible test server. This dataset is one of the few publicly available optical flow dataset for autonomous driving and is used extensively to publish results and benchmarks.

Despite their surreal elements, the synthetic datasets FC and FT are highly useful and improve on models trained with only the little real-world data available. Thus, they are widely used as part of the standard optical flow training pipeline:

- Pretrain with 1.2M iterations on Flying Chairs
- Pretrain with 600k iterations on Flying Things
- Fine-tune 300k iterations on KITTI

For the real dataset KITTI, it is common to train on all 200 and publish results on those same 200 samples, but that is not a good measure of how well the model performs on new data. Our results in this post are based on training on 100 samples and evaluating model performance on the other 100.

Optical flow tasks are often evaluated with End Point Error (EPE). This metric is the magnitude of difference (Euclidean distance) between the ground truth and predicted flow vectors for each pixel.

At Parallel Domain, we sought to improve the standard optical flow pipeline by adding in domain specific synthetic data for autonomous driving. By using and iterating on our synthetic data, we were able to improve EPE on optical flow tasks by 18.5% largely through enhancing flow magnitude. This section describes how we integrated PD data into optical flow pipelines, identified flow magnitude as an important domain gap, and subsequently addressed that domain gap with PD data.

In response to the problems when using the standard optical flow pipeline (e.g., no domain specific synthetic datasets, very small real-world dataset), we generated a range of synthetic data for autonomous driving in different locations, scenarios (e.g. highway vs. urban, creeping datasets), and frame rates to augment the diversity of flow data available while being domain specific.

We added this data as a third pretraining step to form a new optical flow training pipeline:

- Pretrain with 1.2M iterations on Flying Chairs
- Pretrain with 600k iterations on Flying Things
- Pretrain with 600k iterations on Parallel Domain data
- Fine-tune 300k iterations on KITTI

For these experiments, we used an off-the-shelf PWC-Net architecture.

The initial addition of the Parallel Domain Dataset to the standard optical flow pipeline improved EPE by 6%, however we wanted to push that number further. A major advantage of Parallel Domain’s synthetic data is that it is easy to improve the data through iteration. We previously mentioned in our synthetic data best practices blog that good synthetic data should visually resemble real-world sensor data and labels. This is important to ensure generalization. The data should also reflect a distribution of locations, textures, lighting, backgrounds, objects, and agents (e.g., vehicles or pedestrians) similar to what a model will encounter in real-world situations.

One of our team members located in Karlsruhe, Germany pointed out that some of the KITTI highways scenes are on the German Autobahn, **where vehicles drive very fast**. This contrasts with our own generated highway scenes located in the US, where vehicles drive at a lower speed. The difference in speed between the KITTI dataset and Parallel Domain’s data is reflected in the distribution of flow magnitude (the length of a flow vector) histograms below. We can observe that for the KITTI dataset (KITTI Train), the mean and standard deviation flow is much higher than the PD data (PD Original). For reference, we also included the synthetic datasets FT and FC (2nd column); they encompass a wide distribution of flow indicating a potential gap in our initial PD data.

To reduce the flow magnitude domain gap, we **reduced the frame rate sampling** for some of our Parallel Domain data. This increased the flow magnitude between frames as the motion between two frames became larger. We can see that our flow distribution has improved (PD High Flow), covering a wider range of flow magnitudes.

After we tested the pure synthetic data pipeline, we added the KITTI dataset back into the optical flow training pipeline:

- Pretrain with 1.2M iterations on Flying Chairs
- Pretrain with 600k iterations on Flying Things
- Pretrain with the updated Parallel Domain data — PD High Flow
- Fine-tune 300k iterations on KITTI

When we trained using PD data with a wider distribution of flow magnitude, the **EPE dropped by 18.5%** compared to the baseline. This result held when we doubled the FT training time so that both models are trained with the same number of iterations. This result confirms that performance improves when using Parallel Domain data to pretrain optical flow models, and that flow distribution matters. We were able to significantly push our performance further by reducing the frame rate in our generated data. This makes sense as the highest magnitude pixels in optical flow are typically in the surrounding rails on the sides of the frame.

The largest motions are not usually in relation to the vehicles in front of the ego vehicle, but rather the vehicles on the side. By skipping frames we were able to accentuate the motion between pixels, thus improving our overall performance. This dataset tuning would not have been possible if it were not for synthetic data that allowed us to easily iterate and tune labels accordingly.

After reducing the frame rate on some of our Parallel Domain data, we tried a purely synthetic pipeline that **didn’t involve fine-tuning on KITTI**:

- Pretrain with 1.2M iterations on Flying Chairs
- Pretrain with 600k iterations on Flying Things
- Train with Parallel Domain data

Surprisingly, the model trained only on synthetic data outperformed the real model without PD data by 2.5%! If you would like to learn more about training optical flow models with synthetic data, you can check out our documentation.

Optical flow is an important task for autonomous driving, but a major problem is that there is so little real-world optical flow data. This shortage creates a need for synthetic data. In this post, we showed how Parallel Domain’s synthetic data improved optical flow performance by 18.5% by enabling a large distribution of flow magnitude (per pixel motion). An important reason why this was possible was that we were able to iterate on our synthetic data to create the best possible dataset with little to no cost, as labels can be easily generated. If you are a machine learning practitioner interested in training machine learning models using synthetic and real data, consider registering for the Woodscape Motion Segmentation Challenge.

*This post was originally published at Parallel Domain. It was authored by Camille Ballas, Michael Stanley, Phillip Thomas, Lars Pandikow, and Michael Galarnyk. You can see our W&B report for additional optical flow results and experiments.*

The post Autonomous Driving: Boosting Optical Flow with Synthetic Data first appeared on Machine Learning Techniques.

]]>The post Generating and Videolizing Agglomerative Processes first appeared on Machine Learning Techniques.

]]>The focus is on the distribution of atom sizes over time and the number of collisions, becoming rarer and rarer as atoms get bigger and the density of atoms per unit volume decreases. You can use the method and Python implementation in various contexts by updating the algorithm accordingly, and changing the terminology: replacing atoms by particles, or small celestial bodies in the birth of a start system with planets, or molecules, or even the soap bubbles merging together. A potential simple improvement is to allow the atoms to not only merge and grow in size, but also to split or lose electrons.

Perhaps the most interesting feature is that you can simulate the evolution and interactions of the 10^{80} atoms in our universe without working with individual atoms. Indeed, I use very fast and efficiently simulations based on arrays with fewer than 100 cells, to study the macro behavior. The implementation allows you to simulate either one or hundreds of evolution paths in parallel. The second option leads to the theoretical evolving distribution of atom sizes over time. It can be customized to a variety of agglomerative processes. The context may be very different from statistical physics and could even include mergers and acquisitions in the business world, celestial mechanics, or applications in chemistry.

The technical article, entitled *Generating and Videolizing Agglomerative Processes*, is accessible in the “Free Books and Articles” section, here. It contains links to my GitHub files, to easily copy and paste the code. The text highlighted in orange in this PDF document are keywords that will be incorporated in the index, when I aggregate all my related articles into books about machine learning, visualization and Python. The text highlighted in blue corresponds to external clickable links, mostly references. And red is used for internal links, pointing to a section, bibliography entry, equation, and so on. *To not miss future articles, sign-up to our newsletter, here.*

Vincent Granville is a pioneering data scientist and machine learning expert, co-founder of Data Science Central (acquired by TechTarget in 2020), founder of MLTechniques.com, former VC-funded executive, author and patent owner. Vincent’s past corporate experience includes Visa, Wells Fargo, eBay, NBC, Microsoft, and CNET. Vincent is also a former post-doc at Cambridge University, and the National Institute of Statistical Sciences (NISS). He published in *Journal of Number Theory*, *Journal of the Royal Statistical Society* (Series B), and *IEEE Transactions on Pattern Analysis and Machine Intelligence*. He is also the author of multiple books, including “Intuitive Machine Learning and Explainable AI”, available here. Vincent lives in Washington state, and enjoys doing research on spatial stochastic processes, chaotic dynamical systems, experimental math and probabilistic number theory.

The post Generating and Videolizing Agglomerative Processes first appeared on Machine Learning Techniques.

]]>The post Massively Speed-Up your Learning Algorithm, with Stochastic Thinning first appeared on Machine Learning Techniques.

]]>I also show the potential limitations of the new technique, and introduce the concepts of *leading* or *influential* observations (those kept for learning purposes) and *followers* (observations dropped from the training set). The word “influential observations” should not be confused with its usage in statistics, although in both cases it leads to explainable AI. The neural network used in this article offers replicable results by controlling all the sources of randomness, a property rarely satisfied in other implementations.

If you are new to neural networks and deep learning or manage a group of engineers developing or using such tools, the full technical article (13 pages including 6 pages of Python code) will give you a quick overview of the issues and benefits surrounding these methods, and a solid high-level introduction to the subject including how to discover and overcome — or leverage — the problems faced.

The technical article, entitled *Massively Speed-Up your Learning Algorithm, with Stochastic Thinning*, is accessible in the “Free Books and Articles” section, here. It contains links to my GitHub files, to easily copy and paste the code. The text highlighted in orange in this PDF document are keywords that will be incorporated in the index, when I aggregate all my related articles into books about machine learning, visualization and Python. The text highlighted in blue corresponds to external clickable links, mostly references. And red is used for internal links, pointing to a section, bibliography entry, equation, and so on.

*To not miss future articles, sign-up to our newsletter, here.*

Vincent Granville is a pioneering data scientist and machine learning expert, co-founder of Data Science Central (acquired by TechTarget in 2020), founder of MLTechniques.com, former VC-funded executive, author and patent owner. Vincent’s past corporate experience includes Visa, Wells Fargo, eBay, NBC, Microsoft, and CNET. Vincent is also a former post-doc at Cambridge University, and the National Institute of Statistical Sciences (NISS).

Vincent published in *Journal of Number Theory*, *Journal of the Royal Statistical Society* (Series B), and *IEEE Transactions on Pattern Analysis and Machine Intelligence*. He is also the author of multiple books, including “Intuitive Machine Learning and Explainable AI”, available here. He lives in Washington state, and enjoys doing research on spatial stochastic processes, chaotic dynamical systems, experimental math and probabilistic number theory.

The post Massively Speed-Up your Learning Algorithm, with Stochastic Thinning first appeared on Machine Learning Techniques.

]]>The post Smart Grid Search for Faster Hyperparameter Tuning first appeared on Machine Learning Techniques.

]]>Then, I show how to significantly improve grid search, and make it a viable alternative to gradient methods to estimate the two parameters *p* and *α*. The cost function — that is, the error to minimize — is the combined distance between the mean and variance computed on the real data, and the mean and variance of the target zeta-geometric distribution. Thus the mean and variance are used as proxy estimators for *p* and *α*. This technique is known as minimum contrast estimation, or moment-based estimation in statistical circles. The “smart” grid search consists of narrowing down on smaller and smaller regions of the parameter space over successive iterations.

The zeta-geometric distribution is just one example of an hybrid distribution. I explain how to design such hybrid models in general, using a very simple technique. They are useful to combine multiple distributions into a single one, leading to model generalizations with an increased number of parameters. The goal is to design distributions that are a good fit when some in-between solutions are needed to better represent the reality.

The technical article, entitled *Smart Grid Search for Faster Hyperpameter Tuning*, is accessible in the “Free Books and Articles” section, here. It contains links to my GitHub files, to easily copy and paste the code. The text highlighted in orange in this PDF document are keywords that will be incorporated in the index, when I aggregate all my related articles into books about machine learning, visualization and Python. The text highlighted in blue corresponds to external clickable links, mostly references. And red is used for internal links, pointing to a section, bibliography entry, equation, and so on.

*To not miss future articles, sign-up to our newsletter, here.*

Vincent Granville is a pioneering data scientist and machine learning expert, co-founder of Data Science Central (acquired by TechTarget in 2020), founder of MLTechniques.com, former VC-funded executive, author and patent owner. Vincent’s past corporate experience includes Visa, Wells Fargo, eBay, NBC, Microsoft, and CNET. Vincent is also a former post-doc at Cambridge University, and the National Institute of Statistical Sciences (NISS).

Vincent published in *Journal of Number Theory*, *Journal of the Royal Statistical Society* (Series B), and *IEEE Transactions on Pattern Analysis and Machine Intelligence*. He is also the author of multiple books, including “Intuitive Machine Learning and Explainable AI”, available here. He lives in Washington state, and enjoys doing research on spatial stochastic processes, chaotic dynamical systems, experimental math and probabilistic number theory.

The post Smart Grid Search for Faster Hyperparameter Tuning first appeared on Machine Learning Techniques.

]]>The post New Book: Gentle Introduction To Chaotic Dynamical Systems first appeared on Machine Learning Techniques.

]]>Without using measure theory, the invariant distributions of many systems are discussed in details, with numerous closed-form expressions for classic and new maps, including the logistic, square root logistic, nested radicals, generalized continued fractions (the Gauss map), the ten-fold and dyadic maps, and more. The concept of bad seed, rarely discussed in the literature, is explored in details. It leads to singular fractal distributions with no probability density function, and sets similar to the Cantor set. Rather than avoiding these monsters, you will be able to leverage them as competitive tools for modeling purposes, since many evolutionary processes in economy, fintech, physics, population growth and so on, do not always behave nicely.

A summary table of numeration systems serves as a useful, quick reference on the subject. Equivalence between different maps is also discussed. In a nutshell, this book is dedicated to the study of two numbers: zero and one, with a wealth of applications and results attached to them, as well as some of the toughest mathematical conjectures. It will appeal in particular to busy practitioners in fintech, security, defense, operations research, engineering, computer science, machine learning, AI, as well as consultants and professional mathematicians. For students complaining about how hard this topic is, and deterred by the amount of advanced mathematics, this book will help them get jump-started. While the mathematical level remains high in some sections, it is explained as simply as possible, focusing on what is needed for the applications.

Numerous illustrations including beautiful representations of these systems (generative art), a lot of well documented Python code, and nearly 20 off-the-beaten path exercises complementing the theory, will help you navigate through this beautiful field. You will see how even the most basic systems offer such an incredible variety of configurations depending on a few parameters, allowing you to model a very large array of phenomena. Finally, the first chapter also covers time-continuous processes including unusual clustered, reflective, constrained, and integrated Brownian-like processes, random walks and time series, with little math and no obscure jargon. In the end, my goal is to get you to you use these systems fluently, and see them as gentle, controllable chaos. In short, what real life should be! Quantifying the amount of chaos is also one of the topics discussed in the book.

*Authored by Dr. Vincent Granville, 82 pages, published in March 2023. Available on our e-Store exclusively, here. See the table contents or sample chapter on GitHub here. The Python code is also in the same repository.*

The post New Book: Gentle Introduction To Chaotic Dynamical Systems first appeared on Machine Learning Techniques.

]]>The post Feature Clustering: A Simple Solution to Many Machine Learning Problems first appeared on Machine Learning Techniques.

]]>The technique can also be used for traditional clustering performed on the observations. In that case, it is useful in the presence of wide data: when you have a large number of features but a small number of observations, sometimes smaller than the number of features as in clinical trials. When applied to features, it allows you to break down a high-dimensional problem (the dimension is the number of features), into a number of low-dimensional problems. It can accelerate many algorithms — those with computing time growing exponentially fast with the dimension — and at the same time avoid issues related to the “curse of dimensionality”. In fact it can be used as a data reduction technique, where feature clusters with a low average correlation (in absolute value) are removed from the data set.

Applications are numerous. In my case I used it in the context of synthetic data generation, especially with generative adversarial networks (GAN). The idea is is to identify clusters of related features, and apply a separate GAN to each of them, then put the synthetizations back altogether into one dataset. The benefits are faster processing with little to no loss in terms of capturing the full correlation structure present in the data set. It also increases the robustness and explainability of the method, making it less volatile during the successive epochs in the GAN model.

I summarize the feature clustering results in section 2. I used the technique on a Kaggle dataset with 9 features, consisting of medical measurements. I offer two Python implementations: one based on hierarchical clustering in section 3.1, and one based on connected components (a fundamental graph theory algorithm) in section 3.2. In addition, the technique leads to a simple visualization of the 9-dimensional dataset, with one scatterplot and two colors: orange for diabetes and blue for non-diabetes. Here diabetes is the binary response feature. This is because the largest feature cluster contains only 3 features, and one of them is the response. In any well-designed experiment, you would expect the response to always be in a large feature cluster.

No linear algebra or calculus is required: the method is essentially math-free. This is in contrast to principal component analysis (PCA) which relies on eigenvalues, and turns your features into meaningless, arbitrary linear combinations that are hard to interpret. This article is an extract from my book “Synthetic Data and Generative AI”, available here.

The technical article, entitled *Feature Clustering: A Simple Solution to Many Machine Learning Problems*, is accessible in the “Free Books and Articles” section, here. It contains links to my GitHub files, to easily copy and paste the code. The text highlighted in orange in this PDF document are keywords that will be incorporated in the index, when I aggregate all my related articles into books about machine learning, visualization and Python. The text highlighted in blue corresponds to external clickable links, mostly references. And red is used for internal links, pointing to a section, bibliography entry, equation, and so on.

*To not miss future articles, sign-up to our newsletter, here.*

Vincent Granville is a pioneering data scientist and machine learning expert, co-founder of Data Science Central (acquired by TechTarget in 2020), founder of MLTechniques.com, former VC-funded executive, author and patent owner. Vincent’s past corporate experience includes Visa, Wells Fargo, eBay, NBC, Microsoft, and CNET. Vincent is also a former post-doc at Cambridge University, and the National Institute of Statistical Sciences (NISS).

Vincent published in *Journal of Number Theory*, *Journal of the Royal Statistical Society* (Series B), and *IEEE Transactions on Pattern Analysis and Machine Intelligence*. He is also the author of multiple books, including “Intuitive Machine Learning and Explainable AI”, available here. He lives in Washington state, and enjoys doing research on spatial stochastic processes, chaotic dynamical systems, experimental math and probabilistic number theory.

The post Feature Clustering: A Simple Solution to Many Machine Learning Problems first appeared on Machine Learning Techniques.

]]>The post Data Synthetization: enhanced GANs vs Copulas first appeared on Machine Learning Techniques.

]]>I show examples where GANs are superior to copulas, and the other way around. My GAN implementation also leads to fully replicable results — a feature usually absent in other GAN systems. This is particularly important given the high dependency on the initial configuration determined by a seed parameter: it also allows you to find the best synthetic data using multiple runs of GAN in a replicable setting. In the process, I introduce a new matrix correlation distance to evaluate the quality of the synthetic data, taking values between 0 and 1 where 0 is best, and leverage the TableEvaluator library. I also discuss feature clustering to improve the technique, to detect groups of features independent from each other, and apply a different model to each of them. In a medical data example to predict the risk of cancer, I use random forests to classify the real data, and compare the performance with results obtained on the synthetic data.

In one artificial example with very strong patterns, the copula method fails at detecting the non-linear feature interactions, while GANs do a pretty good job. In another example mostly with linear feature interactions, the opposite is true. The document also has many references to technical papers available online for free, as well as a discussion of open source implementations. In particular, I feature an illustration of the CopulaGAN module — blending copulas and GANs — from the synthetic data vault library (SDV), applied to tabular data.

This applied paper is part of the newly added chapter in my book on synthetic data and explainable AI. The 289-page book now in version 4.0 and accepted by Elsevier, is currently available here. In the PDF version (the only one currently available, viewable in any browser), all back-links, external and internal links, in particular to or from other chapters, glossary, references and the index, are working. The datasets, Python code, and illustrations, are also on my GitHub repository. See the table of contents and access sample chapters, from here. You can download the free article “Data Synthetization: Enhanced GANs vs Copulas ” from here.

*To not miss future articles, sign-up to our newsletter, here.*

*Journal of Number Theory*, *Journal of the Royal Statistical Society* (Series B), and *IEEE Transactions on Pattern Analysis and Machine Intelligence*. He is also the author of multiple books, including “Intuitive Machine Learning and Explainable AI”, available here. He lives in Washington state, and enjoys doing research on spatial stochastic processes, chaotic dynamical systems, experimental math and probabilistic number theory.

The post Data Synthetization: enhanced GANs vs Copulas first appeared on Machine Learning Techniques.

]]>The post Data Synthetization Explained in One Picture first appeared on Machine Learning Techniques.

]]>Dashed pink lines are associated to modeling techniques (generative AI, GMM) where synthetic data is obtained by simulating the underlying model using the parameter values estimated on the real data, that is, *q _{k}* =

The goal is to mimic the structure in the real data, not the real data itself. The structure is represented by a parametric configuration denoted as *p* in the real data. I use the notation *p*_{1}, …, *p*_{5} for the structures found in the 5 synthetic data sets. The quality *h _{k}* of the synthetic data set

**Source**: “Synthetic Data and Generative AI”, by Vincent Granville (273 pages, published in 2023), available here. The picture is from the preface. A full resolution, along with the table of contents, sample chapters and Python code, can be found on GitHub, here.

The post Data Synthetization Explained in One Picture first appeared on Machine Learning Techniques.

]]>