Course: Synthetic Data and Interpretable Machine Learning

This live course is based on my book “Synthetic Data”, available here. Participants will receive a free copy of this book. The information below provides a brief overview of the course. To be notified about our next live classes, sign-up to our newsletter. Currently, the course is offered via our certification program, described here

Course Description

The performance of machine learning algorithms such as classification, clustering, regression, decision trees or neural networks can be significantly improved with synthetic data. It enriches training sets, allowing you to make predictions or assign a label to new observations that are significantly different from those in your dataset.

It is very useful if your training set is small or unbalanced. It also allows you to test the limits of your algorithms and find examples where it fails to work (for instance, failing to identify spam). Or deal with missing data or create confidence regions for parameters. I will show how to design rich, good quality synthetic data to meet all these goals. In particular, I illustrate how to rebalance data sets with synthetic data when some categories have very few observations (in fraud detection or clinical trials), how to remove biases by including good-quality synthesized minority people in your data, and how to anonymize your data to boost security and for compliance with privacy laws.

In this course taught by Dr. Vincent Granville, you will learn how to create your own synthetic data in Python. One example includes a real-life insurance data set: using copulas, you will be able to create an alternate (synthetic) data set that matches extremely well the distribution of the observations in your training set – including all the correlations – and why someone would want to do that. Other examples including computer vision, data from the healthcare industry, time series and animated data sets. For instance, agent-based modeling and evolutionary processes (such as virus spreading) where you will also learn how to create insightful data videos in Python.

Emphasis is on evaluating and validating synthetic data to ensure its quality and minimize bias, as well as best practices. By the end of this course, you will have the knowledge and tools to generate high-quality synthetic data for your projects. This course provides you with the skills to generate realistic synthetizations for your applications, and to quickly identify the strengths and weaknesses of each method (GANs, parametric copulas, noise injection), which one to use depending on your data or goal, and how to fine-tune or blend different methods to get the best results or minimize computing time.

Prerequisites

The participants should be familiar with Python or other scripting languages. Foundations in matrix algebra, time series, calculus and optimization are especially useful. However the course is unusually light in mathematics and especially statistics, as the instructor spent years in simplifying many methods and explaining advanced concepts in simple English.

Who is this course for

This course is ideally suited to professionals with an analytic background. This includes data scientists, machine learning practitioners, engineers, software developers, analysts / business analysts, economists, quants, statisticians, scientists, and anyone dealing with data on a regular basis, whether as an individual contributor or in a senior role. Emphasis is on quick acquisition of key concepts, learning how to learn, and solving problems up to professional implementation and testing in Python.

Terrain generation and evolution (frame from synthetic data video)

What you will get out of the course

Master a number of techniques to generate and test rich synthetic data, and be able to quickly grasp future developments on this topic. Be able to complete enterprise-grade projects from beginning to end, ranging from regression to computer vision. Learn how to learn and become independent to solve any future problems. Tasks performed during the training include writing Python code and using Python libraries, modeling and testing using cross-validation methods, implementing model-free techniques, feature and model selection, testing black-box systems using synthetic data, and state-of-the art data animations (including data videos and sound) to present your results. Successful completion of the four modules comes with a personal recommendation (endorsement) on LinkedIn.

New time series models

Modules

The course is split into four modules.

Module 1: Introduction

What is synthetic data, generative models, explainable AI, augmented data? What are the benefits and limitations? Outlined applications:

  1. Terrain generation, morphing and evolution.
  2. Curve fitting: estimating the shape of a meteorite with model-free confidence regions, for meteorite classification.
  3. Time series with double periodicity mimicking ocean tides.
  4. Synthetic tabular data with prespecified correlation matrix.
  5. Synthetic data to test or benchmark algorithms.

The next modules offer a deep dive on many of the topics quickly summarized here. In Module 2, I discuss explainable ML techniques that will be used in Modules 3 and 4 on synthetic data. Modules 3 and 4 deal with the generation of synthetic data.

Module 2: Interpretable Machine Learning

Some of the techniques presented here are used in the next two modules focusing on synthetic data. Before diving into these techniques, I discuss data cleaning automation, data animation (data videos), and simplicity (illustrated by case study: marketing attribution without math). The new machine learning techniques introduced include:

  1. Generic unsupervised regression: covers all regression techniques and more, including an alternative to K-means
  2. Time series with double period (mimicking ocean tides)
  3. Interpretable regression
  4. Simplified ensemble method, alternative to XGBoost.
  5. Superimposed spatial point processes and alternative to GMM and GAN

Module 3: Synthetic Data in Computer Vision

In this module I cover the terrain generation including 3D contour plots, and emulation of GPU clustering with techniques similar to deep neural networks. Depending on the interest of participants, I may cover shape generation or other evolutionary processes such as synthetic star clusters to understand possible evolution of our universe. I also discuss nearest neighbor and collision graphs (such as this one), all synthetically generated. Some of the synthetic data videos that you will be able to produce, can be seen here and here.

Module 4: Tabular Data Generation

This type of data is traditionally used in banking, insurance and finance industry. Synthetic data has become very popular in this sector, as it helps reduce discrimination, algorithm bias, and contributes to the protection of personal data, explainable AI, and compliance with various regulations. In this module we will build a synthetic data set with a prespecified autocorrelation matrix, such as those estimated on real-life data sets.

Testimonials

These testimonials pertain to the training material published by the author.

  • I find all the materials you shared on your website extremely useful. I will share this with my colleagues who started their journey in machine learning. Again thank you for being connected on LinkedIn.  — Jackson Andreas Pola
  • Always your materials are supportive. Most of my students used to review your online materials. You might not know but frankly your impact is very noticeable specially for low-income University students. — Mohammed Alshahrani
  • Very interesting your last article “The sound that the data make”. Would you be interested, once I have introduced my students to the basics, in participating in one of the classes online? Showing them your work. — Isabel Marín
  • Thank you Vincent, I appreciate your operational excellence and resources. You are an invaluable resource to the community! — Milan McGraw

About the Instructor

Vincent Granville is a pioneering data scientist and machine learning expert, founder of MLTechniques.com, co-founder of Data Science Central (acquired by TechTarget in 2020), former VC-funded executive, author and patent owner. Vincent’s past corporate experience includes Visa, Wells Fargo, eBay, NBC, Microsoft, CNET, InfoSpace. Vincent is also a former post-doc at Cambridge University, and the National Institute of Statistical Sciences (NISS).  

Vincent published in Journal of Number TheoryJournal of the Royal Statistical Society (Series B), and IEEE Transactions on Pattern Analysis and Machine Intelligence. He is also the author of multiple books, available here. He lives in Washington state, and enjoys doing research on stochastic processes, dynamical systems, experimental math and probabilistic number theory.

Leave a Reply

%d