New Book: Synthetic Data and Generative AI

Synthetic data is used more and more to augment real-life datasets, enriching them and allowing black-box systems to correctly classify observations or predict values that are well outside of training and validation sets. In addition, it helps understand decisions made by obscure systems such as deep neural networks, contributing to the development of explainable AI. It also helps with unbalanced data, for instance in fraud detection. Finally, since synthetic data is not directly linked to real people or transactions, it offers protection against data leakage. Synthetic data also contributes to eliminating algorithm biases and privacy issues, and more generally, to increased security.

This book is the culmination of years of research on the topic, by the author. Emphasis is on methodological aspects and original contributions, favoring simplicity. This document integrates all the material from the previous book “Intuitive Machine Learning and explainable AI”, and it also contains all but the most advanced math from the book on stochastic simulations. The author also added more recent advances with applications to insurance data synthesized with copulas, terrain generation (with animated data), synthetic universes and experimental math. The latter is an infinite source of synthetic data to build and benchmark new machine learning techniques. Conversely mathematics benefits from these techniques to uncover new insights related to the most famous unsolved math problems. Chapter 14 on the Riemann Hypothesis illustrates this point, with new state-of-the-art research results on the subject.

Terrain generation, evolution, and morphing (video frame, see chapter 11)

Topics cover computer vision, natural language processing, tabular data, time series, geospatial and sound data, supervised classification, clustering, agent-base modeling, generative models, nearest neighbors and collision graphs, data-driven inference, prediction (all regression techniques are unified under a single, easy-to-understand method), deep neural networks, modeling without response (unsupervised regression such as circle or curve fitting), constrained optimization, and more. The author introduces a simple alternative to XGBoost, one of the most efficient ensemble methods; it is applied to an NLP problem — categorizing and ranking articles and blog posts to predict future performance. When needed, modern or new statistical learning techniques are introduced: dual confidence regions, new test of independence, parametric bootstrap, Rayleigh test, distribution-free logistic regression, proxy estimation and minimum contrast estimators, as well as a new prime test for strong pseudo-random number generators.

About 15% of the content is well documented Python code. The code is also on GitHub, spreading across multiple top-level folders, and unified for the first time in this book. It constitutes a solid introduction to scientific computing.

Author, Publisher, Table of Contents

Version 3.0 published in January 2023. Author and publisher: Vincent Granville, Ph.D., founder of private and self-funded machine learning research lab,

The book is available in PDF format (272 pages) with numerous, high-quality color illustrations and clickable links to fundamental concepts described on Wikipedia, if you ever need a refresher on the basics. You can view it for instance in the Chrome browser: press Ctrl-O and select the book. Access all the navigation features and follow the links in the book, with one click. To view the table of contents, list of figures and tables, bibliography, glossary and index, follow this link.

To purchase the book, follow this link.

%d bloggers like this: