Synthetic data is used more and more to augment real-life datasets, enriching them and allowing black-box systems to correctly classify observations or predict values that are well outside of training and validation sets. In addition, it helps understand decisions made by obscure systems such as deep neural networks, contributing to the development of explainable AI. It also helps with unbalanced data, for instance in fraud detection. Finally, since synthetic data is not directly linked to real people or transactions, it offers protection against data leakage. Synthetic data also contributes to eliminating algorithm biases and privacy issues, and more generally, to increased security.
This book is the culmination of years of research on the topic, by the author. Emphasis is on methodological aspects and original contributions, favoring simplicity. This document integrates all the material from the previous book “Intuitive Machine Learning and explainable AI”, and it also contains all but the most advanced math from the book on stochastic simulations. The author also added more recent advances with applications to terrain generation (with animated data), synthetic universes and experimental math. The latter is an infinite source of synthetic data to build and benchmark new machine learning techniques. Conversely mathematics benefits from these techniques to uncover new insights related to the most famous unsolved math problems. The chapter on the Riemann Hypothesis illustrates this point, with new state-of-the-art research results on the subject.
Topics cover generative adversarial networks (GANs), computer vision, natural language processing, tabular data, time series, geospatial and sound data, supervised classification, clustering, agent-based modeling, generative models, nearest neighbors and collision graphs, data-driven inference, prediction (all regression techniques are unified under a single, easy-to-understand method), deep neural networks, modeling without response (unsupervised regression such as circle or curve fitting), constrained optimization, copulas, and more.
The author introduces a simple alternative to XGBoost, one of the most efficient ensemble methods; it is applied to an NLP problem — categorizing and ranking articles and blog posts to predict future performance. When needed, modern or new statistical learning techniques are introduced: dual confidence regions, new test of independence, parametric bootstrap, Rayleigh test, distribution-free logistic regression, proxy estimation and minimum contrast estimators, as well as a new prime test for strong pseudo-random number generators. Several real-life datasets are discussed in detail.
About 15% of the content is well documented Python code. The code is also on GitHub, spreading across multiple top-level folders, and unified for the first time in this book. It constitutes a solid introduction to scientific computing.
Author, Publisher, Table of Contents
The book is available in PDF format (292 pages) with numerous, high-quality color illustrations and clickable links to fundamental concepts described on Wikipedia, if you ever need a refresher on the basics. You can view it for instance in the Chrome browser: press Ctrl-O and select the book. Access all the navigation features and follow the links in the book, with one click. To view the table of contents, list of figures and tables, bibliography, glossary and index, follow this link.