Machine Learning Dictionary

Top entries are in bold, and sub-entries are in italics. This dictionary is from my new book “Intuitive Machine Learning and Explainable AI”, available here and used as reference material for the course with the same name (see here). These entries are cross-referenced in the book to facilitate navigation, with backlinks to the pages where they appear. The index, also with clickable backlinks, is a more comprehensive listing with 300+ terms. Both the glossary and index are available in PDF format here on my GitHub repository, and of course with clickable links within the book.

Clustering on synthetic data created with generative mixture model

Autoregressive processes. Auto-correlated time series. Time-continuous versions include Gaussian processes and Brownian motions, while random walks are a discrete example; two-dimensional versions exist. These processes are essentially integrated white noise.

Binning. Feature binning consists of aggregating the values of a feature into a small number of bins, to avoid overfitting and reduce the number of nodes in methods such as naive Bayes, neural networks or decision trees. Binning can be applied to two or more features simultaneously. I discuss optimum binning in my book.

Boosted model. Blending of several models to get the best of each one, also referred to as ensemble methods. The concept is illustrated with hidden decision trees in my book. Other popular examples are gradient boosting and AdaBoost.

Bootstrapping. A data-driven, model-free technique to estimate parameter values, to optimize goodness-of-fit metrics. Related to resampling in the context of cross-validation. In my book, I discuss parametric bootstrap on synthetic data that mimics the actual observations.

Confidence Region. A confidence region of level γ is a 2D set of minimum area covering a proportion γ of the mass of a bivariate probability distribution. It is a 2D generalization of confidence intervals. In my book, I also discuss dual confidence regions — the analogous of credible regions in Bayesian inference.

Cross-validation. Standard procedure used in bootstrapping, and to test and validate a model, by splitting your data into training and validation set. Parameters are estimated based on training set data. An alternative to cross-validation is testing your model on synthetic data with known response.

Decision trees. A simple, intuitive non-linear modeling techniques used in classification problems. It can handle missing and categorical data, as well as a large number of features, but requires appropriate feature binning. Typically one blends multiple binary trees each with a few nodes, to boost performance.

Dimension reduction. A technique to reduce the number of features in your dataset while minimizing the loss in predictive power. The most well known are principal component analysis and feature selection to maximize goodness-of-fit metrics.

Empirical distribution. Cumulative frequency histogram attached to a statistic (for instance, nearest neighbor distances), and based on observations. When the number of observations tends to infinity and the bin sizes tend to zero, this step function tends to the theoretical cumulative distribution function of the statistic in question.

Ensemble methods. A technique consisting of blending multiple models together, such as many decision trees with logistic regression, to get the best of each method and outperform each method taken separately. Examples include boosting, bagging, and AdaBoost. In my book, I discuss hidden decision trees.

Explainable AI. Automated machine learning techniques that are easy to interpret are referred to as interpretable machine learning or explainable artificial intelligence. As much as possible, the methods discussed in my book belong to that category. The goal is to design black-box systems less likely to generate unexpected results with unintended consequences.

Feature selection. Features — as opposed to the model response — are also called independent variables or predictors. Feature selection, akin to dimension reduction, aims at finding the minimum subset of variables with enough predictive power. It is also used to eliminate redundant features and find causality (typically using hierarchical Bayesian models), as opposed to mere correlations. Sometimes, two features have poor predictive power when taken separately, but provide improved predictions when combined together.

Goodness-of-fit. A model fitting criterion or metric to assess how a model or sub-model fits to a dataset, or to measure its predictive power on a validation set. Examples include R-squared, Chi-squared, Kolmogorov-Smirnov, error rate such as false positives and other metrics discussed in my book.

Gradient methods. Iterative optimization techniques to find the minimum of maximum of a function, such as the maximum likelihood. When there are numerous local minima or maxima, use swarm optimization. Gradient methods (for instance, stochastic gradient descent or Newton’s method) assume that the function is differentiable. If not, other techniques such as Monte Carlo simulations or the fixed-point algorithm can be used. Constrained optimization involves using Lagrange multipliers.

Graph structures. Graphs are found in decision tress, neural networks (connections between neurons), in nearest neighbors methods (NN graphs), in hierarchical Bayesian models, and more.

Hyperparameter. An hyperparameter is used to control the learning process: for instance, the dimension, the number of features, parameters, layers (neural networks) or clusters (clustering problem), or the width of a filtering window in image processing. By contrast, the values of other parameters (typically node weights in neural networks or regression coefficients) are derived via training.

Link function. A link function maps a nonlinear relationship to a linear one so that a linear model can be fit, and then mapped back to the original form using the inverse function. For instance, the logit link function is used in logistic regression. Generalizations include quantile functions and inverse sigmoids in neural networks to work with additive (linear) parameters.

Logistic regression. A generalized linear regression method where the binary response (fraud/non-fraud or cancer/non-cancer) is modeled as a probability via the logistic link function. Alternatives to the iterative maximum likelihood solution are discussed in my book.

Neural network. A blackbox system used for predictions, optimization, or pattern recognition especially in computer vision. It consists of layers, neurons in each layer, link functions to model non-linear interactions, parameters (weights associated to the connections between neurons) and hyperparameters. Networks with several layers are called deep neural networks. Also, neurons are sometimes called nodes.

NLP. Natural language processing is a set of techniques to deal with unstructured text data, such as emails, automated customer support, or webpages downloaded with a crawler. An example discussed in my book deals with creating a keyword taxonomy based on parsing Google search results pages.

Numerical stability. This issue occurring in unstable optimization problems typically with multiple minima or maxima, is frequently overlooked and leads to poor predictions or high volatility. It is sometimes referred to as ill-conditioned problems. I explain how to fix it in several examples in my book. Not to be confused with numerical precision.

Overfitting. Using too many unstable parameters resulting in excellent performance on the training set, but poor performance on future data or on the validation set. It typically occurs with numerically unstable procedures such as regression (especially polynomial regression) when the training set is not large enough, or in the presence of wide data (more features than observations) when using a method not suited to this situation. The opposite is underfitting.

Predictive power. A metric to assess the goodness-of-fit or performance of a model or subset of features, for instance in the context of dimensionality reduction or feature selection. Typical metrics include R-squared, or confusion matrices in classification.

R-squared. A goodness-of-fit metric to assess the predictive power of a model, measured on a validation set. Alternatives include adjusted R-squared, mean absolute error and other metrics discussed in my book.

Random number. Pseudo-random numbers are sequences of binary digits, usually grouped into blocks, satisfying properties of independent Bernoulli trials. In my book, the concept is formally defined, and strong pseudo-number generators are built and used in computer-intensive simulations.

Regression methods. I discuss a unified approach to all regression problems in chapter 1 in my book. Traditional techniques include linear, logistic, Bayesian, polynomial and Lasso regression (to deal with numerical instability and overfitting), solved using optimization techniques, maximum likelihood methods, linear algebra (eigenvalues and singular value decomposition) or stepwise procedures.

Supervised learning. Techniques dealing with labeled data (classification) or when the response is known (regression). The opposite is unsupervised learning, for instance clustering problems. In-between, you have semi-supervised learning and reinforcement learning (favoring good decisions). The technique described in chapter 1 in my book fits into unsupervised regression. Adversarial learning is testing your model against extreme cases intended to make it fail, to build better models.

Synthetic data. Artificial data simulated using a generative model, typically a mixture model, to enrich existing datasets and improve the quality of training sets. Called augmented data when blended with real data.

Tensor. Matrix generalization with three of more dimensions. A matrix is a two-dimensional tensor. A triple summation with three indices is represented by a three-dimensional tensor, while a double summation involves a standard matrix.

Training set. Dataset used to train your model in supervised learning. Typically, a portion of the training set is used to train the model, the other part is used as validation set.

Validation set. A portion of your training set, typically 20%, used to measure the actual performance of your predictive algorithm outside the training set. In cross-validation and bootstrapping, the training and validation sets are split into multiple subsets to get a better sense of variations in the predictions.

%d bloggers like this: