Little Known Secrets about Interpretable Machine Learning on Synthetic Data

This first article in a new series on synthetic data and explainable AI, focuses on making linear regression more meaningful and controllable. Includes synthetic data, advanced machine learning with Excel, combinatorial feature selection, parametric bootstrap, cross-validation, and alternatives to R-squared to measure model performance. The full technical article (PDF, 13 pages, with detailed explanations and access to spreadsheets) is available here in the “Free Books and Articles” section. The title is “Interpretable Machine Learning on Synthetic Data”.

Summary

The technique discussed here handles a large class of problems. In this article, I focus on a simple one: linear regression. I solve it with an iterative algorithm (fixed point) that shares some resemblance to gradient boosting, using machine learning methods and explainable AI, as opposed to traditional statistics. In particular, the algorithm does not use matrix inversion. It is easy to implement in Excel (I provide my spreadsheet) or to automate as a black-box system. Also, it is numerically stable, and can generalize to non-linear problems. Unlike the traditional statistical solution leading to meaningless regression coefficients, here the output coefficients are easier to understand, leading to better interpretation.

I tested it on a rich collection of synthetic data sets: it performs just as well as the standard technique, even after adding noise to the data. I then show how to measure the impact of individual features, or groups of features (and feature interaction), on the solution. A model with m features has 2m sub-models. I show how to draw more insights by analyzing the performance of each sub-model. Finally, I introduce a new metric called score to measure model performance. Based on comparison with the base model, it is more meaningful than R-squared or mean squared error.

Scatterplots: observed data Y versus predictions (blue) and versus exact, unobserved data (orange)

Content

The article covers the following topics:

  • Mathematical setting, explained using elementary matrix algebra. It also delves into convergence analysis, numerical stability and eigenvalues, though this material can be skipped by non-technical readers. There is no matrix inversion involved in the methodology.
  • Data-driven, model-free confidence and prediction intervals, as well as a new, better alternative to R-squared, called “score”, to measure performance. These general techniques apply to a wide range of prediction algorithms well beyond linear regression.
  • A spreadsheet with all the data, computations and results. In particular, you will be able to easily create your own synthetic data, the right way. The synthetic data consists of a training set, and a validation set. Performance metrics are measured on the validation set.
  • A deep dive on all feature combinations, computing the performance and regression coefficients for each subset of features. This analysis yields interesting insights regarding feature importance and feature interaction, much deeper than just looking at the cross-correlation table.
  • Synthetic data generation allows you to simulate the “exact” (unobserved, unknown) data and thus the exact regression coefficients, as well as the observed data. The observed data is a mixture of exact data and simulated noise. I introduced more noise in the validation set than in the training set, to make the simulations more realistic.
  • You can test the methodology on millions of very rich, synthetic data. This is a big contrast with analyses based on real data. In particular, the following elements are simulated with randomization governed by hyper-parameters: the exact regression coefficients, the exact and observed data, the noise, the training and validation data, and the cross-correlation matrix for the features and observed response.

Right plot: residual error vs. observed data (used to design model-free prediction intervals)

Below are the conclusions from my research on this topic.

Conclusion

Using linear regression as an example, I illustrate how to turn the obscure output of a machine learning technique, into an interpretable solution. The method described here also shows the power of synthetic data, when properly generated. The use of synthetic data offers a big benefit: you can test and benchmark algorithms on millions of very different data sets, all at once. I also introduce a new model performance metric, superior to R-squared in many respects, and based on cross-validation.

The methodology leads to a very good approximation, almost as good as the exact solution on noisy data, with few iterations, natural regression coefficients easy to interpret, while avoiding over-fitting. In fact, given a specific data set, many very different sets of regression coefficients lead to almost identical predictions. It makes sense to choose the ones that offer the best compromise between exactness and interpretability.

My solution, which does not require matrix inversion, is also simple, compared to traditional methods. Indeed, it can easily be implemented in Excel, without requiring any coding. Despite the absence of statistical model, I also show how to compute confidence intervals, using parametric and non-parametric bootstrap techniques.

The full technical document (13 pages, with spreadsheet, synthetic data, detailed computations and explanations) is accessible from our resource repository, here. Check out the “Free Books and Articles” section.

About the Author

Vincent Granville is a machine learning scientist, author and publisher. He was the co-founder of Data Science Central (acquired by TechTarget) and most recently, founder of MLtechniques.com.  The following links point to some of my recent articles: synthetic data, interpretable ML, and ML with Excel. To not miss future articles, subscribed to my newsletter, here.

%d bloggers like this: