Advanced Machine Learning with Basic Excel: Simple Alternative to XGBoost

Entitled “Advanced Machine Learning with Basic Excel”, the full version in PDF format is accessible in the “Free Books and Articles” section, here. Also discussed in details with Python code in chapter 2 in my book “Intuitive Machine Learning and Explainable AI”, available here.

I discuss ensemble methods combining many mini decision trees, blended with regression, explained in simple English with both Excel and Python implementations. Case study: natural language processing (NLP) problem. Ideal reading for professionals who want to start light with Machine Learning (say with Excel) and get very fast to much more advanced material and Python. The Python code is not just a call to some blackbox functions, but a full-fledge detailed procedure on its own. This algorithm is in the same category as boosting, bagging, stacking and AdaBoost.

Abstract

The method described here illustrates the concept of ensemble methods, applied to a real life NLP problem: ranking articles published on a website to predict performance of future blog posts yet to be written, and help decide on title and other features to maximize traffic volume and quality, and thus revenue. The method, called hidden decision trees (HDT), implicitly builds a large number of small usable (possibly overlapping) decision trees. Observations that don’t fit in any usable node are classified with an alternate method, typically simplified logistic regression.

This hybrid procedure offers the best of both worlds: decision tree combos and regression models. It is intuitive and simple to implement. The code is written in Python, and I also offer a light version in basic Excel. The interactive Excel version is targeted to analysts interested in learning Python or machine learning. HDT fits in the same category as bagging, boosting, stacking and adaBoost. This article encourages you to understand all the details, upgrade the technique if needed, and play with the full code or spreadsheet as if you wrote it yourself. This is in contrast with using blackbox Python functions without understanding their inner workings and limitations. Finally, I discuss how to build model-free confidence intervals for the predicted values.

Table of Contents

  1. Methodology
    . . . How hidden decision trees (HDT) work
    . . . NLP Case study: summary and findings
    . . . Parameters
    . . . Improving the methodology
  2. Implementation details
    . . . Correcting for bias
    . . . . . . Time-adjusted scores
    . . . Excel spreadsheet
    . . . Python code and dataset
  3. Model-free confidence intervals and perfect nodes
    . . . Interesting asymptotic properties of confidence intervals

Download the Article

The technical article, entitled Machine Learning Cloud Regression: The Swiss Army Knife of Optimization, is accessible in the “Free Books and Articles” section, here. The text highlighted in orange in this PDF document are keywords that will be incorporated in the index, when I aggregate all my related articles into a single book about innovative machine learning techniques. The text highlighted in blue corresponds to external clickable links, mostly references. And red is used for internal links, pointing to a section, bibliography entry, equation, and so on.

To not miss future articles, sign-up to our newsletter, here.

About the Author

Vincent Granville is a pioneering data scientist and machine learning expert, co-founder of Data Science Central (acquired by  TechTarget in 2020), former VC-funded executive, author and patent owner. Vincent’s past corporate experience includes Visa, Wells Fargo, eBay, NBC, Microsoft, CNET, InfoSpace. Vincent is also a former post-doc at Cambridge University, and the National Institute of Statistical Sciences (NISS).  

Vincent published in Journal of Number TheoryJournal of the Royal Statistical Society (Series B), and IEEE Transactions on Pattern Analysis and Machine Intelligence. He is also the author of multiple books, available here. He lives  in Washington state, and enjoys doing research on stochastic processes, dynamical systems, experimental math and probabilistic number theory.

Leave a Reply

%d