New Book: Approaching (Almost) Any Machine Learning Problem

This self-published book is dated July 2020 according to Amazon. But it appears to be an ongoing project. Like many new books, the material is on GitHub, here. The most recent version, dated June 2021, is available in PDF format, here.

This is not a traditional book. It feels like a repository of Python code, printed on paper if you buy the print version. The associated GitHub repository is much more useful if you want to re-use the code with simple copy and paste. It covers a lot of topics and performance metrics, with emphasis on computer vision problems. The code is documented in details. The code represents 80% of the content, and the comments in the code should be considered as an important, integral part of the content.

A Non-traditional Book

That said, the book is not an introduction to machine learning algorithms. It assumes some knowledge of the algorithms discussed, and there is no mathematical explanations. I find it to be an excellent 300-page Python tutorial covering many ML topics (maybe too many). The author focuses on real problems and real data. The style is very far from academic, and in my opinion, anti-academic.

Due to the large number of topics, an index would help a lot. Also, the book lacks references. There are few figures (more would be helpful) and the print version is black and white. This is not helpful to illustrate computer vision problems. But the PDF on the GitHub repository is in color, both for the illustrations and for the code. The only reason to buy the inexpensive print version (under $15) is to help the author make some money. This is very valid reason.

The Style

Overall, despite the high value of the book, it has a strong amateur feel. At least compared to the other self-published books that I reviewed recently (see here). It is not well organized, and it is easy to get lost when reading it. I am tempted to contact the author to help him make a professional version. But then, you may love its authenticity and unique style. I liken it to the great YouTube song by a Romanian sys admin, so amateurish, yet viewed 700 million times and loved by everyone (see here). Indeed, each time the book is mentioned on LinkedIn, it gathers hundreds if not thousands of likes. That’s how I found it!

Character recognition: t-SNE clustering of the MNIST data set based on 3000 images (page 10 in the book)

Make sure you don’t buy a pirated version. The author is wary about this, and as an author myself, routinely pirated as well, I encourage the reader to obtain a legitimate version. It is usually more up-to-date and not stripped of useful links. On the plus side, being pirated on a large scale is an indicator that you are very successful!

Table of Contents

The book starts with detailed instructions about how to install Python and set up the right environment. It then covers the following topics:

  • Supervised vs unsupervised learning
  • Cross-validation
  • Evaluation metrics
  • Arranging machine learning projects
  • Approaching categorical variables
  • Feature engineering
  • Feature selection
  • Hyperparameter optimization
  • Image classification and segmentation
  • Text classification and regression
  • Ensembling and stacking
  • Reproducible code and model serving

The Author

Abhishek Thakur is a data scientist and world’s first 4x grandmaster on Kaggle. His passion lies in solving difficult world problems through data science. Abhishek did his Bachelors in Electronics Engineering from India and moved to Germany for pursuing MSc from University of Bonn, Germany with a focus on image processing and computer vision. He dropped out of PhD in 2015 and since then has been working in industries.

To not miss future articles, sign-up to our newsletter, here.

%d bloggers like this: