Feature Clustering: A Simple Solution to Many Machine Learning Problems

Feature clustering is an unsupervised machine learning technique to separate the features of a dataset into homogeneous groups. In short, it is a clustering procedure, but performed on the features rather than on the observations. Such techniques often rely on a similarity metric, measuring how close two features are to each other. In this article, I use the absolute value of the correlation between two features. An immediate consequence is that the technique is scale-invariant: it does not depend on the units of measurement in your dataset. Of course, in some instances, it makes sense to transform the data using a logit or log transform prior to using the technique, to turn a multiplicative setting into an additive one.

The technique can also be used for traditional clustering performed on the observations. In that case, it is useful in the presence of wide data: when you have a large number of features but a small number of observations, sometimes smaller than the number of features as in clinical trials. When applied to features, it allows you to break down a high-dimensional problem (the dimension is the number of features), into a number of low-dimensional problems. It can accelerate many algorithms — those with computing time growing exponentially fast with the dimension — and at the same time avoid issues related to the “curse of dimensionality”. In fact it can be used as a data reduction technique, where feature clusters with a low average correlation (in absolute value) are removed from the data set.

Applications are numerous. In my case I used it in the context of synthetic data generation, especially with generative adversarial networks (GAN). The idea is is to identify clusters of related features, and apply a separate GAN to each of them, then put the synthetizations back altogether into one dataset. The benefits are faster processing with little to no loss in terms of capturing the full correlation structure present in the data set. It also increases the robustness and explainability of the method, making it less volatile during the successive epochs in the GAN model.

Feature clustering on correlation matrix using Scipy

I summarize the feature clustering results in section 2. I used the technique on a Kaggle dataset with 9 features, consisting of medical measurements. I offer two Python implementations: one based on hierarchical clustering in section 3.1, and one based on connected components (a fundamental graph theory algorithm) in section 3.2. In addition, the technique leads to a simple visualization of the 9-dimensional dataset, with one scatterplot and two colors: orange for diabetes and blue for non-diabetes. Here diabetes is the binary response feature. This is because the largest feature cluster contains only 3 features, and one of them is the response. In any well-designed experiment, you would expect the response to always be in a large feature cluster.

No linear algebra or calculus is required: the method is essentially math-free. This is in contrast to principal component analysis (PCA) which relies on eigenvalues, and turns your features into meaningless, arbitrary linear combinations that are hard to interpret. This article is an extract from my book “Synthetic Data and Generative AI”, available here.

Download the Article

The technical article, entitled Feature Clustering: A Simple Solution to Many Machine Learning Problems, is accessible in the “Free Books and Articles” section, here. It contains links to my GitHub files, to easily copy and paste the code. The text highlighted in orange in this PDF document are keywords that will be incorporated in the index, when I aggregate all my related articles into books about machine learning, visualization and Python. The text highlighted in blue corresponds to external clickable links, mostly references. And red is used for internal links, pointing to a section, bibliography entry, equation, and so on.

To not miss future articles, sign-up to our newsletter, here.

About the Author

Vincent Granville is a pioneering data scientist and machine learning expert, co-founder of Data Science Central (acquired by  TechTarget in 2020), founder of MLTechniques.com, former VC-funded executive, author and patent owner. Vincent’s past corporate experience includes Visa, Wells Fargo, eBay, NBC, Microsoft, and CNET. Vincent is also a former post-doc at Cambridge University, and the National Institute of Statistical Sciences (NISS).  

Vincent published in Journal of Number TheoryJournal of the Royal Statistical Society (Series B), and IEEE Transactions on Pattern Analysis and Machine Intelligence. He is also the author of multiple books, including “Intuitive Machine Learning and Explainable AI”, available here. He lives  in Washington state, and enjoys doing research on spatial stochastic processes, chaotic dynamical systems, experimental math and probabilistic number theory.

%d bloggers like this: