Feature Clustering: A Simple Solution to Many Machine Learning Problems

Feature clustering is an unsupervised machine learning technique to separate the features of a dataset into homogeneous groups. In short, it is a clustering procedure, but performed on the features rather than on the observations. Such techniques often rely on a similarity metric, measuring how close two features are to each other. In this article, I use the absolute value of the correlation between two features. An immediate consequence is that the technique is scale-invariant: it does not depend on the units of measurement in your dataset. Of course, in some instances, it makes sense to transform the data using a logit or log transform prior to using the technique, to turn a multiplicative setting into an additive one.

The technique can also be used for traditional clustering performed on the observations. In that case, it is useful in the presence of wide data: when you have a large number of features but a small number of observations, sometimes smaller than the number of features as in clinical trials. When applied to features, it allows you to break down a high-dimensional problem (the dimension is the number of features), into a number of low-dimensional problems. It can accelerate many algorithms — those with computing time growing exponentially fast with the dimension — and at the same time avoid issues related to the “curse of dimensionality”. In fact it can be used as a data reduction technique, where feature clusters with a low average correlation (in absolute value) are removed from the data set.

Applications are numerous. In my case I used it in the context of synthetic data generation, especially with generative adversarial networks (GAN). The idea is is to identify clusters of related features, and apply a separate GAN to each of them, then put the synthetizations back altogether into one dataset. The benefits are faster processing with little to no loss in terms of capturing the full correlation structure present in the data set. It also increases the robustness and explainability of the method, making it less volatile during the successive epochs in the GAN model.

Feature clustering on correlation matrix using Scipy

I summarize the feature clustering results in section 2. I used the technique on a Kaggle dataset with 9 features, consisting of medical measurements. I offer two Python implementations: one based on hierarchical clustering in section 3.1, and one based on connected components (a fundamental graph theory algorithm) in section 3.2. In addition, the technique leads to a simple visualization of the 9-dimensional dataset, with one scatterplot and two colors: orange for diabetes and blue for non-diabetes. Here diabetes is the binary response feature. This is because the largest feature cluster contains only 3 features, and one of them is the response. In any well-designed experiment, you would expect the response to always be in a large feature cluster.

No linear algebra or calculus is required: the method is essentially math-free. This is in contrast to principal component analysis (PCA) which relies on eigenvalues, and turns your features into meaningless, arbitrary linear combinations that are hard to interpret. This article is an extract from my book “Synthetic Data and Generative AI”, available here.

Download the Article

The technical article, entitled Feature Clustering: A Simple Solution to Many Machine Learning Problems, is accessible in the “Free Books and Articles” section as paper #21, here. It contains links to my GitHub files, to easily copy and paste the code. The text highlighted in orange in this PDF document are keywords that will be incorporated in the index, when I aggregate all my related articles into books about machine learning, visualization and Python. The text highlighted in blue corresponds to external clickable links, mostly references. And red is used for internal links, pointing to a section, bibliography entry, equation, and so on.

To not miss future articles, sign-up to our newsletter, here.

About the Author

Towards Better GenAI: 5 Major Issues, and How to Fix Them

Vincent Granville is a pioneering GenAI scientist, co-founder at BondingAI.io, the LLM 2.0 platform for hallucination-free, secure, in-house, lightning-fast Enterprise AI at scale with zero weight and no GPU. He is also author (Elsevier, Wiley), publisher, and successful entrepreneur with multi-million-dollar exit. Vincent’s past corporate experience includes Visa, Wells Fargo, eBay, NBC, Microsoft, and CNET. He completed a post-doc in computational statistics at University of Cambridge.

	messerb5467 on Quantum Derivatives, GenAI, an…
	Vincent Granville on Quantum Derivatives, GenAI, an…
	Brad Messer on Quantum Derivatives, GenAI, an…
	Sanjay Gautam on Number Theory: Longest Runs of…
	Artem Melnyk on Autonomous Driving: Boosting O…

Feature Clustering: A Simple Solution to Many Machine Learning Problems

Download the Article

About the Author

Like this:

Leave a ReplyCancel reply

Feature Clustering: A Simple Solution to Many Machine Learning Problems

Download the Article

About the Author

Share this:

Like this:

Leave a ReplyCancel reply

Discover more from xLLM and AI Technology