The Art of Visualizing High Dimensional Data

Entitled “The Art of Visualizing High Dimensional Data”, the full version in PDF format is accessible in the “Free Books and Articles” section, here.

This article discusses enriched visualizations, with a focus on animated gifs and videos built in Python. For instance, the comet video displayed below can feature several dimensions that are difficult to show in a static picture: the comet locations at any given time, the relative velocity of each comet, the change in velocity (acceleration), the change in comet size when approaching the sun, the comet interactions (the apparent collisions), any change in the orbit (orientation or eccentricity), or any change in composition (the color assigned to a comet). It can easily display 17 dimensions, as discussed in the paper.

The PDF document (6 pages + code + illustrations, 11MB) focuses on four applications: prediction intervals in any dimension, supervised classification, convergence of algorithms such as gradient descent when dealing with chaotic functions, and spatial time series (the comet illustration). All visualizations use the RGB color model, and one uses RGBA for special and particularly useful effects, by playing with the transparency level. In essence it allows you to perform supervised classification using image techniques only, after mapping your dataset onto an image.

Image compression and anti-aliasing techniques are included in the Python code. They require only a simple call to a library function. The code is also on GitHub, and the videos on YouTube. The document also presents surprising data in number theory and experimental math. It leads to interesting machine learning problems: boundary / holes detection (see figure below), and convergence acceleration for chaotic iterations.

Bayesian-like supervised classification: 3 clusters, infinite dataset

Abstract

I discuss different techniques to produce professional data videos, animated GIFs, and other visualizations in Python, using the Pillow} and Moviepy libraries. Applications include visualizing prediction intervals regardless of the number of features (also called independent variables), supervised classification applied to an infinite dataset, convergence of machine learning algorithms, and animations featuring objects of various sizes moving at various speeds according to various paths. For instance, I show a video simulation of 300 comets circling the sun, to assess the risk of a collision.

The Python libraries in question allow for low-level image processing at the pixel level. This is particularly useful to build ad-hoc, original visualization algorithms. I also discuss optimization: amount of memory required, performance of compression techniques, numpy versus math library, anti-aliasing to depixelate an image, and so on. Some of the videos use the RGBA palette format.

This 4-dimensional color encoding (red, green, blue, alpha) allows you to set the transparency level (also called “opacity”) when objects overlap. It is particularly useful in models involving mixtures or overlapping groups in supervised classification. In that context, not only it helps with visualizations, but it actually solves the classification problem on its own.

Comets circling the sun (simulation)

Table of Contents

Introduction

Applications

  • Spatial time series
  • Predictions intervals in any dimension
  • Supervised classification of an infinite dataset
  • Algorithms with chaotic convergence

Python code

  • Path simulation
  • Visual convergence analysis in 2D
  • Supervised classification

Visualizations

Download the Article

The technical article, entitled The Art of Visualizing High Dimensional Data, is accessible in the “Free Books and Articles” section, here. The text highlighted in orange in this PDF document are keywords that will be incorporated in the index, when I aggregate all my related articles into a single book about innovative machine learning techniques. The text highlighted in blue corresponds to external clickable links, mostly references. And red is used for internal links, pointing to a section, bibliography entry, equation, and so on.

To not miss future articles, sign-up to our newsletter, here.

%d bloggers like this: