The post New Book: Interpretable Machine Learning first appeared on Machine Learning Techniques.

]]>The book appears as a collection or little encyclopedia of various methods and model performance metrics. Too many in my opinion, and you can easily get lost in this ocean of material. The author favors exhaustivity over selectivity. However, many will see this as a benefit. In my opinion, the book is better suited for machine learning developers than decision makers or stakeholders.

Each method or metric is compared to others, with pluses and minuses, and comes with very recent references and Python or R libraries. Applications based on real or “prototype” data, with source code, is available on the author’s GitHub repository. A glossary would help a lot, and this may be available in a future version of this book, which is constantly evolving.

I wish the quality of the print version was higher. Many color illustrations would benefit from being printed on better paper that does not absorb ink so much. I would have preferred to buy a PDF version with clickable links (text highlighted in red in the print version), if it was available. But the book has been thoroughly copy-edited and has been reviewed with the help of numerous readers. So the quality of the content and proofreading is high. The level and amount of mathematics is correct: not too much, not too advanced, but enough to have a real feeling of what the techniques do.

Many people still wonder how you make black box systems interpretable. There are a few themes in the book to address this issue. Below is a short list that caught my attention:

**Proxy models**: a simplified version of your system, simple enough that it is interpretable. The proxy model acts as an interpretable approximation of the full version.**Adversarial data**: data specifically chosen or designed to make your system fail (a weird human face that your system detects as non-human, or a weird rock that your system erroneously classifies as human). This helps you understand where your black box system shines, and its limitations.**Feature importance**and**feature interaction**: to further understand the mechanics that make your system works. It is more powerful than looking at cross-correlation tables.**Pixel and feature attribution**: understanding which pixels in a given image have the biggest impact to classify the image or for pattern recognition (that is, the biggest impact on the output of your system).**Prototype data**: it can be a large set of hand-written digits, added to your training set, if your problem is to recognize digits. This again helps you understand where your system shines or fails, offering some insights on its inner workings.

You can find Christoph’s book on his GitHub repository, here, or on his website. It is written in R Markdown, and published with BookDown.org. I like the GitHub version better than the print edition (it is also more up-to-date).

For related articles on interpretable machine learning, visit this page. My most recent article describes how to generate synthetic data, to use as augmented training sets in black box systems. It is available here. A lot more is described and used in my new book. To not miss future articles, sign-up to our newsletter, here.

The post New Book: Interpretable Machine Learning first appeared on Machine Learning Techniques.

]]>The post New Book: Efficient Deep Learning first appeared on Machine Learning Techniques.

]]>The minimally qualified reader is someone who has a basic understanding of ML and at least some experience of training deep learning models. They can do basic fine-tuning of models by changing common parameters, can make minor changes to model architectures, etc. and get the modified models to train to a good accuracy. However, they are running into problems with productionizing these models / want to optimize them further. This is primarily because the book does not teach deep learning basics. For a basic introduction on the subject, see Deep Learning with Python. Any reader having this pre-requisite knowledge would be able to enjoy the book.

**Gaurav Menghani**is Staff Software Engineer / Tech Lead at Google Research, working on efficient Deep Learning, On-Device machine learning and AI. He was previously senior software engineer at Facebook, working on search quality and ranking.

**Naresh Singh**graduated at Stony Brook University, in data analysis and machine learning. He worked as software engineer at Microsoft and Amazon.

The material covers the following topics: quantization, learning techniques and efficiency, data augmentation, smaller and faster models, efficient architectures, long term dependencies, automation and autoML, hyper-parameter tuning, clustering and classification, contrastive learning, microcontrollers, NLP, computer vision, TensorFlow, PyTorch and more.

Many projects and exercises are discussed throughout the book, including:

- Compressing images from the Mars Rover
- Quantizing a deep learning model
- Increasing the accuracy of an image or text classification model with data augmentation
- Increasing the accuracy of an speech identification model with distillation
- Using pre-trained embeddings to improve accuracy of a NLP task
- News classification using RNN and Attention Models
- Snapchat-like filters for pets
- Searching over model architectures for boosting model accuracy
- Comparing compression techniques for optimizing a speech detection model
- Learning to classify with 10% labels
- Benchmarking a tiny on-device model with TFLite
- Speech detection on a microcontroller with TFMicro
- Face recognition on the web with TensorFlow.JS
- Google Tensor Processing Unit: training BERT efficiently with TPU

You can download the first four chapters (PDF format) on the official website, EfficientDLbook.com. Projects, codelabs and tutorials are available on GitHub, here.

*For related books, visit our books section, here. *

The post New Book: Efficient Deep Learning first appeared on Machine Learning Techniques.

]]>The post Little Known Secrets about Interpretable Machine Learning on Synthetic Data first appeared on Machine Learning Techniques.

]]>The technique discussed here handles a large class of problems. In this article, I focus on a simple one: linear regression. I solve it with an iterative algorithm (fixed point) that shares some resemblance to gradient boosting, using machine learning methods and explainable AI, as opposed to traditional statistics. In particular, the algorithm does not use matrix inversion. It is easy to implement in Excel (I provide my spreadsheet) or to automate as a black-box system. Also, it is numerically stable, and can generalize to non-linear problems. Unlike the traditional statistical solution leading to meaningless regression coefficients, here the output coefficients are easier to understand, leading to better interpretation.

I tested it on a rich collection of synthetic data sets: it performs just as well as the standard technique, even after adding noise to the data. I then show how to measure the impact of individual features, or groups of features (and feature interaction), on the solution. A model with *m* features has 2* ^{m}* sub-models. I show how to draw more insights by analyzing the performance of each sub-model. Finally, I introduce a new metric called

The article covers the following topics:

- Mathematical setting, explained using elementary matrix algebra. It also delves into convergence analysis, numerical stability and eigenvalues, though this material can be skipped by non-technical readers. There is no matrix inversion involved in the methodology.
- Data-driven, model-free confidence and prediction intervals, as well as a new, better alternative to R-squared, called “score”, to measure performance. These general techniques apply to a wide range of prediction algorithms well beyond linear regression.
- A spreadsheet with all the data, computations and results. In particular, you will be able to easily create your own synthetic data, the right way. The synthetic data consists of a training set, and a validation set. Performance metrics are measured on the validation set.
- A deep dive on all feature combinations, computing the performance and regression coefficients for each subset of features. This analysis yields interesting insights regarding feature importance and feature interaction, much deeper than just looking at the cross-correlation table.
- Synthetic data generation allows you to simulate the “exact” (unobserved, unknown) data and thus the exact regression coefficients, as well as the observed data. The observed data is a mixture of exact data and simulated noise. I introduced more noise in the validation set than in the training set, to make the simulations more realistic.
- You can test the methodology on millions of very rich, synthetic data. This is a big contrast with analyses based on real data. In particular, the following elements are simulated with randomization governed by hyper-parameters: the exact regression coefficients, the exact and observed data, the noise, the training and validation data, and the cross-correlation matrix for the features and observed response.

Below are the conclusions from my research on this topic.

Using linear regression as an example, I illustrate how to turn the obscure output of a machine learning technique, into an interpretable solution. The method described here also shows the power of synthetic data, when properly generated. The use of synthetic data offers a big benefit: you can test and benchmark algorithms on millions of very different data sets, all at once. I also introduce a new model performance metric, superior to R-squared in many respects, and based on cross-validation.

The methodology leads to a very good approximation, almost as good as the exact solution on noisy data, with few iterations, natural regression coefficients easy to interpret, while avoiding over-fitting. In fact, given a specific data set, many very different sets of regression coefficients lead to almost identical predictions. It makes sense to choose the ones that offer the best compromise between exactness and interpretability.

My solution, which does not require matrix inversion, is also simple, compared to traditional methods. Indeed, it can easily be implemented in Excel, without requiring any coding. Despite the absence of statistical model, I also show how to compute confidence intervals, using parametric and non-parametric bootstrap techniques.

*The full technical document (13 pages, with spreadsheet, synthetic data, detailed computations and explanations) is accessible from our resource repository, here. Check out the “Free Books and Articles” section. *

Vincent Granville is a machine learning scientist, author and publisher. He was the co-founder of Data Science Central (acquired by TechTarget) and most recently, founder of MLtechniques.com. The following links point to some of my recent articles: synthetic data, interpretable ML, and ML with Excel. To not miss future articles, subscribed to my newsletter, here.

The post Little Known Secrets about Interpretable Machine Learning on Synthetic Data first appeared on Machine Learning Techniques.

]]>The post Upcoming Books and Articles on MLTechniques.com first appeared on Machine Learning Techniques.

]]>I am working full time on this project. Unlike someone working for an organization or even a consultant, there is no restrictions on the intellectual property that I can share. This initiative is entirely self-funded, which guarantees neutrality. The data sets at my disposal, to test my methods, are freely available and huge. Most of my articles are offered with portions of my master data set, allowing you to fully replicate the results.

Various themes will be covered, and discussed in the next section.

**Synthetic data**. I have over 20 years of experience generating and working with simulated data, emulating a large class of real data, spanning from spatial processes, clustering, shapes, to multivariate financial processes and auto-correlated time series. Synthetic data offers a lot of possibilities to test or benchmark algorithms, and train machine learning systems.

**Shape catalog**. I have been working on image analysis since the late eighties. My plan is to offer a large catalog of categorized synthetic shapes to help you create your own training set, to use in computer vision, image and sound recognition problems.

**Regression techniques, decision trees**. The purpose is to offer simple, robust alternatives to traditional models, easy to implement and control, even in Excel. It will include a generic regression technique based on the fixed-point algorithm (no need to know matrix algebra), fuzzy regression, and a blending of regression with a large number of small decision trees, with predictions based on a majority vote among competing techniques. The fuzzy regression offers multiple regression lines as output, rather than just one. Depending on the observation, a probability is assigned to each regression line, making the prediction “fuzzy” but more flexible. This is not restricted to linear regression; I will also discuss a simplified logistic regression.

**Clustering and classification**. I have been working on these problems for decades. My upcoming articles will feature the most recent developments: clustering in GPU (graphics processing unit) using image filtering techniques and equalizers, as well as fuzzy classification. Some of this content is already featured in my recent book, available here. GPU classification is illustrated here. Note that this is applied to standard, tabular data, not images. The data is mapped onto an image to allow easy processing, but the data itself does not consist of images: this is the originality of the technique.

**Data animations, sound, “no code” machine learning**. This section encompasses visualization and goes one step further, with the production of videos and animated Gifs. A lot can be done with a few clicks, using Excel or with some simple calls to video libraries in R or Python, using minimalist code. Also, the plan is to add data-induced sound (matching the summarized data) as extra dimensions (sound frequency, amplitude, duration, texture) to the video. Finally, an article will discuss the generation of optimum palettes either for classification purposes, or for images with a large number of colors.

**Explainable AI and very deep neural networks**. The goal is to design automated black-box systems that are interpretable. An example is my shape classifier, not relying on neural networks. It is available here. To the contrary, my new classifier uses 250 layers in a very deep learning neural network. Yet, it is a very sparse network, with one connection by neuron, producing an unusually granular classification. Because it is based on image filtering techniques (even though the data has nothing to do with image processing), it is easy to fine tune and interpret. See it in action, here. The goal is to publish more articles related to this topic, and eventually, a book.

**Probability distributions**. Over the last 20 years, I have been working and creating hundreds of probability distributions serving many purposes, such as generalized logistic, Poisson-exponential, Riemann zeta distributions, and distributions that are nowhere differentiable or defined on unusual domains (sphere, simplex). The goal is to create a catalog of the most useful ones, illustrated with applications.

**Excel for machine learning**. I have used Excel in many machine learning problems, sometimes in combination with Perl or Python programming, and sometimes as a stand-alone tool to solve a problem. I want put all these spreadsheets in a unified document. Some are currently available on my GitHub repository (see here and here). The plan is to add many more, and bundle them in an easy-to-read document.

**Experimental math**. Topics include discrete dynamical systems (including stochastic systems), unusually clustered Brownian motions, use of machine learning techniques to attack difficult math problems or discover patterns, use of Bignum libraries, benchmarking machine learning techniques on predictable math data, designing synthetic data sets, and more. Including original contributions on the Riemann hypothesis and the twin prime conjecture.

**Innovative machine learning**. This will be the title of an upcoming book, focusing on simpler and more intuitive ways to analyze data. It will cover the following topics: cross-validation, model-free confidence regions, resampling, assessing the impact of individual or pairs of features on predictions, minimum contrast estimation (a generic estimation technique), optimization with a divergent fixed-point algorithm, covering problem based on population density rather than area, true test of independence to detect subtle departures from full independence, time series with long-range autocorrelations, NLP and taxonomy creation, data science with the naked eye, modern regression, and more. Some of this material will first appear as articles posted on MLTechniques.com. Some can be found in my previous book, here.

**Off the beaten path exercises**. My numerous articles and books, including future ones, are peppered with original exercises that require out-of-the-box thinking, and solve interesting problems. If you are a university professor scrambling to find fresh material, you will be interested in my upcoming book featuring the most interesting part of this collection. Of course, this book is also targeted to students.

Vincent Granville, Ph.D.

Author and Publisher,

MLTechniques.com | MLTblog.com

The post Upcoming Books and Articles on MLTechniques.com first appeared on Machine Learning Techniques.

]]>The post Computer Vision: Shape Classification via Explainable AI first appeared on Machine Learning Techniques.

]]>A central problem in computer vision is to compare shapes and assess how similar they are. This is used for instance in text recognition. Modern techniques involve neural networks. In this article, I revisit a methodology designed in 1914, before computer even existed. It leads to an efficient, automated AI algorithm. The benefit is that the decision process made by this black-box system, can be explained (almost) in layman’s terms, and thus easily controlled.

To the contrary, neural networks use millions of weights that are impossible to interpret, potentially leading to over-fitting. Why they work very well on some data and no so well on other data is a mystery. My “old-fashioned” classifier, adapted to modern data and computer architectures, lead to full control of the parameters. In other words, you know beforehand how fine-tuning the parameters will impact the output. Thus the word *explainable AI*.

In an ideal world, one would want to blend both methods, to benefit from their respective strengths, and minimize their respective drawbacks. Such blending is referred to as *ensemble* methods. Also, since we are dealing with sampled points located on a curve (the “shape”), the same methodology also applies to sound recognition.

There is a little bit of mathematics involved here. It mostly boils down to using polar rather than cartesian coordinates, combined with rudiments of differential calculus. Familiarity with multivariate sorting also helps. Here I keep the presentation at a high level, leaving equations and technical details in a paper to be published in the next two weeks. However, the Excel spreadsheet offered in this article has most of the formula implemented.

It is convenient to consider a shape as a set of points, representing the pixels of an image. The center of the image is considered as the origin of the coordinate system. The points in question – all located on the curve – are assumed to be ordered in some way (to be discussed), and indexed by *t*. Physicists may view the index *t* as the time variable (discrete or continuous), and the curve as an orbit. Complex shapes may involve multiple curves, even disconnected ones.

For illustration purposes, it is easier to start with mathematical shapes characterized by a parametric polar equation. In this case, the equation is

r_t =g(t), \quad\theta_t=h(t), \quad \text{with}\quad t\in T, \quad r_t\geq 0, \quad 0\leq \theta_t\leq 2\pi.

Here *g*, *h* are real-valued functions, and *T* is the index domain. An example with *n* = 20 points is as follows:

\theta_t=(t+\eta)\bmod{2\pi},\quad r_t=c+d \sin(at) \sin[b(2\pi-t)], \quad t = 2\pi k/n \text{ with } k=0,\dots, n-1.

This example is pictured in Figure 1. The parameter *η* controls the rotation angle or orientation of the shape. Detailed computations are in the “Data Shapes” tab, in this spreadsheet. You can play with the parameters in the “Dashboard” tab (illustrated in Figure 1) to see how the shapes get transformed.

Each shape (or set of points) is uniquely described by a normalized contour on the unit square, called *signature*. The signature does not depend on the location or center of gravity of the shape. It depends on the orientation, though it is easy to generalize the definition to make it rotation-invariant. Or to handle 3D shapes. The first step is to use the center of gravity (centroid) for the origin, and then rescale by standardizing the variance of the radius *r _{t}*.

The centroid is a weighted average of the points located on the shape. Typically, the weight is constant. However, if the points are not uniformly distributed on the shape, you may use appropriate weights to correct this artefact. This is illustrated in Figure 2, with detailed computations in the “Shape Signature” tab in my spreadsheet. My technical document, once published, explains how to do it. Note that in Figure 2, the corrected centroid, after reweighting, makes much more sense.

Finally, replace *r _{t}* by

Let us assume that the shapes are available as lists of points or more precisely, pixels. The points need to be properly ordered for comparison purposes. In the mathematical example, the index *t* provided a natural order — not the best one — but one that leads to a reasonable solution. Both sets of points were ordered according to the same *t*, and the number of points was the same for both shapes.

With real data sets, proceed as follows. First, compute a “distance” *D*_{12} from shape 1 to shape 2. Then compute a “distance” *D*_{21} from shape 2 to shape 1. The “distance” *D* is the minimum between *D*_{12} and *D*_{21}. To compute *D*_{12}, the ordering of the points does not matter. For each point *P* on shape 1, compute the distance *D*_{12}(*P*, *Q _{P}*) to the nearest neighbor point

The metric *D* is closely related to the Hausdorff distance (first introduced in 1914), but less sensitive to outliers.

Let *ρ*_{1}(*P*) denotes the distance between point *P* on shape 1, and the centroid of shape 1. If *Q _{P}* is the nearest point to

\lambda_{12}=\frac{1}{n_1\sigma_1\sigma_2}\sum_{P\in S_1}\rho_1(P)\rho_2(Q_P).

where *n*_{1} is the number of (sampled) points on shape 1, and *S*_{1} denotes shape 1. Clearly, 0 ≤ *λ*_{12} ≤ 1 and *λ*_{12} = 1 if and only if shape 1 is an exact subset of shape 2 (after scale standardization). Define *λ*_{21} in a similar way (swapping the roles of shape 1 and shape 2) and *λ* = min(*λ*_{12}, *λ*_{21}). Now *λ* measures the correlation between shape 1 and shape 2. A more useful metric is –*λ* log(1 – *λ*). Based on test data in the spreadsheet, a value above 8 means that the two shapes are quite similar, while a value below 4 means dissimilarity.

Note that *λ* depends on the orientation of the shapes, denoted as *η*_{1} and *η*_{2} in the spreadsheet. If orientation is irrelevant, define *λ* as the minimum value of *λ*(*η*_{1}, *η*_{2}). The *λ* computed in the spreadsheet is a simplified version not based on nearest neighbors, nor on *λ*_{12} or *λ*_{21}. It still performs pretty well. The mathematical explanations are in my technical paper, to be published in two weeks. To get access to this paper once published, subscribe to our newsletter, here.

My research heavily relies on synthetic data. To test on real shapes (say, letters of the alphabet), I recommend to add synthetic data to your training set. A blend of synthetic and real data is called *augmented data*.

One of the benefits of synthetic data is that you can test various shape classifiers to find the best performers, given a specific type of shape. You can include shapes that are almost identical based on the mathematical parameters, to find the smallest differences that your algorithm can detect. Likewise, you can use trillions of shapes in your training set. Indeed you can use infinitely many shapes by playing with the shape equations in the spreadsheet, to assess the discriminating power of your classifier.

Note that in the “Dashboards” tab in the spreadsheet, you can add noise in the data set. The amount of noise is determined by the parameter `Precision`

: the lower, the more noise. Download the spreadsheet here.

You can find references by googling “shape comparison machine learning”, “shape matching”, or “shape signature”. Below is a small selection.

*Shape Descriptor / Feature Extraction Techniques*. Fred Park, 2011. UCI iCAMP 2011. Available here.*Shape Signature Matching for Object Identification Invariant to Image Transformations and Occlusion*. Stamatia Giannarou and Tania Stathaki, 2007. Available here.*Shape Matching*. Kristen Grauman, 2008. University of Texas, Austin. Available here.*Geometrical Correlation and Matching of 2D Image Shapes*. Yu Vizilter and Sergey Zheltov, 2012. Available here.

The post Computer Vision: Shape Classification via Explainable AI first appeared on Machine Learning Techniques.

]]>The post Amazing Neural Network Video Demonstration first appeared on Machine Learning Techniques.

]]>The example discussed here, though also involving a data animation and a supervised classification problem, illustrates a different aspect of neural networks. This time, there are 5 layers. The purpose, given the picture of a shape, it to classify it (based on a training set) in one of four categories: circle, square, triangle, or unknown. Note that my classification problem also involved four classes.

Interestingly, in my case, the data was standard numerical, tabular observations (synthetic data) turned into images for easy GPU processing. Here, the non-synthetic data consists of actual images, but the video does not feature real images. The roles are reversed. Instead it features the neural network architecture in action, also showing how the signal propagates across the layers until a specific observed shape is assigned to one of the four categories. This offers a very different perspective on how a neural network classifier works: a back-end view of the operations, while my video features a front-end view.

Another difference is the use of non-linear functions in my neural network, while the example featured here relies on standard (linear) weights between connected neurons. Also, Ryan’s neural network is not sparse, quite the contrary. This explains why you need fewer layers.

The description below is from Ryan Chesler, the author of the video.

*This animation exhibits multi-layer perceptron with dropout train on a dataset of hand drawn squares, circles, and triangles. It was made with the Python Matplotlib animation function. The code will plot any dimensioned neural network when given the input sums and weight matrices between each layer and colors the nodes based on their saturation. It will take one training example from every 25 epochs and shows the forward pass of it computing as well as the cross-entropy loss and accuracy.*

The source code is on GitHub, here. As for my video, the source code is on my GitHub repository, here. My code is described in details in my new book, available here. You can also watch my video here. Below is Ryan’s video.

Ryan’s videos are on YouTube, here. Mines are also on YouTube, here. To not miss future articles and receive monthly updates, sign-up to our newsletter.

The post Amazing Neural Network Video Demonstration first appeared on Machine Learning Techniques.

]]>The post New Neural Network with 500 Billion Parameters first appeared on Machine Learning Techniques.

]]>This system performs translations, answers questions like Alexa does, summarizes documents, performs arithmetic, and more. I was especially interested in its code translation capability (translating Perl to Python) and its arithmetic engine. I use Mathematica’s AI system to solve complex mathematical problems, in particular symbolic math, and I am curious to see how it compares to PaML. How long would it take for PaML to compute the first trillion digits of π? The Gif picture below shows a few tasks that PaML can perform. It switches to the next task after 5 seconds or so.

I haven’t read the documentation in details yet. It is posted here. But at first glance, it sounds really impressive. Google released an 83-page technical paper about it, available here.

Very large and deep neural networks are not something new. The VDNN (very deep neural network) supervised classifier that I created can have 1,000 layers (thus the term *very deep*) and each layer can easily accommodate 4,000 x 4,000 neurons. This amounts to 16 billion neurons. In practice, unless your data set is high dimensional or you need several digits of accuracy, 16 million neurons performs just as well. In my case, a neuron is a pixel in an image. A layer is an image in a series of iterated images (video frames) obtained when recursively applying a filter to some initial image. The initial image is a mapping of the original training set into a bitmap, used for fast supervised classification performed in GPU (graphics processing unit). The final image is the classification of the entire space.

One peculiarity of my VDNN is that each neuron has only one connection, a nearest neuron (pixel) in the image. Thus the number of parameters is equal to the number of neurons. It produces a very fine, granular classification. I could use a much larger filtering window, say 40 x 40 pixels. This would multiply the number of parameters by (40 +1) x (40 + 1) = 1,681, resulting in 27 trillion parameters. But this approach fails to work: it does not converge to anything meaningful^{1}. It works though, if the number of layers is drastically reduced, say from 1,000 down to 4. Then it produces a different classification with smoother borders between adjacent clusters. In this case, it is non longer a very deep neural network because it only has 4 layers, yet it still has 108 billion parameters.

I am curious to know which approach Google has chosen: many layers and low connectivity between neurons, or the other way around. Maybe you can fine-tune PaLM to choose between these two extremes. In any case, my VDNN performs only one task, while PaLM does a lot of different things.

My VDNN is described here. The training set, augmented after 14 iterations, is shown on the left in the picture below. The output classification of my VDNN is on the right.

The above picture shows a VDNN classification that a human brain could do too. But there are examples where no apparent cluster structure is visible to the naked eye. Yet the VDNN manages to detect the structure, outperforming the human brain.

The human brain can perform translation, drive a car, or perform text summarization. Here is an example where the human brain is of no use.

The left plot in the above picture shows a superposition of four shifted, perturbed lattice point processes. Clustering occurs around each vertex of each lattice. However, it is not perceptible to the naked eye due to the mixing. Each color is associated to a different lattice. The neural network knows the location of the points, but not the color. The plot on the right shows the result of (unsupervised) clustering, with the darkest areas centered around invisible vertices. Not all vertices were detected due to overlapping and mixing, but those that were identified correspond to real lattice vertices.

For another example of deep clustering, see the article “DeepDPM: Deep Clustering With an Unknown Number of Clusters”, published in March 2022 and available here.

500 billion parameters and equally big training sets are impressive. But it is all relative. The Poisson-binomial and Exponential-binomial probability distributions discussed in my new book, have an infinite number of parameters.

In experimental math, we work with infinite data sets. Not only infinite, but uncountable: the set of all complex numbers for instance. Samples are obviously of finite size but could be gigantic. There’s no limit to the sample size, other than the number of elementary particles in the universe (estimated to be around 10^{97}) and how much time and energy you have at your disposal to run the computations (the universe may last only another 100 billion years).

The goal is to unveil very deep patterns about numbers, to better understand conjectures or find new ones. In practice, my samples are rarely more than a few billion units (numbers, or mathematical objects). Yet there are times when an assumption is true for all integers up to (say) 10^{300}. And then it is no longer true for larger numbers: see example here. So even a sample size of 10^{300} is not enough in a number of cases. Of course, other techniques and mathematical theory are used to detect these peculiarities.

Also, if your “unknown” is a function *f*, each value *f*(*x*) for any real number *x* is a parameter. If the function is smooth enough, estimating the value of *f*(*x*) for each *x* is easy, given sampled values: it’s called interpolation. But if your function is highly chaotic or nowhere differentiable, it is not a simple problem. You are truly dealing with an infinite number of “unconnected” parameters.

To receive notices about new articles, subscribe to our newsletter. See also the related article “Very Deep Neural Networks Explained in 40 Seconds”, including a video showing the VDNN in action, here.

- Unless the parameters (also called weights) are all very close to zero, except for the connection between the central neuron and its closest neighbor (pixel) in the local filter window. But that would be equivalent to working with a VDNN with low connectivity for nearest neighbors in a same layer. Such a VDNN makes no sense, except for marketing purposes: to be able to tell that you have 27 trillion parameters! Of course such a VDNN is still useful for testing purposes: to check how much time it takes to process 1,000 layers with a total of 27 trillion parameters.

The post New Neural Network with 500 Billion Parameters first appeared on Machine Learning Techniques.

]]>The post Why are Confidence Regions Elliptic? Simple Explanation first appeared on Machine Learning Techniques.

]]>One may argue that ellipses (a particular case of quadratic functions) are the simplest generalization of linear functions, thus their widespread use. But here, there is a much deeper reason. And it is much easier to understand than you think. Many statisticians take it for granted that it should be an ellipse, but I never found a real justification. This article fills this gap. I discuss the elliptic case first, and then provide a non-elliptic example.

While this is nowhere mentioned in the statistical literature, it makes sense to assume that the confidence region is of minimum area. Determining the shape is then a variational problem. Such problems are solved using mathematical methods of functional analysis and calculus of variations. It involves functional, differential and integral equations. These topics are rather advanced.

The most famous example is the brachistochrone problem: determining the curve of fastest descent between a point A and a lower point B, for a bead smoothly rolling downhill due to gravity alone. The problem was posed by Johann Bernoulli in 1696. The solution is illustrated below. Contrarily to popular belief, the straight line is not the fastest past, it is actually the slowest one.

Interestingly, finding the shape of a confidence region of minimum area, is perhaps the most elementary in this class of problems. Think of a bivariate bell curve. A confidence region of minimum confidence level (0%) is reduced to a point. As you increase the confidence level, the region expands. It must expands as fast as possible (as a function of the confidence level) in order to be of minimum area at all times. Thus it starts as a point at the maximum of the density, and expands downwards following contour lines at all times. In other words, the boundary of a confidence region is a contour line of the underlying density.

In mathematical terms, if H(*x*, *y*) is the probability density and *γ* the confidence level, the boundary of a confidence region of level *γ* is defined by the contour line H(*x*, *y*) *= G _{γ}*. Here

In many bivariate statistical estimation problems, due to the central limit theorem, the parameter estimators asymptotically have a Gaussian distribution. That is, the limiting probability density (when the sample size is large) is the exponential of a negative bivariate quadratic function. Since the exponential function is monotonic, one can take the logarithm instead, and still preserves the shape of the confidence region, and the one-to-one mapping between *γ* and *G _{γ}*. Then the boundary of the confidence region is determined by the quadratic function in question. Thus, it is an ellipse! See figure below.

To be more precise, the boundary of a confidence region has the general form H(*x*, *y*, *p*, *q*) = *G _{γ}*. Note that I added

In my new book (see here), I introduced the concept of *dual confidence region*, in section 3.1. It is also briefly explained in this article and will be the topic of an upcoming article, along with minimum contrast estimators. Sign-up to our newsletter to not miss these upcoming articles. In a nutshell, dual confidence regions are obtained by swapping (*x*, *y*) and (*p*, *q*) in H(*x*, *y*, *p*, *q*). They are more intuitive. The resulting confidence region is no longer an ellipse. But in practice, it is still very close to an ellipse.

Now if your bivariate probability density has multiple modes — for instance you are dealing with a mixture of distributions — then of course confidence regions are not at all an ellipse. See illustration below, featuring various contour lines (that is, confidence regions) attached to a bimodal density.

The above plot was produced using Mathematica, with the following code:

```
Plot3D[Exp[-(Abs[x]^3.5 + Abs[y]^3.5 )] +
0.8*Exp[-4*(Abs[x - 1.5]^4.2 + Abs[y - 1.4]^4.2 )], {x, -2, 3},
{y, -2, 3}, MeshFunctions -> {#3 &}, Mesh -> 25,
Exclusions -> None, PlotRange -> {Automatic, Automatic, {0, 1}},
ImageSize -> 600]
```

In this example, depending on the confidence level, the confidence region consists of two, non-connected sets.

The post Why are Confidence Regions Elliptic? Simple Explanation first appeared on Machine Learning Techniques.

]]>The post Very Deep Neural Networks Explained in 40 Seconds first appeared on Machine Learning Techniques.

]]>It is said that a picture is worth a thousand words. Here instead, I use a video to illustrate the concept of very deep neural networks (VDNN).

I use a supervised classification problem to explain how a VDNN works. Supervised classification is one of the main algorithms in supervised learning. The training set has four groups, each assigned a different color. The type of DNN described here is a convolutional neural network (CNN): it relies on filtering techniques. The filter is referred to, in the literature, as a convolution operator, thus the name CNN.

The purpose is to classify any new or future data point outside the training set. In practice, not the whole training set is used to build the classifier, but a subset called test set, to check performance against the control set, and fine-tune parameters. The control set consists of the training set points not in the test set. This type of design is called cross-validation.

The classifier, illustrated in the video, eventually classifies any new point outside the training set, instantly. In addition, this article also illustrates the concepts of fractal (or fuzzy) classification, and machine learning performed in GPU (graphics processing unit).

The methodology consists of three steps.

**Step 1**: Transforming the test set into a format suitable as input for the DNN. This may involve rescaling or some mapping (frequently, a logistic mapping) applied to the original data. In our case, the bivariate data was binned and transformed into pixel locations to fit into the video frames. The first frame of the video represents the test set, after the initial mapping.

**Step 2**: The transition between a frame and the next one, until no unclassified (black) pixels are left, is as follows. You apply a local filter to each pixel, to assign its color (the group it is assigned to), using a majority vote among neighboring pixels. In this example, the filter is non-linear. It is similar to a high-pass filter, or image enhancing filter typically used in signal processing. Linear filters are known as averaging or blurring filters and of no use here. Each frame in the video represents a layer of the DNN. It is called a very deep neural network, because it involves a large number of layers (hundreds, in this example).

**Step 3**: The frame obtained once no black pixels are left (in the middle of the video), is the output of the DNN. To classify any future point, compute its pixel location on the image using the mapping in step 1, and find which color it is assigned to.

The illustration below is a Gif image, and was obtained by converting my MP4 video into Gif format. I used the online EZGif converter to produce it. The original video can be viewed on YouTube, here. Each pixel is called a neuron in DNN terminology, and (just like in the human brain) interacts only with neighboring neurons in a given layer. Thus the name neural network.

Since all the machine learning apparatus is performed on images using standard filtering techniques (once the original data set is converted to an image), it is easy to run the algorithm in video memory. In other words, getting it done in the GPU – the graphics processing unit. I mention it to explain and illustrate what GPU machine learning means, to people unfamiliar with this technology.

Once no black (unclassified) pixels are left, the classifier has accomplished its task. However, in my video, I added extra frames to illustrate the concept of fractal classification. The border between clusters is somewhat porous, or fuzzy. A point close to the border may be assigned to any of the two or three adjacent groups at the border. The extra frames (called layers in DNN terminology) shows the shifting border over time. It allows you to compute the probability that a point next to the border, belongs to one group or another, by looking at its shifting class assignments over time. I will describe this in more details, in an upcoming article.

In this article I explained in layman’s terms the concepts of deep neural network (DNN), convolutional neural network (CNN), convolution filter, layers and neuron of a neural network, GPU machine learning, and fuzzy classification.

The video illustration uses an unusually large number of layers (video frames), with each neuron (pixel) connected to very few other nearby neurons – the neighboring pixels. Thus, the use of the term *very deep neural network* or VDNN. In my example, I use only one connection per neuron. It leads to a quite granular classifier and offers a few benefits. In practice though, traditional DNN’s use much fewer layers, but neurons are connected to dozens or hundreds of other neurons. In other words, the local filter uses a much larger window.

The methodology is described in details in my new book, available here. To not miss future updates, sign-up to our newsletter, below. In an upcoming article, I will show an application to unsupervised learning, with a post-processing filter playing the role of the sigmoid mapping in a DNN. This material is already available in my new book.

The post Very Deep Neural Networks Explained in 40 Seconds first appeared on Machine Learning Techniques.

]]>The post Internship at MLtechniques: Code Translation first appeared on Machine Learning Techniques.

]]>Short-term project, under the guidance of Vincent Granville, with potential for more diversified work over a longer time period, up to full time employment. Ideal for professionals interested in earning extra cash and experience, initially translating simple Perl scripts to professional Python code. Show us your code portfolio to be considered. No prior job experience required. Experience with Python graphics libraries and hash tables is a big plus. If you are familiar with R or other programming languages, sound, image, or video processing, please mention it. Students are welcome to apply. Must be available at least 5 hours a week.

The Perl code in question is on my GitHub repository, here (more to come). It is also included and described in details in my books, for instance in this book. Once translated to Python, the goal is to offer new editions, this time in Python.

**Benefits include**:

- Free copy of our books, where the code is located
- Acknowledged as author of the code, in our books

Email vincentg@MLtechniques.com for consideration.

The post Internship at MLtechniques: Code Translation first appeared on Machine Learning Techniques.

]]>