New Neural Network with 500 Billion Parameters

Google just published a research article about its Pathways Language Model (PaML), a neural network with 500 billion parameters. It is unclear to me how many layers and how many neurons (also called nodes) it can handle. A parameter in this context is a weight attached to a link between two connected neurons. So the number of neurons is at most 500 billion, but it is most likely much smaller. By contrast, the average human brain has 86 billion neurons.

This system performs translations, answers questions like Alexa does, summarizes documents, performs arithmetic, and more. I was especially interested in its code translation capability (translating Perl to Python) and its arithmetic engine. I use Mathematica’s AI system to solve complex mathematical problems, in particular symbolic math, and I am curious to see how it compares to PaML. How long would it take for PaML to compute the first trillion digits of π? The Gif picture below shows a few tasks that PaML can perform. It switches to the next task after 5 seconds or so.

PaML performing a few tasks

I haven’t read the documentation in details yet. It is posted here. But at first glance, it sounds really impressive. Google released an 83-page technical paper about it, available here.

Networks with Huge Number of Layers and Neurons

Very large and deep neural networks are not something new. The VDNN (very deep neural network) supervised classifier that I created can have 1,000 layers (thus the term very deep) and each layer can easily accommodate 4,000 x 4,000 neurons. This amounts to 16 billion neurons. In practice, unless your data set is high dimensional or you need several digits of accuracy, 16 million neurons performs just as well. In my case, a neuron is a pixel in an image. A layer is an image in a series of iterated images (video frames) obtained when recursively applying a filter to some initial image. The initial image is a mapping of the original training set into a bitmap, used for fast supervised classification performed in GPU (graphics processing unit). The final image is the classification of the entire space.

One peculiarity of my VDNN is that each neuron has only one connection, a nearest neuron (pixel) in the image. Thus the number of parameters is equal to the number of neurons. It produces a very fine, granular classification. I could use a much larger filtering window, say 40 x 40 pixels. This would multiply the number of parameters by (40 +1) x (40 + 1) = 1,681, resulting in 27 trillion parameters. But this approach fails to work: it does not converge to anything meaningful1. It works though, if the number of layers is drastically reduced, say from 1,000 down to 4. Then it produces a different classification with smoother borders between adjacent clusters. In this case, it is non longer a very deep neural network because it only has 4 layers, yet it still has 108 billion parameters.

I am curious to know which approach Google has chosen: many layers and low connectivity between neurons, or the other way around. Maybe you can fine-tune PaLM to choose between these two extremes. In any case, my VDNN performs only one task, while PaLM does a lot of different things.

Illustration of Very Deep Neural Networks

My VDNN is described here. The training set, augmented after 14 iterations, is shown on the left in the picture below. The output classification of my VDNN is on the right.

Layer # 14 (left) and # 250 (right) in VDNN

The above picture shows a VDNN classification that a human brain could do too. But there are examples where no apparent cluster structure is visible to the naked eye. Yet the VDNN manages to detect the structure, outperforming the human brain.

A Task that the Human Brain can not do

The human brain can perform translation, drive a car, or perform text summarization. Here is an example where the human brain is of no use.

Unsupervised classification

The left plot in the above picture shows a superposition of four shifted, perturbed lattice point processes. Clustering occurs around each vertex of each lattice. However, it is not perceptible to the naked eye due to the mixing. Each color is associated to a different lattice. The neural network knows the location of the points, but not the color. The plot on the right shows the result of (unsupervised) clustering, with the darkest areas centered around invisible vertices. Not all vertices were detected due to overlapping and mixing, but those that were identified correspond to real lattice vertices.

For another example of deep clustering, see the article “DeepDPM: Deep Clustering With an Unknown Number of Clusters”, published in March 2022 and available here.

Dealing with an Infinite Number of Parameters

500 billion parameters and equally big training sets are impressive. But it is all relative. The Poisson-binomial and Exponential-binomial probability distributions discussed in my new book, have an infinite number of parameters.

In experimental math, we work with infinite data sets. Not only infinite, but uncountable: the set of all complex numbers for instance. Samples are obviously of finite size but could be gigantic. There’s no limit to the sample size, other than the number of elementary particles in the universe (estimated to be around 1097) and how much time and energy you have at your disposal to run the computations (the universe may last only another 100 billion years).

The goal is to unveil very deep patterns about numbers, to better understand conjectures or find new ones. In practice, my samples are rarely more than a few billion units (numbers, or mathematical objects). Yet there are times when an assumption is true for all integers up to (say) 10300. And then it is no longer true for larger numbers: see example here. So even a sample size of 10300 is not enough in a number of cases. Of course, other techniques and mathematical theory are used to detect these peculiarities.

Also, if your “unknown” is a function f, each value f(x) for any real number x is a parameter. If the function is smooth enough, estimating the value of f(x) for each x is easy, given sampled values: it’s called interpolation. But if your function is highly chaotic or nowhere differentiable, it is not a simple problem. You are truly dealing with an infinite number of “unconnected” parameters.

To receive notices about new articles, subscribe to our newsletter. See also the related article “Very Deep Neural Networks Explained in 40 Seconds”, including a video showing the VDNN in action, here.


Notes
  1. Unless the parameters (also called weights) are all very close to zero, except for the connection between the central neuron and its closest neighbor (pixel) in the local filter window. But that would be equivalent to working with a VDNN with low connectivity for nearest neighbors in a same layer. Such a VDNN makes no sense, except for marketing purposes: to be able to tell that you have 27 trillion parameters! Of course such a VDNN is still useful for testing purposes: to check how much time it takes to process 1,000 layers with a total of 27 trillion parameters.

Leave a Reply

%d