Fast Random Generators with Infinite Period for Large-Scale Reproducible AI and Cryptography

Modern GenAI apps rely on billions if not trillions of pseudo-random numbers. You find them in the construction of latent variables in nearly all deep neural networks and almost all applications: computer vision, synthetization, and LLMs. Yet, few AI systems offer reproducibility, though those described in my recent book, do.

When producing so many random numbers or for strong encryption, you need top grade generators. The most popular one — adopted by Numpy and other libraries — is the Mersenne twister. It is known for its flaws, with new ones discovered during my research, and shared with you.

This paper has its origins in the development of a new foundational framework to prove the conjectured randomness and other statistical properties of the digits of infinitely many simple math constants, such as e or π. Here, I focus on three main areas. First, how to efficiently compute the digits of the mathematical constants in question to use them at scale. Then, new tests to compare two types of random numbers: those generated by Python, versus those from the math constants investigated here, and help decide which systems are best. Finally, I propose a new type of strongly random digits based on an incredibly simple formula (one small line of code) leading to fast computations.

One of the benefits of my proposed random bit sequences, besides stronger randomness and fast implementation at scale, is to not rely on external libraries that may change over time. These libraries may get updated and render your results non-replicable in the long term if (say) Numpy decides to modify the internal parameters of its random generator. By combining billions of constants, each with its own seed, with billions of digits from each constant, it is impossible to guess what formula you used to generate your digits, when security is important.

Some of my randomness tests involve predicting the value of a string given the values of previous strings in a sequence, a topic at the core of many large language models (LLMs). Methods based on neural networks — mines being an exception — are notorious for hiding the seeds used in the various random generators involved. It leads to non-replicable results. It is my hope that this article will raise awareness about this issue, while offering better generators that do not depend on which library version you use.

Last but not least, the datasets used here are infinite, giving you the opportunity to work on truly big data and infinite numerical precision. And at the same time, get a glimpse at deep number theory results and concepts, explained in simple English.

View article, get the code and data

The technical document is available on GitHub: see paper 44, here. It features detailed documentation with illustrations and the code. To get the clickable links to work, download the document and view it in any browser or PDF viewer, instead of directly viewing it on GitHub.

To not miss future versions with more features, subscribe to my newsletter, here.

About the Author

Towards Better GenAI: 5 Major Issues, and How to Fix Them

Vincent Granville is a pioneering GenAI scientist and machine learning expert, co-founder of Data Science Central (acquired by a publicly traded company in 2020), Chief AI Scientist at MLTechniques.com and GenAItechLab.com, former VC-funded executive, author (Elsevier) and patent owner — one related to LLM. Vincent’s past corporate experience includes Visa, Wells Fargo, eBay, NBC, Microsoft, and CNET. Follow Vincent on LinkedIn.

Leave a Reply

Discover more from NextGen AI Technology

Subscribe now to keep reading and get access to the full archive.

Continue reading