The post Course: Intuitive Machine Learning first appeared on Machine Learning Techniques.

]]>The information below provides a brief overview of the course.

Solid machine learning foundations presented by a world leading expert. Full life cycle of machine learning development applied to enterprise-grade projects. Includes Python coding, scientific computing, optimization algorithms, explainable AI and state-of-the-art methods favoring simplicity, scalability, reusability, replicability, fast implementation, and easy maintenance. From data cleaning to model design, testing and feature selection, to great visualizations easy to “sell” to stakeholders and decision makers. Depending on the student background and interest, topics may cover augmented data, generative and mixture models, big data, deep neural networks, image processing, machine learning in GPU, graph models, curve and shape fitting, taxonomy creation (NLP) and more. Numerous regression methods including logistic or Lasso are unified and presented under a same umbrella.

Familiarity with basic linear algebra concepts such as elementary matrix operations. Familiarity with basic calculus principles such as maximum and minimum of a function. Be able to install Python on your laptop, and relevant libraries (though I will explain how to do it). Familiarity with basic file processing on a laptop or online folder.

Anyone with some analytic background (engineer, analyst, data scientist, quant, statistician, software developer, teacher, preferably with at least one year of college education with a first course in calculus, some exposure to programming languages (C, C++, Java, Python, PHP, Perl, R, SQL). Experience with manipulating some datasets, even if in Excel only, will help.

The course is suited to busy professionals and students who want to learn quickly and get to the important points without wasting time on long, boring videos. Also ideal for self-learners who need a solid “jump-start” for career acceleration, and interested in quickly working on real-life problems.

Be able to complete machine learning projects from beginning to end, just like a professional working in the industry, for projects ranging from NLP, clustering, regression to computer vision. Learn how to learn and become independent to solve any future problems. Tasks performed during the training include writing Python code and using Python libraries, data cleaning and exploratory analysis, modeling and testing using cross-validation methods, implementing model-free techniques, feature and model selection, testing black-box systems using synthetic data, and state-of-the art data animations (including data videos and sound) to present your results. Successful completion of four modules comes with a personal recommendation (endorsement) on LinkedIn.

Participants are also encouraged to seek advice regarding various career options. The instructor — born in a modest family — has literally done it all and is happy to help you. This includes raising VC money, working for various startups and large companies across multiple industries, self-funding a business from creation to selling to a publicly traded company, starting your own blog and turning it into a multi-million-dollar revenue stream, or creating a strong online presence (including GitHub, LinkedIn) with so many connections that you will never have to look for a job again (jobs will come to you).

More modules will be added. Currently, the following are offered.

**Python**— Installing Python, running Python scripts, using libraries and understanding what they do. Writing your own code to solve new problems, using the most appropriate data structures. This scratch course is more than an introduction to Python: it is aimed at making you capable of quickly obtaining the right information to solve any problem you may face, and introduce you to scientific computing. I also discuss automated data cleaning and exploratory data analysis, as well as using GitHub. For code samples, see here.

**Supervised and Unsupervised Learning**— Covers the core of machine learning, including classification, clustering, regression, structuring unstructured data, cross-validation, model-fitting, feature selection, and a simple ensemble method related to boosted trees. Nearest neighbor graphs and deep neural networks are discussed in the context of GPU machine learning: classifying data using image processing techniques, after turning tabular data into images. New, simple clustering and mode-finding algorithm with exact solution (comparison to*K*-means).

**Generative Models, Explainable AI and Synthetic Data**— Testing black-box systems, designing better ones, and generating and leveraging rich synthetic data to improve the robustness of predictions, minimize overfitting, and assess when an algorithm does well, or not. Useful to deal with wide data and fraud analysis. This module also covers bootstrapping, alternatives to R-squared, minimum contrast estimation and dual confidence regions.

**Visualization and Data Animation Techniques**— Producing high quality visualizations in Python, including animated gifs, data videos, and even soundtracks to present insights that are easy to grasp by non-experts. Topics include optimum palettes, leveraging color transparency, video processing in R and Python, visualizing high-dimensional data, scatterplots for high dimensional data, and sound processing in Python.

**Time Series**— Including random walks, 2D Brownian motions with strong clustering structure, integrated Brownian motions, smooth and chaotic processes, parameter estimation for non-periodic time series, pseudo-random numbers and prime test of randomness, auto-regressive processes, special time series and an introduction to discrete dynamical systems. Special topics: long-range autocorrelations, optimization techniques in the presence of numerical instability using hybrid Monte-Carlo simulations and fixed-point algorithms.

**Natural Language Processing**— Enterprise-grade web crawling and text parsing techniques are used to create keyword taxonomies, with numerous practical applications. Besides solving original real-life problems, the goal is to structure unstructured data, and to develop distributed algorithms that can be resumed without data loss after computer crashes. Computational complexity and fast clustering of text data is discussed.

**Statistical Foundations**— New tests of independence, large selection of probability distributions, simple alternative to*K*-means clustering, alternative to logistic regression, model-free tests of hypothesis. Fundamental theorems with applications: central limit, Berry Esseen, Kolmogorov-Smirnov, law of the iterated logarithm, Le Cam’s theorem. Model-free confidence intervals with very fast convergence based on sample size, with both core theoretical results and applications. This module features a unified and original approach to all regression problems and curve fitting, using constrained optimization and gradient methods.

**LaTeX**— You will learn how to produce modern, top-quality, well-structured documents in LaTeX, including books with glossary, index, bibliography, tables, figures, a smart use of colors, cross-references and external clickable links. See examples here. Indeed, we use the LaTeX sources of textbooks and articles presented as teaching material in the other modules, as templates to start building great PDF documents. The course starts with installing MikTex on your laptop, or using the Overleaf online platform.

These testimonials pertain to the training material published by the author.

- Jackson Andreas Pola — Hello Vincent, I find all the materials you shared on your website extremely useful. I will share this with my colleagues who started their journey in machine learning. Again thank you for being connected on LinkedIn. Kind regards, Jackson
- Mohammed Alshahrani — Thanks Vincent always your materials are supportive. Most of my students used to review your online materials. You might not know but frankly your impact is very noticeable specially for low-income University students.
- Isabel Marín — Very interesting your last article “The sound that the data make”. Would you be interested, once I have introduced my students to the basics, in participating in one of the classes online? Showing them your work.
- Milan McGraw — Thank you Vincent, I appreciate your operational excellence and resources. You are an invaluable resource to the community!

Vincent Granville is a pioneering data scientist and machine learning expert, co-founder of Data Science Central (acquired by TechTarget in 2020), former VC-funded executive, author and patent owner. Vincent’s past corporate experience includes Visa, Wells Fargo, eBay, NBC, Microsoft, CNET, InfoSpace. Vincent is also a former post-doc at Cambridge University, and the National Institute of Statistical Sciences (NISS).

Vincent published in *Journal of Number Theory*, *Journal of the Royal Statistical Society* (Series B), and *IEEE Transactions on Pattern Analysis and Machine Intelligence*. He is also the author of multiple books, available here. He lives in Washington state, and enjoys doing research on stochastic processes, dynamical systems, experimental math and probabilistic number theory.

The post Course: Intuitive Machine Learning first appeared on Machine Learning Techniques.

]]>The post Machine Learning Dictionary first appeared on Machine Learning Techniques.

]]>**Autoregressive processes**. *Auto-correlated time series*. Time-continuous versions include *Gaussian processes* and *Brownian motions*, while *random walks* are a discrete example; two-dimensional versions exist. These processes are essentially integrated *white noise*.

**Binning**. Feature binning consists of aggregating the values of a feature into a small number of bins, to avoid *overfitting* and reduce the number of *nodes* in methods such as *naive Bayes*, *neural networks* or *decision trees*. Binning can be applied to two or more features simultaneously. I discuss *optimum binning* in my book.

**Boosted model**. Blending of several models to get the best of each one, also referred to as *ensemble methods*. The concept is illustrated with *hidden decision trees* in my book. Other popular examples are *gradient boosting* and *AdaBoost*.

**Bootstrapping**. A data-driven, model-free technique to estimate parameter values, to optimize *goodness-of-fit* metrics. Related to resampling in the context of *cross-validation*. In my book, I discuss *parametric bootstrap* on *synthetic data* that mimics the actual observations.

**Confidence Region**. A confidence region of level *γ* is a 2D set of minimum area covering a proportion *γ* of the mass of a bivariate probability distribution. It is a 2D generalization of *confidence intervals*. In my book, I also discuss *dual confidence regions* — the analogous of *credible regions* in Bayesian inference.

**Cross-validation**. Standard procedure used in *bootstrapping*, and to test and validate a model, by splitting your data into training and *validation set*. Parameters are estimated based on *training set* data. An alternative to cross-validation is testing your model on *synthetic data* with known response.

**Decision trees**. A simple, intuitive non-linear modeling techniques used in classification problems. It can handle missing and categorical data, as well as a large number of features, but requires appropriate feature binning. Typically one blends multiple binary trees each with a few *nodes*, to boost performance.

**Dimension reduction**. A technique to reduce the number of features in your dataset while minimizing the loss in predictive power. The most well known are *principal component analysis* and *feature selection* to maximize *goodness-of-fit* metrics.

**Empirical distribution**. Cumulative frequency histogram attached to a statistic (for instance, nearest neighbor distances), and based on observations. When the number of observations tends to infinity and the bin sizes tend to zero, this step function tends to the theoretical cumulative distribution function of the statistic in question.

**Ensemble methods**. A technique consisting of blending multiple models together, such as many *decision trees* with *logistic regression*, to get the best of each method and outperform each method taken separately. Examples include *boosting*, bagging, and *AdaBoost*. In my book, I discuss *hidden decision trees*.

**Explainable AI**. Automated machine learning techniques that are easy to interpret are referred to as interpretable machine learning or explainable artificial intelligence. As much as possible, the methods discussed in my book belong to that category. The goal is to design black-box systems less likely to generate unexpected results with unintended consequences.

**Feature selection**. Features — as opposed to the model response — are also called independent variables or predictors. Feature selection, akin to *dimension reduction*, aims at finding the minimum subset of variables with enough *predictive power*. It is also used to eliminate redundant features and find *causality* (typically using *hierarchical Bayesian models*), as opposed to mere correlations. Sometimes, two features have poor predictive power when taken separately, but provide improved predictions when combined together.

**Goodness-of-fit**. A *model fitting* criterion or metric to assess how a model or sub-model fits to a dataset, or to measure its *predictive power* on a *validation set*. Examples include *R-squared*, *Chi-squared*, *Kolmogorov-Smirnov*, error rate such as false positives and other metrics discussed in my book.

**Gradient methods**. Iterative optimization techniques to find the minimum of maximum of a function, such as the *maximum likelihood*. When there are numerous local minima or maxima, use *swarm optimization*. Gradient methods (for instance, stochastic gradient descent or Newton’s method) assume that the function is differentiable. If not, other techniques such as *Monte Carlo simulations* or the *fixed-point algorithm* can be used. Constrained optimization involves using *Lagrange multipliers*.

**Graph structures**. Graphs are found in *decision tress*, *neural networks* (connections between *neurons*), in *nearest neighbors methods* (NN graphs), in *hierarchical Bayesian models*, and more.

**Hyperparameter**. An hyperparameter is used to control the learning process: for instance, the dimension, the number of features, parameters, layers (neural networks) or clusters (clustering problem), or the width of a filtering window in image processing. By contrast, the values of other parameters (typically node weights in *neural networks* or regression coefficients) are derived via training.

**Link function**. A link function maps a nonlinear relationship to a linear one so that a linear model can be fit, and then mapped back to the original form using the inverse function. For instance, the *logit link function* is used in *logistic regression*. Generalizations include *quantile* functions and inverse *sigmoids* in *neural networks* to work with additive (linear) parameters.

**Logistic regression**. A generalized linear *regression* method where the binary response (fraud/non-fraud or cancer/non-cancer) is modeled as a probability via the logistic link function. Alternatives to the iterative maximum likelihood solution are discussed in my book.

**Neural network**. A blackbox system used for predictions, optimization, or pattern recognition especially in computer vision. It consists of layers, neurons in each layer, *link functions* to model non-linear interactions, parameters (weights associated to the connections between neurons) and *hyperparameters*. Networks with several layers are called *deep neural networks*. Also, *neurons* are sometimes called nodes.

**NLP**. Natural language processing is a set of techniques to deal with unstructured text data, such as emails, automated customer support, or webpages downloaded with a crawler. An example discussed in my book deals with creating a keyword taxonomy based on parsing Google search results pages.

**Numerical stability**. This issue occurring in unstable optimization problems typically with multiple minima or maxima, is frequently overlooked and leads to poor predictions or high volatility. It is sometimes referred to as *ill-conditioned problems*. I explain how to fix it in several examples in my book. Not to be confused with numerical precision.

**Overfitting**. Using too many unstable parameters resulting in excellent performance on the *training set*, but poor performance on future data or on the *validation set*. It typically occurs with numerically unstable procedures such as regression (especially polynomial regression) when the training set is not large enough, or in the presence of *wide data* (more features than observations) when using a method not suited to this situation. The opposite is underfitting.

**Predictive power**. A metric to assess the goodness-of-fit or performance of a model or subset of features, for instance in the context of *dimensionality reduction* or *feature selection*. Typical metrics include *R-squared*, or *confusion matrices* in classification.

**R-squared**. A *goodness-of-fit* metric to assess the predictive power of a model, measured on a *validation set*. Alternatives include adjusted R-squared, mean absolute error and other metrics discussed in my book.

**Random number**. Pseudo-random numbers are sequences of binary digits, usually grouped into blocks, satisfying properties of independent Bernoulli trials. In my book, the concept is formally defined, and strong pseudo-number generators are built and used in computer-intensive simulations.

**Regression methods**. I discuss a unified approach to all regression problems in chapter 1 in my book. Traditional techniques include linear, logistic, Bayesian, polynomial and *Lasso regression* (to deal with numerical instability and *overfitting*), solved using optimization techniques, *maximum likelihood* methods, linear algebra (*eigenvalues* and *singular value decomposition*) or stepwise procedures.

**Supervised learning**. Techniques dealing with labeled data (classification) or when the response is known (*regression*). The opposite is *unsupervised learning*, for instance *clustering* problems. In-between, you have *semi-supervised learning* and* reinforcement learning* (favoring good decisions). The technique described in chapter 1 in my book fits into unsupervised regression. *Adversarial learning* is testing your model against extreme cases intended to make it fail, to build better models.

**Synthetic data**. Artificial data simulated using a *generative model*, typically a *mixture model*, to enrich existing datasets and improve the quality of *training sets*. Called *augmented data* when blended with real data.

**Tensor**. Matrix generalization with three of more dimensions. A matrix is a two-dimensional tensor. A triple summation with three indices is represented by a three-dimensional tensor, while a double summation involves a standard matrix.

**Training set**. Dataset used to train your model in *supervised learning*. Typically, a portion of the training set is used to train the model, the other part is used as *validation set*.

**Validation set**. A portion of your *training set*, typically 20%, used to measure the actual performance of your predictive algorithm outside the training set. In cross-validation and bootstrapping, the training and validation sets are split into multiple subsets to get a better sense of variations in the predictions.

The post Machine Learning Dictionary first appeared on Machine Learning Techniques.

]]>The post Explainable AI, Blackboxes and Synthetic Data first appeared on Machine Learning Techniques.

]]>- Synthetic data design techniques, and how to identify business processes where most useful
- How to test the quality of synthetic data
- The benefits, and potential detriments, of explainable AI
- Common modern enterprise data issues, such as managing unbalanced, inconsistent, small, outdated, and unstructured data
- Ways to address data leakage, as well as small, wide and unobserved data
- How synthetic data, explainable AI and other modern technologies are used to overcome these issues
- Is a 1 trillion parameter neural network necessarily better than a much smaller one?

This podcast comes with two attachments (videos) that further illustrate the problems discussed:

- Classification in action: An illustration of synthetic data and explainable AI
- Video: Using rich synthetic data to test a curve fitting blackbox, and show how it works

In this Q&A, Vincent offers solutions to problems such as creating rich and meaningful synthetic data, assessing its quality, and how to use augmented data to enhance predictions and test/benchmark blackbox systems. An important question is whether explainable AI and blackboxes are incompatible, and how the two can happily be “married”. Another issue is what to do when blackboxes are deemed unethical and can not be used, for instance to automatically decide when an applicant should receive a loan or not, and no one can explain the reason why a blackbox led to a rejection. These issues and several others such as automating data cleaning, are addressed with a focus on solutions.

*To access the podcast, follow this link. *

The post Explainable AI, Blackboxes and Synthetic Data first appeared on Machine Learning Techniques.

]]>The post New Book: Intuitive Machine Learning and Explainable AI first appeared on Machine Learning Techniques.

]]>This book covers the foundations of machine learning, with modern approaches to solving complex problems. Emphasis is on scalability, automation, testing, optimizing, and interpretability (explainable AI). For instance, regression techniques — including logistic and Lasso — are presented as a single method, without using advanced linear algebra. There is no need to learn 50 versions when one does it all and more. Confidence regions and prediction intervals are built using parametric bootstrap, without statistical models or probability distributions. Models (including generative models and mixtures) are mostly used to create rich synthetic data to test and benchmark various methods.

Topics covered include clustering and classification, GPU machine learning, ensemble methods including an original boosting technique, elements of graph modeling, deep neural networks, auto-regressive and non-periodic time series, Brownian motions and related processes, simulations, interpolation, random numbers, natural language processing (smart crawling, taxonomy creation and structuring unstructured data), computer vision (shapes generation and recognition), curve fitting, cross-validation, goodness-of-fit metrics, feature selection, curve fitting, gradient methods, optimization techniques and numerical stability.

Methods are accompanied by enterprise-grade Python code, replicable datasets and visualizations, including data animations (gifs, videos, even sound done in Python). The code uses various data structures and library functions sometimes with advanced options. It constitutes a Python tutorial in itself, and an introduction to scientific computing. Some data animations and chart enhancements are done in R. The code, datasets, spreadsheets and data visualizations are also on GitHub.

Chapters are mostly independent from each other, allowing you to read in random order. A glossary, index and numerous cross-references make the navigation easy and unify all the chapters. The style is very compact, getting down to the point quickly, and suitable to business professionals eager to learn a lot of useful material in a limited amount of time. Jargon and arcane theories are absent, replaced by simple English to facilitate the reading by non-experts, and to help you discover topics usually made inaccessible to beginners.

While state-of-the-art research is presented in all chapters, the prerequisites to read this book are minimal: an analytic professional background, or a first course in calculus and linear algebra. The original presentation avoids all unnecessary math and statistics, yet without eliminating advanced topics.

*For the table of contents and related material (Python code and so on), visit the GitHub page about this book, here. To obtain your copy, follow this link. To not miss future announcements, sign-up to our newsletter, here.*

Vincent Granville is a pioneering data scientist and machine learning expert, co-founder of Data Science Central (acquired by TechTarget in 2020), former VC-funded executive, author and patent owner. Vincent’s past corporate experience includes Visa, Wells Fargo, eBay, NBC, Microsoft, CNET, InfoSpace. Vincent is also a former post-doc at Cambridge University, and the National Institute of Statistical Sciences (NISS).

Vincent published in *Journal of Number Theory*, *Journal of the Royal Statistical Society* (Series B), and *IEEE Transactions on Pattern Analysis and Machine Intelligence*. He is also the author of multiple books, available here. He lives in Washington state, and enjoys doing research on stochastic processes, dynamical systems, experimental math and probabilistic number theory.

The post New Book: Intuitive Machine Learning and Explainable AI first appeared on Machine Learning Techniques.

]]>The post Advanced Machine Learning with Basic Excel first appeared on Machine Learning Techniques.

]]>I discuss ensemble methods combining many mini decision trees, blended with regression, explained in simple English with both Excel and Python implementations. Case study: natural language processing (NLP) problem. Ideal reading for professionals who want to start light with Machine Learning (say with Excel) and get very fast to much more advanced material and Python. The Python code is not just a call to some blackbox functions, but a full-fledge detailed procedure on its own. This algorithm is in the same category as boosting, bagging, stacking and AdaBoost.

The method described here illustrates the concept of ensemble methods, applied to a real life NLP problem: ranking articles published on a website to predict performance of future blog posts yet to be written, and help decide on title and other features to maximize traffic volume and quality, and thus revenue. The method, called hidden decision trees (HDT), implicitly builds a large number of small usable (possibly overlapping) decision trees. Observations that don’t fit in any usable node are classified with an alternate method, typically simplified logistic regression.

This hybrid procedure offers the best of both worlds: decision tree combos and regression models. It is intuitive and simple to implement. The code is written in Python, and I also offer a light version in basic Excel. The interactive Excel version is targeted to analysts interested in learning Python or machine learning. HDT fits in the same category as bagging, boosting, stacking and adaBoost. This article encourages you to understand all the details, upgrade the technique if needed, and play with the full code or spreadsheet as if you wrote it yourself. This is in contrast with using blackbox Python functions without understanding their inner workings and limitations. Finally, I discuss how to build model-free confidence intervals for the predicted values.

- Methodology

. . . How hidden decision trees (HDT) work

. . . NLP Case study: summary and findings

. . . Parameters

. . . Improving the methodology - Implementation details

. . . Correcting for bias

. . . . . . Time-adjusted scores

. . . Excel spreadsheet

. . . Python code and dataset - Model-free confidence intervals and perfect nodes

. . . Interesting asymptotic properties of confidence intervals

The technical article, entitled *Machine Learning Cloud Regression: The Swiss Army Knife of Optimization*, is accessible in the “Free Books and Articles” section, here. The text highlighted in orange in this PDF document are keywords that will be incorporated in the index, when I aggregate all my related articles into a single book about innovative machine learning techniques. The text highlighted in blue corresponds to external clickable links, mostly references. And red is used for internal links, pointing to a section, bibliography entry, equation, and so on.

*To not miss future articles, sign-up to our newsletter, here.*

Vincent Granville is a pioneering data scientist and machine learning expert, co-founder of Data Science Central (acquired by TechTarget in 2020), former VC-funded executive, author and patent owner. Vincent’s past corporate experience includes Visa, Wells Fargo, eBay, NBC, Microsoft, CNET, InfoSpace. Vincent is also a former post-doc at Cambridge University, and the National Institute of Statistical Sciences (NISS).

Vincent published in *Journal of Number Theory*, *Journal of the Royal Statistical Society* (Series B), and *IEEE Transactions on Pattern Analysis and Machine Intelligence*. He is also the author of multiple books, available here. He lives in Washington state, and enjoys doing research on stochastic processes, dynamical systems, experimental math and probabilistic number theory.

The post Advanced Machine Learning with Basic Excel first appeared on Machine Learning Techniques.

]]>The post The Sound that Data Makes first appeared on Machine Learning Techniques.

]]>Then, sound may allow the human brain to identify new patterns in your data set, not noticeable in scatterplots and other visualizations. This is similar to scatterplots allowing you to see patterns (say clusters) that tabular data is unable to render. Or to data videos, allowing you to see patterns that static visualizations are unable to render. Also, people with vision problems may find sounds more useful than images, to interpret data.

Finally, another purpose of this article is to introduce you to sound processing in Python, and to teach you how to generate sound and music. This basic introduction features some of the fundamental elements. Hopefully, enough to get you started if you are interested to further explore this topic.

We are all familiar with static data visualizations. Animated gifs such as this one brings a new dimension, but they are not new. Then, data represented as videos is something rather new, discussed in some of my recent articles, here and here. However, I am not aware of any dataset represented as a melody. This article may very well feature the first example.

As in data videos, time is a main component. The concept is well suited to time series. In particular, here I generated two time series each with *n* = 300 observations, equally spaced in time. It represents pure, uncorrelated noises: the first one is Gaussian and represented by the sound frequencies; the second one is uniform and represented by the duration of the musical notes. Each note corresponds to one observation. I used the most standard musical scale, and avoided half-tones [Wiki] — the black keys on a piano — to produce a pleasant melody. To listen to it, click on the box below. Make sure your speakers are on. You may even play it in your office, as it is work-related after all.

Since it represents noise, the melody never repeats itself and has no memory. Yet it seems to exhibit patterns, the patterns of randomness. Random data is actually the most pattern-rich data, since if large enough, it contains all the patterns that exist. If you plot random points in a square, some will appear clustered, some areas will look sparse, some points will look aligned. The same is true in random musical notes. This will be the topic of a future article, entitled “The Patterns of Randomness”.

The next step is to create melodies for real life data sets, exhibiting auto-correlations and other peculiarities. The bivariate time series used here is pictured below: the red curve is the scaled Gaussian noise linked to note frequencies in the audio; the blue curve is the scaled uniform noise linked to the note durations. As for myself, I plan to create melodies for famous functions in number theory (the Riemann function) and blend the sound with the silent videos that I have produced so far, for instance here.

The musical scale used in my Python code is described in Wikipedia, here. An introduction to sound generation in Python can be found on StackOverFlow, here. For stereo sounds in Python, see here. A more comprehensive article featuring known melodies with all the bells and whistles, is found here (part 1) and here (part 2). However, I was not able to make the code work. See also here if you are familiar with Python classes.

I think my very short code (see next section) offers the best bang for the buck. In particular, it assumes no music knowledge and does not use any library other than Numpy and Scipy.

In a WAV file, sounds are typically recorded as waves. These waves are produced by the `get_sine_wave`

function, one wave per musical note. The base note has a 440 frequency. Each octave contains 12 notes including five half-tones. I skipped those to avoid dissonances. The frequencies double from one octave to the next one. I only included audible notes that can be rendered by a standard laptop, thus the instruction `in range(40,65)`

in the code.

The last line of code turns wave values into integers, and save the whole melody as `sound.wav`

. Now you can write your own code to listen to your data! Or you can use the code to test large sequences of random notes, to find if some short extracts might be good and original enough to integrate into your own music. You may also try non-sinusoidal waves. For instance, a mixture of waves to emulate harmonic pitches (two or more notes at the same time) and instruments other than piano.

```
import numpy as np
import matplotlib.pyplot as plt
from scipy.io import wavfile
def get_sine_wave(frequency, duration, sample_rate=44100, amplitude=4096):
t = np.linspace(0, duration, int(sample_rate*duration))
wave = amplitude*np.sin(2*np.pi*frequency*t)
return wave
# Create the list of musical notes
scale=[]
for k in range(40,65):
note=440*2**((k-49)/12)
if k%12 != 0 and k%12 != 2 and k%12 != 5 and k%12 != 7 and k%12 != 10:
scale.append(note) # add musical note (skip half tones)
M=len(scale) # number of musical notes
# Generate the data
n=300
np.random.seed(101)
x=np.arange(n)
y=np.random.normal(0,1,size=n)
z=np.random.uniform(0.100,0.300,size=n)
min=min(y)
max=max(y)
y=0.999*M*(y-min)/(max-min)
plt.plot(x,y,color='red',linewidth=0.6)
plt.plot(x,15*z,color='blue',linewidth=0.6)
plt.show()
# Turn the data into music
wave=[]
for t in x: # loop over dataset observations, create one note per observation
note=int(y[t])
duration=z[t]
frequency=scale[note]
new_wave = get_sine_wave(frequency, duration=duration, amplitude=2048)
wave=np.concatenate((wave,new_wave))
wavfile.write('sound.wav', rate=44100, data=wave.astype(np.int16))
```

*To not miss future articles, sign-up to our newsletter, here.*

*Journal of Number Theory*, *Journal of the Royal Statistical Society* (Series B), and *IEEE Transactions on Pattern Analysis and Machine Intelligence*. He is also the author of multiple books, available here. He lives in Washington state, and enjoys doing research on stochastic processes, dynamical systems, experimental math and probabilistic number theory.

The post The Sound that Data Makes first appeared on Machine Learning Techniques.

]]>The post Machine Learning Cloud Regression: The Swiss Army Knife of Optimization first appeared on Machine Learning Techniques.

]]>Many machine learning and statistical techniques exist as seemingly unrelated, disparate algorithms designed and used by practitioners from various fields, under various names. Why learn 50 types of regressions when you can solve your problems with one simple generic version that covers all of them and more?

The purpose of this article is to unify these techniques under a same umbrella. The data set is viewed as a cloud of points, and the distinction between response and features is blurred. Yet I designed my method to make it backward-compatible with various existing procedures. Using the same method, I cover linear and logistic regression, curve fitting, unsupervised clustering and fitting non-periodic time series, in less than 10 pages plus Python code, case studies and illustrations.

The fairly abstract approach leads to simplified procedures and nice generalizations. For instance, I discuss a generalized logistic regression with the logistic function replaced by any unspecified CDF and solved using empirical distributions. My new unsupervised clustering technique — with an exact solution — identifies the cluster centers prior to classifying the points. I compute prediction intervals even when the data has no response, in particular in curve fitting problems or for the shape of meteorites. Predictions for non periodic time series such as ocean tides are done with the same method. I also show how to adapt the method to unusual situations, such as fitting a line (not a plane) or two planes in three dimensions.

There is no statistical theory and probability distributions involved, except in the design of synthetic data to test the method. Confidence regions and estimates are based on parametric bootstrap. I provide a quick illustration for statisticians (used to a different framework) in the section “Example for Statisticians”.

This article is not about regression performed in the cloud. It is about considering your data set as a cloud of points or observations, where the concepts of dependent and independent variables (the response and the features) are blurred. It is a very general type of regression, offering backward-compatibility with existing methods. Treating a variable as the response amounts to setting a constraint on the multivariate parameter, and results in an optimization algorithm with Lagrange multipliers. The originality comes from unifying and bringing under a same umbrella, a number of disparate methods each solving a part of the general problem and originating from various fields. I also propose a novel approach to logistic regression, and a generalized R-squared adapted to shape fitting, model fitting, feature selection and dimensionality reduction. In one example, I show how the technique can perform unsupervised clustering, with confidence regions for the cluster centers obtained via parametric bootstrap.

Besides ellipse fitting and its importance in computer vision, an interesting application is non-periodic sum of periodic time series. While rarely discussed in machine learning circles, such models explain many phenomena, for instance ocean tides. It is particular useful in time-continuous situations where the error is not a white noise, but instead smooth and continuous everywhere. For instance, granular temperature forecast. Another curious application is modeling meteorite shapes. Finally, my methodology is model free and data driven, with a focus on numerical stability. Prediction intervals and confidence regions are obtained via bootstrapping. I provide Python code and synthetic data generators for replication purposes.

1 Introduction: circle fitting

. . . Previous versions of my method

2 Methodology, implementation details and caveats

. . . Solution, R-squared and backward compatibility

. . . Upgrades to the model

3 Case studies

. . . Logistic regression, two ways

. . . Ellipsoid and hyperplane fitting

. . . . . . Curve fitting: 250 examples in one video

. . . . . . Confidence region for the fitted ellipse

. . . . . . Python code

. . . Non-periodic sum of periodic time series

. . . . . . Numerical instability and how to fix it

. . . . . . Python code

. . . Fitting a line in 3D, unsupervised clustering, and other generalizations

. . . . . . Example: confidence region for the cluster centers

. . . . . . Exact solution and caveats

. . . . . . Comparison with K-means clustering

I provide here a comparison with standard regression on the most trivial example, for statisticians. In statistics, fitting a line is estimating *a*, *b* in *y* = *ax* + *b*. In my approach, it is finding *a*, *b*, *c* minimizing the sum of squared errors in *ax* +*by* + *c* = 0. There is no dependent variable. But you fall back on standard regression if you set *b* = -1.

You need a constraint, and *a*^{2} + *b*^{2} + *c*^{2} = 1 leads to a more elegant approach, than *b* = -1. The constraint results in a Lagrange multiplier in the least squares optimization. Confidence regions for (*a*, *b*, *c*) are obtained via bootstrap. There is no likelihood function involved. Prediction intervals are for the error between the true *ax* + *by* + *c* supposed to be zero by design, and the estimated one using the estimated (*a*, *b*, *c*), at a specific (*x*, *y*). I use the notation *θ* for the parameter (*a*, *b*, *c*).

The technical article, entitled *Machine Learning Cloud Regression: The Swiss Army Knife of Optimization*, is accessible in the “Free Books and Articles” section, here. The text highlighted in orange in this PDF document are keywords that will be incorporated in the index, when I aggregate all my related articles into a single book about innovative machine learning techniques. The text highlighted in blue corresponds to external clickable links, mostly references. And red is used for internal links, pointing to a section, bibliography entry, equation, and so on.

*To not miss future articles, sign-up to our newsletter, here.*

*Journal of Number Theory*, *Journal of the Royal Statistical Society* (Series B), and *IEEE Transactions on Pattern Analysis and Machine Intelligence*. He is also the author of multiple books, available here. He lives in Washington state, and enjoys doing research on stochastic processes, dynamical systems, experimental math and probabilistic number theory.

The post Machine Learning Cloud Regression: The Swiss Army Knife of Optimization first appeared on Machine Learning Techniques.

]]>The post Weird Random Walks: Synthetizing, Testing and Leveraging Quasi-randomness first appeared on Machine Learning Techniques.

]]>I discuss different types of synthetized random walks that are almost perfectly random, in one and two dimensions. Besides the theoretical interest, it provides new modeling tools, especially for physicists, engineers, natural sciences, security, fintech and quant professionals.

The kind of irregularities injected in these random walks are especially weak and hard to detect. The research results presented here are new, focused on applications, and state-of-the-art. In addition to offering original modeling tools, these unusual stochastic processes can be used to benchmark fraud detection systems or to benchmark tests of randomness.

The picture below features a metric that magnifies the very weak patterns, to show that despite all appearances, something is “off”, and definitely not random in my simulated random walks. You can fine-tune various parameters in the accompanying Python code, to produce different types of non-randomness, ranging from totally undetectable to hard to detect.

This is a follow-up to my article “Detecting Subtle Departures from Randomness”, where I introduced the prime test to identify very weak violations of various laws of large numbers. Pseudo-random sequences failing this test usually pass most test batteries, yet are unsuitable for a number of applications, such as security, strong cryptography, or intensive simulations. The purpose here is to build such sequences with very low, slow-building, long-range dependencies, but that otherwise appear as random as pure noise. They are useful not only for testing and benchmarking tests of randomness, but also in their own right to model almost random systems, such as stock market prices. I introduce new categories of random walks (or quasi-Brownian motions subject to constraints), and discuss the peculiarities of each category. For completeness, I included related stochastic processes discussed in some of my previous articles, for instance integrated and 2D clustered Brownian motions. All the processes investigated here are drift-free and symmetric, yet not perfectly random. They all start at zero.

Symmetric unbiased constrained random walks

- Three fundamental properties of pure random walks
- Random walks with more entropy than pure random signal
- Applications
- Algorithm to generate quasi-random sequences
- Variance of the modified random walk
- Random walks with less entropy than pure random signal

Related stochastic processes

- From Brownian motions to clustered Lévy flights
- Integrated Brownian motions and special autoregressive processes

Python code

- Computing probabilities and variances
- Path simulations

The technical article, entitled *Weird Random Walks: Synthetizing, Testing and Leveraging Quasi-randomness*, is accessible in the “Free Books and Articles” section, here. The text highlighted in orange in this PDF document are keywords that will be incorporated in the index, when I aggregate all my related articles into a single book about innovative machine learning techniques. The text highlighted in blue corresponds to external clickable links, mostly references. And red is used for internal links, pointing to a section, bibliography entry, equation, and so on.

*To not miss future articles, sign-up to our newsletter, here.*

*Journal of Number Theory*, *Journal of the Royal Statistical Society* (Series B), and *IEEE Transactions on Pattern Analysis and Machine Intelligence*. He is also the author of multiple books, available here. He lives in Washington state, and enjoys doing research on stochastic processes, dynamical systems, experimental math and probabilistic number theory.

The post Weird Random Walks: Synthetizing, Testing and Leveraging Quasi-randomness first appeared on Machine Learning Techniques.

]]>The post New Perspective on the Riemann Hypothesis first appeared on Machine Learning Techniques.

]]>In about 10 pages (plus Python code, exercises and figures), this article constitutes a scratch course on the subject. It covers a large range of topics, both recent as well as unpublished, in a very compact style. Full of clickable references, the document covers the basics, offering a light reading experience. It also includes plenty of advanced, state-of-the-art material explained as simply as possible. Written by a machine learning professional working on experimental math, it is targeted to other machine learning professionals. Physicists, mathematicians, quants, statisticians and engineers will hopefully find this document easy to read, interesting, and opening up new research horizons. Exercise 8 is particularly intriguing, showing a potential new path to proving the Riemann Hypothesis.

This tutorial provides a solid introduction to the Generalized Riemann Hypothesis and related functions, including Dirichlet series, Euler products, non-integer primes (Beurling primes), Dirichlet characters and Rademacher random multiplicative functions. The topic is usually explained in obscure jargon or inane generalities. To the contrary, this article will intrigue you with the beauty and power of this theory. The summary style is very compact, covering much more than traditionally taught in a first graduate course in analytic number theory. The choice of the topics is a little biased, with an emphasis on probabilistic models. My approach, discussing the “hole of the orbit” — called the eye of the Riemann zeta function in a previous article — is particularly intuitive.

The accompanying Python code covers a large class of interesting functions to allow you to perform as many different experiments as possible. If you are interested to know a lot more than the basics and possibly investigate this conjecture using machine learning techniques, this article is for you. The Python code also shows you how to produce beautiful videos of the various functions involved, in particular their orbits. This visual exploration shows that the Riemann zeta function, and a specific Dirichlet-L function (based on the non-trivial character modulo 4), behave very uniquely and similarly, explaining the connection between the Riemann and the Generalized Riemann Hypothesis, in pictures and videos rather than words.

Introduction

- Key concepts and terminology
- Orbits and holes
- Industrial applications

Euler products

- Finite Euler products
- Generalization using Dirichlet characters
- Infinite Euler products
- Special products
- Probabilistic properties and conjectures

Finite Dirichlet series and generalizations

- Finite Dirichlet series
- Non trivial cases with infinitely many primes and a hole
- Sums of two cubes, or cuban primes
- Primes associated to elliptic curves
- Analytic continuation, convergence, and functional equation
- Hybrid Dirichlet-Taylor series
- Riemann Hypothesis with cosines replaced by wavelets
- Riemann Hypothesis for Beurling primes
- Stochastic Euler products

Exercises

Python code

- Computing the orbit of various Dirichlet series
- Creating videos of the orbit

The technical article, entitled *New Perspective on the Riemann Hypothesis*, is accessible in the “Free Books and Articles” section, here. The text highlighted in orange in this PDF document are keywords that will be incorporated in the index, when I aggregate all my related articles into a single book about innovative machine learning techniques. The text highlighted in blue corresponds to external clickable links, mostly references. And red is used for internal links, pointing to a section, bibliography entry, equation, and so on.

*To not miss future articles, sign-up to our newsletter, here.*

*Journal of Number Theory*, *Journal of the Royal Statistical Society* (Series B), and *IEEE Transactions on Pattern Analysis and Machine Intelligence*. He is also the author of multiple books, available here. He lives in Washington state, and enjoys doing research on stochastic processes, dynamical systems, experimental math and probabilistic number theory.

The post New Perspective on the Riemann Hypothesis first appeared on Machine Learning Techniques.

]]>The post Synthetic Data in Machine Learning: What, Why, How? first appeared on Machine Learning Techniques.

]]>- 0:00 – Introductions
- 3:24 – How did you become interested in synthetic data?
- 5:36 – How does the corporate world interact with synthetic data?
- 8:31 – Problems that synthetic data can help solve
- 18:55 – Synthetic datasets used by corporations
- 27:55 – What is driving the interest to synthetic data?
- 31:21 – How would you define what synthetic data actually is?
- 38:43 – Creating and sharing high quality synthetic data
- 41:58 – What criteria should be used to measure synthetic data?
- 46:02 – Challenges in scaling from standalone tables to databases
- 49:38 – Data coverage concept and its applications
- 51:30 – Using synthetic data to help solve biases
- 57:13 – Fire round
- 1:00:53 – Conclusions

Nicolai Baldin — Founder & CEO, Synthesized. Nicolai leads Synthesized’s rapid growth, as a top provider of DataOps tools for software testing and data science applications, across the UK, Europe and North America. Nicolai is responsible for the direction and product strategy of Synthesized. For over 8 years, Nicolai has designed and delivered complex ML solutions used by top financial and healthcare institutions. He holds a PhD in Machine Learning and Statistics from the University of Cambridge.

Simon Swan — Machine Learning Lead, Synthesized. Simon contributes to the core technology of Synthesized and is responsible for some of the development processes of the ML team. Prior to joining Synthesized in 2019, he worked in the legal and medical industries as a NLP & Machine Learning engineer. He has an academic background in Statistical Thermodynamics and Computational Linguistics from the University of Cambridge.

Vincent Granville — Founder, MLTechniques.com. Vincent Granville is a pioneering data scientist and machine learning expert, co-founder of Data Science Central (acquired by TechTarget in 2020), former VC-funded executive, author and patent owner. Vincent’s past corporate experience includes Visa, Wells Fargo, eBay, NBC, Microsoft, CNET, InfoSpace. Vincent is also a former post-doc at Cambridge University, and the National Institute of Statistical Sciences (NISS). Vincent published in Journal of Number Theory, Journal of the Royal Statistical Society (Series B), and IEEE Transactions on Pattern Analysis and Machine Intelligence. He is also the author of multiple books. He lives in Washington state, and enjoys doing research on stochastic processes, dynamical systems, experimental math and probabilistic number theory.

Who are we? Synthesized is the development framework helping companies create optimized and safe to share datasets for use in machine learning, software testing and development and analytics. Learn more about Fairlens, here. For more details, visit this page.

*You can find more articles on synthetic data, here. To not miss future articles, sign-up to our newsletter, here.*

The post Synthetic Data in Machine Learning: What, Why, How? first appeared on Machine Learning Techniques.

]]>