10 Tips to Boost Performance of your AI Models

These model enhancements techniques apply to deep neural networks (DNNs) used in AI. The focus is on the core engine that powers all DNNs: gradient descent, layering and loss function.

  • Reparameterization — Typically, in DNNs, many different parameter sets lead to the same optimum: loss minimization. DNN models are non-identifiable. This redundancy is a strength that increases the odds of landing on a good solution. You can change the parameterization to reduce redundancy and increase identifiability. You achieve this by reparameterization and eliminating intermediate layers. It usually does not improve the results. Or you can do the opposite: add redundant layers to increase leeway. Or you can transform parameters while keeping them in the same range. For instance, use θ’ = θ2 instead of θ, both in [0, 1]. This flexibility allows you to achieve better results but require testing.
  • Ghost parameters — Adding artificial, non-necessary or redundant parameters can make the descent more fluid. It gives more chances (more potential paths) for the gradient descent to end in a good configuration. You may use some of these ghost parameters for watermarking your DNN, to protect your model against hijacking and unauthorized use.
  • Layer flattening — Instead of hierarchical layers, you can optimize the entire structure at once, across all layers. That is, minimizing the loss function globally at each epoch rather than propagating changes back and forth throughout many layers. It reduces error propagation and eliminates the need for explicit activation functions. It may reduce the risk of getting stuck (gradient vanishing).
  • Sub-layers — In some sense, it is the oppositive of layer flattening. At each epoch, you minimize the loss iteratively, for a subset of parameters (sub-layer) at a time, keeping the other parameters unchanged. It works as in the EM algorithm. Each parameter subset optimization is a sub-epoch within an epoch. A full epoch consists of going through the full set of parameters. This technique is useful in high dimensional problems with complex, high-redundancy parameter structure.
  • Swarm optimization — Handy for lower dimensional problems with singularities and non-differentiable loss function. You start with multiple random initial configurations (called particles) in the gradient descent. For each particle, you explore random neighbors in the parameter space and move towards the neighbor that minimizes the loss. You need a good normalization of the learning rate to make it work. Consider working with adaptive learning rates that depend on the epoch and/or axis (different learning rate for each axis, that is, for each parameter).
  • Decaying entropy — Instead of descending all the time, allow for occasional random ascents in the gradient descent. Globally or for specific parameters chosen randomly. Especially in the earlier epochs. Think of it like a cave exploration: to reach the very bottom of a deep cave, sometimes you need to go up through a siphon and climb back up from it on the other side to be able to further go down. Otherwise, you get stuck in the siphon. The hyperparameter controlling the ups and downs is the temperature. At zero, the descent is a pure descent with no ups. I call it chaotic gradient descent as it is different from stochastic descent.
  • Adaptive loss — Some knobs turn the loss function into one that changes over time. In one case, I force the loss function to converge towards the model evaluation metric via a number of transitions over time. These changes in the loss function typically take place when the gradient descent stops improving, re-accelerating the descent to move away from a local minimum.
  • Equalization — In my implementation, this is the feature that led to the biggest improvement: reducing the number of epochs thus accelerating convergence along with eliminating a number of gradient descent issues. It consists of replacing the fixed response y at each epoch by an adaptive response y’ = φ(yα). Here φ belongs to a parametric family of transforms indexed by α, and stable under composition and inversion. For instance, scale-location transforms. So, even if you don’t know the history of the successive transforms with a different data-driven α at each epoch, it is very easy at the end to apply an inverse transform to map the final response back to its original, via ordinary least squares. You apply the same inverse transform to predicted values.
  • Normalized parameters — In my implementation all parameters are in [0, 1]. Also true for the input data after normalization. To use parameters outside that range, use reparameterization, for instance θ’ = 1 / (1 – θ) while optimizing for θ in [0, 1]. This approach eliminates gradient explosion.
  • Math-free gradient — No need to use chain rules and other math formula to compute the partial derivatives in the gradient. Indeed, you don’t even need to know the mathematical expression of the loss function. It works on pure data without math functions. Numerical precision is critical. Some math functions are pre-computed in 0.0001 increments and stored as a table. This is possible because the argument (a parameter) is in [0, 1]. It can significantly increase speed.

Parameters in my model have an intuitive interpretation, such as centers and skewness, corresponding to kernels in high-dimensional adaptive kernel density estimation or mixtures of Gaussians that can model any type of response. In some cases, the unknown (predicted) centers are expected and visible in the response such as in clustering problems. Sometimes — depending on the settings — they are not and act as hidden or latent parameters.

For Python implementation, use cases, illustrations and full documentation, download paper #55, here. Sign-up here to our newsletter to not miss future articles.  

Leave a Reply

Discover more from NextGen AI Technology

Subscribe now to keep reading and get access to the full archive.

Continue reading