The procedure used to carry out the learning process is called training (or learning) strategy. The training strategy is applied to the neural network to obtain the minimum loss possible. This is done by searching for a set of parameters that fit the neural network to the data set.

A general strategy consists of two different concepts:

The loss index plays a vital role in the use of neural networks. It defines the task the neural network is required to do and provides a measure of the quality of the representation required to learn. The choice of a suitable loss index depends on the application.

When setting a loss index, two different terms must be chosen: an error term and a regularization term. $$ loss\_index = error\_term + regularization\_term $$

The error is the most important term in the loss expression. It measures how the neural network fits the data set.

All those errors can be measured over different subsets of the data.
In this regard, the **training error** refers to the error measured on the
training samples,
the **selection error**
is measured on the
selection samples,
and the **testing error** is measured on the
testing samples.

Next, the most important errors used in the field of neural networks are described:

- Mean squared error.
- Normalized squared error.
- Weighted squared error.
- Cross entropy error.
- Minkowski error.

The mean squared error calculates the average squared error between the outputs from the neural network and the targets in the data set. $$mean\_squared\_error = \frac{\sum \left(outputs - targets\right)^2}{samples\_number}$$

The normalized squared error divides the squared error between the outputs from the neural network and the targets in the data set by a normalization coefficient. If the normalized squared error has a value of unity, then the neural network is predicting the data 'on the mean', while a value of zero means a perfect prediction of the data. $$normalized\_squared\_error = \frac{\sum \left(outputs - targets\right)^2}{normalization\_coefficient}$$

This can be considered the default error term when solving approximation problems.

The weighted squared error is used in binary classification applications with unbalanced targets i.e., when the numbers of positives and negatives are very different. It gives a different weight to errors belonging to positive and negative samples. $$weighted\_squared\_error = \\ = positives\_weight \sum (outputs-positive\_targets)^2\\ + negatives\_weight \sum (outputs-negative\_targets)^2$$

The cross-entropy error is used in binary classification problems, in which the target variable can only take values 0 or 1. Its main advantage is that it penalizes with a high error when a target that is labeled with a 0 is predicted with a probability closed to 1, and vice versa. It is defined as: $$cross\_entropy\_error =\\ -\sum \left ({target \cdot \log \left ( output \right ) + \left ( 1-target \right ) \cdot log(output))} \right)$$

A perfect model should have 0 cross-entropy error.

One of the potential difficulties of the above errors is that they can receive a too large contribution from points that have significant errors (outliers). If there are long tails on the distribution, then the solution can be dominated by a minimal number of points that have particularly large errors.

On such occasions, to achieve good generalization, it is preferable to choose a more suitable error method. The Minkowski error is the sum, over the training samples, of the difference between outputs and targets elevated to an exponent, which can vary between 1 and 2. That exponent is called the Minkowski parameter, and its default value is 1.5. $$minkowski\_error = \frac{\sum\left(outputs - targets\right)^{minkowski\_parameter}}{samples\_number}$$

For all the error methods that we have seen above, the gradient can be found analytically using the so-called back-propagation algorithm.

A solution is said to be regular when small changes in the input variables led to small changes in the outputs. An approach for non-regular problems is to control the effective complexity of the neural network. This can be achieved by using a regularization term into the loss index.

Regularization terms usually measure the values of the parameters in the neural network. Adding that term to the error will cause the neural network to have smaller weights and biases, which will force its response to be smoother.

The most used types of regularization are the following:

The L1 regularization method consists of the sum of the absolute values of all the parameters in the neural network. $$ l1\_regularization = regularization\_weight · \sum |parameters|$$

The L2 regularization method consists of the squared sum of all the parameters in the neural network. $$ l2\_regularization = regularization\_weight · \sum parameters^{2}$$

As we can see, the regularization term is weighted by a parameter. If the solution is too smooth, the weight must be decreased. Conversely, if the solution oscillates too much, the weight is increased.

The gradient for the regularization terms described above can be computed straightforwardly.

The loss index depends on the function represented by the neural network, and it is measured on the data set. It can be visualized as a hyper-surface with the parameters as coordinates, see the next figure.

The learning problem for neural networks can then be stated as finding a neural network function for which the loss index takes on a minimum value. That is, to find the set of parameters that minimize the above function.

As it was said, the learning problem for neural networks consists of searching for a set of parameters at which the loss index takes a minimum value. The necessary condition states that if the neural network is at a minimum of the loss index, the gradient is zero.

The loss index is, in general, a non-linear function of the parameters. Consequently, it is not possible to find closed optimization algorithms for the minima. Instead, we consider a search through the parameter space consisting of a succession of steps, or epochs. At each epoch, the loss will decrease by adjusting the neural network parameters. The change of parameters between two epochs is called the parameter increment.

In this way, to train a neural network, we start with some parameters vector (often chosen at random). We generate a sequence of parameter vectors so that the loss index is reduced at each iteration of the algorithm. The figure below is a state diagram of the training procedure.

The optimization algorithm stops when a specified condition is satisfied. Some stopping criteria commonly used are:

- The loss improvement in one epoch is less than a set value.
- Loss has been minimized to a goal value.
- A maximum number of epochs is reached.
- The maximum amount of computing time has been exceeded.
- The error on the selected subset increases several epochs.

The optimization algorithm determines how the adjustment of the parameters in the neural network takes place. There are many different optimization algorithms, which have a variety of different computation and storage requirements. Moreover, there is no one best suited to all locations. Next, the most used optimization algorithms are described.

Next, the most used optimization algorithms are described:

- Gradient descent.
- Conjugate gradient.
- Quasi-Newton method.
- Levenberg-Marquardt algorithm.
- Stochastic gradient descent.
- Adaptative linear momentum.

The simplest optimization algorithm is gradient descent. Here, the parameters are updated at each epoch in the direction of the negative gradient of the loss index. $$ new\_parameters = parameters \\ - loss\_gradient*learning\_rate$$

The learning rate is usually adjusted at each epoch using line minimization.

In the conjugate gradient algorithm, the search is performed along with conjugate directions, which produces generally faster convergence than gradient descent directions. $$ new\_parameters = parameters \\ - conjugate\_gradient·learning\_rate$$

As before, the learning rate is adjusted at each epoch using line minimization.

Newton's method uses the Hessian of the loss function, a matrix of second derivatives, to calculate the learning direction. Since it uses high order information, the learning direction points to the minimum of the loss function with higher accuracy. The drawback is that calculating the Hessian matrix is very computationally expensive.

The quasi-Newton method is based on Newton's method but does not require the calculation of second derivatives. Instead, the quasi-Newton method computes an approximation of the inverse Hessian at each iteration of the algorithm, by only using gradient information. $$ new\_parameters = parameters \\ - inverse\_hessian\_approximation·gradient·learning\_rate$$

The learning rate is adjusted here at each epoch using line minimization.

Another training method is the Levenberg-Marquardt algorithm. It is designed to approach second-order training speed without having to compute the Hessian matrix.

The Levenberg-Marquardt algorithm can only be applied when the loss index has the form of a sum of squares (as the sum squared error, the mean squared error, or the normalized squared error). It requires to compute the gradient and the Jacobian matrix of the loss index. $$ new\_parameters = parameters \\ - damping\_parameter \cdot Jacobian\cdot gradient$$

The stochastic gradient descent has a different nature than the above algorithms. At every epoch, it updates the parameters many times using batches of data. $$ new\_parameters = parameters \\ - batch\_gradient·learning\_rate + momentum$$

The Adam algorithm is similar to gradient descent, but implements a more sophisticated method for calculating the training direction which usually produces faster convergence. $$ new\_parameters = parameters \\ - \frac{gradient\_exponential\_decay}{\sqrt{square\_gradient\_exponential\_decay}}\cdot learning\_rate$$

For small data sets (10 variables, 10000 samples), the Levenberg-Marquardt algorithm is recommended, due to its high speed and precision.

For intermediate problems, the quasi-Newton method or the conjugate gradient will perform well.

For big data sets (1000 variables, 1000000 samples), the stochastic gradient descent or the adaptative linear momentum methods are the best choices.

The quasi-Newton method is the default optimization algorithm in Neural Designer.

The 5 algorithms to train a neural network article in the Neural Designer blog contains more information about this subject.