4. Training strategy

The procedure used to carry out the learning process is called training (or learning) strategy. The training strategy is applied to the neural network to obtain the minimum loss possible. This is done by searching for a set of parameters that fit the neural network to the data set.

A general strategy consists on two different concepts:

The loss index plays an important role in the use of a neural network. It defines the task the neural network is required to do and provides a measure of the quality of the representation that the neural network is required to learn. The choice of a suitable loss index depends on the particular application.

When setting a loss index, two different terms must be chosen: an error term and a regularization term. $$ loss\_index = error\_term + regularization\_term $$

The error is the most important term in the loss expression. It measures how the neural network fits the training instances in the data set. Next the most important errors used in the field of neural networks are described.

The mean squared error calculates the average squared error between the outputs from the neural network and the targets in the data set. $$mean\_squared\_error = \frac{\sum \left(outputs - targets\right)^2}{instances\_number}$$

The normalized squared error divides the squared error between the outputs from the neural network and the targets in the data set by a normalization coefficient. If the normalized squared error has a value of unity then the neural network is predicting the data 'in the mean', while a value of zero means perfect prediction of the data. $$normalized\_squared\_error = \frac{\sum \left(outputs - targets\right)^2}{normalization\_coefficient}$$

This can be considered the default error term when solving approximation problems.

The weighted squared error is used in binary classification applications with unbalanced targets i.e., when the numbers of positives and negatives are very different. It gives a different weight to errors belonging to positive and negative instances. $$weighted\_squared\_error \\ = positives\_weight \sum (outputs-positive\_targets)^2\\ + negatives\_weight \sum (outputs-negative\_targets)^2$$

One of the potential difficulties of the above errors is that they can receive a too large contribution from points which have large errors (outliers). If there are long tails on the distribution then the solution can be dominated by a very small number of points which have particularly large error. In such occasions, to achieve good generalization, it is preferable to chose a more suitable error method. The Minkowski error is the sum, over the training instances, of the difference between the outputs and the targets elevated to an exponent which can vary between 1 and 2. That exponent is called the Minkowski parameter, and its default value is 1.5. $$minkowski\_error = \frac{\sum\left(outputs - targets\right)^{minkowski\_parameter}}{instances\_number}$$

For all the error methods that we have seen above, the gradient can be found analytically using the so-called back-propagation algorithm.

On the other hand, all that errors can be measured over different subsets of the data. In this regard, the training error refers to the error measured on the training instances of the data set, the selection error is measured on the selection instances and the testing error is measured on the testing instances.

A solution is said to be regular when small changes in the input variables led to small changes in the outputs. An approach for non-regular problems is to control the effective complexity of the neural network. This can be achieved by using a regularization term into the loss index.

Regularization terms usually measure the values of the parameters in the neural network. Adding that term to the error will cause the neural network to have smaller weights and biases, and this will force its response to be smoother.

The L1 regularization method consists of the sum of the absolute values of all the parameters in the neural network. $$ l1\_regularization = regularization\_weight · \sum |parameters|$$

The L2 regularization method consists of the squared sum of all the parameters in the neural network. $$ l2\_regularization = regularization\_weight · \sum parameters^{2}$$

As we can see, the regularization term is weighted by a parameter. If the solution is too smooth the weight must be decreased. Conversely, if the solution oscillates too much the weight is increased.

The gradient for the regularization terms described above can be computed in a straightforward manner.

The loss index depends on the function represented by the neural network, and it is measured on the data set. It can be visualized as a hyper-surface with the parameters as coordinates, see the next figure.

The learning problem for neural networks can then be stated as finding a neural network function for which the loss index takes on a minimum value. That is, to find the set of parameters that minimize the above function.

As it was said, the learning problem for neural networks consists of searching for a set of parameters at which the loss index takes a minimum value. The necessary condition states that if the neural network is at a minimum of the loss index, then the gradient is zero.

The loss index is, in general, a non-linear function of the parameters. As a consequence, it is not possible to find closed optimization algorithms for the minima. Instead, we consider a search through the parameter space consisting of a succession of steps, or epochs. At each epoch, the loss will decrease by adjusting the neural network parameters. The change of parameters between two epochs is called the parameters increment.

In this way, to train a neural network we start with some parameters vector (often chosen at random) and we generate a sequence of parameter vectors, so that the loss index is reduced at each iteration of the algorithm. The figure below is a state diagram of the training procedure.

The optimization algorithm stops when a specified condition is satisfied. Some stopping criteria commonly used are:

- The parameters increment norm is less than a minimum value.
- The loss improvement in one epoch is less than a set value.
- Loss has been minimized to a goal value.
- The norm of the loss index gradient falls below a goal.
- A maximum number of epochs is reached.
- A maximum amount of computing time has been exceeded.
- The error on the selection subset increases during a number of epochs.

The optimization algorithm determines the way in which the adjustment of the parameters in the neural network takes place. There are many different optimization algorithms, which have a variety of different computation and storage requirements. Moreover, there is not one best suited to all locations. Next, the most used optimization algorithms are described.

The simplest optimization algorithm is gradient descent. With this method, the parameters are updated at each epoch in the direction of the negative gradient of the loss index. $$ new\_parameters = parameters \\ - loss\_gradient*learning\_rate$$

The learning rate is usually adjusted at each epoch using line minimization.

In the conjugate gradient algorithm search is performed along conjugate directions, which produces generally faster convergence than gradient descent directions. $$ new\_parameters = parameters \\ - conjugate\_gradient·learning\_rate$$

As before, the learning rate is adjusted at each epoch using line minimization.

The Newton's method uses the Hessian of the loss function, which is a matrix of second derivatives, to calculate the learning direction. Since it uses high order information, the learning direction points to the minimum of the loss function with higher accuracy. The drawback is that calculating the Hessian matrix is very computationally expensive.

The quasi-Newton method is based on Newton's method, but does not require calculation of second derivatives. Instead, the quasi-Newton method computes an approximation of the inverse Hessian at each iteration of the algorithm, by only using gradient information. $$ new\_parameters = parameters \\ - inverse\_hessian\_approximation·gradient·learning\_rate$$

The learning rate is adjusted here at each epoch using line minimization.

Another training method is the Levenberg-Marquardt algorithm. It is designed to approach second-order training speed without having to compute the Hessian matrix.

The Levenberg-Marquardt algorithm can only be applied when the loss index has the form of a sum of squares (as the sum squared error, the mean squared error or the normalized squared error). It requires to compute the gradient and the Jacobian matrix of the loss index. $$ new\_parameters = parameters \\ - Jacobian· gradient·damping\_parameters$$

The stochastic gradient descent has a different nature than the above algorithms. At every epoch, it updates the parameters many times using batchs of data. $$ new\_parameters = parameters \\ - batch\_gradient·learning\_rate$$

The Adam algorithm is similar to gradient descent, but implements a more sophisticated method for calculating the training direction which usually produces faster convergence. $$ new\_parameters = parameters \\ - batch\_gradient·learning\_rate$$

For very small data sets (10 variables, 1000 instances), the Levenberg-Marquardt algorithm is recommended, due to its high speed. For intermediate problems, the quasi-Newton method or the conjugate gradient will perform well. For very big data sets (1000 variables, 1000000 instances), the stochastic gradient descent or the adaptative linear momentum methods are the best choice. The quasi-Newton method is the default optimization algorithm in Neural Designer.

The 5 algorithms to train a neural network article in our blog contains more information about this subject.