Tutorial index:

# 4. Training strategy

The procedure used to carry out the learning process is called the training (or learning) strategy. The training strategy is applied to the neural network to obtain the minimum loss possible. This is done by searching for parameters that fit the neural network to the data set.

A general strategy consists of two different concepts:

## 4.1. Loss index

The loss index plays a vital role in the use of neural networks. It defines the task the neural network is required to do and provides a measure of the quality of the representation required to learn. The choice of a suitable loss index depends on the application.

$$ error\_rate = \frac{false\_positives+false\_negatives}{total\_instances} $$

When setting a loss index, two different terms must be chosen: an error term and a regularization term.

$$ loss\_index = error\_term + regularization\_term $$

### Error term

The error is the most important term in the loss expression. It measures how the neural network fits the data set.

All those errors can be measured over different subsets of the data. In this regard, the **training error** refers to the error measured on the training samples, the **selection error **is measured on the selection samples, and the **testing error** is measured on the testing samples.

Next, we describe the most important errors used in the field of neural networks:

- Mean squared error.
- Normalized squared error.
- Weighted squared error.
- Cross entropy error.
- Minkowski error.

### Mean squared error (MSE)

The mean squared error calculates the average squared error between the neural network outputs and the data set’s targets.

$$mean\_squared\_error = \frac{\sum \left(outputs – targets\right)^2}{samples\_number}$$

### Normalized squared error (NSE)

The normalized squared error divides the squared error between the outputs from the neural network and the targets in the data set by a normalization coefficient. If the normalized squared error has a value of unity, then the neural network predicts the data ‘on the mean’, while a value of zero means a perfect prediction of the data.

$$normalized\_squared\_error = \frac{\sum \left(outputs – targets\right)^2}{normalization\_coefficient}$$

The normalized squared error is the default error term when solving approximation problems.

### Weighted squared error (WSE)

The weighted squared error is used in binary classification applications with unbalanced targets, i.e., when the numbers of positives and negatives are very different. It gives a different weight to errors belonging to positive and negative samples.

$$weighted\_squared\_error = \\ = positives\_weight \sum (outputs positive\_targets)^2\\ + negatives\_weight \sum (outputs-negative\_targets)^2$$

### Cross entropy error

The cross-entropy error is used in binary classification problems, where the target variable can only take values 0 or 1. Its main advantage is that it penalizes with a high error when a target is labeled with a 0 is predicted with a probability close to 1, and vice versa. It is defined as:

$$cross\_entropy\_error =\\ -\sum \left ({target \cdot \log \left ( output \right ) + \left ( 1-target \right ) \cdot log(output))} \right)$$

A perfect model should have 0 cross-entropy error.

### Minkowski error (ME)

One of the potential difficulties of the above errors is that they can receive an enormous contribution from points with significant errors (outliers). If the distribution has long tails, the solution can be dominated by a few points with significant errors.

On such occasions, to achieve good generalization, choosing a more suitable error method is preferable. The Minkowski error is the sum, over the training samples, of the difference between outputs and targets, elevated to an exponent, which can vary between 1 and 2. That exponent is called the Minkowski parameter; its default value is 1.5.

$$minkowski\_error = \frac{\sum\left(outputs – targets\right)^{minkowski\_parameter}}{samples\_number}$$

We can calculate the gradient analytically for all the error methods we have seen above using the so-called back-propagation algorithm.

### $Regularization :: term$

A solution is regular when small changes in the input variables lead to small output changes. An approach for non-regular problems is to control the effective complexity of the neural network. We can achieve this by including a regularization term in the loss index.

Regularization terms usually measure the values of the parameters in the neural network. Adding that term to the error will cause the neural network to have smaller weights and biases, forcing its response to be smoother.

The most used types of regularization are the following:

### L1 regularization

The L1 regularization method consists of the sum of the absolute values of all the parameters in the neural network.

$$ l1\_regularization = regularization\_weight · \sum |parameters|$$

### L2 regularization

The L2 regularization method consists of the squared sum of all the parameters in the neural network.

$$ l2\_regularization = regularization\_weight · \sum parameters^{2}$$

As we can see, the regularization term is weighted by a parameter. If the solution is too smooth, we need to decrease the weight. Conversely, we need to increase the weight if the solution oscillates too much.

The gradient for the regularization terms described above can be computed straightforwardly.

### Loss function

The loss index depends on the function represented by the neural network and is measured on the data set. It can be visualized as a hyper-surface with the parameters as coordinates. See the following figure.

The learning problem for neural networks can then be stated as finding a neural network function for which the loss index takes on a minimum value. That is, to find the parameters that minimize the above function.

## 4.2. Optimization algorithm

As said, the learning problem for neural networks consists of searching for a set of parameters at which the loss index takes a minimum value. The necessary condition states that the gradient is zero when the neural network is at a minimum of the loss index.

The loss index is generally a non-linear function of the parameters. Consequently, finding closed optimization algorithms for the minima is impossible. Instead, we consider a search through the parameter space consisting of a succession of steps or epochs. The loss will decrease at each epoch by adjusting the neural network parameters. The change of parameters between two epochs is called the parameter increment.

In this way, to train a neural network, we start with some parameters vector (often chosen at random). We generate a sequence of parameter vectors to reduce the loss index at each algorithm iteration. The figure below is a state diagram of the training procedure.

The optimization algorithm stops when a specified condition is satisfied. Some stopping criteria commonly used are:

- The loss improvement in one epoch is less than a set value.
- Loss has been minimized to a goal value.
- A maximum number of epochs is reached.
- The maximum amount of computing time has been exceeded.
- The error on the selected subset increases several epochs.

The optimization algorithm determines how the adjustment of the parameters in the neural network takes place. Many different optimization algorithms have a variety of additional computation and storage requirements. Moreover, there is no one best suited to all locations. Next, the most used optimization algorithms are described.

Next, the most used optimization algorithms are described:

- Gradient descent.
- Conjugate gradient.
- Quasi-Newton method.
- Levenberg-Marquardt algorithm.
- Stochastic gradient descent.
- Adaptative linear momentum.

### Gradient descent (GD)

The most straightforward optimization algorithm is gradient descent. Here, the parameters are updated at each epoch in the direction of the negative gradient of the loss index.

$$ new\_parameters = parameters \\ – loss\_gradient*learning\_rate$$

The learning rate is usually adjusted at each epoch using line minimization.

### Conjugate gradient (CG)

In the conjugate gradient algorithm, the search is performed along with conjugate directions, which produces generally faster convergence than gradient descent directions.

$$ new\_parameters = parameters \\ – conjugate\_gradient·learning\_rate$$

The learning rate is adjusted at each epoch using line minimization.

### Quasi-Newton method (QNM)

Newton’s method uses the Hessian of the loss function, a matrix of second derivatives, to calculate the learning direction. Since it uses high-order information, the learning direction points to the minimum of the loss function with higher accuracy. The drawback is that calculating the Hessian matrix is very computationally expensive.

The quasi-Newton method is based on Newton’s method but does not require the calculation of second derivatives. Instead, the quasi-Newton method computes an approximation of the inverse Hessian at each iteration of the algorithm using only gradient information.

$$ new\_parameters = parameters \\ – inverse\_hessian\_approximation·gradient·learning\_rate$$

The learning rate is adjusted here at each epoch using line minimization.

### Levenberg-Marquardt algorithm (LM)

Another training method is the Levenberg-Marquardt algorithm. It is designed to approach second-order training speed without computing the Hessian matrix.

The Levenberg-Marquardt algorithm can only be applied when the loss index has the form of a sum of squares (as the sum squared error, the mean squared error, or the normalized squared error). It requires computing the gradient and the Jacobian matrix of the loss index.

$$ new\_parameters = parameters \\ – damping\_parameter \cdot Jacobian\cdot gradient$$

### Stochastic gradient descent (SGD)

The stochastic gradient descent has a different nature than the above algorithms. At every epoch, it updates the parameters many times using batches of data.

$$ new\_parameters = parameters \\ – batch\_gradient·learning\_rate + momentum$$

### Adaptative linear momentum (ADAM)

The Adam algorithm is similar to gradient descent but implements a more sophisticated method for calculating the training direction, which usually produces faster convergence.

$$ new\_parameters = parameters \\ – \frac{gradient\_exponential\_decay}{\sqrt{square\_gradient\_exponential\_decay}}\cdot learning\_rate$$

### Performance considerations

For small data sets (10 variables, 10,000 samples), the Levenberg-Marquardt algorithm is recommended due to its high speed and precision.

For intermediate problems, the quasi-Newton method or the conjugate gradient will perform well.

For big data sets (1000 variables, 1000000 samples), the stochastic gradient descent or the adaptative linear momentum methods are the best choices.

The quasi-Newton method is the default optimization algorithm in Neural Designer.

The 5 algorithms to train a neural network article in the Neural Designer blog contains more information about this subject.

Model Selection ⇒