The training strategy is the method that drives the learning process of a neural network.
It searches for the parameter values that best adapt the neural network to the data set.
A general training strategy combines two key ideas:
1. Loss index
The loss index defines the task for the neural network and measures how well it learns.
The choice of a suitable loss index depends on the application.
When setting a loss index, we must choose two different terms: an error term and a regularization term.
$$\text{loss index} = \text{error term} + \text{regularization term}$$
Error term
The error is the most important term in the loss index.
It measures how well the neural network fits the data set.
We can calculate errors on the training, selection, and testing samples
Next, we describe the most important errors used in machine learning:
- Mean squared error.
- Normalized squared error.
- Weighted squared error.
- Cross-entropy error.
- Minkowski error.
We calculate the gradient of all these error methods using the backpropagation algorithm.
Mean squared error (MSE)
The mean squared error is the average of the squared differences between the neural network outputs and the dataset’s targets.
$$\text{mean squared error} = \frac{\sum (outputs – targets)^2}{\text{samples number}}$$
Normalized squared error (NSE)
The normalized squared error is the squared difference between outputs and targets divided by a normalization factor.
A value of zero means the network predicts the data perfectly.
A value of one means the network only predicts the average of the data.
Weighted squared error (WSE)
The weighted squared error is used in binary classification with unbalanced data.
This happens when the number of positive and negative samples is very different.
It assigns weights so that positives and negatives contribute equally to the error.
\text{positives weight} \cdot \sum (outputs – \text{positive targets})^2
+ \text{negatives weight} \cdot \sum (outputs – \text{negative targets})^2$$
Cross-entropy error
The cross-entropy error is used in both binary and multi-class classification problems.
It penalizes heavily when the network assigns a high probability to the wrong class.
For binary classification, the cross-entropy error is
$$\text{Minkowski error} =
\frac{\sum \Big( \text{outputs} – \text{targets} \Big)^{\text{Minkowski parameter}}}{\text{samples number}}$$
Minkowski error (ME)
The previous errors can be very sensitive to outliers.
In such cases, the Minkowski error provides better generalization.
It raises the differences between outputs and targets to a power called the Minkowski parameter.
This parameter ranges from 1 to 2, with a default value of 1.5.
\frac{\sum \Big( \text{outputs} – \text{targets} \Big)^{\text{Minkowski parameter}}}{\text{samples number}}$$
Regularization term
A model is regular when small changes in the inputs only cause small changes in the outputs.
If the model is not regular, it may overfit the training data and fail to generalize.
To avoid this, we add a regularization term to the loss index.
The term keeps the network’s weights and biases small, making the model simpler and smoother.
The main types of regularization are:
We can easily compute the gradient of these regularization terms.
L1 regularization
L1 regularization adds up the absolute values of all the network’s parameters.
It drives the parameters to small values, which makes the model simpler and less likely to overfit.
$$\text{L1 regularization} =
\text{regularization weight} \cdot \sum \lvert \text{parameters} \rvert$$
L2 regularization
L2 regularization is the sum of the squared values of all the network’s parameters.
It also drives the parameters to small values, making the model more regular.
$$\text{L2 regularization} =
\text{regularization weight} \cdot \sum \text{parameters}^2$$
As we can see, a parameter controls the weight of the regularization term.
We decrease the weight if the model is too smooth and increase it if the model oscillates too much.
Loss function
The loss index depends on the neural network function and the data set.
We can imagine it as a surface in many dimensions, with the parameters as coordinates.
The following figure shows this idea.
Training a neural network consists of finding the parameter values that minimize the loss index.
2. Optimization algorithm
An optimization algorithm is the method used to adjust the parameters of a neural network to minimize the loss index.
Training starts with random parameters and iteratively improves them until reaching a minimum of the loss index.
The following figure shows this process.
The optimization algorithm stops when a chosen condition is met.
Standard stopping criteria include:
- The maximum number of epochs is reached.
- The maximum computing time is exceeded.
- The selection error increases for several epochs.
The optimization algorithm decides how to adjust the neural network parameters.
Different algorithms have different computational and memory requirements, and no one is best for all problems.
Next, we describe the most common optimization algorithms.
- Gradient descent.
- Newton’s method.
- Quasi-Newton method.
- Levenberg-Marquardt algorithm.
- Stochastic gradient descent.
- Adaptive linear momentum.
Gradient descent (GD)
The simplest optimization algorithm is gradient descent.
It updates the parameters each epoch in the direction of the negative gradient of the loss index.
A factor called the learning rate controls the change of parameters.
$$\text{New parameters} =
\text{parameters} – \text{loss gradient} \cdot \text{learning rate}$$
The main drawback of gradient descent is that it can converge very slowly.
Newton method (NM)
Newton’s method uses the Hessian matrix, which contains all the second derivatives of the loss function.
Unlike gradient descent, which only relies on first derivatives, Newton’s method uses curvature information to find better training directions.
$$\text{New parameters} =
\text{parameters} – \text{loss Hessian}^{-1} \cdot \text{loss gradient} \cdot \text{learning rate}$$
The main drawback of Newton’s method is the high computational cost of calculating the Hessian matrix.
Quasi-Newton method (QNM)
Newton’s method uses the Hessian matrix of second derivatives to set the learning direction.
This gives high accuracy but is very expensive to compute.
Quasi-Newton methods avoid this by approximating the inverse Hessian with only gradient information.
Line minimization algorithms adjust the learning rate at each epoch.
\text{parameters} – \text{inverse Hessian approximation} \cdot \text{gradient} \cdot \text{learning rate}$$
The Quasi-Newton method makes training faster than gradient descent and less costly than Newton’s method.
Levenberg-Marquardt algorithm (LM)
The Levenberg–Marquardt algorithm is another optimizer that achieves near second-order speed without computing the Hessian matrix.
This method applies only when the loss index is a sum of squares, such as the mean squared error or the normalized squared error.
It requires the gradient and the Jacobian matrix of the loss index.
$$\text{New parameters} =
\text{parameters} – \text{damping parameter} \cdot \text{Jacobian} \cdot \text{gradient}$$
The Levenberg-Marquardt algorithm is very fast but requires a lot of memory, so it is recommended only for small networks.
Stochastic gradient descent (SGD)
Stochastic gradient descent (SGD) works differently from the previous algorithms.
It updates the parameters several times in each epoch using small batches of data.
\text{parameters} – \text{batch gradient} \cdot \text{learning rate} + \text{momentum}$$
Adaptive linear momentum (ADAM)
The Adam algorithm is similar to stochastic gradient descent but uses a more advanced way to calculate the training direction.
It also adapts the learning rate for each parameter, which usually makes convergence faster.
\text{parameters} – \frac{\text{gradient exponential decay}}{\sqrt{\text{square gradient exponential decay}}} \cdot \text{learning rate}$$
Performance considerations
For small datasets (about 10 variables and 10,000 samples), the Levenberg-Marquardt algorithm is recommended for its speed and precision.
For medium-sized problems, the quasi-Newton method works well.
For large data sets (about 1,000 variables and 1,000,000 samples), adaptive linear momentum is the best choice.
The article “5 Algorithms to Train a Neural Network” in the Neural Designer blog contains more information about this subject.
Model Selection ⇒