By Alberto Quesada, Artelnics.

The procedure used to carry out the learning process in a neural network is called the optimization algorithm (or optimizer).

There are many different optimization algorithms. All have different characteristics and performance in terms of memory requirements, speed and precision.

The learning problem is formulated in terms of the minimization of a loss index, \(f\). This is a function which measures the performance of a neural network on a data set.

The loss index is, in general, composed of an error and a regularization terms. The error term evaluates how a neural network fits the data set. The regularization term is used to prevent overfitting, by controlling the effective complexity of the neural network.

The loss function depends on the adaptative parameters (biases and synaptic weights) in the neural network. We can conveniently group them together into a single n-dimensional weight vector \(\mathbf{w}\).

The picture below represents the loss function \(f(\mathbf{w})\).

As we can see in the previous picture, the minimum of the loss function occurs at the point \(\mathbf{w}^{*}\). At any point \(A\), we can calculate the first and second derivatives of the loss function. The first derivatives are grouped in the gradient vector, whose elements can be written as

$$\nabla_i f(\mathbf{w}) = \frac{\partial f}{\partial w_{i}}, $$for \( i = 1,\ldots,n \).

Similarly, the second derivatives of the loss function can be grouped in the Hessian matrix,

$$\mathbf{H}_{i,j} f(\mathbf{w}) = \frac{\partial^{2} f}{\partial w_{i}\partial w_{j}}, $$for \( i,j = 0,1,\ldots \).

The problem of minimizing continuous and differentiable functions of many variables has been widely studied. Many of the conventional approaches to this problem are directly applicable to that of training neural networks.

Although the loss function depends on many parameters, one-dimensional optimization methods are of great importance here. Indeed, they are very often used in the training process of a neural network.

Certainly, many training algorithms first compute a training direction \(\mathbf{d}\) and then a training rate \(\eta\); that minimizes the loss in that direction, \(f(\eta)\). The next picture illustrates this one-dimensional function.

The points \(\eta_1\) and \(\eta_2\) define an interval that contains the minimum of \(f\), \(\eta^{*}\).

In this regard, one-dimensional optimization methods search for the minimum of a given one-dimensional function. Some of the algorithms which are widely used are the golden section method and Brent's method. Both reduce the bracket of a minimum until the distance between the two outer points in the bracket is less than a defined tolerance.

The learning problem for neural networks is formulated as searching of a parameter vector \(w^{*}\) at which the loss function \(f\) takes a minimum value. The necessary condition states that if the neural network is at a minimum of the loss function, then the gradient is the zero vector.

The loss function is, in general, a non-linear function of the parameters. As a consequence, it is not possible to find closed training algorithms for the minima. Instead, we consider a search through the parameter space consisting of a succession of steps. At each step, the loss will decrease by adjusting the neural network parameters.

In this way, to train a neural network we start with some parameter vector (often chosen at random). Then, we generate a sequence of parameters, so that the loss function is reduced at each iteration of the algorithm. The change of loss between two steps is called the loss decrement. The training algorithm stops when a specified condition, or stopping criterion, is satisfied.

Now, we describe the most important training algorithms for neural networks.

Gradient descent, also known as steepest descent, is the simplest training algorithm. It requires information from the gradient vector, and hence it is a first order method.

Let denote \(f(\mathbf{w}^{(i)})=f^{(i)}\) and \(\nabla f(\mathbf{w}^{(i)})=\mathbf{g}^{(i)}\). The method begins at a point \(\mathbf{w}^{(0)}\) and, until a stopping criterion is satisfied, moves from \(\mathbf{w}^{(i)}\) to \(\mathbf{w}^{(i+1)}\) in the training direction \(\mathbf{d}^{(i)}=-\mathbf{g}^{(i)}\). Therefore, the gradient descent method iterates in the following way:

$$w^{(i+1)} = w^{(i)} - \mathbf{g}^{(i)}\eta^{(i)}, $$for \( i = 0,1,\ldots \).

The parameter \(\eta\) is the training rate. This value can either set to a fixed value or found by one-dimensional optimization along the training direction at each step. An optimal value for the training rate obtained by line minimization at each successive step is generally preferable. However, there are still many software tools that only use a fixed value for the training rate.

The next picture is an activity diagram of the training process with gradient descent. As we can see, the parameter vector is improved in two steps: First, the gradient descent training direction is computed. Second, a suitable training rate is found.

The gradient descent training algorithm has the severe drawback of requiring many iterations for functions which have long, narrow valley structures. Indeed, the downhill gradient is the direction in which the loss function decreases most rapidly, but this does not necessarily produce the fastest convergence. The following picture illustrates this issue.

Gradient descent is the recommended algorithm when we have very big neural networks, with many thousand parameters. The reason is that this method only stores the gradient vector (size \(n\)), and it does not store the Hessian matrix (size \(n^{2}\)).

The Newton's method is a second order algorithm because it makes use of the Hessian matrix. The objective of this method is to find better training directions by using the second derivatives of the loss function.

Let denote \(f(\mathbf{w}^{(i)})=f^{(i)}\), \(\nabla f(\mathbf{w}^{(i)})=\mathbf{g}^{(i)}\) and \(\mathbf{H} f(\mathbf{w}^{(i)})=\mathbf{H}^{(i)}\). Consider the quadratic approximation of \(f\) at \(\mathbf{w}^{(0)}\) using the Taylor's series expansion

$$ f = f^{(0)} + g^{(0)} \cdot(\mathbf{w} - \mathbf{w}^{(0)}) \\ + 0.5\cdot(\mathbf{w} - \mathbf{w}^{(0)})^{2}\cdot \mathbf{H}^{(0)} $$\(\mathbf{H}^{(0)}\) is the Hessian matrix of \(f\) evaluated at the point \(\mathbf{w}^{(0)}\). By setting \(g\) equal to \(0\) for the minimum of \(f(\mathbf{w})\), we obtain the next equation

$$ g = g^{(0)} + \mathbf{H}^{(0)}\cdot(\mathbf{w} - \mathbf{w}^{(0)}) = 0 $$Therefore, starting from a parameter vector \(\mathbf{w}^{(0)}\), the Newton's method iterates as follows

$$ \mathbf{w}^{(i+1)}=\mathbf{w}^{(i)}-\mathbf{H}^{(i)-1}\cdot \mathbf{g}^{(i)} $$for \( i = 0,1,\ldots \).

The vector \(\mathbf{H}^{(i)-1} \cdot \mathbf{g}^{(i)}\) is known as the Newton's step. Note that this change for the parameters may move towards a maximum rather than a minimum. This occurs if the Hessian matrix is not positive definite. Thus, the function evaluation is not guaranteed to be reduced at each iteration. to prevent such troubles, the Newton's method equation is usually modified as:

$$ \mathbf{w}^{(i+1)}=\mathbf{w}^{(i)}-(\mathbf{H}^{(i)-1}\cdot \mathbf{g}^{(i)})\eta $$for \( i = 0,1,\ldots \).

The training rate, \(\eta\), can either be set to a fixed value or found by line minimization. The vector \(\mathbf{d}^{(i)}=\mathbf{H}^{(i)-1}\cdot \mathbf{g}^{(i)}\) is now called the Newton's training direction.

The state diagram for the training process with the Newton's method is depicted in the next figure. Here improvement of the parameters is performed by obtaining first the Newton's training direction and then a suitable training rate.

The picture below illustrates the performance of this method. As we can see, the Newton's method requires less steps than gradient descent to find the minimum value of the loss function.

However, the Newton's method has the difficulty that the exact evaluation of the Hessian and its inverse are quite expensive in computational terms.

The conjugate gradient method can be regarded as something intermediate between gradient descent and Newton's method. It is motivated by the desire to accelerate the typically slow convergence associated with gradient descent. This method also avoids the information requirements associated with the evaluation, storage, and inversion of the Hessian matrix, as required by the Newton's method.

In the conjugate gradient training algorithm, the search is performed along conjugate directions which produces generally faster convergence than gradient descent directions. These training directions are conjugated with respect to the Hessian matrix.

Let denote d the training direction vector. Then, starting with an initial parameter vector \(\mathbf{w}^{(0)}\) and an initial training direction vector \(\mathbf{d}^{(0)}=-\mathbf{g}^{(0)}\), the conjugate gradient method constructs a sequence of training directions as:

$$ \mathbf{d}^{(i+1)}=\mathbf{g}^{(i+1)}+\mathbf{d}^{(i)}\cdot \gamma^{(i)}, $$for \( i = 0,1,\ldots \).

Here \(\gamma\) is called the conjugate parameter, and there are different ways to calculate it. Two of the most used are due to Fletcher and Reeves and to Polak and Ribiere. For all conjugate gradient algorithms, the training direction is periodically reset to the negative of the gradient.

The parameters are then improved according to the next expression. The training rate, \(\eta\), is usually found by line minimization.

$$ \mathbf{w}^{(i+1)}=\mathbf{w}^{(i)}+\mathbf{d}^{(i)}\cdot \eta^{(i)} $$for \( i = 0,1,\ldots \).

The picture below depicts an activity diagram for the training process with the conjugate gradient. Here improvement of the parameters is done by first computing the conjugate gradient training direction and then suitable training rate in that direction.

This method has proved to be more effective than gradient descent in training neural networks. Since it does not require the Hessian matrix, conjugate gradient is also recommended when we have very big neural networks.

Application of the Newton's method is computationally expensive, since it requires many operations to evaluate the Hessian matrix and compute its inverse. Alternative approaches, known as quasi-Newton or variable metric methods, are developed to solve that drawback. These methods, instead of calculating the Hessian directly and then evaluating its inverse, build up an approximation to the inverse Hessian at each iteration of the algorithm. This approximation is computed using only information on the first derivatives of the loss function.

The Hessian matrix is composed of the second partial derivatives of the loss function. The main idea behind the quasi-Newton method is to approximate the inverse Hessian by another matrix \(\mathbf{G}\), using only the first partial derivatives of the loss function. Then, the quasi-Newton formula can be expressed as:

$$ \mathbf{w}^{(i+1)}=\mathbf{w}^{(i)}-(\mathbf{G}^{(i)}\cdot \mathbf{g}^{(i)})\cdot\eta^{(i)} $$for \( i = 0,1,\ldots \).

The training rate \(\eta\) can either be set to a fixed value or found by line minimization. The inverse Hessian approximation \(\mathbf{G}\) has different flavours. Two of the most used are the Davidon–Fletcher–Powell formula (DFP) and the Broyden–Fletcher–Goldfarb–Shanno formula (BFGS).

The activity diagram of the quasi-Newton training process is illustrated bellow. Improvement of the parameters is performed by first obtaining the quasi-Newton training direction and then finding a satisfactory training rate.

This is the default method to use in most cases: It is faster than gradient descent and conjugate gradient, and the exact Hessian does not need to be computed and inverted.

The Levenberg-Marquardt algorithm, also known as the damped least-squares method, has been designed to work specifically with loss functions which take the form of a sum of squared errors. It works without computing the exact Hessian matrix. Instead, it works with the gradient vector and the Jacobian matrix.

Consider a loss function which can be expressed as a sum of squared errors of the form

$$ f = \sum_{i=1}^{m} e_{i}^{2} $$Here \(m\) is the number of instances in the data set.

We can define the Jacobian matrix of the loss function as that containing the derivatives of the errors with respect to the parameters,

$$ \mathbf{J}_{i,j} = \frac{\partial e_{i}}{\partial \mathbf{w}_{j}}, $$for \( i=1,\ldots,m \) and \( j = 1,\ldots,n \).

Where \(m\) is the number of instances in the data set and \(n\) is the number of parameters in the neural network. Note that the size of the Jacobian matrix is \(m\cdot n\).

The gradient vector of the loss function can be computed as:

$$\nabla f = 2 \mathbf{J}^{T}\cdot \mathbf{e}$$Here \(\mathbf{e}\) is the vector of all error terms.

Finally, we can approximate the Hessian matrix with the following expression.

$$\mathbf{H} f \approx 2 \mathbf{J}^{T}\cdot \mathbf{J} + \lambda \mathbf{I}$$Where \(\lambda\) is a damping factor that ensures the positiveness of the Hessian and \(I\) is the identity matrix.

The next expression defines the parameters improvement process with the Levenberg-Marquardt algorithm

$$ \mathbf{w}^{(i+1)}=\mathbf{w}^{(i)} \\ - (\mathbf{J}^{(i)T}\cdot \mathbf{J}^{(i)}+ \lambda^{(i)}\mathbf{I})^{-1} \cdot (2\mathbf{J}^{(i)T}\cdot \mathbf{e}^{(i)}), $$for \( i = 0,1,\ldots \).

When the damping parameter \(\lambda\) is zero, this is just Newton's method, using the approximate Hessian matrix. On the other hand, when \(\lambda\) is large, this becomes gradient descent with a small training rate.

The parameter \(\lambda\) is initialized to be large so that first updates are small steps in the gradient descent direction. If any iteration happens to result in a failure, then \(\lambda\) is increased by some factor. Otherwise, as the loss decreases, \(\lambda\) is decreased, so that the Levenberg-Marquardt algorithm approaches the Newton method. This process typically accelerates the convergence to the minimum.

The picture below represents a state diagram for the training process of a neural network with the Levenberg-Marquardt algorithm. The first step is to calculate the loss, the gradient and the Hessian approximation. Then the damping parameter is adjusted so as to reduce the loss at each iteration.

As we have seen the Levenberg-Marquardt algorithm is a method tailored for functions of the type sum-of-squared-error. That makes it to be very fast when training neural networks measured on that kind of errors. However, this algorithm has some drawbacks. The first one is that it cannot be applied to functions such as the root mean squared error or the cross entropy error. Also, it is not compatible with regularization terms. Finally, for very big data sets and neural networks, the Jacobian matrix becomes huge, and therefore it requires a lot of memory. Therefore, the Levenberg-Marquardt algorithm is not recommended when we have big data sets and/or neural networks.

The next graph depicts the computational speed and the memory requirements of the training algorithms discussed in this post. As we can see, the slowest training algorithm is usually gradient descent, but it is the one requiring less memory. On the contrary, the fastest one might be the Levenberg-Marquardt algorithm, but usually requires a lot of memory. A good compromise might be the quasi-Newton method.

To conclude, if our neural networks has many thousands of parameters we can use gradient descent or conjugate gradient, to save memory. If we have many neural networks to train with just a few thousands of instances and a few hundreds of parameters, the best choice might be the Levenberg-Marquardt algorithm. In the rest of situations, the quasi-Newton method will work well.

- Retail store sales forecasting.
- Methods binary classification.
- Customer segmentation using advanced analytics.