{"id":3368,"date":"2025-09-24T10:59:22","date_gmt":"2025-09-24T08:59:22","guid":{"rendered":"https:\/\/neuraldesigner.com\/blog\/5_algorithms_to_train_a_neural_network\/"},"modified":"2025-11-27T15:17:47","modified_gmt":"2025-11-27T14:17:47","slug":"5_algorithms_to_train_a_neural_network","status":"publish","type":"blog","link":"https:\/\/www.neuraldesigner.com\/blog\/5_algorithms_to_train_a_neural_network\/","title":{"rendered":"5 algorithms to train a neural network"},"content":{"rendered":"\t\t<div data-elementor-type=\"wp-post\" data-elementor-id=\"3368\" class=\"elementor elementor-3368\" data-elementor-post-type=\"blog\">\n\t\t\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-43f3fba1 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"43f3fba1\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-6f55d1c1\" data-id=\"6f55d1c1\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-5b5cbde elementor-widget elementor-widget-text-editor\" data-id=\"5b5cbde\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t\t\t\t\t\t<section><p><span style=\"color: var( --e-global-color-text ); font-family: var( --e-global-typography-text-font-family ), Sans-serif; font-size: var( --e-global-typography-text-font-size ); font-weight: var( --e-global-typography-text-font-weight );\">The training algorithms orchestrates the learning process in a <a style=\"font-family: var( --e-global-typography-text-font-family ), Sans-serif; font-size: var( --e-global-typography-text-font-size ); font-weight: var( --e-global-typography-text-font-weight ); background-color: #ffffff;\" href=\"https:\/\/www.neuraldesigner.com\/learning\/tutorials\/neural-network\">neural network<\/a>, while the <a style=\"font-family: var( --e-global-typography-text-font-family ), Sans-serif; font-size: var( --e-global-typography-text-font-size ); font-weight: var( --e-global-typography-text-font-weight ); background-color: #ffffff;\" href=\"https:\/\/www.neuraldesigner.com\/learning\/tutorials\/training-strategy#OptimizationAlgorithm\">optimization algorithm<\/a> (or optimizer) fine-tunes the model&#8217;s parameters during this training.<\/span><\/p><p>There are many different optimization algorithms. They are different regarding memory requirements, processing speed, and numerical precision.<\/p><p>This post first formulates the learning problem for neural networks. Then, it describes some essential <a href=\"https:\/\/www.neuraldesigner.com\/learning\/tutorials\/training-strategy#OptimizationAlgorithm\">optimization algorithms<\/a>. Finally, it compares those algorithms&#8217; memory, speed, and precision.<\/p><ul><li><a href=\"#LearningProblem\">Learning problem<\/a>.<\/li><li><a href=\"#GradientDescent\">1. Gradient descent<\/a>.<\/li><li><a href=\"#NewtonMethod\">2. Newton method<\/a>.<\/li><li><a href=\"#ConjugateGradient\">3. Conjugate gradient<\/a>.<\/li><li><a href=\"#Quasi-Newton\">4. Quasi-Newton method<\/a>.<\/li><li><a href=\"#Levenberg-Marquardt\">5. Levenberg-Marquardt algorithm<\/a>.<\/li><li><a href=\"#PerformanceComparison\">Performance comparison<\/a>.<\/li><li><a href=\"#Conclusions\">Conclusions<\/a>.<\/li><\/ul><p><a href=\"https:\/\/www.neuraldesigner.com\">Neural Designer<\/a> includes many different <a href=\"https:\/\/www.neuraldesigner.com\/learning\/tutorials\/training-strategy#OptimizationAlgorithm\">optimization algorithms<\/a>.\u00a0This allows you to always get the best models from your data. You can download a free trial <a href=\"https:\/\/www.neuraldesigner.com\/free-trial\">here<\/a>.<\/p><\/section><section><h2>Learning problem<\/h2><p>The learning problem is formulated as minimizing a <a href=\"https:\/\/www.neuraldesigner.com\/learning\/tutorials\/training-strategy#LossIndex\">loss index<\/a>\u00a0(f). It is a function that measures the performance of a <a href=\"https:\/\/www.neuraldesigner.com\/learning\/tutorials\/neural-network\">neural network<\/a>\u00a0on a <a href=\"https:\/\/www.neuraldesigner.com\/learning\/tutorials\/data-set\">data set<\/a>.<\/p><p>The <a href=\"https:\/\/www.neuraldesigner.com\/learning\/tutorials\/training-strategy#LossIndex\">loss index<\/a> includes, in general, an error and regularization terms. The <a href=\"https:\/\/www.neuraldesigner.com\/learning\/tutorials\/training-strategy#ErrorTerm\">error term<\/a> evaluates how a neural network fits the data set. The <a href=\"https:\/\/www.neuraldesigner.com\/learning\/tutorials\/training-strategy#RegularizationTerm\">regularization term<\/a> prevents overfitting by controlling the model&#8217;s complexity.<\/p><p>The <a href=\"https:\/\/www.neuraldesigner.com\/learning\/tutorials\/training-strategy#LossIndex\">loss<\/a> function depends on the neural network&#8217;s adaptative parameters (biases and synaptic weights). We can group them into a single n-dimensional weight vector $(\\mathbf{w})$.<\/p><p>The following figure represents the <a href=\"https:\/\/www.neuraldesigner.com\/learning\/tutorials\/training-strategy#LossIndex\">loss<\/a> function $(f(\\mathbf{w}))$.<\/p><p><img decoding=\"async\" src=\"https:\/\/www.neuraldesigner.com\/images\/loss-function.svg\" alt=\"Neural network loss function\" \/><\/p><p>As we can see, the minimum of the loss function occurs at the point $(\\mathbf{w}^{*})$. At any point (A), we can calculate the first and second derivatives of the loss function.<\/p><p>The gradient vector groups the first derivatives,<\/p><p>$$\\nabla_i f(\\mathbf{w}) = \\frac{\\partial f}{\\partial w_{i}}$$<\/p><p>for $( i = 1,\\ldots,n )$.<\/p><p>Similarly, the Hessian matrix groups the second derivatives,<\/p><p>$$\\mathbf{H}_{i,j} f(\\mathbf{w}) = \\frac{\\partial^{2} f}{\\partial w_{i}\\partial w_{j}},$$<br \/><br \/><\/p><p>for $( i,j = 0,1,\\ldots )$.<\/p><p>The problem of minimizing continuous and differentiable functions of many variables is widely studied. We can directly apply many conventional approaches to this problem to training neural networks.<\/p><\/section><section><h2>One-dimensional optimization<\/h2><p>Although the <a href=\"https:\/\/www.neuraldesigner.com\/learning\/tutorials\/training-strategy#LossIndex\">loss function<\/a> depends on many parameters, one-dimensional optimization methods are essential here.<br \/>Indeed, they are very often used in the training process of a <a href=\"https:\/\/www.neuraldesigner.com\/learning\/tutorials\/neural-network\">neural network<\/a>.<\/p><p>Many training algorithms first compute a training direction $(\\mathbf{d})$ and then a training rate $(\\eta)$ that minimizes the loss in that direction, $(f(\\eta))$. The following figure illustrates this one-dimensional function.<\/p><p><img decoding=\"async\" src=\"https:\/\/www.neuraldesigner.com\/images\/one-dimensional-optimization.svg\" alt=\"One dimensional optimization\" \/><\/p><p>The points $(\\eta_1)$ and $(\\eta_2)$ define an interval that contains the minimum of $(f)$, $(\\eta^{*})$.<\/p><p>In this regard, one-dimensional optimization methods search for the minimum of one-dimensional functions. Some of the most used are the golden section and <a href=\"https:\/\/mathworld.wolfram.com\/BrentsMethod.html\">Brent&#8217;s method<\/a>. Both reduce the minimum bracket until the distance between the outer points is less than a defined tolerance.<\/p><\/section><section><h2>Multidimensional optimization<\/h2><p>The learning problem for <a href=\"https:\/\/www.neuraldesigner.com\/learning\/tutorials\/neural-network\">neural networks<\/a> is formulated as searching for a parameter vector $(w^{*})$ at which the loss function $(f)$ takes a minimum value. The necessary condition states that if the <a href=\"https:\/\/www.neuraldesigner.com\/learning\/tutorials\/neural-network\">neural network<\/a> is at a minimum of the loss function, then the gradient is the zero vector.<\/p><p>The <a href=\"https:\/\/www.neuraldesigner.com\/learning\/tutorials\/training-strategy#LossIndex\">loss function<\/a> is generally a non-linear function of the parameters. Consequently, it is impossible to find closed training algorithms for the minima. Instead, we consider a several-step search through the parameter space. The loss will decrease at each step by adjusting the <a href=\"https:\/\/www.neuraldesigner.com\/learning\/tutorials\/neural-network\">neural network<\/a> parameters.<\/p><p>In this way, to train a <a href=\"https:\/\/www.neuraldesigner.com\/learning\/tutorials\/neural-network\">neural network<\/a>, we start with some parameter vector (often chosen at random). Then, we generate a sequence of parameters to reduce the <a href=\"https:\/\/www.neuraldesigner.com\/learning\/tutorials\/training-strategy#LossIndex\">loss<\/a> function\u00a0at each algorithm iteration. The change of loss between two steps is called the loss decrement. The <a href=\"https:\/\/www.neuraldesigner.com\/learning\/tutorials\/training-strategy\">training algorithm<\/a> stops when a specified condition, or stopping criterion, is satisfied.<\/p><\/section><section><h2>1. Gradient descent (GD)<\/h2><p><a href=\"https:\/\/www.neuraldesigner.com\/learning\/tutorials\/training-strategy#GradientDescent\">Gradient descent<\/a> is the most straightforward training algorithm. It requires information from the gradient vector and is a first-order method.<\/p><p>Let denote $(f(\\mathbf{w}^{(i)})=f^{(i)})$ and $(\\nabla f(\\mathbf{w}^{(i)})=\\mathbf{g}^{(i)})$. The method begins at a point $(\\mathbf{w}^{(0)})$ and, until a stopping criterion is satisfied, moves from $(\\mathbf{w}^{(i)})$ to $(\\mathbf{w}^{(i+1)})$ in the training direction $(\\mathbf{d}^{(i)}=-\\mathbf{g}^{(i)})$. Therefore, the <a href=\"https:\/\/www.neuraldesigner.com\/learning\/tutorials\/training-strategy#GradientDescent\">gradient descent<\/a> method iterates in the following way:<\/p><p>$$w^{(i+1)} = w^{(i)} &#8211; \\mathbf{g}^{(i)}\\eta^{(i)}, $$<\/p><p>for $( i = 0,1,\\ldots )$.<\/p><p>The parameter (eta) is the training rate. This value can be set to a fixed value or found by one-dimensional optimization along the training direction at each step. An optimal value for the training rate obtained by line minimization at each successive step is generally preferable. However, many software tools still use only a fixed value for the training rate.<\/p><p>The following picture is an activity diagram of the training process with <a href=\"https:\/\/www.neuraldesigner.com\/learning\/tutorials\/training-strategy#GradientDescent\">gradient descent<\/a>. As we can see, the algorithm improves the parameters in two steps: First, it computes the gradient descent training direction. Second, it finds a suitable training rate.<\/p><p><img decoding=\"async\" src=\"https:\/\/www.neuraldesigner.com\/images\/gradient_descent_algorithm_big.webp\" alt=\"Gradient descent diagram\" \/><\/p><p>The <a href=\"https:\/\/www.neuraldesigner.com\/learning\/tutorials\/training-strategy#GradientDescent\">gradient descent<\/a> training algorithm has the severe drawback of requiring many iterations for functions that have long, narrow valley structures. Indeed, the downhill gradient is the direction in which the <a href=\"https:\/\/www.neuraldesigner.com\/learning\/tutorials\/training-strategy#LossIndex\">loss<\/a> function decreases rapidly, but this does not necessarily produce the fastest convergence. The following picture illustrates this issue.<\/p><p><img decoding=\"async\" src=\"https:\/\/www.neuraldesigner.com\/images\/gradient_descent_graph_big.webp\" alt=\"Gradient descent picture\" \/><\/p><p><a href=\"https:\/\/www.neuraldesigner.com\/learning\/tutorials\/training-strategy#GradientDescent\">Gradient descent<\/a> is the recommended algorithm for massive neural networks with many thousand parameters. The reason is that this method only stores the gradient vector $(size (n))$, and it does not keep the Hessian matrix of size $(size (n^{2}))$.<\/p><\/section><section><h2>2. Newton&#8217;s method (NM)<\/h2><p>Newton&#8217;s method is a second-order algorithm because it uses the Hessian matrix. This method aims to find better training directions by using the second derivatives of the <a href=\"https:\/\/www.neuraldesigner.com\/learning\/tutorials\/training-strategy#LossIndex\">loss<\/a> function.<\/p><p>Let denote $(f(\\mathbf{w}^{(i)})=f^{(i)})$, $(\\nabla f(\\mathbf{w}^{(i)})=\\mathbf{g}^{(i)})$ and $(\\mathbf{H} f(\\mathbf{w}^{(i)})=\\mathbf{H}^{(i)}).$ Consider the quadratic approximation of (f) at $(\\mathbf{w}^{(0)})$ using the Taylor&#8217;s series expansion<\/p><p>$$ f = f^{(0)} + g^{(0)} \\cdot(\\mathbf{w} &#8211; \\mathbf{w}^{(0)}) + 0.5\\cdot(\\mathbf{w} &#8211; \\mathbf{w}^{(0)})^{2}\\cdot \\mathbf{H}^{(0)} $$<\/p><p>$(\\mathbf{H}^{(0)})$ is the Hessian matrix of $(f)$ evaluated at the point $(\\mathbf{w}^{(0)})$. By setting $(g)$ equal to $(0)$ for the minimum of $(f(\\mathbf{w}))$, we obtain the next equation<\/p><p>$$ g = g^{(0)} + \\mathbf{H}^{(0)}\\cdot(\\mathbf{w} &#8211; \\mathbf{w}^{(0)}) = 0 $$<\/p><p>Therefore, starting from a parameter vector $(\\mathbf{w}^{(0)})$, Newton&#8217;s method iterates as follows:<\/p><p>$$ \\mathbf{w}^{(i+1)}=\\mathbf{w}^{(i)}-\\mathbf{H}^{(i)-1}\\cdot \\mathbf{g}^{(i)} $$<\/p><p>for $( i = 0,1,\\ldots )$.<\/p><p>The vector $(\\mathbf{H}^{(i)-1} \\cdot \\mathbf{g}^{(i)})$ is known as Newton&#8217;s step. Note that this parameter change may move towards a maximum rather than a minimum. This occurs if the Hessian matrix is not positive definite. Thus, Newton&#8217;s method does not guarantee to reduce the loss index at each iteration. To prevent that, Newton&#8217;s method equation is usually modified as follows:<\/p><p>$$ \\mathbf{w}^{(i+1)}=\\mathbf{w}^{(i)}-(\\mathbf{H}^{(i)-1}\\cdot \\mathbf{g}^{(i)})\\eta $$\u00a0<span style=\"color: var( --e-global-color-text ); font-family: var( --e-global-typography-text-font-family ), Sans-serif; font-size: var( --e-global-typography-text-font-size ); font-weight: var( --e-global-typography-text-font-weight );\">for $( i = 0,1,\\ldots )$.<\/span><\/p><p>The training rate, (eta) can either be set to a fixed value or found by line minimization. The vector $(\\mathbf{d}^{(i)}=\\mathbf{H}^{(i)-1}\\cdot \\mathbf{g}^{(i)})$ is now called Newton&#8217;s training direction.<\/p><p>The following figure depicts the state diagram for the training process with Newton&#8217;s method. The parameters are improved by obtaining a suitable training direction and rate.<\/p><p><img decoding=\"async\" src=\"https:\/\/www.neuraldesigner.com\/images\/newton_algorithm_big.webp\" alt=\"Newton's method diagram\" \/><\/p><p>The picture below illustrates the performance of this method. As we can see, Newton&#8217;s method requires fewer steps than gradient descent to find the minimum value of the <a href=\"https:\/\/www.neuraldesigner.com\/learning\/tutorials\/training-strategy#LossIndex\">loss<\/a> function.<\/p><p><img decoding=\"async\" src=\"https:\/\/www.neuraldesigner.com\/images\/newton_graph_big.webp\" alt=\"Newton's method graph\" \/><\/p><p>However, Newton&#8217;s method is complicated because the exact evaluation of the Hessian and its inverse are pretty expensive in computational terms.<\/p><\/section><section><h2>3. Conjugate gradient (CG)<\/h2><p>The <a href=\"https:\/\/www.neuraldesigner.com\/learning\/tutorials\/training-strategy#ConjugateGradient\">conjugate gradient method<\/a> can be regarded as an intermediate between gradient descent and Newton&#8217;s method.<br \/>It is motivated to accelerate the typically slow convergence associated with gradient descent. This method also avoids the information requirements related to the Hessian matrix&#8217;s storage, evaluation, and inversion, as Newton&#8217;s method requires.<\/p><p>In the <a href=\"https:\/\/www.neuraldesigner.com\/learning\/tutorials\/training-strategy#ConjugateGradient\">conjugate gradient<\/a> training algorithm, the search is performed along with conjugate directions. They generally produce faster convergence than gradient descent directions. These training directions are conjugated concerning the Hessian matrix.<\/p><p>Let denote d as the training direction vector. Then, starting with an initial parameter vector $(\\mathbf{w}^{(0)})$ and an initial training direction vector $(\\mathbf{d}^{(0)}=-\\mathbf{g}^{(0)})$, the conjugate gradient method constructs a sequence of training directions as:<\/p><p>$$ \\mathbf{d}^{(i+1)}=\\mathbf{g}^{(i+1)}+\\mathbf{d}^{(i)}\\cdot \\gamma^{(i)}, $$<\/p><p>for $( i = 0,1,\\ldots )$.<\/p><p>Here, $(\\gamma)$ is called the conjugate parameter, and there are different ways to calculate it. Two of the most used are Fletcher and Reeves and Polak and Ribiere. For all <a href=\"https:\/\/www.neuraldesigner.com\/learning\/tutorials\/training-strategy#ConjugateGradient\">conjugate gradient algorithms<\/a>, the training direction is periodically reset to the negative of the gradient.<\/p><p>The parameters are then improved according to the following expression.<\/p><p>$$ \\mathbf{w}^{(i+1)}=\\mathbf{w}^{(i)}+\\mathbf{d}^{(i)}\\cdot \\eta^{(i)} $$<\/p><p>for $( i = 0,1,\\ldots )$.<\/p><p>The training rate, $(\\eta)$, is usually found by line minimization.<\/p><p>The picture below depicts an activity diagram for the training process with the conjugate gradient. Here, the improvement of the parameters is done in two steps. First, the algorithm computes the conjugate gradient training direction. Second, it finds a suitable training rate in that direction.<\/p><p><img decoding=\"async\" src=\"https:\/\/www.neuraldesigner.com\/images\/conjugate_gradient_algorithm_big.webp\" alt=\"Conjugate gradient diagram\" \/><\/p><p>This method has proved more effective than <a href=\"https:\/\/www.neuraldesigner.com\/learning\/tutorials\/training-strategy#GradientDescent\">gradient descent<\/a> in training <a href=\"https:\/\/www.neuraldesigner.com\/learning\/tutorials\/neural-network\">neural networks<\/a>. Since it does not require the Hessian matrix, the conjugate gradient also performs well with vast neural networks.<\/p><\/section><section><h2>4. Quasi-Newton method (QNM)<\/h2><p>The application of Newton&#8217;s method is computationally expensive. Indeed, it requires many operations to evaluate the Hessian matrix and compute its inverse. Alternative approaches, known as <a href=\"https:\/\/www.neuraldesigner.com\/learning\/tutorials\/training-strategy#QuasiNewtonMethod\">quasi-Newton<\/a>, are developed to solve that drawback. These methods do not calculate the Hessian directly and then evaluate its inverse. Instead, they build up an approximation to the inverse Hessian. This approximation is computed using only information on the first derivatives of the <a href=\"https:\/\/www.neuraldesigner.com\/learning\/tutorials\/training-strategy#LossIndex\">loss function<\/a>.<\/p><p>The Hessian matrix comprises the second partial derivatives of the loss function. Thus, the main idea behind the <a href=\"https:\/\/www.neuraldesigner.com\/learning\/tutorials\/training-strategy#QuasiNewtonMethod\">quasi-Newton method<\/a> is approximating the inverse Hessian by another matrix $(\\mathbf{G})$, using only the first partial derivatives of the loss function. Then, the quasi-Newton formula is expressed as<\/p><p>$$ \\mathbf{w}^{(i+1)}=\\mathbf{w}^{(i)}-(\\mathbf{G}^{(i)}\\cdot \\mathbf{g}^{(i)})\\cdot\\eta^{(i)} $$<\/p><p>for $( i = 0,1,\\ldots )$.<\/p><p>The training rate (eta) can be set to a fixed value or be found by line minimization. The inverse Hessian approximation $(\\mathbf{G})$ has different flavors. Two of the most used ones are the Davidon\u2013Fletcher\u2013Powell (DFP) and the Broyden\u2013Fletcher\u2013Goldfarb\u2013Shanno (BFGS) formulas.<\/p><p>The activity diagram of the <a href=\"https:\/\/www.neuraldesigner.com\/learning\/tutorials\/training-strategy#QuasiNewtonMethod\">quasi-Newton training<\/a> process is illustrated below. Improvement of the parameters is performed in two steps. First, the algorithms obtain the quasi-Newton training direction. Second, it finds a satisfactory training rate.\u00a0<span style=\"color: var( --e-global-color-text ); font-family: var( --e-global-typography-text-font-family ), Sans-serif; font-size: var( --e-global-typography-text-font-size ); font-weight: var( --e-global-typography-text-font-weight );\">This is the default method to use in most cases:<\/span><\/p><p><img decoding=\"async\" src=\"https:\/\/www.neuraldesigner.com\/images\/quasi-newton_algorithm_big.webp\" alt=\"Quasi newton algorithm diagram\" \/><\/p><p>It is faster than <a href=\"https:\/\/www.neuraldesigner.com\/learning\/tutorials\/training-strategy#GradientDescent\">gradient descent<\/a> and <a href=\"https:\/\/www.neuraldesigner.com\/learning\/tutorials\/training-strategy#ConjugateGradient\">conjugate gradient<\/a>, and the exact Hessian must not be computed and inverted.<\/p><\/section><section><h2>5. Levenberg-Marquardt algorithm (LM)<\/h2><p>The <a href=\"https:\/\/www.neuraldesigner.com\/learning\/tutorials\/training-strategy#LevenbergMarquardtAlgorithm\">Levenberg-Marquardt algorithm<\/a> is designed to work specifically with loss functions defined by a sum of squared errors. It works without computing the exact Hessian matrix. Instead, it works with the gradient vector and the Jacobian matrix.<\/p><p>Consider a loss function which takes the form of a sum of squared errors,<\/p><p>$$ f = \\sum_{i=1}^{m} e_{i}^{2} $$<\/p><p>Here $(m)$ is the number of training samples.<\/p><p>We can define the Jacobian matrix of the loss function as that containing the derivatives of the errors concerning the parameters,<\/p><p>$$ \\mathbf{J}_{i,j} = \\frac{\\partial e_{i}}{\\partial \\mathbf{w}_{j}}, $$<\/p><p>for $( i=1,\\ldots,m )$ and $( j = 1,\\ldots,n )$.<\/p><p>Where $(m)$ is the number of samples in the data set, and (n) is the number of parameters in the neural network.<br \/>Note that the size of the Jacobian matrix is $(m\\cdot n)$.<\/p><p>We can compute the gradient vector of the loss function as<\/p><p>$$\\nabla f = 2 \\mathbf{J}^{T}\\cdot \\mathbf{e}$$<\/p><p>Here $(\\mathbf{e})$ is the vector of all error terms.<\/p><p>Finally, we can approximate the Hessian matrix with the following expression.<\/p><p>$$\\mathbf{H} f \\approx 2 \\mathbf{J}^{T}\\cdot \\mathbf{J} + \\lambda \\mathbf{I}$$<\/p><p>Where $(\\lambda)$ is a damping factor that ensures the positiveness of the Hessian and $(I)$ is the identity matrix.<\/p><p>The next expression defines the parameters improvement process with the Levenberg-Marquardt algorithm<\/p><p>$$ \\mathbf{w}^{(i+1)}=\\mathbf{w}^{(i)}<br \/>&#8211; (\\mathbf{J}^{(i)T}\\cdot \\mathbf{J}^{(i)}+ \\lambda^{(i)}\\mathbf{I})^{-1} \\cdot (2\\mathbf{J}^{(i)T}\\cdot \\mathbf{e}^{(i)}), $$<\/p><p>for $( i = 0,1,\\ldots )$.<\/p><p>Newton&#8217;s method uses the approximate Hessian matrix when the damping parameter $(\\lambda)$ is zero. On the other hand, when $(\\lambda)$ is large, this becomes a gradient descent with a small training rate.<\/p><p>The parameter $(\\lambda)$ is initialized to be large, so the first updates are small steps in the gradient descent direction.<br \/>If any iteration results in a fail, then $(\\lambda)$ is increased by some factor. Otherwise, $(\\lambda)$ is reduced as the loss decreases, so the Levenberg-Marquardt algorithm approaches the Newton method. This process typically accelerates the convergence to the minimum.<\/p><p>The picture below represents a state diagram for the training process of a neural network with the Levenberg-Marquardt algorithm. The first step is calculating the loss, the gradient, and the Hessian approximation. Then, the damping parameter is adjusted to reduce the loss at each iteration.<\/p><p><img decoding=\"async\" src=\"https:\/\/www.neuraldesigner.com\/images\/levenberg_algorithm_big.webp\" alt=\"Levenberg-Marquardt algorithm diagram\" \/><\/p><p>As we have seen, the <a href=\"https:\/\/www.neuraldesigner.com\/learning\/tutorials\/training-strategy#LevenbergMarquardtAlgorithm\">Levenberg-Marquardt algorithm<\/a> is tailored for the sum-of-squared error functions.<br \/>That makes it very fast when training <a href=\"https:\/\/www.neuraldesigner.com\/learning\/tutorials\/neural-network\">neural networks<\/a> measured on that error.<\/p><p>However, this algorithm has some drawbacks. First, it cannot minimize functions such as the root mean squared error or the cross-entropy error. Also, the Jacobian matrix becomes enormous for big data sets and neural networks, requiring much memory. Therefore, the <a href=\"https:\/\/www.neuraldesigner.com\/learning\/tutorials\/training-strategy#LevenbergMarquardtAlgorithm\">Levenberg-Marquardt algorithm<\/a> is not recommended for big data sets or neural networks.<\/p><\/section><section><h2>Performance comparison<\/h2><p>The following chart depicts the computational speed and the memory requirements of the training algorithms discussed in this post.<\/p><p><img decoding=\"async\" src=\"https:\/\/www.neuraldesigner.com\/images\/performance-comparison.svg\" alt=\"Performance comparison between algorithms\" \/><\/p><p>As we can see, the slowest training algorithm is usually <a href=\"https:\/\/www.neuraldesigner.com\/learning\/tutorials\/training-strategy#GradientDescent\">gradient descent<\/a>, but it is the one requiring less memory.<\/p><p>On the contrary, the fastest one might be the <a href=\"https:\/\/www.neuraldesigner.com\/learning\/tutorials\/training-strategy#LevenbergMarquardtAlgorithm\">Levenberg-Marquardt algorithm<\/a>, but it usually requires much memory.<\/p><p>A good compromise might be the <a href=\"https:\/\/www.neuraldesigner.com\/learning\/tutorials\/training-strategy#QuasiNewtonMethod\">quasi-Newton method<\/a>.<\/p><\/section><section><h2>Conclusions<\/h2><p>If our <a href=\"https:\/\/www.neuraldesigner.com\/learning\/tutorials\/neural-network\">neural network<\/a> has thousands of parameters, we can use <a href=\"https:\/\/www.neuraldesigner.com\/learning\/tutorials\/training-strategy#GradientDescent\">gradient descent<\/a> or <a href=\"https:\/\/www.neuraldesigner.com\/learning\/tutorials\/training-strategy#ConjugateGradient\">conjugate gradient<\/a>\u00a0to save memory.<\/p><p>If we have many neural networks to train with just a few thousand samples and a few hundred parameters, the best choice might be the <a href=\"https:\/\/www.neuraldesigner.com\/learning\/tutorials\/training-strategy#LevenbergMarquardtAlgorithm\">Levenberg-Marquardt algorithm<\/a>.<\/p><p>In the rest of the situations, the <a href=\"https:\/\/www.neuraldesigner.com\/learning\/tutorials\/training-strategy#QuasiNewtonMethod\">quasi-Newton method<\/a> will work well.<\/p><p><a href=\"https:\/\/www.neuraldesigner.com\">Neural Designer<\/a> implements all those optimization algorithms. You can download the <a href=\"https:\/\/www.neuraldesigner.com\/free-trial\">free trial<\/a> to see how they work in practice.<\/p><\/section><section><h3>Related posts<\/h3><\/section>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<\/div>\n\t\t","protected":false},"author":20,"featured_media":2664,"template":"","categories":[],"tags":[36],"class_list":["post-3368","blog","type-blog","status-publish","has-post-thumbnail","hentry","tag-tutorials"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v26.4 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>5 algorithms to train a neural network<\/title>\n<meta name=\"description\" content=\"Explore training algorithms for neural networks, from gradient descent to the Levenberg-Marquardt algorithm.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.neuraldesigner.com\/blog\/5_algorithms_to_train_a_neural_network\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Understanding Neural Network Optimization Algorithms | Neural Designer Blog\" \/>\n<meta property=\"og:description\" content=\"Explore various optimization algorithms used in neural network learning, including gradient descent, Newton&#039;s method, conjugate gradient, quasi-Newton method, and Levenberg-Marquardt algorithm. Learn how these algorithms impact memory, speed, and precision in this Neural Designer blog post.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.neuraldesigner.com\/blog\/5_algorithms_to_train_a_neural_network\/\" \/>\n<meta property=\"og:site_name\" content=\"Neural Designer\" \/>\n<meta property=\"article:modified_time\" content=\"2025-11-27T14:17:47+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.neuraldesigner.com\/wp-content\/uploads\/2023\/06\/algorithms-train-network.webp\" \/>\n\t<meta property=\"og:image:width\" content=\"1200\" \/>\n\t<meta property=\"og:image:height\" content=\"628\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/webp\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:title\" content=\"Understanding Neural Network Optimization Algorithms | Neural Designer Blog\" \/>\n<meta name=\"twitter:description\" content=\"Explore various optimization algorithms used in neural network learning, including gradient descent, Newton&#039;s method, conjugate gradient, quasi-Newton method, and Levenberg-Marquardt algorithm. Learn how these algorithms impact memory, speed, and precision in this Neural Designer blog post.\" \/>\n<meta name=\"twitter:image\" content=\"https:\/\/www.neuraldesigner.com\/wp-content\/uploads\/2023\/06\/algorithms-train-network.webp\" \/>\n<meta name=\"twitter:site\" content=\"@NeuralDesigner\" \/>\n<meta name=\"twitter:label1\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data1\" content=\"13 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"https:\/\/www.neuraldesigner.com\/blog\/5_algorithms_to_train_a_neural_network\/\",\"url\":\"https:\/\/www.neuraldesigner.com\/blog\/5_algorithms_to_train_a_neural_network\/\",\"name\":\"5 algorithms to train a neural network\",\"isPartOf\":{\"@id\":\"https:\/\/www.neuraldesigner.com\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/www.neuraldesigner.com\/blog\/5_algorithms_to_train_a_neural_network\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/www.neuraldesigner.com\/blog\/5_algorithms_to_train_a_neural_network\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/www.neuraldesigner.com\/wp-content\/uploads\/2023\/06\/algorithms-train-network.webp\",\"datePublished\":\"2025-09-24T08:59:22+00:00\",\"dateModified\":\"2025-11-27T14:17:47+00:00\",\"description\":\"Explore training algorithms for neural networks, from gradient descent to the Levenberg-Marquardt algorithm.\",\"breadcrumb\":{\"@id\":\"https:\/\/www.neuraldesigner.com\/blog\/5_algorithms_to_train_a_neural_network\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/www.neuraldesigner.com\/blog\/5_algorithms_to_train_a_neural_network\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.neuraldesigner.com\/blog\/5_algorithms_to_train_a_neural_network\/#primaryimage\",\"url\":\"https:\/\/www.neuraldesigner.com\/wp-content\/uploads\/2023\/06\/algorithms-train-network.webp\",\"contentUrl\":\"https:\/\/www.neuraldesigner.com\/wp-content\/uploads\/2023\/06\/algorithms-train-network.webp\",\"width\":1200,\"height\":628},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/www.neuraldesigner.com\/blog\/5_algorithms_to_train_a_neural_network\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/www.neuraldesigner.com\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Blog\",\"item\":\"https:\/\/www.neuraldesigner.com\/blog\/\"},{\"@type\":\"ListItem\",\"position\":3,\"name\":\"5 algorithms to train a neural network\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/www.neuraldesigner.com\/#website\",\"url\":\"https:\/\/www.neuraldesigner.com\/\",\"name\":\"Neural Designer\",\"description\":\"Explanable AI Platform\",\"publisher\":{\"@id\":\"https:\/\/www.neuraldesigner.com\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/www.neuraldesigner.com\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/www.neuraldesigner.com\/#organization\",\"name\":\"Neural Designer\",\"url\":\"https:\/\/www.neuraldesigner.com\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.neuraldesigner.com\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/www.neuraldesigner.com\/wp-content\/uploads\/2023\/05\/logo-neural-1.png\",\"contentUrl\":\"https:\/\/www.neuraldesigner.com\/wp-content\/uploads\/2023\/05\/logo-neural-1.png\",\"width\":1024,\"height\":223,\"caption\":\"Neural Designer\"},\"image\":{\"@id\":\"https:\/\/www.neuraldesigner.com\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/x.com\/NeuralDesigner\",\"https:\/\/es.linkedin.com\/showcase\/neuraldesigner\/\"]}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"5 algorithms to train a neural network","description":"Explore training algorithms for neural networks, from gradient descent to the Levenberg-Marquardt algorithm.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.neuraldesigner.com\/blog\/5_algorithms_to_train_a_neural_network\/","og_locale":"en_US","og_type":"article","og_title":"Understanding Neural Network Optimization Algorithms | Neural Designer Blog","og_description":"Explore various optimization algorithms used in neural network learning, including gradient descent, Newton's method, conjugate gradient, quasi-Newton method, and Levenberg-Marquardt algorithm. Learn how these algorithms impact memory, speed, and precision in this Neural Designer blog post.","og_url":"https:\/\/www.neuraldesigner.com\/blog\/5_algorithms_to_train_a_neural_network\/","og_site_name":"Neural Designer","article_modified_time":"2025-11-27T14:17:47+00:00","og_image":[{"width":1200,"height":628,"url":"https:\/\/www.neuraldesigner.com\/wp-content\/uploads\/2023\/06\/algorithms-train-network.webp","type":"image\/webp"}],"twitter_card":"summary_large_image","twitter_title":"Understanding Neural Network Optimization Algorithms | Neural Designer Blog","twitter_description":"Explore various optimization algorithms used in neural network learning, including gradient descent, Newton's method, conjugate gradient, quasi-Newton method, and Levenberg-Marquardt algorithm. Learn how these algorithms impact memory, speed, and precision in this Neural Designer blog post.","twitter_image":"https:\/\/www.neuraldesigner.com\/wp-content\/uploads\/2023\/06\/algorithms-train-network.webp","twitter_site":"@NeuralDesigner","twitter_misc":{"Est. reading time":"13 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/www.neuraldesigner.com\/blog\/5_algorithms_to_train_a_neural_network\/","url":"https:\/\/www.neuraldesigner.com\/blog\/5_algorithms_to_train_a_neural_network\/","name":"5 algorithms to train a neural network","isPartOf":{"@id":"https:\/\/www.neuraldesigner.com\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.neuraldesigner.com\/blog\/5_algorithms_to_train_a_neural_network\/#primaryimage"},"image":{"@id":"https:\/\/www.neuraldesigner.com\/blog\/5_algorithms_to_train_a_neural_network\/#primaryimage"},"thumbnailUrl":"https:\/\/www.neuraldesigner.com\/wp-content\/uploads\/2023\/06\/algorithms-train-network.webp","datePublished":"2025-09-24T08:59:22+00:00","dateModified":"2025-11-27T14:17:47+00:00","description":"Explore training algorithms for neural networks, from gradient descent to the Levenberg-Marquardt algorithm.","breadcrumb":{"@id":"https:\/\/www.neuraldesigner.com\/blog\/5_algorithms_to_train_a_neural_network\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.neuraldesigner.com\/blog\/5_algorithms_to_train_a_neural_network\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.neuraldesigner.com\/blog\/5_algorithms_to_train_a_neural_network\/#primaryimage","url":"https:\/\/www.neuraldesigner.com\/wp-content\/uploads\/2023\/06\/algorithms-train-network.webp","contentUrl":"https:\/\/www.neuraldesigner.com\/wp-content\/uploads\/2023\/06\/algorithms-train-network.webp","width":1200,"height":628},{"@type":"BreadcrumbList","@id":"https:\/\/www.neuraldesigner.com\/blog\/5_algorithms_to_train_a_neural_network\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.neuraldesigner.com\/"},{"@type":"ListItem","position":2,"name":"Blog","item":"https:\/\/www.neuraldesigner.com\/blog\/"},{"@type":"ListItem","position":3,"name":"5 algorithms to train a neural network"}]},{"@type":"WebSite","@id":"https:\/\/www.neuraldesigner.com\/#website","url":"https:\/\/www.neuraldesigner.com\/","name":"Neural Designer","description":"Explanable AI Platform","publisher":{"@id":"https:\/\/www.neuraldesigner.com\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.neuraldesigner.com\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.neuraldesigner.com\/#organization","name":"Neural Designer","url":"https:\/\/www.neuraldesigner.com\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.neuraldesigner.com\/#\/schema\/logo\/image\/","url":"https:\/\/www.neuraldesigner.com\/wp-content\/uploads\/2023\/05\/logo-neural-1.png","contentUrl":"https:\/\/www.neuraldesigner.com\/wp-content\/uploads\/2023\/05\/logo-neural-1.png","width":1024,"height":223,"caption":"Neural Designer"},"image":{"@id":"https:\/\/www.neuraldesigner.com\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/x.com\/NeuralDesigner","https:\/\/es.linkedin.com\/showcase\/neuraldesigner\/"]}]}},"_links":{"self":[{"href":"https:\/\/www.neuraldesigner.com\/api\/wp\/v2\/blog\/3368","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.neuraldesigner.com\/api\/wp\/v2\/blog"}],"about":[{"href":"https:\/\/www.neuraldesigner.com\/api\/wp\/v2\/types\/blog"}],"author":[{"embeddable":true,"href":"https:\/\/www.neuraldesigner.com\/api\/wp\/v2\/users\/20"}],"version-history":[{"count":1,"href":"https:\/\/www.neuraldesigner.com\/api\/wp\/v2\/blog\/3368\/revisions"}],"predecessor-version":[{"id":21404,"href":"https:\/\/www.neuraldesigner.com\/api\/wp\/v2\/blog\/3368\/revisions\/21404"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.neuraldesigner.com\/api\/wp\/v2\/media\/2664"}],"wp:attachment":[{"href":"https:\/\/www.neuraldesigner.com\/api\/wp\/v2\/media?parent=3368"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.neuraldesigner.com\/api\/wp\/v2\/categories?post=3368"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.neuraldesigner.com\/api\/wp\/v2\/tags?post=3368"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}