## Getting Started

## Demystifying the inner workings of BFGS optimization

Published in

·

17 min read

·

Nov 26, 2020

--

Surely anyone who has dabbled in machine learning is familiar with gradient descent, and possibly even its close counterpart, stochastic gradient descent. If you have more than dabbled, then you’re surely also aware of the fancier extensions like gradient descent with momentum and Adam optimization.

Perhaps less well-known are a class of optimization algorithms known as quasi-Newton methods. Though these optimization methods are less fervently advertised in popular accounts of machine learning, they hold an important place in the arsenal of machine learning practitioners.

The goal of this article is to provide an introduction to the mathematical formulation of BFGS optimization, by far the most widely used quasi-Newton method. As such, the focus will be on the mathematical derivation of results, rather than the application of BFGS in code. It is my hope that by the end of this article, you will have gained an appreciation for what BFGS is, how it works, and why it was developed. A top priority of mine when crafting this article was to make it as accessible as possible. To this end, any non-trivial derivation will be explicitly shown, and the dreaded phrase “it is straightforward to show” will never make an appearance.

Instead of jumping right into quasi-Newton methods and BFGS, my strategy is to start off by doing a run-through of a few of the more basic optimization methods first, and explore their deficiencies. This would then provide a natural segue to quasi-Newton methods and how they aim to address these deficiencies. This might seem like a long-winded way of getting to the main topic, but I believe it is well-justified if we are to truly appreciate the motivation behind developing BFGS, and to get a sense of where it stands within the landscape of optimization methods.

So without further ado, let’s start from the very beginning, with gradient descent.

We begin with a lightning quick review of gradient descent, which is an iterative method for finding a local minimum of a real-valued, differentiable objective function *f*(**x**).

To find a local minimum, we start off at a random initial point and iteratively take steps proportional to the negative gradient of the function *f* at the current point. Since the gradient of a function points in the direction of steepest ascent, the *negative* gradient points in the direction of steepest *descent*, and thus at each step of gradient descent, we are moving in the direction where *f*(**x**) decreases the fastest. Symbolically, an iteration of gradient descent is written as

where 𝛼** **is a positive real number known as the learning rate, which controls the step size taken at each iteration. Too large of a learning rate and our model diverges out of control; too small of a learning rate and our model becomes highly inefficient, taking unnecessarily long to converge to the minimum. There is no a priori way of choosing a good learning rate that lies in the sweet spot between these two extremes, and in practice the optimal learning rate is typically found empirically, by looping through various values and seeing how they perform.

Looking at equation (1), we see that gradient descent is a first-order optimization method, as it uses first-order information (ie. the gradient) to find the minimum. While this often reliably gets the job done, its main disadvantage lies in the fact that it is quite inefficient, even for a suitably chosen learning rate. By approximating our objective function linearly at each step, we are only working with very limited local information, and we thus have to be cautious and restrain ourselves to small step sizes at each iteration.

Perhaps we can do better, by obtaining more local information of the objective function at each iteration, in hopes that we can make more well-informed steps. A natural extension would be to look at the second-order behavior of the objective function.

For simplicity, let’s first consider a function *f *of one variable. Consulting Taylor, we know that the second order approximation of *f* about a point *xᴋ*+ϵ is given by

Our Taylor approximation is minimized when

which corresponds to a step size of

We can thus try a new iteration scheme

which also takes into account the second-order behavior of the objective function.

Generalizing to *n* dimensions, the first derivative is replaced by the gradient ∇*f* and the second derivative is replaced by the *Hessian* *H*.

In *n* dimensions, our new iterative scheme is thus written as:

This method of optimization, where we take into account the objective function’s second order behavior in addition to its first order behavior, is known as Newton’s method. In each iteration *k*, Newton’s method approximates the function *f* at the point **x***ᴋ* with a paraboloid, and then proceeds to minimize that approximation by stepping to the minimum of that paraboloid (there is actually a caveat here that we will get to later). Notice that in contrast to regular gradient descent, there is no longer a need to set a learning rate parameter, because now our step size is determined exactly by the distance to the minimum of the fitted parabola at that point.

To see how much we stand to gain from computing the Hessian, consider figure 1 below where we want to minimize our loss function, starting from the point (-5, 5). For a suitably chosen learning rate, gradient descent takes 229 steps to converge to the minimum. On the other hand, Newton’s method converges to the minimum in only six steps!

If this is your first time learning about Newton’s method, you’re probably just as shocked as I was when it was *my* first time. “Surely this is too good to be true!” you might exclaim. Why ever use gradient descent if Newton’s method converges to the minimum so much faster?

It turns out that the benefits we gain from Newton’s method comes at a cost. There are two main issues with Newton’s method:

- It is sensitive to initial conditions. This is especially apparent if our objective function is non-convex. Unlike gradient descent, which ensures that we’re always going downhill by always going in the direction opposite the gradient, Newton’s method fits a paraboloid to the local curvature and proceeds to move to the stationary point of that paraboloid. Depending on the local behavior of our initial point, Newton’s method could take us to a maximum or a saddle point instead of a minimum. This problem is exacerbated in higher dimensional non-convex functions, where saddle points are much more likely to occur compared to minimum points. We thus see that Newton’s method is really only appropriate for minimizing convex objective functions. The silver lining here, though, is that a considerable amount of optimization problems in machine learning are convex problems, such as linear (and ridge) regression, logisitic regression, and support vector machines. An obvious example of a non-convex model would be neural networks. For the rest of this article, we will restrict our attention to only convex objective functions. Correspondingly
*, we require the Hessian to be positive-definite.* - Newton’s method is very computationally expensive. While the computation of the gradient scales as
*O*(*n*), the computation of the inverse Hessian scales as*O*(*n*³) (computing the Hessian scales as*O*(*n*²), inverting it scales as*O*(*n*³)). As the dimensions of our problem increases, the overhead in memory and time gets out of hand very quickly. For example, in 50 dimensions, we’ll have to calculate 50(50+1)/2 = 1275 values for the Hessian at each step, and then perform approximately another 50³ operations to invert it. It’s clear at this point that the benefit of an increased convergence rate will be far outweighed by the large cost of the additional computation time.

Despite its limited practical use, Newton’s method is still of great theoretical value. We did see how efficient the second-order optimization can be if used correctly. What if we could somehow leverage the efficiency gained from considering second-order behavior, but avoid the computational cost of calculating the inverse Hessian? In other words, can we get a sort of hybrid between gradient descent and Newton’s method, where we can have faster convergence than gradient descent, but lower operational cost per iteration than Newton’s method?

It turns out there’s a class of optimization methods, called *quasi-Newton methods*, that does just that.

We went through Newton’s method for optimization, which, in contrast to vanilla gradient descent, leverages second-order behavior in addition to first-order behavior at each step, making for a much faster convergence to the minimum. However, we also saw the downsides to Newton’s method — one of which is how computationally expensive it is to both calculate the Hessian and to invert it, especially when dimensions get large. Quasi-Newton methods are a class of optimization methods that attempt to address this issue.

Recall that in Newton’s method, we make the following update at each iteration:

where the Hessian is computed and inverted at each step. In quasi-Newton methods, instead of computing the actual Hessian, we just *approximate it *with a positive-definite matrix *B*, which is updated from iteration to iteration using information computed from previous steps (we require *B* to be positive-definite because we are optimizing a convex function, and this automatically takes care of the symmetry requirement of the Hessian).We immediately see that this scheme would yield a much less costly algorithm compared to Newton’s method, because instead of computing a large amount of new quantities at each iteration, we’re largely making use of previously computed quantities.

At this stage the nature of the update scheme for *B* has been left vague, because the specific update for *B* is given by the specific quasi-Newton method used. There is however one condition that all quasi-Newton methods must share, and that is the Hessian approximation *B* must satisfy the *quasi-Newton condition* (or *secant equation*):

which is obtained from the first order Taylor expansion of ∇ *f*(**x***ᴋ*₊₁) about ∇*f*(**x***ᴋ*) (we can also view this as sort of a finite difference equation of the gradient itself). We can rewrite the quasi-Newton condition more succinctly by letting **y***ᴋ* = ∇ *f*(**x***ᴋ*₊₁) −∇*f*(**x***ᴋ*) and Δ**x***ᴋ* = **x***ᴋ*₊₁−**x***ᴋ*, so that we have

Furthermore, we can verify the positive-definiteness of our Hessian approximation *B* by pre-multiplying the quasi-Newton condition with Δ**x***ᴋ*ᵀ, so our requirement for positive-definiteness (ie. the curvature condition) can be expressed as Δ**x***ᴋ*ᵀ **y***ᴋ*>0.

Before we go further however, let’s take a step back and consider only one dimension, where our intuition is strong. The secant equation, in one dimension, is

and the curvature condition is satisfied by requiring

(*f*′*ᴋ*₊₁−*f*′*ᴋ*)/(*xᴋ*₊₁−*xᴋ*) > 0. Solving for *f*″ and substituting into Newton’s method in one dimension, we get

We thus have here an optimization method that leverages the (approximate) second-order behavior of the objective function in order to converge faster, without actually taking any second derivatives. Instead, at each iteration *k*+1, we construct an approximate inverse second derivative using only quantities from previous steps — in this case the first derivative from the previous two iterations *k* and *k*−1.

This method — approximating Newton’s method in one dimension by replacing the second derivative with its finite difference approximation, is known as the secant method, a subclass of quasi-Newton methods.

Now back to our *n*-dimensional quasi-Newton condition (equation 4). At first glance, it looks like we could just work analogously to our one-dimensional case: we can simply solve for *Bᴋ*₊₁ directly, and substitute it into Newton’s iterative step (2). Job done, right?

Not quite, actually. Despite looking deceptively similar to our one-dimensional case, remember that *B* is in general a symmetric *n *× *n* matrix, with *n*(*n*+1)/2 components, whereas our equation only has *n* components. This means that we’re trying to solve for *n*(*n*+1)/2 unknowns with only *n* equations, making this an underdetermined system. In fact, we were only able to find a unique solution to the secant equation in one dimension because the unknown components of the Hessian, being 1(1+1)/2 = 1, coincide with the one equation that we have. In general, there are *n*(*n*+1)/2 − *n*= *n*(*n*−1)/2 unknowns that we cannot solve for.

This is where quasi-Newton methods come in, where the secant method is generalized to multidimensional objective functions. Instead of approximating the second derivative merely by using the finite difference like in the secant method, quasi-Newton methods have to impose additional constraints. The common theme still runs through though — at each iteration *k*+1, the new Hessian approximation *Bᴋ*₊₁ is obtained using only previous *gradient* information.

Various quasi-Newton methods have been developed over the years, and they differ in how the approximate Hessian *B* is updated at each iteration. As of today, the most widely used quasi-Newton method is the BFGS method, and this will be our focus for the remaining of this article.

It’s been somewhat of a long trek so far, so let’s pause for moment and do a quick recap before moving on. Our objective is to find the minimum of a (twice-differentiable) convex function. A simple approach to this is gradient descent — starting from some initial point, we slowly move downhill by taking iterative steps proportional to the negative gradient of the function at each point. Newton then taught us that we can take far less steps and converge much quicker to the minimum if we also consider the second-order behavior of the function, by computing the (inverse) Hessian at each step. This comes at a cost, however, as calculating (and inverting) the Hessian takes a lot of resources. A compromise would be to instead just *approximate* the Hessian at each step, subject to the quasi-Newton condition. In one dimension, this just amounts to approximating the second derivative by replacing it with a finite difference approximation; this method of optimization is called the secant method. In more than one dimension, the quasi-Newton condition does not uniquely specify the Hessian estimate *B*, and we need to impose further constraints on *B *to solve for it. Different quasi-Newton methods offer their own method for constraining the solution. Here, we will focus on one of the most popular methods, known as the BFGS method. The name is an acronym of the algorithm’s creators: Broyden, Fletcher, Goldfarb, and Shanno, who each came up with the algorithm independently in 1970 [7–10].

We saw previously that in *n*>1 dimensions, the quasi-Newton condition (4) is underdetermined. To determine an update scheme for *B* then, we will need to impose additional constraints. One such constrain that we’ve already mentioned is the symmetry and positive-definiteness of *B — *these properties should be preserved in each update. Another desirable property we want is for *Bᴋ*₊₁ to be sufficiently close to *Bᴋ* at each update *k*+1. A formal way to characterize the notion of “closeness” for matrices is the matrix norm. Thus, we should look to minimize the quantity ||*Bᴋ*₊₁−*Bᴋ*||. Putting all our conditions together, we can formulate our problem as

Referring back to Newton’s method, we recall that it’s actually the Hessian’s inverse, instead of the Hessian itself, that makes an appearance. So instead of computing the approximate Hessian *B*, we can just as well compute the approximate inverse *B*⁻¹ directly. We can thus alternatively formulate (5) as

(Notice the inverted quasi-Newton condition.) To repeat, we minimize the change in *B*⁻¹ at each iteration, subject to the (inverted) quasi-Newton condition and the requirement that it be symmetric. We have also previously mentioned many times the requirement that *B* (and by extension *B*⁻¹) be positive-definite; it turns out this property comes along for the ride when we solve this problem. We will show this eventually; in the meantime, I’ll have to ask you to hold your breath.

To solve for *Bᴋ*₊₁, we still have to specify the particular matrix norm we’re using in (6). Different choices of norms result in different quasi-Newton methods, characterized by a different resulting update scheme. In the BFGS method, the norm is chosen to be the Frobenius norm:

which is just the square root of the sum of the absolute value squared of the matrix elements.

Solving (6) from scratch is no easy feat, and we will not go through it here. For the highly mathematically inclined reader, refer to references [4-6] for detailed derivations. For our purposes, it suffices to know that the problem ends up being equivalent to updating our approximate Hessian at each iteration by adding two symmetric, rank-one matrices *U* and *V*:

To fulfill the aforementioned conditions, the update matrices can then be chosen to be of the form *U = a ***u u**ᵀ and *V = b ***v v**ᵀ, where **u** and **v** are linearly independent non-zero vectors, and *a* and *b* are constants. The outer product of any two non-zero vectors is always rank one, and since *U* and *V* are both the outer product of a vector with itself, they are also symmetric, taking care of the requirement that the Hessian remains symmetric upon updates. We then have

Since *Uᴋ* and *Vᴋ* are rank-one (and **u** and **v** are linearly independent), their sum is rank-two, and an update of this form is known as a rank-two update. Thus if we start out with a symmetric matrix *B*₀, then *Bᴋ* will remain symmetric for all *k*. The rank-two condition guarantees the “closeness” of *Bᴋ* and *Bᴋ*₊₁at each iteration.

Furthermore, we have to impose the quasi-Newton condition *Bᴋ*₊₁ Δ**x** = **y***ᴋ *:

At this point, a natural choice for **u** and **v** would be **u** = **y***ᴋ* and **v** = *Bᴋ* Δ**x***ᴋ*. We then have

Substituting *a* and *b* back into (7), we have the BFGS update in all its glory:

As advertised, we are updating the approximate Hessian at each iteration using only previous *gradient* information. Note, however, that in practice we actually want to compute *B*⁻¹ directly, because it’s the inverse Hessian that appears in a Newton update. To invert (8), we make use of the Woodbury formula, which tells us how to invert the sum of an invertible matrix *A* and a rank-*k* correction:

We can thus obtain the formula for a BFGS update for the inverse of *B.* We first rewrite the update in a more suitable form for applying the Woodbury formula (to avoid clutter, we’ll suppress the subscripts *k*+1 and *k*, instead denoting the update as *B*₊):

Plugging into the Woodbury formula,

So while equation (8) is cleaner and allows for more straightforward analysis, equation (9) is what we actually want to compute in practice. Again, each update is made only requiring previous *gradient* information. *At each iteration, we update the value of B⁻¹ using only the values of *Δ**x***ᴋ* = **x***ᴋ*₊₁−**x***ᴋ and ***y***ᴋ* = ∇ *f*(**x***ᴋ*₊₁) −∇*f*(**x***ᴋ*)* of the previous steps, in accordance to equation *(*9*). By directly estimating the inverse Hessian at each step, we are completely doing away with those laborious *O*(*n*³) operations of inverting the Hessian, as in Newton’s method.

We are now also in a position to show that our rank-two update preserves positive-definiteness. From (9), we can calculate the quantity

for a non-zero vector **z**. If *Bᴋ*⁻¹ is positive-definite, then both terms are non-negative, with the second term being zero only if Δ**x**ᵀ **z **= 0. Positive-definiteness is thus preserved by BFGS’s rank-two update. If we were to go back to (7) and repeat the development but with a rank-one update instead, we wouldn’t actually end up with a positive-definite preserving update [4, 5]. This explains why BFGS uses a rank-two update: it is the update with the lowest rank that preserves positive-definiteness.

One final implementation detail that we previously glossed over: since an update of *B*⁻¹ depends on its previous value, we have to initialize *B*₀⁻¹ to kick off the algorithm. There are two natural ways to do this. The first approach is to set *B*₀⁻¹ to the identity matrix, in which case the first step will be equivalent to vanilla gradient descent, and subsequent updates will incrementally refine it to get closer to the inverse Hessian. Another approach would be to compute and invert the true Hessian at the initial point. This would start the algorithm off with more efficient steps, but comes at an initial cost of computing the true Hessian and inverting it.

Being an introductory piece, the aim of this discussion was to present quasi-Newton methods and BFGS in a manner that is as accessible as possible. As such, we’ve really only just skimmed the surface of all the mathematical intricacies and performance considerations that underlie BFGS and other quasi-Newton methods. There is a wealth of resources out there on this subject, a very small selection of which I have included in the references below. I encourage readers who are interested in diving deeper to seek out these resources. Be forewarned though — some of these references aren’t easy!

Thank you for making it all the way to the end! Hopefully you’ve gained an appreciation for how BFGS works under the hood, so that next time you call

, instead of viewing it as a complete black box, you’ll smile to yourself knowingly, fully aware that BFGS is simply iteratively making positive-definite preserving rank-two updates to the loss function’s approximate inverse Hessian, to efficiently find you its minimum.**scipy.optimize.minimize(method=‘BFGS')**

**Cursory run-throughs:**[1] BFGS on Wikipedia

[2] Quasi-Newton methods on Wikipedia

[3] Newton’s method on Wikipedia

**Advanced references on BFGS:**[4] J. Nocedal and S. Wright.

*Numerical Optimization (Chapter 6). S*pringer, 2nd edition, 2006.

[5] Oxford University lecture notes by R. Hauser

[6] MIT 18.335 lecture notes by S. G. Johnson

**The original papers:**[7] C. G. Broyden, “The convergence of a class of double-rank minimization algorithms”,

*J. Inst. Math. Appl.*

**6**, 76–90 (1970)

[8] R. Fletcher, “A new approach to variable metric algorithms”,

*Comp. J.*

**13**, 317–322 (1970)

[9] D. F. Goldfarb, “A family of variable-metric methods derived by variational means”,

*Math. Comp.*

**24**, 23–26 (1970)

[10] D. Shanno, “Conditioning of quasi-Newton methods for function minimization”,

*Math. Comp.*

**24**, 647–656 (1970)