 
              QUASI-N E W TON MET HODS David F. Gleich February 29, 2012 Te material here is from Chapter 6 in No- Te idea behind Quasi-Newton methods is to make an optimization algorithm with cedal and Wright, and Section 12.3 in Griva, Sofer, and Nash. only a function value and gradient converge more quickly than steepest descent. Tat is, a Quasi-Newton method does not require a means to evaluate the Hessian matrix at the current iterate, as in a Newton method. Instead, the algorithm constructs a matrix that resembles the Hessian as it proceeds. In fact, there are many ways of doing this, and so there is really a family of Quasi- Newton methods. 1 quasi-newton in one variable: the secant method In a one dimensional problem, approximating the Hessian simplifies to approximating f ′ ( x + h )− f ′ ( x ) the second derivative: f ′′ ( x ) ≈ . Tus, the fact that this is possible is not h unreasonable. Using a related approximation in a one-dimensional optimization algorithm results in a procedure called the Secant method : 1 ( x k − x k − 1 ) f ′′ ( x k ) f ′ ( x k ) ” f ′ ( x k ) “ x k + 1 = x k − → x k + 1 = x k − f ′ ( x k ) − f ′ ( x k − 1 ) �ÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜ�ÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜ� �ÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜ�ÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜ� One dimensional Newton ≈ 1 / f ′′ ( x k ) Tis new update is trying to approximate the Newton update by approximating the second derivative information. Te secant method converges superlinearly, under appropriate conditions; so this idea checks out in one-dimension. 2 quasi-newton in general Quasi-Newton methods are line-search methods that compute the search direction by trying to approximate the Newton direction: “ H ( x k ) p = − g ” without using the matrix H ( x k ) . Tey work by computing B k “that behaves like” H ( x k ) . Once we compute x k + 1 , then we update B k → B k + 1 . Tus, a Quasi-Newton method has the general iteration: initialize B 0 , and k = 0 for k = 0, ... and while x k does not satisfy the conditions we want ... solve for the search direction B k p k = − g compute a line search α k update x k + 1 = x k + α p k update B k + 1 based on x k + 1 We can derive different Quasi-Newton methods by changing how we update B k + 1 from B k . 1
3 the secant condition While there are many ways of updating B k + 1 from B k , a random choice is unlikely to provide any benefit, and may making things considerably worse. Tus, we want to start from a principled approach. Recall that the Newton direction H k p k = − g arises as the unconstrained minimizer of m N k ( p ) = f k + g T k p + 1 2 p T H k p when H k is positive definite. Te model for Quasi-Newton methods uses B k instead of H k : m Q k ( p ) = f k + g T k p + 1 2 p T B k p so one common requirement for B k is that it remains positive definite. Tis requirement is relaxed for some Quasi-Newton methods. However, all Quasi-Newton methods require: ∇ m Q k + 1 ( 0 ) = g ( x k + 1 ) and ∇ m Q k + 1 (− α k p k ) = g ( x k ) . In other works, a Quasi-Newton method has the property that the gradient of the model function m Q k + 1 ( p ) has the same gradient as f at x k and x k + 1 . Tis requirement imposes some conditions on B k + 1 : ∇ m Q k + 1 (− α p k ) = g ( x k + 1 ) − α k B k + 1 p k = g ( x k ) � → B k + 1 α k p k = g ( x k + 1 ) − g ( x k ) . Note that α k p k = x k + 1 − x k . If we define: s k = x k + 1 − x k and y k = g ( x k + 1 ) − g ( x k ) . Ten Quasi-Newton methods require: B k + 1 s k = y k , which is called the secant condition . If we write this out for a one-dimensional problem: b k + 1 ( x k + 1 − x k ) = f ′ ( x k + 1 ) − f ′ ( x k ) . Tis equation is identical to the approximation of f ′′ ( x k ) used in the secant method. Q uiz Is it always possible to find such a B k + 1 ? Suppose that B k is symmetric, positive definite. Show that we need y T k x k > 0 in order for B k + 1 to be positive definite. If B k = 1 for a one dimensional problem, find a function where this isn’t true. 4 finding the update We are getting closer to figuring out how to find such an update. Tere are many ways to derive the following updates, I’ll just list them and state their properties. 4.1 DAVIDSON, FLETCHER, POWELL (DFP) 1 Let ρ = k s k . y T B k + 1 = ( I − ρ k sy T ) B k ( I − ρ k sy T ) + ρ k y k y T k . Clearly this matrix is symmetric when B k is. Also, B k + 1 is positive definite. Quiz Show that B k + 1 is positive definite. Tis choice of B k + 1 has the following optimality property: minimize ∥ B − B k ∥ W B T = B , Bs k = y k subject to where W is a weight based on the average Hessian. 2
4.2 BROYDEN, FLETCHER, GOLDFARB, SHANNO (BFGS) – “STANDARD” Because we compute the search direction by solving a system with the approximate Hessian matrix: B k p k = − g k , the BFGS update constructs an approximation of the inverse Hessian instead. Suppose that T k “behaves like” H ( x ) − 1 . Ten T k + 1 y k = s k is the secant condition for the inverse. Tis helps because now we can find search directions via p k = − T k g k , via a matrix-vector multiplication instead of a linear solve. Te BFGS method uses the update: T k + 1 = ( I − ρ k sy T ) T k ( I − ρ k sy T ) + ρ s k s T k . By the same proof, this update is also positive definite. Tis choice has the following optimality property: ∥ T − T k ∥ W minimize T T = T , Ty k = s k subject to where W is a weight based on the average Hessian. 4.3 SYMMETRIC RANK-1 (SR1) – FOR TRUST REGION METHODS Both of the previous updates were rank-2 changes to B k (or T k ). Te SR1 method is a rank-1 update to B k . Unfortunately, this update will not preserve positive definiteness. Nonetheless, it’s frequently used in practice and is a reasonable choice for Trust Region methods that don’t require a positive definite approximate Hessian. Any rank-1 symmetric matrix is: σ vv T and so the update is: B k + 1 = B k + σ vv T . Applying the Secant equation constrains v , and we have: B k + 1 = B k + ( y k − B k s k )( y k − B k s k ) T ( y k − B k s k ) T s k or T k + 1 = T k + ( s k − T k y k )( s k − T k y k ) T . ( s k − T k y k ) T y k Te SR1 method tends to generate better approximations to the true Hessian than the other methods. For instance, if the search directions p k are all linearly independent for k = 1, . . . , n , and f ( x ) is a simple quadratic model, then T n is the inverse of the true Hessian. 4.4 BROYDEN CLASS Te Broyden class is a linear combination of the BFGS and the DFP method: B k + 1 = ( 1 − ϕ ) B BFGS k + 1 + ϕ B DFP k + 1 . (Tis form requires the BFGS update for B and not T .) Tere are all sorts of great properties of the Broyden class, e.g. for the right choice of parameters, it’ll reproduce the CG method. 3
Recommend
More recommend