Computational Optimization Quasi Newton Methods 2/22 NW Chapter 8

Theorem 3.4 Suppose f is twice cont diff and the sequence of steepest descent converge to x* satisfying SOSC. ⎛ ⎞ λ − λ ⎛ ⎞ ρ − 1 Let ∈ = λ ≤ ≤ λ ∇ 2 ⎜ n 1 ⎟ ⎜ ⎟ r where ... eigenvaluesof f x ( *) λ + λ ρ + 1 n ⎝ ⎠ ⎝ ⎠ 1 n 1 The for all k sufficient large [ ] − ≤ − 2 f ( x ) f ( x *) r f ( x ) f ( x *) + k 1 k

Choosing better directions Steepest Descent – simple and cheap per iterations but can converge very slowly if conditioning is bad. Modified Newton’s – expensive per iterations but converges quickly. Goal – first order methods with Newton like behavior

Scaled Steepest Descent Pick approximation of Hessian D k + = − α ∇ x x D f x ( ) k 1 k k k k ( ) = 1/ 2 Let S D k k Do change = x Sy of variables Now problem is = min ( ) h y f Sy ( )

Scaled Steepest Descent… = − α ∇ y y h y ( ) + k 1 k k k = − α ∇ y S f Sy ( ) k k k Multiple by S = − α ∇ S y Sy SS f Sy ( ) + k 1 k k k = − α ∇ S y Sy D f Sy ( ) + k 1 k k k = − α ∇ x x D f x ( ) + k 1 k k k Thus convergence rate of steepest descent applies in this space = = ( ) ( ) ' g y f Sy y SQSy

Scaled Steepest Descent… Convergence rate governed by eigs of SQS λ = smallest eigenvale of SQS 1 λ = largest eigenvale of SQS n ( ) 1/ 2 λ λ -1 Choose S close to Q to make / close to 1 n 1 ( ) ( ) ( ) 1/ 2 1/ 2 = -1 -1 note Q Q Q I

Cheap Newton Approximation Use just diagonal of Hessian ⎡ − ⎤ 1 ⎛ ⎞ ∂ 2 f ⎢ ⎥ ⎜ ⎟ 0 0 ∂ ∂ ⎢ ⎥ ⎝ ⎠ x x 1 1 ⎢ ⎥ − 1 ⎛ ⎞ ∂ ⎢ ⎥ 2 f = = ⎜ ⎟ S D 0 0 ⎢ ⎥ ∂ ∂ k ⎝ ⎠ x x ⎢ ⎥ 2 2 ⎢ ⎥ − 1 ⎛ ⎞ ∂ 2 f ⎢ ⎥ ⎜ ⎟ 0 0 ⎢ ⎥ ∂ ∂ ⎝ ⎠ x x ⎢ ⎥ ⎣ ⎦ 3 3 Linear storage and computation, inverse is trivial. Limited effectiveness.

Quasi-Newton Methods Newton’s Method 2 ( ∇ = −∇ f x ) p f x ( ) k k Instead substitute B k = −∇ B p f x ( ) k k to get Newton-like directions

Better yet – estimate Newton inverse Quasi-Newton Methods ) k f x ( = −∇ 1 B − k = B p k k H directly

1-dimensional case In 1-d case might estimate change in derivative − f '( x ) f '( x ) ≈ − k k 1 f ''( x ) change in x − k x x − k k 1 If you do this you get secant method − x x = − − k k 1 x x f '( x ) f '( x ) + − k 1 k k k f '( x ) f '( x ) − k k 1 0 x k-1 x k x k+1

1-d convergence Secant method has superlinear convergence with rate ( ) 1 1 r = + 5 (the "golden ratio" again!) 2 But Secant Method only applies to 1-d

Secant Condition 1-d condition − = − '' f ( x )( x x ) f '( x ) f '( x ) − − k k k 1 k k 1 Generalizes to ∇ − = ∇ −∇ 2 f x ( )( x x ) f x ( ) f x ( ) − − k k k 1 k k 1 So we want − = ∇ −∇ B ( x x ) f x ( ) f x ( ) − − k k k 1 k k 1

Another way to think about it Approximating quadratic model = + ∇ + 1 m ( ) p f x ( ) f x ( )' p p B p ' k k k k 2 Gradient = grad of current iterate ∇ = ∇ m (0) f x ( ) k k Want gradient = gradient of old iterate ∇ − = ∇ − α = ∇ − α = ∇ m ( x x ) m ( p ) f x ( ) B p f x ( ) − − − − − k k k 1 k k k 1 k k k 1 k 1 k 1 So α − = ∇ −∇ B p f x ( ) f x ( ) − − k k 1 k 1 k k 1

Quadratic Case For min 1/2 x’Qx-b’x ∇ −∇ = − − − f x ( ) f x ( ) ( Qx b ) ( Qx b ) − − k k 1 k k 1 = − Q x ( x ) − 1 k k So B k should act like Q along direction = − s x x − k k k 1 = ∇ −∇ y f x ( ) f x ( ) Let − k k k 1 So Quasi Newton Condition becomes = B s y + k 1 k k

Choice of B At each step we get information about Q along direction x k -x k-1 Use it to update our estimate of Q Many possible ways to do this and still satisfy quasi-Newton condition

BFGS Update Update by adding two matrices ′ ′ = + α + β B B a a b b Note outer product + k 1 k k k k k Need ′ ′ = + α + β B s B s a a s b b s =y from QNC + k 1 k k k k k k k k k k ′ α = So we make a a s y k k k k ′ = β and - b b s B s k k k k

BFGS Update ′ = β To make b b s - B s k k k k ( ) = Define b B s k k k ( )( ) β = β ' ' So b b s B s B s s k k k k k k k k ( ) ( ) = β ' s B s B s k k k k k 1 So pick β =- ' s B s k k k

BFGS Update k y k ) k y y s y ) k ' y s = k k ' y s k 1 k ' k a a s k ′ α ( k α ( k α = k = = y α k = a a s So define ' k k a To make k Define α So

BFGS Update Final Update is ′ ( )( ) ′ B s B s y y = − + k k k k k k B B + 1 ′ ′ k k s B s y s k k k k k This is called a BFGS family update for Broyden Fletcher Goldfarb and Shanno

Key Ideas This update is called a rank 2 update since it adds two rank one matrices. We want B k to be p.d. and symmetric. =−∇ Want to solve efficiently. B p f x ( ) k k k Two possible ways

Descent directions Need B to be positive definite. Necessary condition=Curvature Condition = ⇒ = > B s y s ' B s s ' y 0 + + k 1 k k k k 1 k k k Enforce for general conditions using Wolfe or Strong Wolfe Conditions

Wolfe Conditions For 0<c 1 <c 2 <1 ≤ + α ∇ f ( x ) f ( x ) c f ( x ) ' p + k 1 k 1 k k ∇ ≥ ∇ f ( x ) ' p c f ( x ) ' p + k 1 k 2 k k Implies ∇ ≥ ∇ f ( x ) ' s c f ( x ) ' s + k 1 k 2 k k ( ) = ∇ − ∇ ≥ − ∇ y ' s f ( x ) f ( x ) ' s ( c 1) f ( x ) ' s + k k k 1 k k 2 k k = − ∇ α > ( c 1) f ( x ) '( p ) 0 2 k k k

Guaranteeing B p.d. and sym. Lemma 11.5 in Nash and Sofer if B k is p.d. and symmetric then B k+1 is p.d. if and only if y k ’s k >0 So enforce this condition in linesearch procedure using wolfe conditions ′ [ ] [ ] ∇ −∇ − > f x ( ) f x ( ) x x 0 − − k k 1 k k 1

Quasi-Newton Algorithm with BFGS update Start with x 0. B 0 e.g. B 0 = I For k =1,…,K � If x k is optimal then stop � Solve: =−∇ B p f x ( ) using modified cholesky fact. k k k � Perform linesearch satisfying Wolf conditions x k+1 =x k + α k p k � Update s k =x k+1 -x k, = ∇ −∇ y f x ( ) f x ( ) + k k 1 k ′ ( )( ) ′ B s B s y y = − + k k k k k k � B B + 1 ′ ′ k k s B s y s k k k k k

Add Wolfe Condition to Linesearch Wolfe condition is approximation to optimality condition for the exact linesearch. + α = α min f x ( p ) g ( ) α k k α = = ∇ + α ' Optim. Cond. '( ) g 0 p f x ( p ) k k k ∇ + α ≤ η ∇ η > ' ' want p f x ( p ) p f x ( ) for 1> 0 k k k k k Used with Armijo search condition

Theorem 8.5 – global convergence Assumes start with symmetric pd B0 F is twice continuously differentiable X0 forms a convex level set, and eigenvalues of hessian on that level are bounded and strictly positive Then BFGS converges to minimizer of f.

Theorem 8.6 Assumes BFGS converges to x* and Hessian is Lipschitz in neighborhood of x* Then quasi-Newton BFGS algorithm has superlinear convergence.

B + = B =LL, want LL k k 1 Easy update of Cholesky Fact. Don’t need to refactorize whole matrix each time. Just much simpler matrix. ′ ( )( ) ′ B s B s y y = − + k k k k k k B B + ′ ′ k 1 k s B s y s k k k k k ′ ( )( ) ′ LL s ' LL s ' y y = − + k k k k LL ' O(n 2 ) ′ ′ s LL s ' y s k k k k ⎛ ⎞ ˆ ˆ ss ' yy ' ′ = − + = = ⎜ ⎟ ˆ L I L ' where s L s ' , Ly y k k ⎝ ⎠ s s ' y s ' � � � = LLL L ' ' where L are factors of inner matrix ˆ ˆ =LL'

Practical considerations see pages 200-201 Linesearches that don’t’ satisfy Wolfe conditions may not satisfy curvature condition. Then no descent direction so need some kind of recovery strategy. (Book suggest damped Newton) Can eliminate solving Newton equation

Calculating H B − = 1 Want H k k = − ρ − ρ + ρ ' ' ' H ( I s y ) H ( I s y ) s s k k k k k k k k k k k 1 ρ = where k ' y s k k Book shows derivation of this directly

Finding H = Want H y s + k 1 k k H H Such that is as close as possible to + k 1 k − min H H k = = subject to H H ' Hy s k k Can go back and forth between H and B using Sherman-Morrison-Woodbury Formula (see page 605)

Computational Optimization Quasi Newton Methods 2/22 NW Chapter 8 - PowerPoint PPT Presentation

Computational Optimization Quasi Newton Methods 2/22 NW Chapter 8 Theorem 3.4 Suppose f is twice cont diff and the sequence of steepest descent converge to x* satisfying SOSC. 1 Let =

Computational Optimization Constrained Optimization m R b , m n n Easiest Problem

Computational Optimization Advance Topics NonSmooth Optimization Reference: Nonlinear

Computational Optimization Convexity and Unconstrained Optimization 1/29/08 and 2/1(revised)

Computational Optimization Constrained Optimization Algorithms Feasible Descent Methods

Computational methods in optimization David F. Gleich Purdue University Thanks to Nick

Computational Optimization Last of unconstrained 2/26 Half-way there Minimize f(x) (objective

Computational Optimization Constrained Optimization Algorithms Same basic algorithms Repeat

Topology Optimization for Computational Fabrication Jun Wu Depart. of Design Engineering, TU

Integrated computational physics and numerical optimization Matthew J. Zahr Luis W. Alvarez

Computational Sustainability Andreas Krause Master Class at CompSust 2012 Combinatorial

Distributed RL Richard Liaw, Eric Liang Common Computational Patterns for RL Original Batch

Optimization Computational Model for Piezoelectric Energy Harvesters Considering Material

B5.1 Motivation in the size of the transition system (i.e., the number of states). However,

Computational Optimization Duality Theory (MW 12.9) Prof. K. Bennett Bennek@rpi.edu

Formal Program Optimization in Nuprl Using Computational Equivalence and Partial Types Vincent

Computational Optimization Convergence Rates Lecture 3 1/24/08 Golden Section Search Basic

Planning and Optimization B5. Computational Complexity of Planning: Background Gabriele R oger

Monte Carlo simulation inspired by computational optimization Colin Fox fox@physics.otago.ac.nz

Lecture6.1: Whatwewillnot betalkingabout Optimization and Computational Linear Algebra for Data

Topics in Computational Sustainability CS 325 Spring 2016 Making Choices: Stochastic

Planning and Optimization B5. Computational Complexity of Planning: Background Malte Helmert and

Session4: Normsand inner-products Optimization and Computational Linear Algebra for Data Science

Optimization-based computational physics and high-order methods: from optimized analysis to

Computational complexity of stochastic programs A. Shapiro School of Industrial and Systems