Computational Optimization Quasi Newton Methods 2/22 NW Chapter 8 - - PowerPoint PPT Presentation

computational optimization
SMART_READER_LITE
LIVE PREVIEW

Computational Optimization Quasi Newton Methods 2/22 NW Chapter 8 - - PowerPoint PPT Presentation

Computational Optimization Quasi Newton Methods 2/22 NW Chapter 8 Theorem 3.4 Suppose f is twice cont diff and the sequence of steepest descent converge to x* satisfying SOSC. 1 Let =


slide-1
SLIDE 1

Computational Optimization

Quasi Newton Methods 2/22 NW Chapter 8

slide-2
SLIDE 2

Theorem 3.4

Suppose f is twice cont diff and the sequence of steepest descent converge to x* satisfying SOSC. Let The for all k sufficient large

2 1 1 1

1 ... ( *) 1

n n n

r where eigenvaluesof f x λ λ ρ λ λ λ λ ρ ⎛ ⎞ − ⎛ ⎞ − ∈ = ≤ ≤ ∇ ⎜ ⎟ ⎜ ⎟ + + ⎝ ⎠ ⎝ ⎠

[ ]

*) ( ) ( *) ( ) (

2 1

x f x f r x f x f

k k

− ≤ −

+

slide-3
SLIDE 3

Choosing better directions

Steepest Descent – simple and cheap per iterations but can converge very slowly if conditioning is bad. Modified Newton’s – expensive per iterations but converges quickly. Goal – first order methods with Newton like behavior

slide-4
SLIDE 4

Scaled Steepest Descent

Pick approximation of Hessian Dk Do change

  • f variables

( )

1 1/ 2

( ) Let Now problem is min ( ) ( )

k k k k k k k

x x D f x S D x Sy h y f Sy α

+ =

− ∇ = = =

slide-5
SLIDE 5

Scaled Steepest Descent…

1 1 1 1

( ) ( ) Multiple by ( ) ( ) ( ) Thus convergence rate of steepest descent applies in this space ( ) ( ) '

k k k k k k k k k k k k k k k k k k k

y y h y y S f Sy S S y Sy SS f Sy S y Sy D f Sy x x D f x g y f Sy y SQSy α α α α α

+ + + +

= − ∇ = − ∇ = − ∇ = − ∇ = − ∇ = =

slide-6
SLIDE 6

Scaled Steepest Descent…

( ) ( ) ( )

( )

1 1/ 2

  • 1

1 1/ 2 1/ 2

  • 1
  • 1

Convergence rate governed by eigs of smallest eigenvale of largest eigenvale of Choose S close to Q to make / close to 1 note Q Q Q

n n

SQS SQS SQS I λ λ λ λ = = =

slide-7
SLIDE 7

Cheap Newton Approximation

Use just diagonal of Hessian Linear storage and computation, inverse is trivial. Limited effectiveness.

1 2 1 1 1 2 2 2 1 2 3 3 k

f x x f S D x x f x x

− − −

⎡ ⎤ ⎛ ⎞ ∂ ⎢ ⎥ ⎜ ⎟ ∂ ∂ ⎢ ⎥ ⎝ ⎠ ⎢ ⎥ ⎛ ⎞ ∂ ⎢ ⎥ = = ⎜ ⎟ ⎢ ⎥ ∂ ∂ ⎝ ⎠ ⎢ ⎥ ⎢ ⎥ ⎛ ⎞ ∂ ⎢ ⎥ ⎜ ⎟ ⎢ ⎥ ∂ ∂ ⎝ ⎠ ⎢ ⎥ ⎣ ⎦

slide-8
SLIDE 8

Quasi-Newton Methods

Newton’s Method Instead substitute Bk to get Newton-like directions

2 (

) ( )

k k

f x p f x ∇ = −∇ ( )

k k

B p f x = −∇

slide-9
SLIDE 9

Quasi-Newton Methods

Better yet – estimate Newton inverse directly

1

( )

k k k k

B p f x H B− = −∇ =

slide-10
SLIDE 10

1-dimensional case

In 1-d case might estimate

change in derivative change in x

If you do this you get secant method

1 1

'( ) '( ) ''( )

k k k k k

f x f x f x x x

− −

− ≈ −

1 1 1

'( ) '( ) '( )

k k k k k k k

x x x x f x f x f x

− + −

− = − − '( )

k

f x

xk-1 xk xk+1

slide-11
SLIDE 11

1-d convergence

Secant method has superlinear convergence with rate But Secant Method only applies to 1-d

( )

1 1 5 (the "golden ratio" again!) 2 r = +

slide-12
SLIDE 12

Secant Condition

1-d condition Generalizes to So we want

'' 1 1

( )( ) '( ) '( )

k k k k k

f x x x f x f x

− −

− = −

2 1 1

( )( ) ( ) ( )

k k k k k

f x x x f x f x

− −

∇ − = ∇ −∇

1 1

( ) ( ) ( )

k k k k k

B x x f x f x

− −

− = ∇ −∇

slide-13
SLIDE 13

Another way to think about it

Approximating quadratic model Gradient = grad of current iterate Want gradient = gradient of old iterate So

1 2

( ) ( ) ( )' '

k k k k

m p f x f x p p B p = + ∇ + (0) ( )

k k

m f x ∇ = ∇

1 1 1 1 1

( ) ( ) ( ) ( )

k k k k k k k k k k k

m x x m p f x B p f x α α

− − − − −

∇ − = ∇ − = ∇ − = ∇

1 1 1

( ) ( )

k k k k k

B p f x f x α −

− −

= ∇ −∇

slide-14
SLIDE 14

Quadratic Case

For min 1/2x’Qx-b’x So Bk should act like Q along direction Let

1 1 1

( ) ( ) ( ) ( ) ( )

k k k k k k

f x f x Qx b Qx b Q x x

− − −

∇ −∇ = − − − = −

1 k k k

s x x − = −

1 1

( ) ( ) So Quasi Newton Condition becomes

k k k k k k

y f x f x B s y

− +

= ∇ −∇ =

slide-15
SLIDE 15

Choice of B

At each step we get information about Q along direction xk-xk-1 Use it to update our estimate of Q Many possible ways to do this and still satisfy quasi-Newton condition

slide-16
SLIDE 16

BFGS Update

Update by adding two matrices Need

1

Note outer product

k k k k k k

B B a a b b α β

+

′ ′ = + +

1 k

=y from QNC So we make and

  • k

k k k k k k k k k k k k k k k k k

B s B s a a s b b s a a s y b b s B s α β α β

+

′ ′ = + + ′ = ′ =

slide-17
SLIDE 17

BFGS Update

To make So pick

  • k

k k k

b b s B s β ′ =

( ) ( )( )

( )(

)

' ' '

Define So

k k k k k k k k k k k k k k k k

b B s b b s B s B s s s B s B s β β β = = =

'

1 =-

k k k

s B s β

slide-18
SLIDE 18

BFGS Update

To make

( ) ( )

' ' ' '

Define So 1 So define

k k k k k k k k k k k k k

a y a a s y y s y s y y s α α α α = = = =

k k k k

a a s y α ′ =

slide-19
SLIDE 19

BFGS Update

Final Update is This is called a BFGS family update for Broyden Fletcher Goldfarb and Shanno

( )( )

1 k k k k k k k k k k k k k

B s B s y y B B s B s y s

+

′ ′ = − + ′ ′

slide-20
SLIDE 20

Key Ideas

This update is called a rank 2 update since it adds two rank one matrices. We want Bk to be p.d. and symmetric. Want to solve efficiently. Two possible ways

( )

k k k

B p f x =−∇

slide-21
SLIDE 21

Descent directions

Need B to be positive definite. Necessary condition=Curvature Condition Enforce for general conditions using Wolfe or Strong Wolfe Conditions

1 1

' '

k k k k k k k k

B s y s B s s y

+ +

= ⇒ = >

slide-22
SLIDE 22

Wolfe Conditions

For 0<c1<c2<1 Implies

1

1

( ) ( ) ( ) '

k k k

k

f x f x c f x p α

+

≤ + ∇

1

2

( ) ' ( ) '

k k

k k

f x p c f x p

+

∇ ≥ ∇

1

2

( ) ' ( ) '

k k

k k

f x s c f x s

+

∇ ≥ ∇

( )

1

2 2

' ( ) ( ) ' ( 1) ( ) ' ( 1) ( ) '( )

k k k k

k k k k k k

y s f x f x s c f x s c f x p α

+

= ∇ − ∇ ≥ − ∇ = − ∇ >

slide-23
SLIDE 23

Guaranteeing B p.d. and sym.

Lemma 11.5 in Nash and Sofer if Bk is p.d. and symmetric then Bk+1 is p.d. if and only if yk’sk>0 So enforce this condition in linesearch procedure using wolfe conditions

[ ] [ ]

1 1

( ) ( )

k k k k

f x f x x x

− −

′ ∇ −∇ − >

slide-24
SLIDE 24

Quasi-Newton Algorithm with BFGS update

Start with x0.B0 e.g. B0=I For k =1,…,K

If xk is optimal then stop Solve:

using modified cholesky fact.

Perform linesearch satisfying Wolf conditions

xk+1=xk+ αkpk

Update sk=xk+1 -xk,

  • ( )

k k k

B p f x =−∇

1

( ) ( )

k k k

y f x f x

+

= ∇ −∇

( )( )

1 k k k k k k k k

B s B s y y B B

+

k k k k k

s B s y s ′ = − + ′ ′

slide-25
SLIDE 25

Add Wolfe Condition to Linesearch

Wolfe condition is approximation to

  • ptimality condition for the exact

linesearch. Used with Armijo search condition

' ' '

min ( ) ( )

  • Optim. Cond. '( )

( ) want ( ) ( ) for 1>

k k k k k k k k k k

f x p g g p f x p p f x p p f x

α

α α α α α η η + = = = ∇ + ∇ + ≤ ∇ >

slide-26
SLIDE 26

Theorem 8.5 – global convergence

Assumes start with symmetric pd B0 F is twice continuously differentiable X0 forms a convex level set, and eigenvalues of hessian on that level are bounded and strictly positive Then BFGS converges to minimizer of f.

slide-27
SLIDE 27

Theorem 8.6

Assumes BFGS converges to x* and Hessian is Lipschitz in neighborhood

  • f x*

Then quasi-Newton BFGS algorithm has superlinear convergence.

slide-28
SLIDE 28

Easy update of Cholesky Fact.

Don’t need to refactorize whole matrix each time. Just much simpler matrix.

1

=LL, want LL

k k

B B + =

( )( ) ( )( )

1

' ' ' ' ˆ ˆ ' ' ˆ ' where ' , ' ' ' ' where L are factors of inner matrix ˆ ˆ =LL'

k k k k k k k k k k k k k k k k k k k k k k k

B s B s y y B B s B s y s LL s LL s y y LL s LL s y s ss yy L I L s L s Ly y s s y s LLL L

+

′ ′ = − + ′ ′ ′ ′ = − + ′ ′ ⎛ ⎞ ′ = − + = = ⎜ ⎟ ⎝ ⎠ =

  • O(n2)
slide-29
SLIDE 29

Practical considerations see pages 200-201

Linesearches that don’t’ satisfy Wolfe conditions may not satisfy curvature

  • condition. Then no descent direction

so need some kind of recovery strategy. (Book suggest damped Newton) Can eliminate solving Newton equation

slide-30
SLIDE 30

Calculating H

Want Book shows derivation of this directly

1 k k

H B − =

' ' ' '

( ) ( ) 1

k k k k k k k k k k k k k k

H I s y H I s y s s where y s ρ ρ ρ ρ = − − + =

slide-31
SLIDE 31

Finding H

Want Such that is as close as possible to Can go back and forth between H and B using Sherman-Morrison-Woodbury Formula (see page 605)

1 k k k

H y s

+

=

min '

k k k

H H subject to H H Hy s − = =

1 k

H

+ k

H

slide-32
SLIDE 32

Quasi Newton Methods Pros and Cons

Globally converges to a local min Superlinear convergence w/o computing Hessian Works great in practice. Widely used. More complicated than steepest descent Best implementations require sophisticated linear algebra, linesearch, dealing with curvature conditions. Have to watch out for numerical error.