Computational Optimization Quasi Newton Methods 2/22 NW Chapter 8 - - PowerPoint PPT Presentation
Computational Optimization Quasi Newton Methods 2/22 NW Chapter 8 - - PowerPoint PPT Presentation
Computational Optimization Quasi Newton Methods 2/22 NW Chapter 8 Theorem 3.4 Suppose f is twice cont diff and the sequence of steepest descent converge to x* satisfying SOSC. 1 Let =
Theorem 3.4
Suppose f is twice cont diff and the sequence of steepest descent converge to x* satisfying SOSC. Let The for all k sufficient large
2 1 1 1
1 ... ( *) 1
n n n
r where eigenvaluesof f x λ λ ρ λ λ λ λ ρ ⎛ ⎞ − ⎛ ⎞ − ∈ = ≤ ≤ ∇ ⎜ ⎟ ⎜ ⎟ + + ⎝ ⎠ ⎝ ⎠
[ ]
*) ( ) ( *) ( ) (
2 1
x f x f r x f x f
k k
− ≤ −
+
Choosing better directions
Steepest Descent – simple and cheap per iterations but can converge very slowly if conditioning is bad. Modified Newton’s – expensive per iterations but converges quickly. Goal – first order methods with Newton like behavior
Scaled Steepest Descent
Pick approximation of Hessian Dk Do change
- f variables
( )
1 1/ 2
( ) Let Now problem is min ( ) ( )
k k k k k k k
x x D f x S D x Sy h y f Sy α
+ =
− ∇ = = =
Scaled Steepest Descent…
1 1 1 1
( ) ( ) Multiple by ( ) ( ) ( ) Thus convergence rate of steepest descent applies in this space ( ) ( ) '
k k k k k k k k k k k k k k k k k k k
y y h y y S f Sy S S y Sy SS f Sy S y Sy D f Sy x x D f x g y f Sy y SQSy α α α α α
+ + + +
= − ∇ = − ∇ = − ∇ = − ∇ = − ∇ = =
Scaled Steepest Descent…
( ) ( ) ( )
( )
1 1/ 2
- 1
1 1/ 2 1/ 2
- 1
- 1
Convergence rate governed by eigs of smallest eigenvale of largest eigenvale of Choose S close to Q to make / close to 1 note Q Q Q
n n
SQS SQS SQS I λ λ λ λ = = =
Cheap Newton Approximation
Use just diagonal of Hessian Linear storage and computation, inverse is trivial. Limited effectiveness.
1 2 1 1 1 2 2 2 1 2 3 3 k
f x x f S D x x f x x
− − −
⎡ ⎤ ⎛ ⎞ ∂ ⎢ ⎥ ⎜ ⎟ ∂ ∂ ⎢ ⎥ ⎝ ⎠ ⎢ ⎥ ⎛ ⎞ ∂ ⎢ ⎥ = = ⎜ ⎟ ⎢ ⎥ ∂ ∂ ⎝ ⎠ ⎢ ⎥ ⎢ ⎥ ⎛ ⎞ ∂ ⎢ ⎥ ⎜ ⎟ ⎢ ⎥ ∂ ∂ ⎝ ⎠ ⎢ ⎥ ⎣ ⎦
Quasi-Newton Methods
Newton’s Method Instead substitute Bk to get Newton-like directions
2 (
) ( )
k k
f x p f x ∇ = −∇ ( )
k k
B p f x = −∇
Quasi-Newton Methods
Better yet – estimate Newton inverse directly
1
( )
k k k k
B p f x H B− = −∇ =
1-dimensional case
In 1-d case might estimate
change in derivative change in x
If you do this you get secant method
1 1
'( ) '( ) ''( )
k k k k k
f x f x f x x x
− −
− ≈ −
1 1 1
'( ) '( ) '( )
k k k k k k k
x x x x f x f x f x
− + −
− = − − '( )
k
f x
xk-1 xk xk+1
1-d convergence
Secant method has superlinear convergence with rate But Secant Method only applies to 1-d
( )
1 1 5 (the "golden ratio" again!) 2 r = +
Secant Condition
1-d condition Generalizes to So we want
'' 1 1
( )( ) '( ) '( )
k k k k k
f x x x f x f x
− −
− = −
2 1 1
( )( ) ( ) ( )
k k k k k
f x x x f x f x
− −
∇ − = ∇ −∇
1 1
( ) ( ) ( )
k k k k k
B x x f x f x
− −
− = ∇ −∇
Another way to think about it
Approximating quadratic model Gradient = grad of current iterate Want gradient = gradient of old iterate So
1 2
( ) ( ) ( )' '
k k k k
m p f x f x p p B p = + ∇ + (0) ( )
k k
m f x ∇ = ∇
1 1 1 1 1
( ) ( ) ( ) ( )
k k k k k k k k k k k
m x x m p f x B p f x α α
− − − − −
∇ − = ∇ − = ∇ − = ∇
1 1 1
( ) ( )
k k k k k
B p f x f x α −
− −
= ∇ −∇
Quadratic Case
For min 1/2x’Qx-b’x So Bk should act like Q along direction Let
1 1 1
( ) ( ) ( ) ( ) ( )
k k k k k k
f x f x Qx b Qx b Q x x
− − −
∇ −∇ = − − − = −
1 k k k
s x x − = −
1 1
( ) ( ) So Quasi Newton Condition becomes
k k k k k k
y f x f x B s y
− +
= ∇ −∇ =
Choice of B
At each step we get information about Q along direction xk-xk-1 Use it to update our estimate of Q Many possible ways to do this and still satisfy quasi-Newton condition
BFGS Update
Update by adding two matrices Need
1
Note outer product
k k k k k k
B B a a b b α β
+
′ ′ = + +
1 k
=y from QNC So we make and
- k
k k k k k k k k k k k k k k k k k
B s B s a a s b b s a a s y b b s B s α β α β
+
′ ′ = + + ′ = ′ =
BFGS Update
To make So pick
- k
k k k
b b s B s β ′ =
( ) ( )( )
( )(
)
' ' '
Define So
k k k k k k k k k k k k k k k k
b B s b b s B s B s s s B s B s β β β = = =
'
1 =-
k k k
s B s β
BFGS Update
To make
( ) ( )
' ' ' '
Define So 1 So define
k k k k k k k k k k k k k
a y a a s y y s y s y y s α α α α = = = =
k k k k
a a s y α ′ =
BFGS Update
Final Update is This is called a BFGS family update for Broyden Fletcher Goldfarb and Shanno
( )( )
1 k k k k k k k k k k k k k
B s B s y y B B s B s y s
+
′ ′ = − + ′ ′
Key Ideas
This update is called a rank 2 update since it adds two rank one matrices. We want Bk to be p.d. and symmetric. Want to solve efficiently. Two possible ways
( )
k k k
B p f x =−∇
Descent directions
Need B to be positive definite. Necessary condition=Curvature Condition Enforce for general conditions using Wolfe or Strong Wolfe Conditions
1 1
' '
k k k k k k k k
B s y s B s s y
+ +
= ⇒ = >
Wolfe Conditions
For 0<c1<c2<1 Implies
1
1
( ) ( ) ( ) '
k k k
k
f x f x c f x p α
+
≤ + ∇
1
2
( ) ' ( ) '
k k
k k
f x p c f x p
+
∇ ≥ ∇
1
2
( ) ' ( ) '
k k
k k
f x s c f x s
+
∇ ≥ ∇
( )
1
2 2
' ( ) ( ) ' ( 1) ( ) ' ( 1) ( ) '( )
k k k k
k k k k k k
y s f x f x s c f x s c f x p α
+
= ∇ − ∇ ≥ − ∇ = − ∇ >
Guaranteeing B p.d. and sym.
Lemma 11.5 in Nash and Sofer if Bk is p.d. and symmetric then Bk+1 is p.d. if and only if yk’sk>0 So enforce this condition in linesearch procedure using wolfe conditions
[ ] [ ]
1 1
( ) ( )
k k k k
f x f x x x
− −
′ ∇ −∇ − >
Quasi-Newton Algorithm with BFGS update
Start with x0.B0 e.g. B0=I For k =1,…,K
If xk is optimal then stop Solve:
using modified cholesky fact.
Perform linesearch satisfying Wolf conditions
xk+1=xk+ αkpk
Update sk=xk+1 -xk,
- ( )
k k k
B p f x =−∇
1
( ) ( )
k k k
y f x f x
+
= ∇ −∇
( )( )
1 k k k k k k k k
B s B s y y B B
+
′
k k k k k
s B s y s ′ = − + ′ ′
Add Wolfe Condition to Linesearch
Wolfe condition is approximation to
- ptimality condition for the exact
linesearch. Used with Armijo search condition
' ' '
min ( ) ( )
- Optim. Cond. '( )
( ) want ( ) ( ) for 1>
k k k k k k k k k k
f x p g g p f x p p f x p p f x
α
α α α α α η η + = = = ∇ + ∇ + ≤ ∇ >
Theorem 8.5 – global convergence
Assumes start with symmetric pd B0 F is twice continuously differentiable X0 forms a convex level set, and eigenvalues of hessian on that level are bounded and strictly positive Then BFGS converges to minimizer of f.
Theorem 8.6
Assumes BFGS converges to x* and Hessian is Lipschitz in neighborhood
- f x*
Then quasi-Newton BFGS algorithm has superlinear convergence.
Easy update of Cholesky Fact.
Don’t need to refactorize whole matrix each time. Just much simpler matrix.
1
=LL, want LL
k k
B B + =
( )( ) ( )( )
1
' ' ' ' ˆ ˆ ' ' ˆ ' where ' , ' ' ' ' where L are factors of inner matrix ˆ ˆ =LL'
k k k k k k k k k k k k k k k k k k k k k k k
B s B s y y B B s B s y s LL s LL s y y LL s LL s y s ss yy L I L s L s Ly y s s y s LLL L
+
′ ′ = − + ′ ′ ′ ′ = − + ′ ′ ⎛ ⎞ ′ = − + = = ⎜ ⎟ ⎝ ⎠ =
- O(n2)
Practical considerations see pages 200-201
Linesearches that don’t’ satisfy Wolfe conditions may not satisfy curvature
- condition. Then no descent direction
so need some kind of recovery strategy. (Book suggest damped Newton) Can eliminate solving Newton equation
Calculating H
Want Book shows derivation of this directly
1 k k
H B − =
' ' ' '
( ) ( ) 1
k k k k k k k k k k k k k k
H I s y H I s y s s where y s ρ ρ ρ ρ = − − + =
Finding H
Want Such that is as close as possible to Can go back and forth between H and B using Sherman-Morrison-Woodbury Formula (see page 605)
1 k k k
H y s
+
=
min '
k k k
H H subject to H H Hy s − = =
1 k
H
+ k