Convex Optimization ( EE227A: UC Berkeley ) Lecture 25 (Newton, - - PowerPoint PPT Presentation

convex optimization
SMART_READER_LITE
LIVE PREVIEW

Convex Optimization ( EE227A: UC Berkeley ) Lecture 25 (Newton, - - PowerPoint PPT Presentation

Convex Optimization ( EE227A: UC Berkeley ) Lecture 25 (Newton, quasi-Newton) 23 Apr, 2013 Suvrit Sra Admin Project poster presentations: Soda 306 HP Auditorium Fri May 10, 2013 4pm 8pm HW5 due on May 02, 2013 Will be


slide-1
SLIDE 1

Convex Optimization

(EE227A: UC Berkeley)

Lecture 25

(Newton, quasi-Newton) 23 Apr, 2013

  • Suvrit Sra
slide-2
SLIDE 2

Admin

♠ Project poster presentations:

Soda 306 HP Auditorium Fri May 10, 2013 4pm – 8pm

♠ HW5 due on May 02, 2013 Will be released today.

2 / 25

slide-3
SLIDE 3

Newton method

◮ Recall numerical analysis: Newton method for solving equations g(x) = 0 x ∈ R.

3 / 25

slide-4
SLIDE 4

Newton method

◮ Recall numerical analysis: Newton method for solving equations g(x) = 0 x ∈ R. ◮ Key idea: linear approximation.

3 / 25

slide-5
SLIDE 5

Newton method

◮ Recall numerical analysis: Newton method for solving equations g(x) = 0 x ∈ R. ◮ Key idea: linear approximation. ◮ Suppose we are at some x close to x∗ (the root)

3 / 25

slide-6
SLIDE 6

Newton method

◮ Recall numerical analysis: Newton method for solving equations g(x) = 0 x ∈ R. ◮ Key idea: linear approximation. ◮ Suppose we are at some x close to x∗ (the root) g(x + ∆x) = g(x) + g′(x)∆x + o(|∆x|).

3 / 25

slide-7
SLIDE 7

Newton method

◮ Recall numerical analysis: Newton method for solving equations g(x) = 0 x ∈ R. ◮ Key idea: linear approximation. ◮ Suppose we are at some x close to x∗ (the root) g(x + ∆x) = g(x) + g′(x)∆x + o(|∆x|). ◮ Equation g(x + ∆x) = 0 approximated by g(x) + g′(x)∆x = 0 = ⇒ ∆x = −g(x)/g′(x).

3 / 25

slide-8
SLIDE 8

Newton method

◮ Recall numerical analysis: Newton method for solving equations g(x) = 0 x ∈ R. ◮ Key idea: linear approximation. ◮ Suppose we are at some x close to x∗ (the root) g(x + ∆x) = g(x) + g′(x)∆x + o(|∆x|). ◮ Equation g(x + ∆x) = 0 approximated by g(x) + g′(x)∆x = 0 = ⇒ ∆x = −g(x)/g′(x). ◮ If x is close to x∗, we can expect ∆x ≈ ∆x∗ = x∗ − x

3 / 25

slide-9
SLIDE 9

Newton method

◮ Recall numerical analysis: Newton method for solving equations g(x) = 0 x ∈ R. ◮ Key idea: linear approximation. ◮ Suppose we are at some x close to x∗ (the root) g(x + ∆x) = g(x) + g′(x)∆x + o(|∆x|). ◮ Equation g(x + ∆x) = 0 approximated by g(x) + g′(x)∆x = 0 = ⇒ ∆x = −g(x)/g′(x). ◮ If x is close to x∗, we can expect ∆x ≈ ∆x∗ = x∗ − x ◮ Thus, we may write x∗ ≈ x − g(x) g′(x)

3 / 25

slide-10
SLIDE 10

Newton method

◮ Recall numerical analysis: Newton method for solving equations g(x) = 0 x ∈ R. ◮ Key idea: linear approximation. ◮ Suppose we are at some x close to x∗ (the root) g(x + ∆x) = g(x) + g′(x)∆x + o(|∆x|). ◮ Equation g(x + ∆x) = 0 approximated by g(x) + g′(x)∆x = 0 = ⇒ ∆x = −g(x)/g′(x). ◮ If x is close to x∗, we can expect ∆x ≈ ∆x∗ = x∗ − x ◮ Thus, we may write x∗ ≈ x − g(x) g′(x) ◮ Which suggests the iterative process xk+1 ← xk − g(xk) g′(xk)

3 / 25

slide-11
SLIDE 11

Newton method

◮ Suppose we have a system of nonlinear equations G(x) = 0 G : Rn → Rn.

4 / 25

slide-12
SLIDE 12

Newton method

◮ Suppose we have a system of nonlinear equations G(x) = 0 G : Rn → Rn. ◮ Again, arguing as above we arrive at the Newton system G(x) + G′(x)∆x = 0, where G′(x) is the Jacobian.

4 / 25

slide-13
SLIDE 13

Newton method

◮ Suppose we have a system of nonlinear equations G(x) = 0 G : Rn → Rn. ◮ Again, arguing as above we arrive at the Newton system G(x) + G′(x)∆x = 0, where G′(x) is the Jacobian. ◮ Assume G′(x) is non-degenerate (invertible), we obtain xk+1 = xk − [G′(xk)]−1G(xk).

4 / 25

slide-14
SLIDE 14

Newton method

◮ Suppose we have a system of nonlinear equations G(x) = 0 G : Rn → Rn. ◮ Again, arguing as above we arrive at the Newton system G(x) + G′(x)∆x = 0, where G′(x) is the Jacobian. ◮ Assume G′(x) is non-degenerate (invertible), we obtain xk+1 = xk − [G′(xk)]−1G(xk). ◮ This is Newton’s method for solving nonlinear equations

4 / 25

slide-15
SLIDE 15

Newton method

min f(x) such that x ∈ Rn

5 / 25

slide-16
SLIDE 16

Newton method

min f(x) such that x ∈ Rn ∇f(x) = 0 is necessary for optimality

5 / 25

slide-17
SLIDE 17

Newton method

min f(x) such that x ∈ Rn ∇f(x) = 0 is necessary for optimality

Newton system ∇f(x) + ∇2f(x)∆x = 0, which leads to xk+1 = xk − [∇2f(xk)]−1∇f(xk). the Newton method for optimization

5 / 25

slide-18
SLIDE 18

Newton method – remarks

◮ Newton method for equations is more general than minimizing f(x) by finding roots of ∇f(x) = 0

6 / 25

slide-19
SLIDE 19

Newton method – remarks

◮ Newton method for equations is more general than minimizing f(x) by finding roots of ∇f(x) = 0 ◮ Reason: Not every function G : Rn → Rn is a derivative! Example Consider the linear system Ax − b = 0. Unless A is symmetric, does not correspond to a derivative (Why?)

6 / 25

slide-20
SLIDE 20

Newton method – remarks

◮ Newton method for equations is more general than minimizing f(x) by finding roots of ∇f(x) = 0 ◮ Reason: Not every function G : Rn → Rn is a derivative! Example Consider the linear system Ax − b = 0. Unless A is symmetric, does not correspond to a derivative (Why?) ◮ If it were a derivative, then its own derivative is a Hessian, and we know that Hessians must be symmetric, QED.

6 / 25

slide-21
SLIDE 21

Newton method – remarks

◮ In general, Newton method highly nontrivial to analyze Example Consider the iteration xk+1 = xk − 1

xk ,

x0 = 2. May be viewed as iter for ex2/2 = 0 (which has no real solution)

7 / 25

slide-22
SLIDE 22

Newton method – remarks

◮ In general, Newton method highly nontrivial to analyze Example Consider the iteration xk+1 = xk − 1

xk ,

x0 = 2. May be viewed as iter for ex2/2 = 0 (which has no real solution) Unknown whether this iteration generates a bounded sequence!

7 / 25

slide-23
SLIDE 23

Newton method – remarks

◮ In general, Newton method highly nontrivial to analyze Example Consider the iteration xk+1 = xk − 1

xk ,

x0 = 2. May be viewed as iter for ex2/2 = 0 (which has no real solution) Unknown whether this iteration generates a bounded sequence! Newton fractals (Complex dynamics) z3 − 2z + 2 x8 + 15x4 − 16

7 / 25

slide-24
SLIDE 24

Newton method – alternative view

Quadratic approximation φ(x) := f(x) + ∇f(xk), x − xk + 1

2∇2f(xk)(x − xk), x − xk.

8 / 25

slide-25
SLIDE 25

Newton method – alternative view

Quadratic approximation φ(x) := f(x) + ∇f(xk), x − xk + 1

2∇2f(xk)(x − xk), x − xk.

Assuming ∇2f(xk) ≻ 0, choose xk+1 as argmin of φ(x)

8 / 25

slide-26
SLIDE 26

Newton method – alternative view

Quadratic approximation φ(x) := f(x) + ∇f(xk), x − xk + 1

2∇2f(xk)(x − xk), x − xk.

Assuming ∇2f(xk) ≻ 0, choose xk+1 as argmin of φ(x) φ′(xk+1) = ∇f(xk) + ∇2f(xk)(xk+1 − xk) = 0.

8 / 25

slide-27
SLIDE 27

Newton method – convergence

◮ Method breaks down if ∇2f(xk) ≻ 0 ◮ Only locally convergent Example Find the root of g(x) = x √ 1 + x2 . Clearly, x∗ = 0.

9 / 25

slide-28
SLIDE 28

Newton method – convergence

◮ Method breaks down if ∇2f(xk) ≻ 0 ◮ Only locally convergent Example Find the root of g(x) = x √ 1 + x2 . Clearly, x∗ = 0. Exercise: Analyze behavior of Newton method for this problem. Hint: Consider the cases: |x0| < 1, x0 = ±1 and |x0| > 1.

9 / 25

slide-29
SLIDE 29

Newton method – convergence

◮ Method breaks down if ∇2f(xk) ≻ 0 ◮ Only locally convergent Example Find the root of g(x) = x √ 1 + x2 . Clearly, x∗ = 0. Exercise: Analyze behavior of Newton method for this problem. Hint: Consider the cases: |x0| < 1, x0 = ±1 and |x0| > 1. Damped Newton method

xk+1 = xk − αk[∇2f(xk)]−1∇f(xk)

9 / 25

slide-30
SLIDE 30

Newton – local convergence rate

◮ Suppose method generates sequence {xk} → x∗

10 / 25

slide-31
SLIDE 31

Newton – local convergence rate

◮ Suppose method generates sequence {xk} → x∗ ◮ where x∗ is a local min, i.e., ∇f(x∗) = 0 and ∇2f(x∗) ≻ 0

10 / 25

slide-32
SLIDE 32

Newton – local convergence rate

◮ Suppose method generates sequence {xk} → x∗ ◮ where x∗ is a local min, i.e., ∇f(x∗) = 0 and ∇2f(x∗) ≻ 0 ◮ Let g(xk) ≡ ∇f(xk); Taylor’s theorem: 0 = g(x∗) = g(xk) + ∇g(xk), x∗ − xk + o(xk − x∗)

10 / 25

slide-33
SLIDE 33

Newton – local convergence rate

◮ Suppose method generates sequence {xk} → x∗ ◮ where x∗ is a local min, i.e., ∇f(x∗) = 0 and ∇2f(x∗) ≻ 0 ◮ Let g(xk) ≡ ∇f(xk); Taylor’s theorem: 0 = g(x∗) = g(xk) + ∇g(xk), x∗ − xk + o(xk − x∗) ◮ Multiply by [∇g(xk)]−1 to obtain xk − x∗ − [∇g(xk)]−1g(xk) = o(xk − x∗)

10 / 25

slide-34
SLIDE 34

Newton – local convergence rate

◮ Suppose method generates sequence {xk} → x∗ ◮ where x∗ is a local min, i.e., ∇f(x∗) = 0 and ∇2f(x∗) ≻ 0 ◮ Let g(xk) ≡ ∇f(xk); Taylor’s theorem: 0 = g(x∗) = g(xk) + ∇g(xk), x∗ − xk + o(xk − x∗) ◮ Multiply by [∇g(xk)]−1 to obtain xk − x∗ − [∇g(xk)]−1g(xk) = o(xk − x∗) ◮ Newton iteration is: xk+1 = xk − [∇g(xk)]−1g(xk), so xk+1 − x∗ = o(xk − x∗),

10 / 25

slide-35
SLIDE 35

Newton – local convergence rate

◮ Suppose method generates sequence {xk} → x∗ ◮ where x∗ is a local min, i.e., ∇f(x∗) = 0 and ∇2f(x∗) ≻ 0 ◮ Let g(xk) ≡ ∇f(xk); Taylor’s theorem: 0 = g(x∗) = g(xk) + ∇g(xk), x∗ − xk + o(xk − x∗) ◮ Multiply by [∇g(xk)]−1 to obtain xk − x∗ − [∇g(xk)]−1g(xk) = o(xk − x∗) ◮ Newton iteration is: xk+1 = xk − [∇g(xk)]−1g(xk), so xk+1 − x∗ = o(xk − x∗), ◮ So for xk = x∗ we get lim

k→∞

xk+1 − x∗ xk − x∗ = lim

k→∞

  • (xk+1 − x∗)

xk − x∗ = 0. Local superlinear convergence rate

10 / 25

slide-36
SLIDE 36

Newton method – local convergence

Assumptions

  • Lipschitz Hessian: ∇2f(x) − ∇2f(y) ≤ Mx − y
  • Local strong convexity: There exists a local minimum x∗ with

∇2f(x∗) µI, µ > 0.

  • Locality: Starting point x0 “close enough” to x∗

11 / 25

slide-37
SLIDE 37

Newton method – local convergence

Assumptions

  • Lipschitz Hessian: ∇2f(x) − ∇2f(y) ≤ Mx − y
  • Local strong convexity: There exists a local minimum x∗ with

∇2f(x∗) µI, µ > 0.

  • Locality: Starting point x0 “close enough” to x∗

Theorem Suppose x0 satisfies x0 − x∗ < r := 2µ 3M . Then, xk − x∗ < r, ∀k and the NM converges quadratically xk+1 − x∗ ≤ Mxk − x∗2 2(µ − Mxk − x∗)

11 / 25

slide-38
SLIDE 38

Newton method – local convergence

Assumptions

  • Lipschitz Hessian: ∇2f(x) − ∇2f(y) ≤ Mx − y
  • Local strong convexity: There exists a local minimum x∗ with

∇2f(x∗) µI, µ > 0.

  • Locality: Starting point x0 “close enough” to x∗

Theorem Suppose x0 satisfies x0 − x∗ < r := 2µ 3M . Then, xk − x∗ < r, ∀k and the NM converges quadratically xk+1 − x∗ ≤ Mxk − x∗2 2(µ − Mxk − x∗) Reading assignment: Read §9.5.3 of Boyd-Vandenberghe

11 / 25

slide-39
SLIDE 39

Quasi-Newton

12 / 25

slide-40
SLIDE 40

Gradient and Newton

(Grad) xk+1 = xk − αk∇f(xk), αk > 0 (Newton) xk+1 = xk − [∇2f(xk)]−1∇f(xk).

13 / 25

slide-41
SLIDE 41

Gradient and Newton

(Grad) xk+1 = xk − αk∇f(xk), αk > 0 (Newton) xk+1 = xk − [∇2f(xk)]−1∇f(xk). Viewpoint for the gradient method.

13 / 25

slide-42
SLIDE 42

Gradient and Newton

(Grad) xk+1 = xk − αk∇f(xk), αk > 0 (Newton) xk+1 = xk − [∇2f(xk)]−1∇f(xk). Viewpoint for the gradient method. Consider approximation φ1(x) := f(xk) + ∇f(xk), x − xk + 1 2αx − xk2

13 / 25

slide-43
SLIDE 43

Gradient and Newton

(Grad) xk+1 = xk − αk∇f(xk), αk > 0 (Newton) xk+1 = xk − [∇2f(xk)]−1∇f(xk). Viewpoint for the gradient method. Consider approximation φ1(x) := f(xk) + ∇f(xk), x − xk + 1 2αx − xk2 Optimality condition yields φ′(x∗) = ∇f(xk) + 1

α(x∗ − xk) = 0

13 / 25

slide-44
SLIDE 44

Gradient and Newton

(Grad) xk+1 = xk − αk∇f(xk), αk > 0 (Newton) xk+1 = xk − [∇2f(xk)]−1∇f(xk). Viewpoint for the gradient method. Consider approximation φ1(x) := f(xk) + ∇f(xk), x − xk + 1 2αx − xk2 Optimality condition yields φ′(x∗) = ∇f(xk) + 1

α(x∗ − xk) = 0

x∗ = xk − α∇f(xk)

13 / 25

slide-45
SLIDE 45

Gradient and Newton

(Grad) xk+1 = xk − αk∇f(xk), αk > 0 (Newton) xk+1 = xk − [∇2f(xk)]−1∇f(xk). Viewpoint for the gradient method. Consider approximation φ1(x) := f(xk) + ∇f(xk), x − xk + 1 2αx − xk2 Optimality condition yields φ′(x∗) = ∇f(xk) + 1

α(x∗ − xk) = 0

x∗ = xk − α∇f(xk) If α ∈ (0, 1

L], φ1(x) is global overestimator

f(x) ≤ φ1(x), ∀x ∈ Rn.

13 / 25

slide-46
SLIDE 46

Gradient and Newton

Viewpoint for Newton method. Consider quadratic approx φ2(x) := f(xk)+∇f(xk), x − xk+1 2∇2f(xk)(x − xk), x − xk.

14 / 25

slide-47
SLIDE 47

Gradient and Newton

Viewpoint for Newton method. Consider quadratic approx φ2(x) := f(xk)+∇f(xk), x − xk+1 2∇2f(xk)(x − xk), x − xk. Minimum of this function is x∗ = xk − [∇2f(xk)]−1∇f(xk).

14 / 25

slide-48
SLIDE 48

Gradient and Newton

Viewpoint for Newton method. Consider quadratic approx φ2(x) := f(xk)+∇f(xk), x − xk+1 2∇2f(xk)(x − xk), x − xk. Minimum of this function is x∗ = xk − [∇2f(xk)]−1∇f(xk).

Something better than φ1, less expensive than φ2?

14 / 25

slide-49
SLIDE 49

Quasi-Newton methods

Generic Quadratic Model φD(x) := f(xk) + ∇f(xk), x − xk + 1

2Hk(x − xk), x − xk.

15 / 25

slide-50
SLIDE 50

Quasi-Newton methods

Generic Quadratic Model φD(x) := f(xk) + ∇f(xk), x − xk + 1

2Hk(x − xk), x − xk.

◮ Matrix Hk ≻ 0, some posdef matrix ◮ Leads to optimum x∗ = xk − H−1

k ∇f(xk)

x∗ = xk − Sk∇f(xk).

15 / 25

slide-51
SLIDE 51

Quasi-Newton methods

Generic Quadratic Model φD(x) := f(xk) + ∇f(xk), x − xk + 1

2Hk(x − xk), x − xk.

◮ Matrix Hk ≻ 0, some posdef matrix ◮ Leads to optimum x∗ = xk − H−1

k ∇f(xk)

x∗ = xk − Sk∇f(xk). ◮ The first-order methods that form a sequence of matrices {Hk} : Hk → ∇2f(x∗) where Hk is constructed using only gradient information,

15 / 25

slide-52
SLIDE 52

Quasi-Newton methods

Generic Quadratic Model φD(x) := f(xk) + ∇f(xk), x − xk + 1

2Hk(x − xk), x − xk.

◮ Matrix Hk ≻ 0, some posdef matrix ◮ Leads to optimum x∗ = xk − H−1

k ∇f(xk)

x∗ = xk − Sk∇f(xk). ◮ The first-order methods that form a sequence of matrices {Hk} : Hk → ∇2f(x∗) where Hk is constructed using only gradient information,are called variable metric or quasi-Newton methods. xk+1 = xk − H−1

k ∇f(xk)

k = 0, 1, . . . xk+1 = xk − Sk∇f(xk) k = 0, 1, . . .

15 / 25

slide-53
SLIDE 53

Quasi-Newton method

  • Choose x0 ∈ Rn. Let H0 = I.

Compute f(x0) and ∇f(x0)

16 / 25

slide-54
SLIDE 54

Quasi-Newton method

  • Choose x0 ∈ Rn. Let H0 = I.

Compute f(x0) and ∇f(x0)

  • For k ≥ 0:

1 descent direction: dk ← Sk∇f(xk)

16 / 25

slide-55
SLIDE 55

Quasi-Newton method

  • Choose x0 ∈ Rn. Let H0 = I.

Compute f(x0) and ∇f(x0)

  • For k ≥ 0:

1 descent direction: dk ← Sk∇f(xk) 2 stepsize: search for good αk > 0

16 / 25

slide-56
SLIDE 56

Quasi-Newton method

  • Choose x0 ∈ Rn. Let H0 = I.

Compute f(x0) and ∇f(x0)

  • For k ≥ 0:

1 descent direction: dk ← Sk∇f(xk) 2 stepsize: search for good αk > 0 3 update: xk+1 = xk − αkdk

16 / 25

slide-57
SLIDE 57

Quasi-Newton method

  • Choose x0 ∈ Rn. Let H0 = I.

Compute f(x0) and ∇f(x0)

  • For k ≥ 0:

1 descent direction: dk ← Sk∇f(xk) 2 stepsize: search for good αk > 0 3 update: xk+1 = xk − αkdk 4 compute f(xk+1) and ∇f(xk+1)

16 / 25

slide-58
SLIDE 58

Quasi-Newton method

  • Choose x0 ∈ Rn. Let H0 = I.

Compute f(x0) and ∇f(x0)

  • For k ≥ 0:

1 descent direction: dk ← Sk∇f(xk) 2 stepsize: search for good αk > 0 3 update: xk+1 = xk − αkdk 4 compute f(xk+1) and ∇f(xk+1) 5 QN update: Sk → Sk+1

16 / 25

slide-59
SLIDE 59

Quasi-Newton method

  • Choose x0 ∈ Rn. Let H0 = I.

Compute f(x0) and ∇f(x0)

  • For k ≥ 0:

1 descent direction: dk ← Sk∇f(xk) 2 stepsize: search for good αk > 0 3 update: xk+1 = xk − αkdk 4 compute f(xk+1) and ∇f(xk+1) 5 QN update: Sk → Sk+1

QN schemes differ in how Sk ≡ H−1

k

are updated!

16 / 25

slide-60
SLIDE 60

Quasi-Newton methods

Secant equation / QN rule Sk+1(∇f(xk+1) − ∇f(xk)) = xk+1 − xk.

17 / 25

slide-61
SLIDE 61

Quasi-Newton methods

Secant equation / QN rule Sk+1(∇f(xk+1) − ∇f(xk)) = xk+1 − xk. ◮ Quadratic models from iteration k → k + 1 φk(x) = ak + gk, x − xk + 1

2H(x − xk), x − xk

φk+1(x) = ak+1 + gk+1, x − xk+1 + 1

2H(x − xk+1), x − xk+1

17 / 25

slide-62
SLIDE 62

Quasi-Newton methods

Secant equation / QN rule Sk+1(∇f(xk+1) − ∇f(xk)) = xk+1 − xk. ◮ Quadratic models from iteration k → k + 1 φk(x) = ak + gk, x − xk + 1

2H(x − xk), x − xk

φk+1(x) = ak+1 + gk+1, x − xk+1 + 1

2H(x − xk+1), x − xk+1

◮ φ′

k(x) − φ′ k+1(x) = gk − gk+1 + H(xk+1 − xk)

17 / 25

slide-63
SLIDE 63

Quasi-Newton methods

Secant equation / QN rule Sk+1(∇f(xk+1) − ∇f(xk)) = xk+1 − xk. ◮ Quadratic models from iteration k → k + 1 φk(x) = ak + gk, x − xk + 1

2H(x − xk), x − xk

φk+1(x) = ak+1 + gk+1, x − xk+1 + 1

2H(x − xk+1), x − xk+1

◮ φ′

k(x) − φ′ k+1(x) = gk − gk+1 + H(xk+1 − xk)

◮ Setting this to zero, we get gk+1 − gk = H(xk+1 − xk) S(gk+1 − gk) = xk+1 − xk. ◮ So we construct Hk → Hk+1 or Sk → Sk+1 to respect this.

17 / 25

slide-64
SLIDE 64

Hessian updates

◮ Barzilai-Borwein stepsize. Let yk = gk+1 − gk, sk = xk+1 − xk: min

H Hsk − yk,

H = αI.

18 / 25

slide-65
SLIDE 65

Hessian updates

◮ Barzilai-Borwein stepsize. Let yk = gk+1 − gk, sk = xk+1 − xk: min

H Hsk − yk,

H = αI. ◮ Davidon-Fletcher-Powell (DFP): β := 1/yk, sk Hk+1 = (I − βyksT

k )Hk(I − βskyT k ) + βykyT k

Sk+1 = Sk − SksksT

k Sk

Sksk, sk + βykyT

k .

18 / 25

slide-66
SLIDE 66

Hessian updates

◮ Barzilai-Borwein stepsize. Let yk = gk+1 − gk, sk = xk+1 − xk: min

H Hsk − yk,

H = αI. ◮ Davidon-Fletcher-Powell (DFP): β := 1/yk, sk Hk+1 = (I − βyksT

k )Hk(I − βskyT k ) + βykyT k

Sk+1 = Sk − SksksT

k Sk

Sksk, sk + βykyT

k .

◮ Broyden-Fletcher-Goldfarb-Shanno (BFGS) Sk+1 = (I − βskyT

k )Sk(I − βyksT k ) + βsksT k

Hk+1 = Hk − HksksT

k Hk

Hksk, sk + βykykT.

18 / 25

slide-67
SLIDE 67

Hessian updates

◮ Barzilai-Borwein stepsize. Let yk = gk+1 − gk, sk = xk+1 − xk: min

H Hsk − yk,

H = αI. ◮ Davidon-Fletcher-Powell (DFP): β := 1/yk, sk Hk+1 = (I − βyksT

k )Hk(I − βskyT k ) + βykyT k

Sk+1 = Sk − SksksT

k Sk

Sksk, sk + βykyT

k .

◮ Broyden-Fletcher-Goldfarb-Shanno (BFGS) Sk+1 = (I − βskyT

k )Sk(I − βyksT k ) + βsksT k

Hk+1 = Hk − HksksT

k Hk

Hksk, sk + βykykT. BFGS believed to be most stable, best scheme.

18 / 25

slide-68
SLIDE 68

Hessian updates

◮ Barzilai-Borwein stepsize. Let yk = gk+1 − gk, sk = xk+1 − xk: min

H Hsk − yk,

H = αI. ◮ Davidon-Fletcher-Powell (DFP): β := 1/yk, sk Hk+1 = (I − βyksT

k )Hk(I − βskyT k ) + βykyT k

Sk+1 = Sk − SksksT

k Sk

Sksk, sk + βykyT

k .

◮ Broyden-Fletcher-Goldfarb-Shanno (BFGS) Sk+1 = (I − βskyT

k )Sk(I − βyksT k ) + βsksT k

Hk+1 = Hk − HksksT

k Hk

Hksk, sk + βykykT. BFGS believed to be most stable, best scheme. ◮ Notice, updates computationally “cheap”

18 / 25

slide-69
SLIDE 69

Limited memory methods

Hessian storage and update has O(n2) cost

19 / 25

slide-70
SLIDE 70

Limited memory methods

Hessian storage and update has O(n2) cost Estimate Hk or Sk using only previous few iterations; so essentially, use only O(mn) storage, where m ≈ 5-17 ◮ Each step of BFGS is: xk+1 = xk − αkSk∇f(xk)

19 / 25

slide-71
SLIDE 71

Limited memory methods

Hessian storage and update has O(n2) cost Estimate Hk or Sk using only previous few iterations; so essentially, use only O(mn) storage, where m ≈ 5-17 ◮ Each step of BFGS is: xk+1 = xk − αkSk∇f(xk) ◮ Sk is updated at every iteration using Sk+1 = V T

k SkVk + βksksT k

19 / 25

slide-72
SLIDE 72

Limited memory methods

Hessian storage and update has O(n2) cost Estimate Hk or Sk using only previous few iterations; so essentially, use only O(mn) storage, where m ≈ 5-17 ◮ Each step of BFGS is: xk+1 = xk − αkSk∇f(xk) ◮ Sk is updated at every iteration using Sk+1 = V T

k SkVk + βksksT k

where, with sk := xk+1 − xk and yk := ∇f(xk+1) − ∇f(xk), βk = 1 yT

k sk

, Vk = I − βkyksT

k ,

19 / 25

slide-73
SLIDE 73

Limited memory methods

Hessian storage and update has O(n2) cost Estimate Hk or Sk using only previous few iterations; so essentially, use only O(mn) storage, where m ≈ 5-17 ◮ Each step of BFGS is: xk+1 = xk − αkSk∇f(xk) ◮ Sk is updated at every iteration using Sk+1 = V T

k SkVk + βksksT k

where, with sk := xk+1 − xk and yk := ∇f(xk+1) − ∇f(xk), βk = 1 yT

k sk

, Vk = I − βkyksT

k ,

◮ We use m vector pairs (si, yi), for i = k − m, . . . , k − 1

19 / 25

slide-74
SLIDE 74

Limited memory methods

Unroll the Sk update loop for m iterations to obtain

20 / 25

slide-75
SLIDE 75

Limited memory methods

Unroll the Sk update loop for m iterations to obtain Sk = (V T

k−1 · · · V T k−m)S0 k(Vk−m · · · Vk−1)

20 / 25

slide-76
SLIDE 76

Limited memory methods

Unroll the Sk update loop for m iterations to obtain Sk = (V T

k−1 · · · V T k−m)S0 k(Vk−m · · · Vk−1)

+ βk−m(V T

k−1 · · · V T k−m+1)sk−msT k−m(V T k−m+1 · · · V T k−1)

20 / 25

slide-77
SLIDE 77

Limited memory methods

Unroll the Sk update loop for m iterations to obtain Sk = (V T

k−1 · · · V T k−m)S0 k(Vk−m · · · Vk−1)

+ βk−m(V T

k−1 · · · V T k−m+1)sk−msT k−m(V T k−m+1 · · · V T k−1)

+ βk−m+1(V T

k−1 · · · V T k−m+2)sk−m+1sT k−m+1(V T k−m+2 · · · V T k−1)

20 / 25

slide-78
SLIDE 78

Limited memory methods

Unroll the Sk update loop for m iterations to obtain Sk = (V T

k−1 · · · V T k−m)S0 k(Vk−m · · · Vk−1)

+ βk−m(V T

k−1 · · · V T k−m+1)sk−msT k−m(V T k−m+1 · · · V T k−1)

+ βk−m+1(V T

k−1 · · · V T k−m+2)sk−m+1sT k−m+1(V T k−m+2 · · · V T k−1)

+ · · · + βk−1sk−1sT

k−1.

Ultimate aim is to efficiently compute: Sk∇f(xk)

20 / 25

slide-79
SLIDE 79

Limited memory methods

Unroll the Sk update loop for m iterations to obtain Sk = (V T

k−1 · · · V T k−m)S0 k(Vk−m · · · Vk−1)

+ βk−m(V T

k−1 · · · V T k−m+1)sk−msT k−m(V T k−m+1 · · · V T k−1)

+ βk−m+1(V T

k−1 · · · V T k−m+2)sk−m+1sT k−m+1(V T k−m+2 · · · V T k−1)

+ · · · + βk−1sk−1sT

k−1.

Ultimate aim is to efficiently compute: Sk∇f(xk) Exercise: Implement procedure to compute Sk∇f(xk) efficiently.

20 / 25

slide-80
SLIDE 80

Limited memory methods

Unroll the Sk update loop for m iterations to obtain Sk = (V T

k−1 · · · V T k−m)S0 k(Vk−m · · · Vk−1)

+ βk−m(V T

k−1 · · · V T k−m+1)sk−msT k−m(V T k−m+1 · · · V T k−1)

+ βk−m+1(V T

k−1 · · · V T k−m+2)sk−m+1sT k−m+1(V T k−m+2 · · · V T k−1)

+ · · · + βk−1sk−1sT

k−1.

Ultimate aim is to efficiently compute: Sk∇f(xk) Exercise: Implement procedure to compute Sk∇f(xk) efficiently. ◮ Typical choice for S0

k = sT k−1yk−1

yT

k−1yk−1

I

20 / 25

slide-81
SLIDE 81

Limited memory methods

Unroll the Sk update loop for m iterations to obtain Sk = (V T

k−1 · · · V T k−m)S0 k(Vk−m · · · Vk−1)

+ βk−m(V T

k−1 · · · V T k−m+1)sk−msT k−m(V T k−m+1 · · · V T k−1)

+ βk−m+1(V T

k−1 · · · V T k−m+2)sk−m+1sT k−m+1(V T k−m+2 · · · V T k−1)

+ · · · + βk−1sk−1sT

k−1.

Ultimate aim is to efficiently compute: Sk∇f(xk) Exercise: Implement procedure to compute Sk∇f(xk) efficiently. ◮ Typical choice for S0

k = sT k−1yk−1

yT

k−1yk−1

I ◮ This is related to the BB stepsize!

20 / 25

slide-82
SLIDE 82

Constrained problems

21 / 25

slide-83
SLIDE 83

Constrained problems

Two-metric projection method

xk+1 = PX(xk − αkSk∇f(xk))

21 / 25

slide-84
SLIDE 84

Constrained problems

Two-metric projection method

xk+1 = PX(xk − αkSk∇f(xk))

◮ Fundamental problem: not a descent iteration!

21 / 25

slide-85
SLIDE 85

Constrained problems

Two-metric projection method

xk+1 = PX(xk − αkSk∇f(xk))

◮ Fundamental problem: not a descent iteration! ◮ We may have f(xk+1) > f(xk) for all αk > 0

21 / 25

slide-86
SLIDE 86

Constrained problems

Two-metric projection method

xk+1 = PX(xk − αkSk∇f(xk))

◮ Fundamental problem: not a descent iteration! ◮ We may have f(xk+1) > f(xk) for all αk > 0 ◮ Method might not even recognize a stationary point!

21 / 25

slide-87
SLIDE 87

Failure of projected-Newton methods

rf (x k ) x 1 x k x k
  • D
k rf (x k ) lev el sets
  • f
f
  • x
= x k
  • (G
T G) 1 (G T Gx k
  • G
T h) x
  • P
+ [x k
  • D
k rf (x k )℄ P + [x k
  • (G
T G) 1 (G T Gx k
  • G
T h)℄ x 2

22 / 25

slide-88
SLIDE 88

Constrained problems

◮ Projected-gradient works! BUT

23 / 25

slide-89
SLIDE 89

Constrained problems

◮ Projected-gradient works! BUT ◮ Projected Newton or Quasi-Newton do not work!

23 / 25

slide-90
SLIDE 90

Constrained problems

◮ Projected-gradient works! BUT ◮ Projected Newton or Quasi-Newton do not work! ◮ More careful selection of Sk (or Hk) needed

23 / 25

slide-91
SLIDE 91

Constrained problems

◮ Projected-gradient works! BUT ◮ Projected Newton or Quasi-Newton do not work! ◮ More careful selection of Sk (or Hk) needed ◮ See e.g., Bertsekas and Gafni (Projected QN) (1984)

23 / 25

slide-92
SLIDE 92

Constrained problems

◮ Projected-gradient works! BUT ◮ Projected Newton or Quasi-Newton do not work! ◮ More careful selection of Sk (or Hk) needed ◮ See e.g., Bertsekas and Gafni (Projected QN) (1984) ◮ With simple bound constraints: LBFGS-B

23 / 25

slide-93
SLIDE 93

Nonsmooth problems

We did not cover many interesting ideas ♠ Proximal Newton methods ♠ f(x) + r(x) problems (see book chapter) ♠ Nonsmooth BFGS – Lewis, Overton ♠ Nonsmooth LBFGS

24 / 25

slide-94
SLIDE 94

References

♥ Y. Nesterov. Introductory Lectures on Convex Optimization (2004). ♥ J. Nocedal, S. J. Wright. Numerical Optimization (1999). ♥ M. Schmidt, D. Kim, S. Sra. Newton-type methods in machine learning, Chapter 13 in Optimization for Machine Learning (2011).

25 / 25