Basics of Numerical Optimization: Iterative Methods Ju Sun - - PowerPoint PPT Presentation

basics of numerical optimization iterative methods
SMART_READER_LITE
LIVE PREVIEW

Basics of Numerical Optimization: Iterative Methods Ju Sun - - PowerPoint PPT Presentation

Basics of Numerical Optimization: Iterative Methods Ju Sun Computer Science & Engineering University of Minnesota, Twin Cities February 13, 2020 1 / 43 Find global minimum 1st-order necessary condition : Assume f is 1st-order


slide-1
SLIDE 1

Basics of Numerical Optimization: Iterative Methods

Ju Sun

Computer Science & Engineering University of Minnesota, Twin Cities

February 13, 2020

1 / 43

slide-2
SLIDE 2

Find global minimum

1st-order necessary condition: Assume f is 1st-order differentiable at x0. If x0 is a local minimizer, then ∇f (x0) = 0. x with ∇f (x) = 0: 1st-order stationary point (1OSP)

2 / 43

slide-3
SLIDE 3

Find global minimum

1st-order necessary condition: Assume f is 1st-order differentiable at x0. If x0 is a local minimizer, then ∇f (x0) = 0. x with ∇f (x) = 0: 1st-order stationary point (1OSP) 2nd-order necessary condition: Assume f (x) is 2-order differentiable at

  • x0. If x0 is a local min, ∇f (x0) = 0 and ∇2f (x0) 0.

x with ∇f (x) = 0 and ∇2f (x) 0: 2nd-order stationary point (2OSP)

2 / 43

slide-4
SLIDE 4

Find global minimum

1st-order necessary condition: Assume f is 1st-order differentiable at x0. If x0 is a local minimizer, then ∇f (x0) = 0. x with ∇f (x) = 0: 1st-order stationary point (1OSP) 2nd-order necessary condition: Assume f (x) is 2-order differentiable at

  • x0. If x0 is a local min, ∇f (x0) = 0 and ∇2f (x0) 0.

x with ∇f (x) = 0 and ∇2f (x) 0: 2nd-order stationary point (2OSP) – Analytic method: find 1OSP’s using gradient first, then study them using Hessian — for simple functions! e.g., f (x) = y − Ax2

2, or

f (x, y) = x2y2 − x3y + y2 − 1)

2 / 43

slide-5
SLIDE 5

Find global minimum

1st-order necessary condition: Assume f is 1st-order differentiable at x0. If x0 is a local minimizer, then ∇f (x0) = 0. x with ∇f (x) = 0: 1st-order stationary point (1OSP) 2nd-order necessary condition: Assume f (x) is 2-order differentiable at

  • x0. If x0 is a local min, ∇f (x0) = 0 and ∇2f (x0) 0.

x with ∇f (x) = 0 and ∇2f (x) 0: 2nd-order stationary point (2OSP) – Analytic method: find 1OSP’s using gradient first, then study them using Hessian — for simple functions! e.g., f (x) = y − Ax2

2, or

f (x, y) = x2y2 − x3y + y2 − 1) – Grid search: incurs O

  • ε−n

cost

2 / 43

slide-6
SLIDE 6

Find global minimum

1st-order necessary condition: Assume f is 1st-order differentiable at x0. If x0 is a local minimizer, then ∇f (x0) = 0. x with ∇f (x) = 0: 1st-order stationary point (1OSP) 2nd-order necessary condition: Assume f (x) is 2-order differentiable at

  • x0. If x0 is a local min, ∇f (x0) = 0 and ∇2f (x0) 0.

x with ∇f (x) = 0 and ∇2f (x) 0: 2nd-order stationary point (2OSP) – Analytic method: find 1OSP’s using gradient first, then study them using Hessian — for simple functions! e.g., f (x) = y − Ax2

2, or

f (x, y) = x2y2 − x3y + y2 − 1) – Grid search: incurs O

  • ε−n

cost – Iterative methods: find 1OSP’s/2OSP’s by making consecutive small movements

2 / 43

slide-7
SLIDE 7

Iterative methods

Credit: aria42.com

Illustration of iterative methods on the contour/levelset plot (i.e., the function assumes the same value on each curve)

3 / 43

slide-8
SLIDE 8

Iterative methods

Credit: aria42.com

Illustration of iterative methods on the contour/levelset plot (i.e., the function assumes the same value on each curve) Two questions: what direction to move, and how far to move

3 / 43

slide-9
SLIDE 9

Iterative methods

Credit: aria42.com

Illustration of iterative methods on the contour/levelset plot (i.e., the function assumes the same value on each curve) Two questions: what direction to move, and how far to move Two possibilities: – Line-search methods: direction first, size second

3 / 43

slide-10
SLIDE 10

Iterative methods

Credit: aria42.com

Illustration of iterative methods on the contour/levelset plot (i.e., the function assumes the same value on each curve) Two questions: what direction to move, and how far to move Two possibilities: – Line-search methods: direction first, size second – Trust-region methods: size first, direction second

3 / 43

slide-11
SLIDE 11

Outline

Classic line-search methods Advanced line-search methods Momentum methods Quasi-Newton methods Coordinate descent Conjugate gradient methods Trust-region methods

4 / 43

slide-12
SLIDE 12

Framework of line-search methods

A generic line search algorithm Input: initialization x0, stopping criterion (SC), k = 1

1: while SC not satisfied do 2:

choose a direction dk

3:

decide a step size tk

4:

make a step: xk = xk−1 + tkdk

5:

update counter: k = k + 1

6: end while

5 / 43

slide-13
SLIDE 13

Framework of line-search methods

A generic line search algorithm Input: initialization x0, stopping criterion (SC), k = 1

1: while SC not satisfied do 2:

choose a direction dk

3:

decide a step size tk

4:

make a step: xk = xk−1 + tkdk

5:

update counter: k = k + 1

6: end while

Four questions: – How to choose direction dk? – How to choose step size tk? – Where to initialize? – When to stop?

5 / 43

slide-14
SLIDE 14

How to choose a search direction?

We want to decrease the function value toward global minimum...

6 / 43

slide-15
SLIDE 15

How to choose a search direction?

We want to decrease the function value toward global minimum... shortsighted answer: find a direction to decrease most rapidly

for any fixed t > 0, using 1st order Taylor expansion f (xk + tdk+1) − f (xk) ≈ t ∇f (xk) , dk+1

6 / 43

slide-16
SLIDE 16

How to choose a search direction?

We want to decrease the function value toward global minimum... shortsighted answer: find a direction to decrease most rapidly

for any fixed t > 0, using 1st order Taylor expansion f (xk + tdk+1) − f (xk) ≈ t ∇f (xk) , dk+1 min

v2=1 ∇f (xk) , v

6 / 43

slide-17
SLIDE 17

How to choose a search direction?

We want to decrease the function value toward global minimum... shortsighted answer: find a direction to decrease most rapidly

for any fixed t > 0, using 1st order Taylor expansion f (xk + tdk+1) − f (xk) ≈ t ∇f (xk) , dk+1 min

v2=1 ∇f (xk) , v =

⇒ v = − ∇f (xk) ∇f (xk)2

6 / 43

slide-18
SLIDE 18

How to choose a search direction?

We want to decrease the function value toward global minimum... shortsighted answer: find a direction to decrease most rapidly

for any fixed t > 0, using 1st order Taylor expansion f (xk + tdk+1) − f (xk) ≈ t ∇f (xk) , dk+1 min

v2=1 ∇f (xk) , v =

⇒ v = − ∇f (xk) ∇f (xk)2

Set dk = −∇f (xk) gradient/steepest descent: xk+1 = xk − t∇f (xk)

6 / 43

slide-19
SLIDE 19

Gradient descent

minx x⊺Ax + b⊺x

7 / 43

slide-20
SLIDE 20

Gradient descent

minx x⊺Ax + b⊺x typical zig-zag path

7 / 43

slide-21
SLIDE 21

Gradient descent

minx x⊺Ax + b⊺x typical zig-zag path conditioning affects the path length

7 / 43

slide-22
SLIDE 22

Gradient descent

minx x⊺Ax + b⊺x typical zig-zag path conditioning affects the path length – remember direction curvature? v⊺∇2f (x) v = d2

dt2 f (x + tv)

7 / 43

slide-23
SLIDE 23

Gradient descent

minx x⊺Ax + b⊺x typical zig-zag path conditioning affects the path length – remember direction curvature? v⊺∇2f (x) v = d2

dt2 f (x + tv)

– large curvature ↔ narrow valley

7 / 43

slide-24
SLIDE 24

Gradient descent

minx x⊺Ax + b⊺x typical zig-zag path conditioning affects the path length – remember direction curvature? v⊺∇2f (x) v = d2

dt2 f (x + tv)

– large curvature ↔ narrow valley – directional curvatures encoded in the Hessian

7 / 43

slide-25
SLIDE 25

How to choose a search direction?

We want to decrease the function value toward global minimum... shortsighted answer: find a direction to decrease most rapidly

8 / 43

slide-26
SLIDE 26

How to choose a search direction?

We want to decrease the function value toward global minimum... shortsighted answer: find a direction to decrease most rapidly farsighted answer: find a direction based on both gradient and Hessian

8 / 43

slide-27
SLIDE 27

How to choose a search direction?

We want to decrease the function value toward global minimum... shortsighted answer: find a direction to decrease most rapidly farsighted answer: find a direction based on both gradient and Hessian for any fixed t > 0, using 2nd-order Taylor expansion f (xk + tv) − f (v) ≈ t ∇f (xk) , v + 1 2t2 v, ∇2f (xk) v

  • 8 / 43
slide-28
SLIDE 28

How to choose a search direction?

We want to decrease the function value toward global minimum... shortsighted answer: find a direction to decrease most rapidly farsighted answer: find a direction based on both gradient and Hessian for any fixed t > 0, using 2nd-order Taylor expansion f (xk + tv) − f (v) ≈ t ∇f (xk) , v + 1 2t2 v, ∇2f (xk) v

  • minimizing the right side

= ⇒ v = −t−1 ∇2f (xk) −1 ∇f (xk)

8 / 43

slide-29
SLIDE 29

How to choose a search direction?

We want to decrease the function value toward global minimum... shortsighted answer: find a direction to decrease most rapidly farsighted answer: find a direction based on both gradient and Hessian for any fixed t > 0, using 2nd-order Taylor expansion f (xk + tv) − f (v) ≈ t ∇f (xk) , v + 1 2t2 v, ∇2f (xk) v

  • minimizing the right side

= ⇒ v = −t−1 ∇2f (xk) −1 ∇f (xk)

grad desc: green; Newton: red

8 / 43

slide-30
SLIDE 30

How to choose a search direction?

We want to decrease the function value toward global minimum... shortsighted answer: find a direction to decrease most rapidly farsighted answer: find a direction based on both gradient and Hessian for any fixed t > 0, using 2nd-order Taylor expansion f (xk + tv) − f (v) ≈ t ∇f (xk) , v + 1 2t2 v, ∇2f (xk) v

  • minimizing the right side

= ⇒ v = −t−1 ∇2f (xk) −1 ∇f (xk)

grad desc: green; Newton: red

Set dk =

  • ∇2f (xk)

−1 ∇f (xk) Newton’s method: xk+1 = xk − t

  • ∇2f (xk)

−1 ∇f (xk), t can set to be 1.

8 / 43

slide-31
SLIDE 31

Why called Newton’s method?

Newton’s method: xk+1 = xk − t

  • ∇2f (xk)

−1 ∇f (xk), Recall Newton’s method for root-finding xk+1 = xk − f′ (xn) f (xn)

9 / 43

slide-32
SLIDE 32

Why called Newton’s method?

Newton’s method: xk+1 = xk − t

  • ∇2f (xk)

−1 ∇f (xk), Recall Newton’s method for root-finding xk+1 = xk − f′ (xn) f (xn) Newton’s method for solving nonliear system f (x) = 0 xk+1 = xk − [Jf (xn)]† f (xn)

9 / 43

slide-33
SLIDE 33

Why called Newton’s method?

Newton’s method: xk+1 = xk − t

  • ∇2f (xk)

−1 ∇f (xk), Recall Newton’s method for root-finding xk+1 = xk − f′ (xn) f (xn) Newton’s method for solving nonliear system f (x) = 0 xk+1 = xk − [Jf (xn)]† f (xn) Newton’s method for solving ∇f (x) = 0 xk+1 = xk −

  • ∇2f (xn)

−1 f (xn)

9 / 43

slide-34
SLIDE 34

How to choose a search direction?

grad desc: green; Newton: red

Newton’s method take fewer steps

10 / 43

slide-35
SLIDE 35

How to choose a search direction?

grad desc: green; Newton: red

Newton’s method take fewer steps

nearsighted choice: cost O(n) per step gradient/steepest descent: xk+1 = xk − t∇f (xk)

10 / 43

slide-36
SLIDE 36

How to choose a search direction?

grad desc: green; Newton: red

Newton’s method take fewer steps

nearsighted choice: cost O(n) per step gradient/steepest descent: xk+1 = xk − t∇f (xk) farsighted choice: cost O(n3) per step Newton’s method: xk+1 = xk − t

  • ∇2f (xk)

−1 ∇f (xk),

10 / 43

slide-37
SLIDE 37

How to choose a search direction?

grad desc: green; Newton: red

Newton’s method take fewer steps

nearsighted choice: cost O(n) per step gradient/steepest descent: xk+1 = xk − t∇f (xk) farsighted choice: cost O(n3) per step Newton’s method: xk+1 = xk − t

  • ∇2f (xk)

−1 ∇f (xk), Implication: The plain Newton never used for large-scale problems. More on this later ...

10 / 43

slide-38
SLIDE 38

Problems with Newton’s method

Newton’s method: xk+1 = xk − t

  • ∇2f (xk)

−1 ∇f (xk),

11 / 43

slide-39
SLIDE 39

Problems with Newton’s method

Newton’s method: xk+1 = xk − t

  • ∇2f (xk)

−1 ∇f (xk), for any fixed t > 0, using 2nd-order Taylor expansion f (xk + tv) − f (v) ≈ t ∇f (xk) , v + 1 2t2 v, ∇2f (xk) v

  • minimizing the right side =

⇒ v = −t−1 ∇2f (xk) −1 ∇f (xk)

11 / 43

slide-40
SLIDE 40

Problems with Newton’s method

Newton’s method: xk+1 = xk − t

  • ∇2f (xk)

−1 ∇f (xk), for any fixed t > 0, using 2nd-order Taylor expansion f (xk + tv) − f (v) ≈ t ∇f (xk) , v + 1 2t2 v, ∇2f (xk) v

  • minimizing the right side =

⇒ v = −t−1 ∇2f (xk) −1 ∇f (xk) – ∇2f (xk) may be non-invertible

11 / 43

slide-41
SLIDE 41

Problems with Newton’s method

Newton’s method: xk+1 = xk − t

  • ∇2f (xk)

−1 ∇f (xk), for any fixed t > 0, using 2nd-order Taylor expansion f (xk + tv) − f (v) ≈ t ∇f (xk) , v + 1 2t2 v, ∇2f (xk) v

  • minimizing the right side =

⇒ v = −t−1 ∇2f (xk) −1 ∇f (xk) – ∇2f (xk) may be non-invertible – the minimum value is − 1

2

  • ∇f (xk) ,
  • ∇2f (xk)

−1 ∇f (xk)

  • . If

∇2f (xk) not positive definite, may be positive

11 / 43

slide-42
SLIDE 42

Problems with Newton’s method

Newton’s method: xk+1 = xk − t

  • ∇2f (xk)

−1 ∇f (xk), for any fixed t > 0, using 2nd-order Taylor expansion f (xk + tv) − f (v) ≈ t ∇f (xk) , v + 1 2t2 v, ∇2f (xk) v

  • minimizing the right side =

⇒ v = −t−1 ∇2f (xk) −1 ∇f (xk) – ∇2f (xk) may be non-invertible – the minimum value is − 1

2

  • ∇f (xk) ,
  • ∇2f (xk)

−1 ∇f (xk)

  • . If

∇2f (xk) not positive definite, may be positive solution: e.g., modify the Hessian ∇2f (xk) + τI with τ sufficiently large

11 / 43

slide-43
SLIDE 43

How to choose step size?

xk = xk−1 + tkdk

12 / 43

slide-44
SLIDE 44

How to choose step size?

xk = xk−1 + tkdk – Naive choice: sufficiently small constant t for all k

12 / 43

slide-45
SLIDE 45

How to choose step size?

xk = xk−1 + tkdk – Naive choice: sufficiently small constant t for all k – Robust and practical choice: back-tracking line search

12 / 43

slide-46
SLIDE 46

How to choose step size?

xk = xk−1 + tkdk – Naive choice: sufficiently small constant t for all k – Robust and practical choice: back-tracking line search Intuition for back-tracking line search: – By Taylor’s theorem, f (xk + tdk) = f (xk) + t ∇f (xk) , dk + o

  • t dk2
  • when t sufficiently

small — t ∇f (xk) , dk dictates the value decrease

12 / 43

slide-47
SLIDE 47

How to choose step size?

xk = xk−1 + tkdk – Naive choice: sufficiently small constant t for all k – Robust and practical choice: back-tracking line search Intuition for back-tracking line search: – By Taylor’s theorem, f (xk + tdk) = f (xk) + t ∇f (xk) , dk + o

  • t dk2
  • when t sufficiently

small — t ∇f (xk) , dk dictates the value decrease – But we also want t large as possible to make rapid progress

12 / 43

slide-48
SLIDE 48

How to choose step size?

xk = xk−1 + tkdk – Naive choice: sufficiently small constant t for all k – Robust and practical choice: back-tracking line search Intuition for back-tracking line search: – By Taylor’s theorem, f (xk + tdk) = f (xk) + t ∇f (xk) , dk + o

  • t dk2
  • when t sufficiently

small — t ∇f (xk) , dk dictates the value decrease – But we also want t large as possible to make rapid progress – idea: find a large possible t∗ to make sure f (xk + t∗dk) − f (xk) ≤ ct∗ ∇f (xk) , dk (key condition) for a chosen parameter c ∈ (0, 1), and no less

12 / 43

slide-49
SLIDE 49

How to choose step size?

xk = xk−1 + tkdk – Naive choice: sufficiently small constant t for all k – Robust and practical choice: back-tracking line search Intuition for back-tracking line search: – By Taylor’s theorem, f (xk + tdk) = f (xk) + t ∇f (xk) , dk + o

  • t dk2
  • when t sufficiently

small — t ∇f (xk) , dk dictates the value decrease – But we also want t large as possible to make rapid progress – idea: find a large possible t∗ to make sure f (xk + t∗dk) − f (xk) ≤ ct∗ ∇f (xk) , dk (key condition) for a chosen parameter c ∈ (0, 1), and no less – details: start from t = 1. If the key condition not satisfied, t = ρt for a chosen parameter ρ ∈ (0, 1).

12 / 43

slide-50
SLIDE 50

Back-tracking line search

A widely implemented strategy in numerical optimization packages Back-tracking line search Input: initial t > 0, ρ ∈ (0, 1), c ∈ (0, 1)

1: while f (xk + tdk) − f (xk) ≥ ct ∇f (xk) , dk do 2:

t = ρt

3: end while

Output: tk = t.

13 / 43

slide-51
SLIDE 51

Where to initialize?

convex vs. nonconvex functions

14 / 43

slide-52
SLIDE 52

Where to initialize?

convex vs. nonconvex functions – Convex: most iterative methods converge to the global min no matter the initialization

14 / 43

slide-53
SLIDE 53

Where to initialize?

convex vs. nonconvex functions – Convex: most iterative methods converge to the global min no matter the initialization – Nonconvex: initialization matters a lot. Common heuristics: random initialization, multiple independent runs

14 / 43

slide-54
SLIDE 54

Where to initialize?

convex vs. nonconvex functions – Convex: most iterative methods converge to the global min no matter the initialization – Nonconvex: initialization matters a lot. Common heuristics: random initialization, multiple independent runs – Nonconvex: clever initialization is possible with certain assumptions on the data:

14 / 43

slide-55
SLIDE 55

Where to initialize?

convex vs. nonconvex functions – Convex: most iterative methods converge to the global min no matter the initialization – Nonconvex: initialization matters a lot. Common heuristics: random initialization, multiple independent runs – Nonconvex: clever initialization is possible with certain assumptions on the data: https://sunju.org/research/nonconvex/ and sometimes random initialization works!

14 / 43

slide-56
SLIDE 56

When to stop?

1st-order necessary condition: Assume f is 1st-order differentiable at x0. If x0 is a local minimizer, then ∇f (x0) = 0. 2nd-order necessary condition: Assume f (x) is 2-order differentiable at

  • x0. If x0 is a local min, ∇f (x0) = 0 and ∇2f (x0) 0.

15 / 43

slide-57
SLIDE 57

When to stop?

1st-order necessary condition: Assume f is 1st-order differentiable at x0. If x0 is a local minimizer, then ∇f (x0) = 0. 2nd-order necessary condition: Assume f (x) is 2-order differentiable at

  • x0. If x0 is a local min, ∇f (x0) = 0 and ∇2f (x0) 0.

Fix some positive tolerance values εg, εH, εf, εv. Possibilities:

15 / 43

slide-58
SLIDE 58

When to stop?

1st-order necessary condition: Assume f is 1st-order differentiable at x0. If x0 is a local minimizer, then ∇f (x0) = 0. 2nd-order necessary condition: Assume f (x) is 2-order differentiable at

  • x0. If x0 is a local min, ∇f (x0) = 0 and ∇2f (x0) 0.

Fix some positive tolerance values εg, εH, εf, εv. Possibilities: – ∇f (xk)2 ≤ εg

15 / 43

slide-59
SLIDE 59

When to stop?

1st-order necessary condition: Assume f is 1st-order differentiable at x0. If x0 is a local minimizer, then ∇f (x0) = 0. 2nd-order necessary condition: Assume f (x) is 2-order differentiable at

  • x0. If x0 is a local min, ∇f (x0) = 0 and ∇2f (x0) 0.

Fix some positive tolerance values εg, εH, εf, εv. Possibilities: – ∇f (xk)2 ≤ εg – ∇f (xk)2 ≤ εg and λmin

  • ∇2f (xk)
  • ≥ −εH

15 / 43

slide-60
SLIDE 60

When to stop?

1st-order necessary condition: Assume f is 1st-order differentiable at x0. If x0 is a local minimizer, then ∇f (x0) = 0. 2nd-order necessary condition: Assume f (x) is 2-order differentiable at

  • x0. If x0 is a local min, ∇f (x0) = 0 and ∇2f (x0) 0.

Fix some positive tolerance values εg, εH, εf, εv. Possibilities: – ∇f (xk)2 ≤ εg – ∇f (xk)2 ≤ εg and λmin

  • ∇2f (xk)
  • ≥ −εH

– |f (xk) − f (xk−1)| ≤ εf

15 / 43

slide-61
SLIDE 61

When to stop?

1st-order necessary condition: Assume f is 1st-order differentiable at x0. If x0 is a local minimizer, then ∇f (x0) = 0. 2nd-order necessary condition: Assume f (x) is 2-order differentiable at

  • x0. If x0 is a local min, ∇f (x0) = 0 and ∇2f (x0) 0.

Fix some positive tolerance values εg, εH, εf, εv. Possibilities: – ∇f (xk)2 ≤ εg – ∇f (xk)2 ≤ εg and λmin

  • ∇2f (xk)
  • ≥ −εH

– |f (xk) − f (xk−1)| ≤ εf – xk − xk−12 ≤ εv

15 / 43

slide-62
SLIDE 62

Nonconvex optimization is hard

Nonconvex: Even computing (verifying!) a local minimizer is NP-hard! (see, e.g., [Murty and Kabadi, 1987])

16 / 43

slide-63
SLIDE 63

Nonconvex optimization is hard

Nonconvex: Even computing (verifying!) a local minimizer is NP-hard! (see, e.g., [Murty and Kabadi, 1987]) 2nd order sufficient: ∇f (x0) = 0 and ∇2f (x0) ≻ 0 2nd order necessary: ∇f (x0) = 0 and ∇2f (x0) 0

16 / 43

slide-64
SLIDE 64

Nonconvex optimization is hard

Nonconvex: Even computing (verifying!) a local minimizer is NP-hard! (see, e.g., [Murty and Kabadi, 1987]) 2nd order sufficient: ∇f (x0) = 0 and ∇2f (x0) ≻ 0 2nd order necessary: ∇f (x0) = 0 and ∇2f (x0) 0 Cases in between: local shapes around SOSP determined by spectral properties

  • f higher-order derivative tensors, calculating which is

hard [Hillar and Lim, 2013]!

16 / 43

slide-65
SLIDE 65

Outline

Classic line-search methods Advanced line-search methods Momentum methods Quasi-Newton methods Coordinate descent Conjugate gradient methods Trust-region methods

17 / 43

slide-66
SLIDE 66

Outline

Classic line-search methods Advanced line-search methods Momentum methods Quasi-Newton methods Coordinate descent Conjugate gradient methods Trust-region methods

18 / 43

slide-67
SLIDE 67

Why momentum?

Credit: Princeton ELE522

– GD is cheap (O(n) per step) but overall convergence sensitive to conditioning – Newton’s convergence is not sensitive to conditioning but expensive (O(n3) per step)

19 / 43

slide-68
SLIDE 68

Why momentum?

Credit: Princeton ELE522

– GD is cheap (O(n) per step) but overall convergence sensitive to conditioning – Newton’s convergence is not sensitive to conditioning but expensive (O(n3) per step) A cheap way to achieve faster convergence?

19 / 43

slide-69
SLIDE 69

Why momentum?

Credit: Princeton ELE522

– GD is cheap (O(n) per step) but overall convergence sensitive to conditioning – Newton’s convergence is not sensitive to conditioning but expensive (O(n3) per step) A cheap way to achieve faster convergence? Answer: using historic information

19 / 43

slide-70
SLIDE 70

Heavy ball method

In physics, a heavy object has a large inertia/momentum — resistance to change velocity.

20 / 43

slide-71
SLIDE 71

Heavy ball method

In physics, a heavy object has a large inertia/momentum — resistance to change velocity. xk+1 = xk − αk∇f (xk) + βk (xk − xk−1)

  • momentum

due to Polyak

20 / 43

slide-72
SLIDE 72

Heavy ball method

In physics, a heavy object has a large inertia/momentum — resistance to change velocity. xk+1 = xk − αk∇f (xk) + βk (xk − xk−1)

  • momentum

due to Polyak

Credit: Princeton ELE522

History helps to smooth out the zig-zag path!

20 / 43

slide-73
SLIDE 73

Nesterov’s accelerated gradient methods

Another version, due to Y. Nesterov xk+1 = xk + βk (xk − xk−1) − αk∇f (xk + βk (xk − xk−1))

21 / 43

slide-74
SLIDE 74

Nesterov’s accelerated gradient methods

Another version, due to Y. Nesterov xk+1 = xk + βk (xk − xk−1) − αk∇f (xk + βk (xk − xk−1))

Credit: Stanford CS231N

21 / 43

slide-75
SLIDE 75

Nesterov’s accelerated gradient methods

Another version, due to Y. Nesterov xk+1 = xk + βk (xk − xk−1) − αk∇f (xk + βk (xk − xk−1))

Credit: Stanford CS231N

For more info, see Chap 10 of [Beck, 2017] and Chap 2 of [Nesterov, 2018].

21 / 43

slide-76
SLIDE 76

Outline

Classic line-search methods Advanced line-search methods Momentum methods Quasi-Newton methods Coordinate descent Conjugate gradient methods Trust-region methods

22 / 43

slide-77
SLIDE 77

Quasi-Newton methods

quasi-: seemingly; apparently but not really. Newton’s method: cost O(n2) storage and O(n3) computation per step xk+1 = xk − t

  • ∇2f (xk)

−1 ∇f (xk)

23 / 43

slide-78
SLIDE 78

Quasi-Newton methods

quasi-: seemingly; apparently but not really. Newton’s method: cost O(n2) storage and O(n3) computation per step xk+1 = xk − t

  • ∇2f (xk)

−1 ∇f (xk) Idea: approximate ∇2f (xk) or

  • ∇2f (xk)

−1 to allow efficient storage and computation — Quasi-Newton Methods

23 / 43

slide-79
SLIDE 79

Quasi-Newton methods

quasi-: seemingly; apparently but not really. Newton’s method: cost O(n2) storage and O(n3) computation per step xk+1 = xk − t

  • ∇2f (xk)

−1 ∇f (xk) Idea: approximate ∇2f (xk) or

  • ∇2f (xk)

−1 to allow efficient storage and computation — Quasi-Newton Methods Choose Hk to approximate ∇2f (xk) so that – avoid calculation of second derivatives – simplify matrix inversion, i.e., computing the search direction

23 / 43

slide-80
SLIDE 80

Quasi-Newton methods

– Different variants differ on how to compute Hk+1 – Normally H−1

k

  • r its factorized version stored to simplify calculation
  • f ∆xk

Credit: UCLA ECE236C

24 / 43

slide-81
SLIDE 81

BFGS method

Broyden–Fletcher–Goldfarb–Shanno (BFGS) method

25 / 43

slide-82
SLIDE 82

BFGS method

Broyden–Fletcher–Goldfarb–Shanno (BFGS) method Cost of update: O(n2) (vs. O(n3) in Newton’s method), storage: O(n2)

25 / 43

slide-83
SLIDE 83

BFGS method

Broyden–Fletcher–Goldfarb–Shanno (BFGS) method Cost of update: O(n2) (vs. O(n3) in Newton’s method), storage: O(n2) To derive the update equations, three conditions are imposed: – secant condition: Hk+1s = y (think of 1st Taylor expansion to ∇f) – Curvature condition: s⊺

kyk > 0 to ensure that Hk+1 ≻ 0 if Hk ≻ 0

– Hk+1 and Hk are close in an appropriate sense See Chap 6 of [Nocedal and Wright, 2006]

Credit: UCLA ECE236C

25 / 43

slide-84
SLIDE 84

Limited-memory BFGS (L-BFGS)

26 / 43

slide-85
SLIDE 85

Limited-memory BFGS (L-BFGS)

Cost of update: O(mn) (vs. O(n2) in BFGS), storage: O(mn) (vs. O(n2) in BFGS) — linear in dimension n! recall the cost of GD? See Chap 7 of [Nocedal and Wright, 2006] Credit: UCLA ECE236C

26 / 43

slide-86
SLIDE 86

Outline

Classic line-search methods Advanced line-search methods Momentum methods Quasi-Newton methods Coordinate descent Conjugate gradient methods Trust-region methods

27 / 43

slide-87
SLIDE 87

Block coordinate descent

Consider a function f (x1, . . . , xp) with x1 ∈ Rn1, . . . , xp ∈ Rnp

28 / 43

slide-88
SLIDE 88

Block coordinate descent

Consider a function f (x1, . . . , xp) with x1 ∈ Rn1, . . . , xp ∈ Rnp A generic block coordinate descent algorithm Input: initialization (x1,0, . . . , xp,0) (the 2nd subscript indexes iteration number) 1: for k = 1, 2, . . . do 2: Pick a block index i ∈ {1, . . . , p} 3: Minimize wrt the chosen block: xi,k = arg minξ∈Rni f (x1,k−1, . . . , xi−1,k−1, ξ, xi+1,k−1, . . . , xp,k−1) 4: Leave other blocks unchanged: xj,k = xj,k−1 ∀ j = i 5: end for

28 / 43

slide-89
SLIDE 89

Block coordinate descent

Consider a function f (x1, . . . , xp) with x1 ∈ Rn1, . . . , xp ∈ Rnp A generic block coordinate descent algorithm Input: initialization (x1,0, . . . , xp,0) (the 2nd subscript indexes iteration number) 1: for k = 1, 2, . . . do 2: Pick a block index i ∈ {1, . . . , p} 3: Minimize wrt the chosen block: xi,k = arg minξ∈Rni f (x1,k−1, . . . , xi−1,k−1, ξ, xi+1,k−1, . . . , xp,k−1) 4: Leave other blocks unchanged: xj,k = xj,k−1 ∀ j = i 5: end for – Also called alternating direction/minimization methods

28 / 43

slide-90
SLIDE 90

Block coordinate descent

Consider a function f (x1, . . . , xp) with x1 ∈ Rn1, . . . , xp ∈ Rnp A generic block coordinate descent algorithm Input: initialization (x1,0, . . . , xp,0) (the 2nd subscript indexes iteration number) 1: for k = 1, 2, . . . do 2: Pick a block index i ∈ {1, . . . , p} 3: Minimize wrt the chosen block: xi,k = arg minξ∈Rni f (x1,k−1, . . . , xi−1,k−1, ξ, xi+1,k−1, . . . , xp,k−1) 4: Leave other blocks unchanged: xj,k = xj,k−1 ∀ j = i 5: end for – Also called alternating direction/minimization methods – When n1 = n2 = · · · = np = 1, called coordinate descent

28 / 43

slide-91
SLIDE 91

Block coordinate descent

Consider a function f (x1, . . . , xp) with x1 ∈ Rn1, . . . , xp ∈ Rnp A generic block coordinate descent algorithm Input: initialization (x1,0, . . . , xp,0) (the 2nd subscript indexes iteration number) 1: for k = 1, 2, . . . do 2: Pick a block index i ∈ {1, . . . , p} 3: Minimize wrt the chosen block: xi,k = arg minξ∈Rni f (x1,k−1, . . . , xi−1,k−1, ξ, xi+1,k−1, . . . , xp,k−1) 4: Leave other blocks unchanged: xj,k = xj,k−1 ∀ j = i 5: end for – Also called alternating direction/minimization methods – When n1 = n2 = · · · = np = 1, called coordinate descent – Minimization in Line 3 can be inexact: e.g., xi,k = xi,k−1 − tk

∂f ∂ξ (x1,k−1, . . . , xi−1,k−1, xi,k−1, xi+1,k−1, . . . , xp,k−1)

28 / 43

slide-92
SLIDE 92

Block coordinate descent

Consider a function f (x1, . . . , xp) with x1 ∈ Rn1, . . . , xp ∈ Rnp A generic block coordinate descent algorithm Input: initialization (x1,0, . . . , xp,0) (the 2nd subscript indexes iteration number) 1: for k = 1, 2, . . . do 2: Pick a block index i ∈ {1, . . . , p} 3: Minimize wrt the chosen block: xi,k = arg minξ∈Rni f (x1,k−1, . . . , xi−1,k−1, ξ, xi+1,k−1, . . . , xp,k−1) 4: Leave other blocks unchanged: xj,k = xj,k−1 ∀ j = i 5: end for – Also called alternating direction/minimization methods – When n1 = n2 = · · · = np = 1, called coordinate descent – Minimization in Line 3 can be inexact: e.g., xi,k = xi,k−1 − tk

∂f ∂ξ (x1,k−1, . . . , xi−1,k−1, xi,k−1, xi+1,k−1, . . . , xp,k−1)

– In Line 2, many different ways of picking an index, e.g., cyclic, randomized, weighted sampling, etc

28 / 43

slide-93
SLIDE 93

Block coordinate descent: examples

Least-squares minx f (x) = y − Ax2

2

– y − Ax2

2 = y − A−ix−i − aixi2

– coordinate descent: minξ∈R y − A−ix−i − aiξ2 = ⇒ xi,+ = y−A−ix−i,ai

ai2

2

(A−i is A with the i-th column removed; x−i is x with the i-th coordinate removed)

29 / 43

slide-94
SLIDE 94

Block coordinate descent: examples

Least-squares minx f (x) = y − Ax2

2

– y − Ax2

2 = y − A−ix−i − aixi2

– coordinate descent: minξ∈R y − A−ix−i − aiξ2 = ⇒ xi,+ = y−A−ix−i,ai

ai2

2

(A−i is A with the i-th column removed; x−i is x with the i-th coordinate removed)

Matrix factorization minA,B Y − AB2

F

– Two groups of variables, consider block coordinate descent – Updates: A+ = Y B†, B+ = A†Y .

(·)† denotes the matrix pseudoinverse.)

29 / 43

slide-95
SLIDE 95

Why block coordinate descent?

– may work with constrained problems and non-differentiable problems (e.g., minA,B Y − AB2

F , s. t. A orthogonal,

Lasso: minx y − Ax2

2 + λ x1) 30 / 43

slide-96
SLIDE 96

Why block coordinate descent?

– may work with constrained problems and non-differentiable problems (e.g., minA,B Y − AB2

F , s. t. A orthogonal,

Lasso: minx y − Ax2

2 + λ x1)

– may be faster than gradient descent or Newton (next)

30 / 43

slide-97
SLIDE 97

Why block coordinate descent?

– may work with constrained problems and non-differentiable problems (e.g., minA,B Y − AB2

F , s. t. A orthogonal,

Lasso: minx y − Ax2

2 + λ x1)

– may be faster than gradient descent or Newton (next) – may be simple and cheap!

30 / 43

slide-98
SLIDE 98

Why block coordinate descent?

– may work with constrained problems and non-differentiable problems (e.g., minA,B Y − AB2

F , s. t. A orthogonal,

Lasso: minx y − Ax2

2 + λ x1)

– may be faster than gradient descent or Newton (next) – may be simple and cheap! Some references: – [Wright, 2015] – Lecture notes by Prof. Ruoyu Sun

30 / 43

slide-99
SLIDE 99

Outline

Classic line-search methods Advanced line-search methods Momentum methods Quasi-Newton methods Coordinate descent Conjugate gradient methods Trust-region methods

31 / 43

slide-100
SLIDE 100

Conjugate direction methods

Solve linear equation y = Ax ⇐ ⇒ minx

1 2x⊺Ax − b⊺x with A ≻ 0

32 / 43

slide-101
SLIDE 101

Conjugate direction methods

Solve linear equation y = Ax ⇐ ⇒ minx

1 2x⊺Ax − b⊺x with A ≻ 0

apply coordinate descent...

32 / 43

slide-102
SLIDE 102

Conjugate direction methods

Solve linear equation y = Ax ⇐ ⇒ minx

1 2x⊺Ax − b⊺x with A ≻ 0

apply coordinate descent... diagonal A: solve the problem in n steps non-diagonal A: does not solve the problem in n steps

32 / 43

slide-103
SLIDE 103

Conjugate direction methods

Solve linear equation y = Ax ⇐ ⇒ minx

1 2x⊺Ax − b⊺x with A ≻ 0

non-diagonal A: does not solve the problem in n steps

33 / 43

slide-104
SLIDE 104

Conjugate direction methods

Solve linear equation y = Ax ⇐ ⇒ minx

1 2x⊺Ax − b⊺x with A ≻ 0

non-diagonal A: does not solve the problem in n steps Idea: define n “conjugate directions” {p1, . . . , pn} so that p⊺

i Apj = 0 for all

i = j—conjugate as generalization of orthogonal

33 / 43

slide-105
SLIDE 105

Conjugate direction methods

Solve linear equation y = Ax ⇐ ⇒ minx

1 2x⊺Ax − b⊺x with A ≻ 0

non-diagonal A: does not solve the problem in n steps Idea: define n “conjugate directions” {p1, . . . , pn} so that p⊺

i Apj = 0 for all

i = j—conjugate as generalization of orthogonal – Write P = [p1, . . . , pn]. Can verify that P ⊺AP is diagonal and positive

33 / 43

slide-106
SLIDE 106

Conjugate direction methods

Solve linear equation y = Ax ⇐ ⇒ minx

1 2x⊺Ax − b⊺x with A ≻ 0

non-diagonal A: does not solve the problem in n steps Idea: define n “conjugate directions” {p1, . . . , pn} so that p⊺

i Apj = 0 for all

i = j—conjugate as generalization of orthogonal – Write P = [p1, . . . , pn]. Can verify that P ⊺AP is diagonal and positive – Write x = P s. Then 1

2x⊺Ax − b⊺x = 1 2s⊺ (P ⊺AP ) s − (P ⊺b)⊺ s — quadratic

with diagonal P ⊺AP

33 / 43

slide-107
SLIDE 107

Conjugate direction methods

Solve linear equation y = Ax ⇐ ⇒ minx

1 2x⊺Ax − b⊺x with A ≻ 0

non-diagonal A: does not solve the problem in n steps Idea: define n “conjugate directions” {p1, . . . , pn} so that p⊺

i Apj = 0 for all

i = j—conjugate as generalization of orthogonal – Write P = [p1, . . . , pn]. Can verify that P ⊺AP is diagonal and positive – Write x = P s. Then 1

2x⊺Ax − b⊺x = 1 2s⊺ (P ⊺AP ) s − (P ⊺b)⊺ s — quadratic

with diagonal P ⊺AP – Perform updates in the s space, but write the equivalent form in x space

33 / 43

slide-108
SLIDE 108

Conjugate direction methods

Solve linear equation y = Ax ⇐ ⇒ minx

1 2x⊺Ax − b⊺x with A ≻ 0

non-diagonal A: does not solve the problem in n steps Idea: define n “conjugate directions” {p1, . . . , pn} so that p⊺

i Apj = 0 for all

i = j—conjugate as generalization of orthogonal – Write P = [p1, . . . , pn]. Can verify that P ⊺AP is diagonal and positive – Write x = P s. Then 1

2x⊺Ax − b⊺x = 1 2s⊺ (P ⊺AP ) s − (P ⊺b)⊺ s — quadratic

with diagonal P ⊺AP – Perform updates in the s space, but write the equivalent form in x space – The i-the coordinate direction in the s space is pi in the x space

33 / 43

slide-109
SLIDE 109

Conjugate direction methods

Solve linear equation y = Ax ⇐ ⇒ minx

1 2x⊺Ax − b⊺x with A ≻ 0

non-diagonal A: does not solve the problem in n steps Idea: define n “conjugate directions” {p1, . . . , pn} so that p⊺

i Apj = 0 for all

i = j—conjugate as generalization of orthogonal – Write P = [p1, . . . , pn]. Can verify that P ⊺AP is diagonal and positive – Write x = P s. Then 1

2x⊺Ax − b⊺x = 1 2s⊺ (P ⊺AP ) s − (P ⊺b)⊺ s — quadratic

with diagonal P ⊺AP – Perform updates in the s space, but write the equivalent form in x space – The i-the coordinate direction in the s space is pi in the x space In short, change of variable trick!

33 / 43

slide-110
SLIDE 110

Conjugate gradient methods

Solve linear equation y = Ax ⇐ ⇒ minx

1 2x⊺Ax − b⊺x with A ≻ 0

Idea: define n “conjugate directions” {p1, . . . , pn} so that p⊺

i Apj = 0 for all

i = j—conjugate as generalization of orthogonal

34 / 43

slide-111
SLIDE 111

Conjugate gradient methods

Solve linear equation y = Ax ⇐ ⇒ minx

1 2x⊺Ax − b⊺x with A ≻ 0

Idea: define n “conjugate directions” {p1, . . . , pn} so that p⊺

i Apj = 0 for all

i = j—conjugate as generalization of orthogonal Generally, many choices for {p1, . . . , pn}. Conjugate gradient methods: choice based on ideas from steepest descent

34 / 43

slide-112
SLIDE 112

Conjugate gradient methods

Solve linear equation y = Ax ⇐ ⇒ minx

1 2x⊺Ax − b⊺x with A ≻ 0

Idea: define n “conjugate directions” {p1, . . . , pn} so that p⊺

i Apj = 0 for all

i = j—conjugate as generalization of orthogonal Generally, many choices for {p1, . . . , pn}. Conjugate gradient methods: choice based on ideas from steepest descent

34 / 43

slide-113
SLIDE 113

Conjugate gradient methods

CG vs. GD (Green: GD, Red: CG)

35 / 43

slide-114
SLIDE 114

Conjugate gradient methods

CG vs. GD (Green: GD, Red: CG) – Can be extended to general non-quadratic functions – Often used to solve subproblems of other iterative methods, e.g., truncated Newton method, the trust-region subproblem (later) See Chap 5

  • f [Nocedal and Wright, 2006]

35 / 43

slide-115
SLIDE 115

Outline

Classic line-search methods Advanced line-search methods Momentum methods Quasi-Newton methods Coordinate descent Conjugate gradient methods Trust-region methods

36 / 43

slide-116
SLIDE 116

Iterative methods

Credit: aria42.com

Illustration of iterative methods on the contour/levelset plot (i.e., the function assumes the same value on each curve) Two questions: what direction to move, and how far to move Two possibilities: – Line-search methods: direction first, size second – Trust-region methods (TRM): size first, direction second

37 / 43

slide-117
SLIDE 117

Ideas behind TRM

Recall Taylor expansion f (x + d) ≈ f (x) + ∇f (xk) , d + 1

2

  • d, ∇2f (xk) d
  • Credit: [Arezki et al., 2018]

Start with x0. Repeat the following:

38 / 43

slide-118
SLIDE 118

Ideas behind TRM

Recall Taylor expansion f (x + d) ≈ f (x) + ∇f (xk) , d + 1

2

  • d, ∇2f (xk) d
  • Credit: [Arezki et al., 2018]

Start with x0. Repeat the following: – At xk, approximate f by the quadratic function (called model function dotted black) mk (d) = f (xk) + ∇f (xk) , d + 1 2 d, Bkd i.e., mk (d) ≈ f (xk + d), and Bk to approximate ∇2f (xk)

38 / 43

slide-119
SLIDE 119

Ideas behind TRM

Recall Taylor expansion f (x + d) ≈ f (x) + ∇f (xk) , d + 1

2

  • d, ∇2f (xk) d
  • Credit: [Arezki et al., 2018]

Start with x0. Repeat the following: – At xk, approximate f by the quadratic function (called model function dotted black) mk (d) = f (xk) + ∇f (xk) , d + 1 2 d, Bkd i.e., mk (d) ≈ f (xk + d), and Bk to approximate ∇2f (xk) – Minimize mk (d) within a trust region

  • d : d ≤ ∆
  • , i.e., a norm ball (in red), to obtain dk

38 / 43

slide-120
SLIDE 120

Ideas behind TRM

Recall Taylor expansion f (x + d) ≈ f (x) + ∇f (xk) , d + 1

2

  • d, ∇2f (xk) d
  • Credit: [Arezki et al., 2018]

Start with x0. Repeat the following: – At xk, approximate f by the quadratic function (called model function dotted black) mk (d) = f (xk) + ∇f (xk) , d + 1 2 d, Bkd i.e., mk (d) ≈ f (xk + d), and Bk to approximate ∇2f (xk) – Minimize mk (d) within a trust region

  • d : d ≤ ∆
  • , i.e., a norm ball (in red), to obtain dk

– If the approximation is inaccurate, decrease the region size; if the approximation is sufficiently accurate, increase the region size.

38 / 43

slide-121
SLIDE 121

Ideas behind TRM

Recall Taylor expansion f (x + d) ≈ f (x) + ∇f (xk) , d + 1

2

  • d, ∇2f (xk) d
  • Credit: [Arezki et al., 2018]

Start with x0. Repeat the following: – At xk, approximate f by the quadratic function (called model function dotted black) mk (d) = f (xk) + ∇f (xk) , d + 1 2 d, Bkd i.e., mk (d) ≈ f (xk + d), and Bk to approximate ∇2f (xk) – Minimize mk (d) within a trust region

  • d : d ≤ ∆
  • , i.e., a norm ball (in red), to obtain dk

– If the approximation is inaccurate, decrease the region size; if the approximation is sufficiently accurate, increase the region size. – If the approximation is reasonably accurate, update the iterate xk+1 = xk + dk.

38 / 43

slide-122
SLIDE 122

Framework of trust-region methods

To measure approximation quality: ρk . = f(xk)−f(xk+dk)

mk(0)−mk(dk) = actual decrease model decrease

39 / 43

slide-123
SLIDE 123

Framework of trust-region methods

To measure approximation quality: ρk . = f(xk)−f(xk+dk)

mk(0)−mk(dk) = actual decrease model decrease

A generic trust-region algorithm

Input: x0, radius cap ∆ > 0, initial radius ∆0, acceptance ratio η ∈ [0, 1/4)

1: for k = 0, 1, . . . do 2:

dk = arg mind mk (d) , s. t. d ≤ ∆k (TR Subproblem)

3:

if ρk < 1/4 then

4:

∆k+1 = ∆k/4

5:

else

6:

if ρk > 3/4 and dk = ∆k then

7:

∆k+1 = min

  • 2∆k,

  • 8:

else

9:

∆k+1 = ∆k

10:

end if

11:

end if

12:

if ρk > η then

13:

xk+1 = xk + dk

14:

else

15:

xk+1 = xk

16:

end if

17: end for

39 / 43

slide-124
SLIDE 124

Why TRM?

Recall the model function mk (d) . = f (xk) + ∇f (xk) , d + 1

2 d, Bkd

40 / 43

slide-125
SLIDE 125

Why TRM?

Recall the model function mk (d) . = f (xk) + ∇f (xk) , d + 1

2 d, Bkd

– Take Bk = ∇2f (xk)

40 / 43

slide-126
SLIDE 126

Why TRM?

Recall the model function mk (d) . = f (xk) + ∇f (xk) , d + 1

2 d, Bkd

– Take Bk = ∇2f (xk) – Gradient descent: stop at ∇f (xk) = 0

40 / 43

slide-127
SLIDE 127

Why TRM?

Recall the model function mk (d) . = f (xk) + ∇f (xk) , d + 1

2 d, Bkd

– Take Bk = ∇2f (xk) – Gradient descent: stop at ∇f (xk) = 0 – Newton’s method:

  • ∇2f (xk)

−1 ∇f (xk) may just stop at ∇f (xk) = 0

  • r be ill-defined

40 / 43

slide-128
SLIDE 128

Why TRM?

Recall the model function mk (d) . = f (xk) + ∇f (xk) , d + 1

2 d, Bkd

– Take Bk = ∇2f (xk) – Gradient descent: stop at ∇f (xk) = 0 – Newton’s method:

  • ∇2f (xk)

−1 ∇f (xk) may just stop at ∇f (xk) = 0

  • r be ill-defined

– Trust-region method: mind mk (d)

  • s. t. d ≤ ∆k

When ∇f (xk) = 0, mk (d) − f (xk) = 1 2

  • d, ∇2f (xk) d
  • .

40 / 43

slide-129
SLIDE 129

Why TRM?

Recall the model function mk (d) . = f (xk) + ∇f (xk) , d + 1

2 d, Bkd

– Take Bk = ∇2f (xk) – Gradient descent: stop at ∇f (xk) = 0 – Newton’s method:

  • ∇2f (xk)

−1 ∇f (xk) may just stop at ∇f (xk) = 0

  • r be ill-defined

– Trust-region method: mind mk (d)

  • s. t. d ≤ ∆k

When ∇f (xk) = 0, mk (d) − f (xk) = 1 2

  • d, ∇2f (xk) d
  • .

If ∇2f (xk) has negative eigenvalues, i.e., there are negative directional curvatures,

1 2

  • d, ∇2f (xk) d
  • < 0 for certain choices of d

(e.g., eigenvectors corresponding to the negative eigenvalues)

40 / 43

slide-130
SLIDE 130

Why TRM?

Recall the model function mk (d) . = f (xk) + ∇f (xk) , d + 1

2 d, Bkd

– Take Bk = ∇2f (xk) – Gradient descent: stop at ∇f (xk) = 0 – Newton’s method:

  • ∇2f (xk)

−1 ∇f (xk) may just stop at ∇f (xk) = 0

  • r be ill-defined

– Trust-region method: mind mk (d)

  • s. t. d ≤ ∆k

When ∇f (xk) = 0, mk (d) − f (xk) = 1 2

  • d, ∇2f (xk) d
  • .

If ∇2f (xk) has negative eigenvalues, i.e., there are negative directional curvatures,

1 2

  • d, ∇2f (xk) d
  • < 0 for certain choices of d

(e.g., eigenvectors corresponding to the negative eigenvalues) TRM can help to move away from “nice” saddle points!

40 / 43

slide-131
SLIDE 131

To learn more about TRM

– A comprehensive reference [Conn et al., 2000] – A closely-related alternative: cubic regularized second-order (CRSOM) method [Nesterov and Polyak, 2006, Agarwal et al., 2018] – Example implementation of both TRM and CRSOM: Manopt (in Matlab) https://www.manopt.org/ (choosing the Euclidean manifold)

41 / 43

slide-132
SLIDE 132

References i

[Agarwal et al., 2018] Agarwal, N., Boumal, N., Bullins, B., and Cartis, C. (2018). Adaptive regularization with cubics on manifolds. arXiv:1806.00065. [Arezki et al., 2018] Arezki, Y., Nouira, H., Anwer, N., and Mehdi-Souzani, C. (2018). A novel hybrid trust region minimax fitting algorithm for accurate dimensional metrology of aspherical shapes. Measurement, 127:134–140. [Beck, 2017] Beck, A. (2017). First-Order Methods in Optimization. Society for Industrial and Applied Mathematics. [Conn et al., 2000] Conn, A. R., Gould, N. I. M., and Toint, P. L. (2000). Trust Region Methods. Society for Industrial and Applied Mathematics. [Hillar and Lim, 2013] Hillar, C. J. and Lim, L.-H. (2013). Most tensor problems are NP-hard. Journal of the ACM, 60(6):1–39. [Murty and Kabadi, 1987] Murty, K. G. and Kabadi, S. N. (1987). Some NP-complete problems in quadratic and nonlinear programming. Mathematical Programming, 39(2):117–129. [Nesterov, 2018] Nesterov, Y. (2018). Lectures on Convex Optimization. Springer International Publishing. 42 / 43

slide-133
SLIDE 133

References ii

[Nesterov and Polyak, 2006] Nesterov, Y. and Polyak, B. (2006). Cubic regularization of newton method and its global performance. Mathematical Programming, 108(1):177–205. [Nocedal and Wright, 2006] Nocedal, J. and Wright, S. J. (2006). Numerical

  • Optimization. Springer New York.

[Wright, 2015] Wright, S. J. (2015). Coordinate descent algorithms. Mathematical Programming, 151(1):3–34. 43 / 43