First-order methods (Optml++ Meeting 2) Suvrit Sra Massachusetts - - PowerPoint PPT Presentation

first order methods
SMART_READER_LITE
LIVE PREVIEW

First-order methods (Optml++ Meeting 2) Suvrit Sra Massachusetts - - PowerPoint PPT Presentation

First-order methods (Optml++ Meeting 2) Suvrit Sra Massachusetts Institute of Technology OPTML++, Fall 2015 Outline Lect 1: Recap on convexity Lect 1: Recap on duality, optimality First-order optimization algorithms Proximal


slide-1
SLIDE 1

First-order methods

(Optml++ Meeting 2)

Suvrit Sra Massachusetts Institute of Technology OPTML++, Fall 2015

slide-2
SLIDE 2

Outline

– Lect 1: Recap on convexity – Lect 1: Recap on duality, optimality – First-order optimization algorithms – Proximal methods, operator splitting

Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 2 / 23

slide-3
SLIDE 3

Descent methods minx f(x)

Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 3 / 23

slide-4
SLIDE 4

Descent methods minx f(x)

x∗ ∇f(x∗) = 0 xk xk+1 . . .

Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 3 / 23

slide-5
SLIDE 5

Descent methods

x

Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 4 / 23

slide-6
SLIDE 6

Descent methods ∇f(x) −∇f(x)

x

Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 4 / 23

slide-7
SLIDE 7

Descent methods ∇f(x) −∇f(x)

x

x − α∇f(x) x − δ∇f(x)

Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 4 / 23

slide-8
SLIDE 8

Descent methods ∇f(x) −∇f(x)

x

x − α∇f(x) x − δ∇f(x) d x + α2d

Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 4 / 23

slide-9
SLIDE 9

Algorithm

1 Start with some guess x0; 2 For each k = 0, 1, . . .

xk+1 ← xk + αkdk Check when to stop (e.g., if ∇f(xk+1) = 0)

Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 5 / 23

slide-10
SLIDE 10

Gradient methods

xk+1 = xk + αkdk, k = 0, 1, . . .

stepsize αk ≥ 0, usually ensures f(xk+1) < f(xk)

Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 6 / 23

slide-11
SLIDE 11

Gradient methods

xk+1 = xk + αkdk, k = 0, 1, . . .

stepsize αk ≥ 0, usually ensures f(xk+1) < f(xk) Descent direction dk satisfies ∇f(xk), dk < 0

Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 6 / 23

slide-12
SLIDE 12

Gradient methods

xk+1 = xk + αkdk, k = 0, 1, . . .

stepsize αk ≥ 0, usually ensures f(xk+1) < f(xk) Descent direction dk satisfies ∇f(xk), dk < 0

Numerous ways to select αk and dk

Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 6 / 23

slide-13
SLIDE 13

Gradient methods

xk+1 = xk + αkdk, k = 0, 1, . . .

stepsize αk ≥ 0, usually ensures f(xk+1) < f(xk) Descent direction dk satisfies ∇f(xk), dk < 0

Numerous ways to select αk and dk

Usually methods seek monotonic descent f(xk+1) < f(xk)

Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 6 / 23

slide-14
SLIDE 14

Gradient methods – direction

xk+1 = xk + αkdk, k = 0, 1, . . . ◮ Different choices of direction dk

  • Scaled gradient: dk = −Dk∇f(xk), Dk ≻ 0
  • Newton’s method: (Dk = [∇2f(xk)]−1)
  • Quasi-Newton: Dk ≈ [∇2f(xk)]−1
  • Steepest descent: Dk = I
  • Diagonally scaled: Dk diagonal with Dk

ii ≈

  • ∂2f(xk)

(∂xi)2

−1

  • Discretized Newton: Dk = [H(xk)]−1, H via finite-diff.

Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 7 / 23

slide-15
SLIDE 15

Gradient methods – direction

xk+1 = xk + αkdk, k = 0, 1, . . . ◮ Different choices of direction dk

  • Scaled gradient: dk = −Dk∇f(xk), Dk ≻ 0
  • Newton’s method: (Dk = [∇2f(xk)]−1)
  • Quasi-Newton: Dk ≈ [∇2f(xk)]−1
  • Steepest descent: Dk = I
  • Diagonally scaled: Dk diagonal with Dk

ii ≈

  • ∂2f(xk)

(∂xi)2

−1

  • Discretized Newton: Dk = [H(xk)]−1, H via finite-diff.
  • . . .

Exercise: Verify that ∇f(xk), dk < 0 for above choices

Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 7 / 23

slide-16
SLIDE 16

Gradient methods – stepsize

◮ Exact: αk := argmin

α≥0

f(xk + αdk)

Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 8 / 23

slide-17
SLIDE 17

Gradient methods – stepsize

◮ Exact: αk := argmin

α≥0

f(xk + αdk) ◮ Limited min: αk = argmin

0≤α≤s

f(xk + αdk)

Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 8 / 23

slide-18
SLIDE 18

Gradient methods – stepsize

◮ Exact: αk := argmin

α≥0

f(xk + αdk) ◮ Limited min: αk = argmin

0≤α≤s

f(xk + αdk) ◮ Armijo-rule. Given fixed scalars, s, β, σ with 0 < β < 1 and 0 < σ < 1 (chosen experimentally). Set αk = βmks, where we try βms for m = 0, 1, . . . until sufficient descent f(xk) − f(x + βmsdk) ≥ −σβms∇f(xk), dk

Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 8 / 23

slide-19
SLIDE 19

Gradient methods – stepsize

◮ Exact: αk := argmin

α≥0

f(xk + αdk) ◮ Limited min: αk = argmin

0≤α≤s

f(xk + αdk) ◮ Armijo-rule. Given fixed scalars, s, β, σ with 0 < β < 1 and 0 < σ < 1 (chosen experimentally). Set αk = βmks, where we try βms for m = 0, 1, . . . until sufficient descent f(xk) − f(x + βmsdk) ≥ −σβms∇f(xk), dk If ∇f(xk), dk < 0, stepsize guaranteed to exist

Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 8 / 23

slide-20
SLIDE 20

Gradient methods – stepsize

◮ Exact: αk := argmin

α≥0

f(xk + αdk) ◮ Limited min: αk = argmin

0≤α≤s

f(xk + αdk) ◮ Armijo-rule. Given fixed scalars, s, β, σ with 0 < β < 1 and 0 < σ < 1 (chosen experimentally). Set αk = βmks, where we try βms for m = 0, 1, . . . until sufficient descent f(xk) − f(x + βmsdk) ≥ −σβms∇f(xk), dk If ∇f(xk), dk < 0, stepsize guaranteed to exist Usually, σ small ∈ [10−5, 0.1], while β from 1/2 to 1/10 depending on how confident we are about initial stepsize s. ◮ Constant: αk = 1/L (for suitable value of L) ◮ Diminishing: αk → 0 but

k αk = ∞.

Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 8 / 23

slide-21
SLIDE 21

Gradient methods – nonmonotonic steps∗

Stepsize computation can be expensive Convergence analysis depends on monotonic descent

Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 9 / 23

slide-22
SLIDE 22

Gradient methods – nonmonotonic steps∗

Stepsize computation can be expensive Convergence analysis depends on monotonic descent Give up search for stepsizes Use closed-form formulae for stepsizes Don’t insist on monotonic descent? (e.g., diminishing stepsizes do not give monotonic descent)

Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 9 / 23

slide-23
SLIDE 23

Gradient methods – nonmonotonic steps∗

Stepsize computation can be expensive Convergence analysis depends on monotonic descent Give up search for stepsizes Use closed-form formulae for stepsizes Don’t insist on monotonic descent? (e.g., diminishing stepsizes do not give monotonic descent) Barzilai & Borwein stepsizes

xk+1 = xk − αk∇f(xk), k = 0, 1, . . .

Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 9 / 23

slide-24
SLIDE 24

Gradient methods – nonmonotonic steps∗

Stepsize computation can be expensive Convergence analysis depends on monotonic descent Give up search for stepsizes Use closed-form formulae for stepsizes Don’t insist on monotonic descent? (e.g., diminishing stepsizes do not give monotonic descent) Barzilai & Borwein stepsizes

xk+1 = xk − αk∇f(xk), k = 0, 1, . . .

αk = uk, vk vk2 , αk = uk2 uk, vk uk = xk − xk−1, vk = ∇f(xk) − ∇f(xk−1)

Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 9 / 23

slide-25
SLIDE 25

Least-squares

Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 10 / 23

slide-26
SLIDE 26

Nonnegative least squares

min

1 2Ax − b2 + x ≥ 0

intensities, concentrations, frequencies, . . . Applications

Machine learning Statistics Image Processing Computer Vision Medical Imaging Astronomy Physics Bioinformatics Remote Sensing Engineering Inverse problems Finance

Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 11 / 23

slide-27
SLIDE 27

NNLS: Ax − b2 s.t. x ≥ 0

Unconstrained solution Solve ∇f(x) = 0 = ⇒ xuc = (ATA)−1ATb Cannot just truncate x = (xuc)+

x∗ xuc

(xuc)+

x ≥ 0 makes problem trickier as problem size ր

Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 12 / 23

slide-28
SLIDE 28

Solving NNLS scalably

x∗ xuc

x ← (x − α∇f(x))+

Good choice of α crucial ◮ Backtracking line-search ◮ Armijo ◮ and many others

Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 13 / 23

slide-29
SLIDE 29

Solving NNLS scalably

x∗ xuc

x ← (x − α∇f(x))+

Good choice of α crucial ◮ Backtracking line-search ◮ Armijo ◮ and many others

Too slow!

Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 13 / 23

slide-30
SLIDE 30

NNLS: long studied problem

Method Remarks Scalability Accuracy NNLS (1976) MATLAB default poor high FNNLS (1989) fast NNLS poor high LBFGS-B (1997) famous solver fair medium TRON (1999) TR newton poor high SPG (2000) spectral proj fair+ medium ASA (2006) prev state-of-art fair+ medium SBB (2011) subspace BB steps very good medium

Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 14 / 23

slide-31
SLIDE 31

Spectacular failure of projection

5 10 15 20 25 30 35 40 10

−4

10

−3

10

−2

10

−1

10 10

1

10

2

Running time (seconds) Objective function value Naive BB+Projxn

x′ = (x − α∇f(x))+

Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 15 / 23

slide-32
SLIDE 32

Rescue: occasional line-search?

5 10 15 20 25 30 35 40 10

−4

10

−3

10

−2

10

−1

10 10

1

10

2

Running time (seconds) Objective function value Naive BB+Projxn Naive BB + Linesearch

Mix BB-step with linesearch

Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 16 / 23

slide-33
SLIDE 33

Can we completely avoid linesearch?

Do not use all coordinates to compute α! “Subspace-BB” (SBB) Kim, Sra, Dhillon (OMS, 2011) Identify fixed variables (those likely to satisfy xi = 0) Compute α using free variables Most crucial step!

Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 17 / 23

slide-34
SLIDE 34

Can we completely avoid linesearch?

Do not use all coordinates to compute α! “Subspace-BB” (SBB) Kim, Sra, Dhillon (OMS, 2011) Identify fixed variables (those likely to satisfy xi = 0) Compute α using free variables Most crucial step! SBB convergence theorem Global rate – open problem Empirically great!

Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 17 / 23

slide-35
SLIDE 35

SBB: simplicity and scalability

5 10 15 20 25 30 35 40 10

−10

10

−8

10

−6

10

−4

10

−2

10 Running time (seconds) Objective function value Naive BB+Projxn Naive BB + Linesearch SBB

Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 18 / 23

slide-36
SLIDE 36

Numerical result

Algorithm Time Ax − b2

  • Convg. tol.

LBFGS-B (FORTRAN) 19000s 20.2 1.0E-03 SPG (FORTRAN) 8600s 20.5 3.8E-01 ASA (C++) 1001s 24.5 4.8e-02 “medium” 20,000 × 1,350,000 matrix

  • Suvrit Sra (MIT)

Optimization for ML and beyond: OPTML++ 19 / 23

slide-37
SLIDE 37

Numerical result

Algorithm Time Ax − b2

  • Convg. tol.

LBFGS-B (FORTRAN) 19000s 20.2 1.0E-03 SPG (FORTRAN) 8600s 20.5 3.8E-01 ASA (C++) 1001s 24.5 4.8e-02 SBB (MATLAB) 201s 21.2 8.7E-03 “medium” 20,000 × 1,350,000 matrix

  • Suvrit Sra (MIT)

Optimization for ML and beyond: OPTML++ 19 / 23

slide-38
SLIDE 38

Back to gradient-descent

Assumption: Lipschitz continuous gradient; denoted f ∈ C1

L

∇f(x) − ∇f(y)2 ≤ Lx − y2

Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 20 / 23

slide-39
SLIDE 39

Back to gradient-descent

Assumption: Lipschitz continuous gradient; denoted f ∈ C1

L

∇f(x) − ∇f(y)2 ≤ Lx − y2 ♣ Gradient vectors of closeby points are close to each other ♣ Objective function has “bounded curvature” ♣ Speed at which gradient varies is bounded

Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 20 / 23

slide-40
SLIDE 40

Back to gradient-descent

Assumption: Lipschitz continuous gradient; denoted f ∈ C1

L

∇f(x) − ∇f(y)2 ≤ Lx − y2 ♣ Gradient vectors of closeby points are close to each other ♣ Objective function has “bounded curvature” ♣ Speed at which gradient varies is bounded Lemma (Descent). Let f ∈ C1

  • L. Then,

f(y) ≤ f(x) + ∇f(x), y − x + L

2y − x2 2

Theorem Let f ∈ C1

L and

  • xk

be sequence generated as above, with αk = 1/L. Then, f(xk+1) − f(x∗) = O(1/k).

Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 20 / 23

slide-41
SLIDE 41

Linear convergence

Assumption: Strong convexity; denote f ∈ S1

L,µ

f(x) ≥ f(y) + ∇f(y), x − y + µ

2x − y2 2

  • Setting αk = 2/(µ + L) yields linear rate (µ > 0)

Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 21 / 23

slide-42
SLIDE 42

Strongly convex – linear rate

  • Theorem. If f ∈ S1

L,µ, 0 < α < 2/(L + µ), then the gradient

method generates a sequence

  • xk

that satisfies xk − x∗2

2 ≤

  • 1 − 2αµL

µ + L k x0 − x∗2. Moreover, if α = 2/(L + µ) then f(xk) − f ∗ ≤ L 2 κ − 1 κ + 1 2k x0 − x∗2

2,

where κ = L/µ is the condition number.

Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 22 / 23

slide-43
SLIDE 43

Gradient methods – lower bounds

xk+1 = xk − αk∇f(xk) Theorem Lower bound I (Nesterov) For any x0 ∈ Rn, and 1 ≤ k ≤ 1

2(n − 1), there is a smooth f, s.t.

f(xk) − f(x∗) ≥ 3Lx0 − x∗2

2

32(k + 1)2

Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 23 / 23

slide-44
SLIDE 44

Gradient methods – lower bounds

xk+1 = xk − αk∇f(xk) Theorem Lower bound I (Nesterov) For any x0 ∈ Rn, and 1 ≤ k ≤ 1

2(n − 1), there is a smooth f, s.t.

f(xk) − f(x∗) ≥ 3Lx0 − x∗2

2

32(k + 1)2 Theorem Lower bound II (Nesterov). For class of smooth, strongly convex, i.e., S∞

L,µ (µ > 0, κ > 1)

f(xk) − f(x∗) ≥ µ 2 √κ − 1 √κ + 1 2k x0 − x∗2

2.

Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 23 / 23