The importance of better models in stochastic optimization John - - PowerPoint PPT Presentation

the importance of better models in stochastic optimization
SMART_READER_LITE
LIVE PREVIEW

The importance of better models in stochastic optimization John - - PowerPoint PPT Presentation

The importance of better models in stochastic optimization John Duchi (based on joint work with Feng Ruan and Hilal Asi) Stanford University Les Houches 2019 Outline Motivating experiments Models in optimization Stochastic optimization


slide-1
SLIDE 1

The importance of better models in stochastic

  • ptimization

John Duchi (based on joint work with Feng Ruan and Hilal Asi) Stanford University Les Houches 2019

slide-2
SLIDE 2

Outline

Motivating experiments Models in optimization Stochastic optimization Stability is better Nothing gets worse Beyond convexity Adaptivity in easy problems Revisiting experimental results

slide-3
SLIDE 3

Outline

Motivating experiments Models in optimization Stochastic optimization Stability is better Nothing gets worse Beyond convexity Adaptivity in easy problems Revisiting experimental results

slide-4
SLIDE 4

Why robustness is important

How much ENERGY spent in this paper?

slide-5
SLIDE 5

Why robustness is important

How much ENERGY spent in this paper? ≡ How many Toyota Camrys from SF to LA?

slide-6
SLIDE 6

Why robustness is important

1? How much ENERGY spent in this paper? ≡ How many Toyota Camrys from SF to LA?

slide-7
SLIDE 7

Why robustness is important

10? How much ENERGY spent in this paper? ≡ How many Toyota Camrys from SF to LA?

slide-8
SLIDE 8

Why robustness is important

100? How much ENERGY spent in this paper? ≡ How many Toyota Camrys from SF to LA?

slide-9
SLIDE 9

Why robustness is important

1000? How much ENERGY spent in this paper? ≡ How many Toyota Camrys from SF to LA?

slide-10
SLIDE 10

Why robustness is important

4200 How much ENERGY spent in this paper? ≡ How many Toyota Camrys from SF to LA?

slide-11
SLIDE 11

Stochastic gradient methods

The problem in this talk: minimize

x

F(x) := E[f(x; S)] =

  • f(x; s)dP(s)

subject to x ∈ X

slide-12
SLIDE 12

Stochastic gradient methods

The problem in this talk: minimize

x

F(x) := E[f(x; S)] =

  • f(x; s)dP(s)

subject to x ∈ X Weakly convex functions: for each s, some ρ(s) such that f(x; s) + ρ(s) 2 x2

2

convex in x

slide-13
SLIDE 13

Stochastic gradient methods

The problem in this talk: minimize

x

F(x) := E[f(x; S)] =

  • f(x; s)dP(s)

subject to x ∈ X Weakly convex functions: for each s, some ρ(s) such that f(x; s) + ρ(s) 2 x2

2

convex in x

◮ add a big enough quadratic, it becomes convex

slide-14
SLIDE 14

Stochastic gradient methods

The problem in this talk: minimize

x

F(x) := E[f(x; S)] =

  • f(x; s)dP(s)

subject to x ∈ X

slide-15
SLIDE 15

Stochastic gradient methods

The problem in this talk: minimize

x

F(x) := E[f(x; S)] =

  • f(x; s)dP(s)

subject to x ∈ X Stochastic gradient method: xk+1 = xk − αkgk, gk ∈ ∂f(xk; Sk)

slide-16
SLIDE 16

Stochastic gradient methods

The problem in this talk: minimize

x

F(x) := E[f(x; S)] =

  • f(x; s)dP(s)

subject to x ∈ X Stochastic gradient method: xk+1 = xk − αkgk, gk ∈ ∂f(xk; Sk) Why we use this?

◮ Easy to analyze? ◮ Default in software packages and simple to implement? ◮ It works?

slide-17
SLIDE 17

Linear regression

F(x) = 1 2m

m

  • i=1

(aT

i x − bi)2

10

2

10

1

100 101 102 103 104 2000 4000 6000 8000 10000

SGM Truncated Prox

Initial stepsize α0 Time to ǫ-accuracy

slide-18
SLIDE 18

Linear regression

F(x) = 1 2m

m

  • i=1

(aT

i x − bi)2

10

2

10

1

100 101 102 103 104 2000 4000 6000 8000 10000

SGM Truncated Prox

Initial stepsize α0 Time to ǫ-accuracy

slide-19
SLIDE 19

Linear regression

F(x) = 1 2m

m

  • i=1

(aT

i x − bi)2

10

2

10

1

100 101 102 103 104 2000 4000 6000 8000 10000

SGM Truncated Prox

Initial stepsize α0 Time to ǫ-accuracy

slide-20
SLIDE 20

Outline

Motivating experiments Models in optimization Stochastic optimization Stability is better Nothing gets worse Beyond convexity Adaptivity in easy problems Revisiting experimental results

slide-21
SLIDE 21

Optimization methods

How do we solve optimization problems?

  • 1. Build a “good” but simple local model of f
  • 2. Minimize the model (perhaps regularizing)
slide-22
SLIDE 22

Optimization methods

How do we solve optimization problems?

  • 1. Build a “good” but simple local model of f
  • 2. Minimize the model (perhaps regularizing)

Gradient descent: Taylor (first-order) model f(y) ≈ fx(y) := f(x) + ∇f(x)T (y − x)

slide-23
SLIDE 23

Optimization methods

How do we solve optimization problems?

  • 1. Build a “good” but simple local model of f
  • 2. Minimize the model (perhaps regularizing)

Newton’s method: Taylor (second-order) model f(y) ≈ fx(y) := f(x) + ∇f(x)T (y − x) + (1/2)(y − x)T ∇2f(x)(y − x)

slide-24
SLIDE 24

Composite optimization problems (other model-able structures)

The problem: minimize

x

f(x) := h(c(x)) where h : Rm → R is convex and c : Rn → Rm is smooth

[Fletcher & Watson 80; Fletcher 82; Burke 85; Wright 87; Lewis & Wright 15; Drusvyatskiy & Lewis 16]

slide-25
SLIDE 25

Modeling composite problems

Now we make a convex model f(x) = h(c(x))

slide-26
SLIDE 26

Modeling composite problems

Now we make a convex model f(x) = h( c(x)

  • linearize

)

slide-27
SLIDE 27

Modeling composite problems

Now we make a convex model f(y) ≈ h(c(x) + ∇c(x)T (y − x))

slide-28
SLIDE 28

Modeling composite problems

Now we make a convex model f(y) ≈ h(c(x) + ∇c(x)T (y − x)

  • =c(y)+O(x−y2)

)

slide-29
SLIDE 29

Modeling composite problems

Now we make a convex model fx(y) := h

  • c(x) + ∇c(x)T (y − x)
slide-30
SLIDE 30

Modeling composite problems

Now we make a convex model fx(y) := h

  • c(x) + ∇c(x)T (y − x)
  • [Burke 85; Drusvyatskiy, Ioffe, Lewis 16]
slide-31
SLIDE 31

Modeling composite problems

Now we make a convex model fx(y) := h

  • c(x) + ∇c(x)T (y − x)
  • Example: f(x) = |x2 − 1|, h(z) = |z| and c(x) = x2 − 1
slide-32
SLIDE 32

Modeling composite problems

Now we make a convex model fx(y) := h

  • c(x) + ∇c(x)T (y − x)
  • Example: f(x) = |x2 − 1|, h(z) = |z| and c(x) = x2 − 1
slide-33
SLIDE 33

Modeling composite problems

Now we make a convex model fx(y) := h

  • c(x) + ∇c(x)T (y − x)
  • Example: f(x) = |x2 − 1|, h(z) = |z| and c(x) = x2 − 1
slide-34
SLIDE 34

The prox-linear method [Burke, Drusvyatskiy et al.]

Iteratively (1) form regularized convex model and (2) minimize it xk+1 = argmin

x∈X

  • fxk(x) + 1

2α x − xk2

2

  • = argmin

x∈X

  • h
  • c(xk) + ∇c(xk)T (x − xk)
  • + 1

2α x − xk2

2

slide-35
SLIDE 35

The prox-linear method [Burke, Drusvyatskiy et al.]

Iteratively (1) form regularized convex model and (2) minimize it xk+1 = argmin

x∈X

  • fxk(x) + 1

2α x − xk2

2

  • = argmin

x∈X

  • h
  • c(xk) + ∇c(xk)T (x − xk)
  • + 1

2α x − xk2

2

slide-36
SLIDE 36

The prox-linear method [Burke, Drusvyatskiy et al.]

Iteratively (1) form regularized convex model and (2) minimize it xk+1 = argmin

x∈X

  • fxk(x) + 1

2α x − xk2

2

  • = argmin

x∈X

  • h
  • c(xk) + ∇c(xk)T (x − xk)
  • + 1

2α x − xk2

2

  • |xk − x⋆| = .3
slide-37
SLIDE 37

The prox-linear method [Burke, Drusvyatskiy et al.]

Iteratively (1) form regularized convex model and (2) minimize it xk+1 = argmin

x∈X

  • fxk(x) + 1

2α x − xk2

2

  • = argmin

x∈X

  • h
  • c(xk) + ∇c(xk)T (x − xk)
  • + 1

2α x − xk2

2

  • |xk − x⋆| = .024
slide-38
SLIDE 38

The prox-linear method [Burke, Drusvyatskiy et al.]

Iteratively (1) form regularized convex model and (2) minimize it xk+1 = argmin

x∈X

  • fxk(x) + 1

2α x − xk2

2

  • = argmin

x∈X

  • h
  • c(xk) + ∇c(xk)T (x − xk)
  • + 1

2α x − xk2

2

  • |xk − x⋆| = 3 · 10−4
slide-39
SLIDE 39

The prox-linear method [Burke, Drusvyatskiy et al.]

Iteratively (1) form regularized convex model and (2) minimize it xk+1 = argmin

x∈X

  • fxk(x) + 1

2α x − xk2

2

  • = argmin

x∈X

  • h
  • c(xk) + ∇c(xk)T (x − xk)
  • + 1

2α x − xk2

2

  • |xk − x⋆| = 4 · 10−8
slide-40
SLIDE 40

Generic(ish) optimization methods

Iterate xk+1 = argmin

x∈X

  • fxk(x) +

1 2αk x − xk2

slide-41
SLIDE 41

Generic(ish) optimization methods

Iterate xk+1 = argmin

x∈X

  • fxk(x) +

1 2αk x − xk2

  • ◮ Proximal point method (fx = f) [Rockafellar 76]

◮ Gradient descent (fx(y) = f(x) + ∇f(x), y − x) ◮ Newton (fx(y) = f(x) + ∇f(x), y − x + 1 2(x − y)T ∇2f(x)(x − y)) ◮ Prox-linear (fx(y) = h(c(x) + ∇c(x)T (y − x)))

slide-42
SLIDE 42

The aProx family for stochastic optimization

Iterate:

◮ Sample Sk iid

∼ P

◮ Update by minimizing model

xk+1 = argmin

x∈X

  • fxk(x; Sk) +

1 2αk x − xk2

slide-43
SLIDE 43

The aProx family for stochastic optimization

Iterate:

◮ Sample Sk iid

∼ P

◮ Update by minimizing model

xk+1 = argmin

x∈X

  • fxk(x; Sk) +

1 2αk x − xk2

  • Examples:

◮ Stochastic gradient method ◮ Stochastic proximal-point (implicit gradient) method, fxk(x) = f(x)

[Rockafellar 76; Kulis & Bartlett 10; Karampatziakis & Langford 11; Bertsekas 11; Toulis & Airoldi 17; Ryu & Boyd 16]

◮ Stochastic prox-linear methods [D. & Ruan 18; Davis & Drusvyatskiy 18;

Asi & D. 19]

slide-44
SLIDE 44

Models in stochastic optimization

Conditions on our models (convex case)

  • i. Convex model:

y → fx(y; s) is convex

  • ii. Lower bound:

fx(y; s) ≤ f(y; s)

  • iii. Local correctness:

fx(x; s) = f(x; s) and ∂fx(x; s) ⊂ ∂f(x; s)

[D. & Ruan 17; Davis & Drusvyatskiy 18]

slide-45
SLIDE 45

Models in stochastic optimization

Conditions on our models (ρ-weakly convex case)

  • i. Convex model:

y → fx(y; s) is convex

  • ii. Lower bound:

fx(y; s) ≤ f(y; s) + ρ(s) 2 x − y2

2

  • iii. Local correctness:

fx(x; s) = f(x; s) and ∂fx(x; s) ⊂ ∂f(x; s)

[D. & Ruan 17; Davis & Drusvyatskiy 18; Asi & D. 19]

slide-46
SLIDE 46

Modeling conditions

Model fx(y) of f near x f(x)

slide-47
SLIDE 47

Modeling conditions

Model fx(y) of f near x f(x) (x0, f(x0)) fx0(y) = f(x0) + ∇f(x0)T (y − x0)

slide-48
SLIDE 48

Modeling conditions

Model fx(y) of f near x

truncated

f(x) (x0, f(x0)) fx0(y) = f(x0) + ∇f(x0)T (y − x0) fx0(y) =

  • f(x0) + ∇f(x0)T (y − x0)
  • +
slide-49
SLIDE 49

Models in stochastic optimization

Linear Truncated

x0 x1

  • i. (Sub)gradient: fx(y) = f(x) + f′(x), y − x
  • ii. Truncated: fx(y) = (f(x) + f′(x), y − x) ∨ infx f(x)
  • iii. Bundle/multi-line: fx(y) = max{f(xi) + f′(xi), x − xi}
  • iv. Prox-linear: fx(y) = h(c(x) + ∇c(x)T (y − x))
slide-50
SLIDE 50

The aProx family

Iterate:

◮ Sample Sk iid

∼ P

◮ Update by minimizing model

xk+1 = argmin

x∈X

  • fxk(x; Sk) +

1 2αk x − xk2

slide-51
SLIDE 51

Outline

Motivating experiments Models in optimization Stochastic optimization Stability is better Nothing gets worse Beyond convexity Adaptivity in easy problems Revisiting experimental results

slide-52
SLIDE 52

The aProx family

Iterate:

◮ Sample Sk iid

∼ P

◮ Update by minimizing model

xk+1 = argmin

x∈X

  • fxk(x; Sk) +

1 2αk x − xk2

slide-53
SLIDE 53

Divergence of a gradient method

slide-54
SLIDE 54

Divergence of a gradient method

slide-55
SLIDE 55

Divergence of a gradient method

slide-56
SLIDE 56

Divergence of a gradient method

slide-57
SLIDE 57

Divergence of a gradient method

slide-58
SLIDE 58

Divergence of a gradient method

slide-59
SLIDE 59

Divergence of a gradient method

slide-60
SLIDE 60

Divergence of a gradient method

slide-61
SLIDE 61

Divergence of a gradient method

slide-62
SLIDE 62

Divergence of a gradient method

slide-63
SLIDE 63

Stability guarantees (convex)

Use full stochastic-proximal method, xk+1 = argmin

x∈X

  • f(x; Sk) +

1 2αk x − xk2

  • .

Theorem (Asi & D. 18) Assume X ⋆ = argminx∈X F(x) is non-empty and E[f′(x⋆; S)2] ≤ σ2. Then E[dist(xk, X ⋆)2] ≤ dist(x0, X ⋆)2 + σ2

k

  • i=1

α2

i

slide-64
SLIDE 64

Stability guarantees (convex)

Use full stochastic-proximal method, xk+1 = argmin

x∈X

  • f(x; Sk) +

1 2αk x − xk2

  • .

Theorem (Asi & D. 18) Assume X ⋆ = argminx∈X F(x) is non-empty and E[f′(x⋆; S)2] ≤ σ2. Then E[dist(xk, X ⋆)2] ≤ dist(x0, X ⋆)2 + σ2

k

  • i=1

α2

i

Theorem (Asi & D. 18) Under the same assumptions, sup

k

dist(xk, X ⋆) < ∞ and dist(xk, X ⋆) a.s. → 0.

slide-65
SLIDE 65

Stability guarantees (convex)

Use any model with fx(y; s) ≥ infz f(z; s) (i.e. good lower bound) xk+1 = argmin

x∈X

  • fxk(x; Sk) +

1 2αk x − xk2

  • .

Theorem (Asi & D. 19) Assume X ⋆ = argminx∈X F(x) is non-empty and there exists p < ∞ such that E[

  • f′(x; S)
  • 2] ≤ C(1 + dist(x, X ⋆)p).

Then sup

k

dist(xk, X ⋆) < ∞ and dist(xk, X ⋆) a.s. → 0.

slide-66
SLIDE 66

Example behaviors

On least-squares objective F(x) =

1 2m

m

i=1(aT i x − bi)2

100 200 300 400 500 10

2

10

1

100 101 102 103 104 105 106

SGM Prox

slide-67
SLIDE 67

Classical asymptotic analysis

Theorem (Polyak & Juditsky 92) Let F be convex and strongly convex in a neighborhood of x⋆, and assume that f(x; S) are globally smooth. For xk generated by stochastic gradient method, 1 √ k

k

  • i=1

(xi − x⋆) d N

  • 0, ∇2F(x⋆)−1 Cov(∇f(x⋆; S))∇2F(x⋆)−1

.

slide-68
SLIDE 68

New asymptotic analysis (convex case)

Theorem (Asi & D. 18) Let F be convex and strongly convex in a neighborhood of x⋆, and assume that f(x; S) are smooth near x⋆. Then if xk remain bounded and the models fxk(·; Sk) satisfy our conditions, 1 √ k

k

  • i=1

(xi − x⋆) d N

  • 0, ∇2F(x⋆)−1 Cov(∇f(x⋆; S))∇2F(x⋆)−1

.

truncated

slide-69
SLIDE 69

New asymptotic analysis (convex case)

Theorem (Asi & D. 18) Let F be convex and strongly convex in a neighborhood of x⋆, and assume that f(x; S) are smooth near x⋆. Then if xk remain bounded and the models fxk(·; Sk) satisfy our conditions, 1 √ k

k

  • i=1

(xi − x⋆) d N

  • 0, ∇2F(x⋆)−1 Cov(∇f(x⋆; S))∇2F(x⋆)−1

.

◮ Optimal by local minimax

theorem [H´

ajek 72; Le Cam 73;

  • D. & Ruan 19]

◮ Key insight: subgradients of

fxk(·; Sk) close to ∇f(xk; Sk)

truncated

slide-70
SLIDE 70

Convergence to stationarity in weakly convex cases

Convergence requires Moreau envelope [Davis & Drusvyatskiy 18] Fλ(x) := inf

y∈X

  • F(y) + λ

2 y − x2

2

  • ,

Important properties:

◮ Proximal mapping:

xλ := proxF/λ(x) := argmin

y∈X

  • F(y) + λ

2 y − x2

2

  • satisfies

∇Fλ(x) = λ(x − xλ)

◮ Near stationarity and decrease:

F(xλ) ≤ F(x) and dist(0, ∂F(xλ)) ≤ ∇Fλ(x)2

slide-71
SLIDE 71

Convergence to stationarity in weakly convex cases

Convergence requires Moreau envelope [Davis & Drusvyatskiy 18] Fλ(x) := inf

y∈X

  • F(y) + λ

2 y − x2

2

  • ,

Important properties:

◮ Proximal mapping:

xλ := proxF/λ(x) := argmin

y∈X

  • F(y) + λ

2 y − x2

2

  • satisfies

∇Fλ(x) = λ(x − xλ)

◮ Near stationarity and decrease:

F(xλ) ≤ F(x) and dist(0, ∂F(xλ)) ≤ ∇Fλ(x)2 Convergence: Say iterates xk converge if ∇Fλ(xk) → 0

slide-72
SLIDE 72

Moreau envelope of the absolute value

F Fλ For F(x) = |x|, Fλ(x) =

  • λ

2x2

if |x| ≤ λ−1 |x| − 1

if |x| > λ−1

◮ F ′ λ(x) = λx ◮ |F ′ λ(x)| = λ dist(x, 0) ◮ prox step xλ = 0 if |x| ≤ 1/λ

slide-73
SLIDE 73

Convergence in weakly convex cases

Use regularized stochastic-proximal point method, xk+1 = argmin

x∈X

  • f(x; Sk) + ρ(Sk)

2 x − xk2

2 +

1 2αk x − xk2

2

  • .

Theorem (Asi & D. 19) Let random f be ρ(s) weakly convex with E[ρ2(S)] < ∞. With proximal-point iteration, iterates xk satisfy Fλ(xk) a.s. → G and

  • k=1

αk ∇Fλ(xk)2

2 < ∞.

slide-74
SLIDE 74

Convergence in weakly convex cases

Use regularized stochastic-proximal point method, xk+1 = argmin

x∈X

  • f(x; Sk) + ρ(Sk)

2 x − xk2

2 +

1 2αk x − xk2

2

  • .

Theorem (Asi & D. 19) Let random f be ρ(s) weakly convex with E[ρ2(S)] < ∞. With proximal-point iteration, iterates xk satisfy Fλ(xk) a.s. → G and

  • k=1

αk ∇Fλ(xk)2

2 < ∞.

Proposition (Asi & D. 19) If iterates xk remain bounded and image of stationary points has measure zero, ∇Fλ(xk) a.s. → 0.

slide-75
SLIDE 75

What is an easy problem?

◮ Interpolation problems [Belkin, Hsu, Mitra 18; Ma, Bassily, Belkin 18] ◮ Overparameterized linear systems (Kaczmarz algorithms) [Strohmer &

Vershynin 09; Needell, Srebro, Ward 14; Needell & Tropp 14]

◮ Random projections for linear constraints [Leventhal & Lewis 10]

(a) MNIST (b) CIFAR-10 (c) SVHN (

4 subsamples)

slide-76
SLIDE 76

What is an easy problem?

minimize

x

F(x) := E[f(x; S)] =

  • f(x; s)dP(s)

truncated

slide-77
SLIDE 77

What is an easy problem?

minimize

x

F(x) := E[f(x; S)] =

  • f(x; s)dP(s)

Definition: Problem is easy if there exists x⋆ such that f(x⋆; S) = infx f(x; S) with probability 1. [Schmidt & Le Roux 13; Ma,

Bassily, Belkin 18; Belkin, Rakhlin, Tsybakov 18]

truncated

slide-78
SLIDE 78

What is an easy problem?

minimize

x

F(x) := E[f(x; S)] =

  • f(x; s)dP(s)

Definition: Problem is easy if there exists x⋆ such that f(x⋆; S) = infx f(x; S) with probability 1. [Schmidt & Le Roux 13; Ma,

Bassily, Belkin 18; Belkin, Rakhlin, Tsybakov 18]

One additional condition

  • iv. The models fx satisfy

fx(y; s) ≥ inf

x⋆∈X f(x⋆; s)

truncated

slide-79
SLIDE 79

Easy strongly convex problems

Theorem (Asi & D. 18) Let the function F satisfy the growth condition F(x) ≥ F(x⋆) + λ 2 dist(x, X⋆)2 where X⋆ = argminx F(x), and be easy. Then E[dist(xk, X⋆)2] ≤ max

  • exp
  • −c

k

  • i=1

αi

  • , exp (−ck)
  • dist(x1, X⋆)2.
slide-80
SLIDE 80

Easy strongly convex problems

Theorem (Asi & D. 18) Let the function F satisfy the growth condition F(x) ≥ F(x⋆) + λ 2 dist(x, X⋆)2 where X⋆ = argminx F(x), and be easy. Then E[dist(xk, X⋆)2] ≤ max

  • exp
  • −c

k

  • i=1

αi

  • , exp (−ck)
  • dist(x1, X⋆)2.

◮ Adaptive no matter the stepsizes ◮ Most other results (e.g. for SGM [Schmidt & Le Roux 13; Ma, Bassily,

Belkin 18]) require careful stepsize choices

slide-81
SLIDE 81

Sharp problems

Definition: An objective F is sharp if F(x) ≥ F(x⋆) + λ dist(x, X⋆) for X⋆ = argmin F(x). [Ferris 88; Burke & Ferris 95]

◮ Piecewise linear objectives ◮ Hinge loss F(x) = 1 m

m

i=1

  • 1 − aT

i x

  • +
slide-82
SLIDE 82

Sharp problems

Definition: An objective F is sharp if F(x) ≥ F(x⋆) + λ dist(x, X⋆) for X⋆ = argmin F(x). [Ferris 88; Burke & Ferris 95]

◮ Piecewise linear objectives ◮ Hinge loss F(x) = 1 m

m

i=1

  • 1 − aT

i x

  • +
slide-83
SLIDE 83

Sharp problems

Definition: An objective F is sharp if F(x) ≥ F(x⋆) + λ dist(x, X⋆) for X⋆ = argmin F(x). [Ferris 88; Burke & Ferris 95]

◮ Piecewise linear objectives ◮ Hinge loss F(x) = 1 m

m

i=1

  • 1 − aT

i x

  • +
slide-84
SLIDE 84

Sharp convex problems

Definition: An objective F is sharp if F(x) ≥ F(x⋆) + λ dist(x, X⋆) for X⋆ = argmin F(x). [Ferris 88; Burke & Ferris 95]

◮ Piecewise linear objectives ◮ Hinge loss F(x) = 1 m

m

i=1

  • 1 − aT

i x

  • +

◮ Projection onto intersections: F(x) = 1 m

m

i=1 dist(x, Ci)

slide-85
SLIDE 85

Sharp convex problems

Definition: An objective F is sharp if F(x) ≥ F(x⋆) + λ dist(x, X⋆) for X⋆ = argmin F(x). [Ferris 88; Burke & Ferris 95]

◮ Piecewise linear objectives ◮ Hinge loss F(x) = 1 m

m

i=1

  • 1 − aT

i x

  • +

◮ Projection onto intersections: F(x) = 1 m

m

i=1 dist(x, Ci)

Theorem (Asi & D. 18) Let F have sharp growth and be easy. If F is convex, E[dist(xk+1, X⋆)2] ≤ max

  • exp(−ck), exp
  • −c

k

  • i=1

αi

  • dist(x1, X⋆)2.
slide-86
SLIDE 86

Sharp weakly problems

Definition: An objective F is sharp if F(x) ≥ F(x⋆) + λ dist(x, X⋆) for X⋆ = argmin F(x). [Ferris 88; Burke & Ferris 95]

◮ Phase retrieval F(x) = 1 m

  • (Ax)2 − (Ax⋆)2
  • 1

◮ Blind deconvolution [Charisopoulos et al. 19]

slide-87
SLIDE 87

Sharp weakly problems

Definition: An objective F is sharp if F(x) ≥ F(x⋆) + λ dist(x, X⋆) for X⋆ = argmin F(x). [Ferris 88; Burke & Ferris 95]

◮ Phase retrieval F(x) = 1 m

  • (Ax)2 − (Ax⋆)2
  • 1

◮ Blind deconvolution [Charisopoulos et al. 19]

Theorem (Asi & D. 19) Let F have sharp growth and be easy. There exists c ∈ (0, 1) such that on the event xk → X⋆, lim sup

k

dist(xk, X⋆) (1 − c)k < ∞.

slide-88
SLIDE 88

Outline

Motivating experiments Models in optimization Stochastic optimization Stability is better Nothing gets worse Beyond convexity Adaptivity in easy problems Revisiting experimental results

slide-89
SLIDE 89

Methods

Iterate xk+1 = argmin

x

  • fxk(x; Sk) +

1 2αk x − xk2

2

slide-90
SLIDE 90

Methods

Iterate xk+1 = argmin

x

  • fxk(x; Sk) +

1 2αk x − xk2

2

  • ◮ Stochastic gradient

fxk(x; Sk) = f(xk; Sk) + f′(xk; Sk), x − xk

◮ Truncated gradient (f ≥ 0):

fxk(x; Sk) =

  • f(xk; Sk) + f′(xk; Sk), x − xk
  • +

◮ (Stochastic) proximal point

fxk(x; Sk) = f(x; Sk)

slide-91
SLIDE 91

Linear regression with low noise

F(x) = 1 2m

m

  • i=1

(aT

i x − bi)2

10

2

10

1

100 101 102 103 104 2000 4000 6000 8000 10000

SGM Truncated Prox

Initial stepsize α0 Time to ǫ-accuracy

slide-92
SLIDE 92

Linear regression with no noise

F(x) = 1 2m

m

  • i=1

(aT

i x − bi)2

10

2

10

1

100 101 102 103 104 2000 4000 6000 8000 10000

SGM Truncated Prox

Initial stepsize α0 Time to ǫ-accuracy

slide-93
SLIDE 93

Linear regression with “poor” conditioning

10

1

100 101 102 103 104 105 300 400 500 600 700 800 900 1000

Accuracy epsilon = 0.055 Proximal SGM Truncated Bundle

slide-94
SLIDE 94

Linear regression with “poor” conditioning

10

1

100 101 102 103 104 105 300 400 500 600 700 800 900 1000

Accuracy epsilon = 0.055 Proximal SGM Truncated Bundle

Poor conditioning? κ(A) = 15

slide-95
SLIDE 95

Multiclass hinge loss: no noise

f(x; (a, l)) = max

i=l [1 + a, xi − xl]+

10

1

100 101 102 103 104 105 2000 4000 6000 8000 10000

SGM Truncated Prox

Initial stepsize α0 Time to ǫ-accuracy

slide-96
SLIDE 96

Multiclass hinge loss: small label flipping

f(x; (a, l)) = max

i=l [1 + a, xi − xl]+

10

1

100 101 102 103 104 105 2000 4000 6000 8000 10000

SGM Truncated Prox

Initial stepsize α0 Time to ǫ-accuracy

slide-97
SLIDE 97

Multiclass hinge loss: substantial label flipping

f(x; (a, l)) = max

i=l [1 + a, xi − xl]+

10

1

100 101 102 103 104 105 2000 4000 6000 8000 10000

SGM Truncated Prox

Initial stepsize α0 Time to ǫ-accuracy

slide-98
SLIDE 98

(Robust) Phase retrieval

[Cand` es, Li, Soltanolkotabi 15]

slide-99
SLIDE 99

(Robust) Phase retrieval

[Cand` es, Li, Soltanolkotabi 15]

Observations (usually) bi = ai, x⋆2 yield objective f(x) = 1 m

m

  • i=1

|ai, x2 − bi|

slide-100
SLIDE 100

Phase retrieval without noise

F(x) = 1 m

m

  • i=1

|ai, x2 − bi|

10

1

100 101 102 103 104 105 200 400 600 800 1000

Proximal SGM Truncated

Initial stepsize α0 Time to ǫ-accuracy

slide-101
SLIDE 101

Matrix completion without noise

F(x, y) =

  • i,j∈Ω

|xi, yj − Mij|

10

1

100 101 102 103 104 105 10 20 30 40 50 60 70 80

SGM Truncated

slide-102
SLIDE 102

Deep learning experiments

CIFAR 10 Dataset: 10 class image classification

10−3 10−2 10−1 100 101 102 103 20 25 30 35 40 45 50

SGM Truncated adam trunc-adagrad

Time to error ǫ = .11

slide-103
SLIDE 103

Deep learning experiment: dog recognition

Stanford Dogs: 120 class dog breed classification

10−4 10−3 10−2 10−1 100 101 102 103 12.5 15.0 17.5 20.0 22.5 25.0 27.5 30.0

SGM Truncated adam trunc-adagrad

Initial stepsize α0 Time to error ǫ = .35

slide-104
SLIDE 104

Conclusions

◮ Perhaps blind application of stochastic gradient methods is not the

right answer

◮ Care and better modeling can yield improved performance ◮ Computational efficiency important in model choice

slide-105
SLIDE 105

Conclusions

◮ Perhaps blind application of stochastic gradient methods is not the

right answer

◮ Care and better modeling can yield improved performance ◮ Computational efficiency important in model choice

Questions

◮ Parallelism? ◮ The importance of better models in stochastic optimization.

arXiv:1903.08619

◮ Stochastic (Approximate) Proximal Point Methods: Convergence,

Optimality, and Adaptivity. arXiv:1810.05633