SLIDE 1 The importance of better models in stochastic
John Duchi (based on joint work with Feng Ruan and Hilal Asi) Stanford University Les Houches 2019
SLIDE 2
Outline
Motivating experiments Models in optimization Stochastic optimization Stability is better Nothing gets worse Beyond convexity Adaptivity in easy problems Revisiting experimental results
SLIDE 3
Outline
Motivating experiments Models in optimization Stochastic optimization Stability is better Nothing gets worse Beyond convexity Adaptivity in easy problems Revisiting experimental results
SLIDE 4
Why robustness is important
How much ENERGY spent in this paper?
SLIDE 5
Why robustness is important
How much ENERGY spent in this paper? ≡ How many Toyota Camrys from SF to LA?
SLIDE 6
Why robustness is important
1? How much ENERGY spent in this paper? ≡ How many Toyota Camrys from SF to LA?
SLIDE 7
Why robustness is important
10? How much ENERGY spent in this paper? ≡ How many Toyota Camrys from SF to LA?
SLIDE 8
Why robustness is important
100? How much ENERGY spent in this paper? ≡ How many Toyota Camrys from SF to LA?
SLIDE 9
Why robustness is important
1000? How much ENERGY spent in this paper? ≡ How many Toyota Camrys from SF to LA?
SLIDE 10
Why robustness is important
4200 How much ENERGY spent in this paper? ≡ How many Toyota Camrys from SF to LA?
SLIDE 11 Stochastic gradient methods
The problem in this talk: minimize
x
F(x) := E[f(x; S)] =
subject to x ∈ X
SLIDE 12 Stochastic gradient methods
The problem in this talk: minimize
x
F(x) := E[f(x; S)] =
subject to x ∈ X Weakly convex functions: for each s, some ρ(s) such that f(x; s) + ρ(s) 2 x2
2
convex in x
SLIDE 13 Stochastic gradient methods
The problem in this talk: minimize
x
F(x) := E[f(x; S)] =
subject to x ∈ X Weakly convex functions: for each s, some ρ(s) such that f(x; s) + ρ(s) 2 x2
2
convex in x
◮ add a big enough quadratic, it becomes convex
SLIDE 14 Stochastic gradient methods
The problem in this talk: minimize
x
F(x) := E[f(x; S)] =
subject to x ∈ X
SLIDE 15 Stochastic gradient methods
The problem in this talk: minimize
x
F(x) := E[f(x; S)] =
subject to x ∈ X Stochastic gradient method: xk+1 = xk − αkgk, gk ∈ ∂f(xk; Sk)
SLIDE 16 Stochastic gradient methods
The problem in this talk: minimize
x
F(x) := E[f(x; S)] =
subject to x ∈ X Stochastic gradient method: xk+1 = xk − αkgk, gk ∈ ∂f(xk; Sk) Why we use this?
◮ Easy to analyze? ◮ Default in software packages and simple to implement? ◮ It works?
SLIDE 17 Linear regression
F(x) = 1 2m
m
(aT
i x − bi)2
10
2
10
1
100 101 102 103 104 2000 4000 6000 8000 10000
SGM Truncated Prox
Initial stepsize α0 Time to ǫ-accuracy
SLIDE 18 Linear regression
F(x) = 1 2m
m
(aT
i x − bi)2
10
2
10
1
100 101 102 103 104 2000 4000 6000 8000 10000
SGM Truncated Prox
Initial stepsize α0 Time to ǫ-accuracy
SLIDE 19 Linear regression
F(x) = 1 2m
m
(aT
i x − bi)2
10
2
10
1
100 101 102 103 104 2000 4000 6000 8000 10000
SGM Truncated Prox
Initial stepsize α0 Time to ǫ-accuracy
SLIDE 20
Outline
Motivating experiments Models in optimization Stochastic optimization Stability is better Nothing gets worse Beyond convexity Adaptivity in easy problems Revisiting experimental results
SLIDE 21 Optimization methods
How do we solve optimization problems?
- 1. Build a “good” but simple local model of f
- 2. Minimize the model (perhaps regularizing)
SLIDE 22 Optimization methods
How do we solve optimization problems?
- 1. Build a “good” but simple local model of f
- 2. Minimize the model (perhaps regularizing)
Gradient descent: Taylor (first-order) model f(y) ≈ fx(y) := f(x) + ∇f(x)T (y − x)
SLIDE 23 Optimization methods
How do we solve optimization problems?
- 1. Build a “good” but simple local model of f
- 2. Minimize the model (perhaps regularizing)
Newton’s method: Taylor (second-order) model f(y) ≈ fx(y) := f(x) + ∇f(x)T (y − x) + (1/2)(y − x)T ∇2f(x)(y − x)
SLIDE 24
Composite optimization problems (other model-able structures)
The problem: minimize
x
f(x) := h(c(x)) where h : Rm → R is convex and c : Rn → Rm is smooth
[Fletcher & Watson 80; Fletcher 82; Burke 85; Wright 87; Lewis & Wright 15; Drusvyatskiy & Lewis 16]
SLIDE 25
Modeling composite problems
Now we make a convex model f(x) = h(c(x))
SLIDE 26 Modeling composite problems
Now we make a convex model f(x) = h( c(x)
)
SLIDE 27
Modeling composite problems
Now we make a convex model f(y) ≈ h(c(x) + ∇c(x)T (y − x))
SLIDE 28 Modeling composite problems
Now we make a convex model f(y) ≈ h(c(x) + ∇c(x)T (y − x)
)
SLIDE 29 Modeling composite problems
Now we make a convex model fx(y) := h
SLIDE 30 Modeling composite problems
Now we make a convex model fx(y) := h
- c(x) + ∇c(x)T (y − x)
- [Burke 85; Drusvyatskiy, Ioffe, Lewis 16]
SLIDE 31 Modeling composite problems
Now we make a convex model fx(y) := h
- c(x) + ∇c(x)T (y − x)
- Example: f(x) = |x2 − 1|, h(z) = |z| and c(x) = x2 − 1
SLIDE 32 Modeling composite problems
Now we make a convex model fx(y) := h
- c(x) + ∇c(x)T (y − x)
- Example: f(x) = |x2 − 1|, h(z) = |z| and c(x) = x2 − 1
SLIDE 33 Modeling composite problems
Now we make a convex model fx(y) := h
- c(x) + ∇c(x)T (y − x)
- Example: f(x) = |x2 − 1|, h(z) = |z| and c(x) = x2 − 1
SLIDE 34 The prox-linear method [Burke, Drusvyatskiy et al.]
Iteratively (1) form regularized convex model and (2) minimize it xk+1 = argmin
x∈X
2α x − xk2
2
x∈X
- h
- c(xk) + ∇c(xk)T (x − xk)
- + 1
2α x − xk2
2
SLIDE 35 The prox-linear method [Burke, Drusvyatskiy et al.]
Iteratively (1) form regularized convex model and (2) minimize it xk+1 = argmin
x∈X
2α x − xk2
2
x∈X
- h
- c(xk) + ∇c(xk)T (x − xk)
- + 1
2α x − xk2
2
SLIDE 36 The prox-linear method [Burke, Drusvyatskiy et al.]
Iteratively (1) form regularized convex model and (2) minimize it xk+1 = argmin
x∈X
2α x − xk2
2
x∈X
- h
- c(xk) + ∇c(xk)T (x − xk)
- + 1
2α x − xk2
2
SLIDE 37 The prox-linear method [Burke, Drusvyatskiy et al.]
Iteratively (1) form regularized convex model and (2) minimize it xk+1 = argmin
x∈X
2α x − xk2
2
x∈X
- h
- c(xk) + ∇c(xk)T (x − xk)
- + 1
2α x − xk2
2
SLIDE 38 The prox-linear method [Burke, Drusvyatskiy et al.]
Iteratively (1) form regularized convex model and (2) minimize it xk+1 = argmin
x∈X
2α x − xk2
2
x∈X
- h
- c(xk) + ∇c(xk)T (x − xk)
- + 1
2α x − xk2
2
SLIDE 39 The prox-linear method [Burke, Drusvyatskiy et al.]
Iteratively (1) form regularized convex model and (2) minimize it xk+1 = argmin
x∈X
2α x − xk2
2
x∈X
- h
- c(xk) + ∇c(xk)T (x − xk)
- + 1
2α x − xk2
2
SLIDE 40 Generic(ish) optimization methods
Iterate xk+1 = argmin
x∈X
1 2αk x − xk2
SLIDE 41 Generic(ish) optimization methods
Iterate xk+1 = argmin
x∈X
1 2αk x − xk2
- ◮ Proximal point method (fx = f) [Rockafellar 76]
◮ Gradient descent (fx(y) = f(x) + ∇f(x), y − x) ◮ Newton (fx(y) = f(x) + ∇f(x), y − x + 1 2(x − y)T ∇2f(x)(x − y)) ◮ Prox-linear (fx(y) = h(c(x) + ∇c(x)T (y − x)))
SLIDE 42 The aProx family for stochastic optimization
Iterate:
◮ Sample Sk iid
∼ P
◮ Update by minimizing model
xk+1 = argmin
x∈X
1 2αk x − xk2
SLIDE 43 The aProx family for stochastic optimization
Iterate:
◮ Sample Sk iid
∼ P
◮ Update by minimizing model
xk+1 = argmin
x∈X
1 2αk x − xk2
◮ Stochastic gradient method ◮ Stochastic proximal-point (implicit gradient) method, fxk(x) = f(x)
[Rockafellar 76; Kulis & Bartlett 10; Karampatziakis & Langford 11; Bertsekas 11; Toulis & Airoldi 17; Ryu & Boyd 16]
◮ Stochastic prox-linear methods [D. & Ruan 18; Davis & Drusvyatskiy 18;
Asi & D. 19]
SLIDE 44 Models in stochastic optimization
Conditions on our models (convex case)
y → fx(y; s) is convex
fx(y; s) ≤ f(y; s)
fx(x; s) = f(x; s) and ∂fx(x; s) ⊂ ∂f(x; s)
[D. & Ruan 17; Davis & Drusvyatskiy 18]
SLIDE 45 Models in stochastic optimization
Conditions on our models (ρ-weakly convex case)
y → fx(y; s) is convex
fx(y; s) ≤ f(y; s) + ρ(s) 2 x − y2
2
fx(x; s) = f(x; s) and ∂fx(x; s) ⊂ ∂f(x; s)
[D. & Ruan 17; Davis & Drusvyatskiy 18; Asi & D. 19]
SLIDE 46
Modeling conditions
Model fx(y) of f near x f(x)
SLIDE 47
Modeling conditions
Model fx(y) of f near x f(x) (x0, f(x0)) fx0(y) = f(x0) + ∇f(x0)T (y − x0)
SLIDE 48 Modeling conditions
Model fx(y) of f near x
truncated
f(x) (x0, f(x0)) fx0(y) = f(x0) + ∇f(x0)T (y − x0) fx0(y) =
- f(x0) + ∇f(x0)T (y − x0)
- +
SLIDE 49 Models in stochastic optimization
Linear Truncated
x0 x1
- i. (Sub)gradient: fx(y) = f(x) + f′(x), y − x
- ii. Truncated: fx(y) = (f(x) + f′(x), y − x) ∨ infx f(x)
- iii. Bundle/multi-line: fx(y) = max{f(xi) + f′(xi), x − xi}
- iv. Prox-linear: fx(y) = h(c(x) + ∇c(x)T (y − x))
SLIDE 50 The aProx family
Iterate:
◮ Sample Sk iid
∼ P
◮ Update by minimizing model
xk+1 = argmin
x∈X
1 2αk x − xk2
SLIDE 51
Outline
Motivating experiments Models in optimization Stochastic optimization Stability is better Nothing gets worse Beyond convexity Adaptivity in easy problems Revisiting experimental results
SLIDE 52 The aProx family
Iterate:
◮ Sample Sk iid
∼ P
◮ Update by minimizing model
xk+1 = argmin
x∈X
1 2αk x − xk2
SLIDE 53
Divergence of a gradient method
SLIDE 54
Divergence of a gradient method
SLIDE 55
Divergence of a gradient method
SLIDE 56
Divergence of a gradient method
SLIDE 57
Divergence of a gradient method
SLIDE 58
Divergence of a gradient method
SLIDE 59
Divergence of a gradient method
SLIDE 60
Divergence of a gradient method
SLIDE 61
Divergence of a gradient method
SLIDE 62
Divergence of a gradient method
SLIDE 63 Stability guarantees (convex)
Use full stochastic-proximal method, xk+1 = argmin
x∈X
1 2αk x − xk2
Theorem (Asi & D. 18) Assume X ⋆ = argminx∈X F(x) is non-empty and E[f′(x⋆; S)2] ≤ σ2. Then E[dist(xk, X ⋆)2] ≤ dist(x0, X ⋆)2 + σ2
k
α2
i
SLIDE 64 Stability guarantees (convex)
Use full stochastic-proximal method, xk+1 = argmin
x∈X
1 2αk x − xk2
Theorem (Asi & D. 18) Assume X ⋆ = argminx∈X F(x) is non-empty and E[f′(x⋆; S)2] ≤ σ2. Then E[dist(xk, X ⋆)2] ≤ dist(x0, X ⋆)2 + σ2
k
α2
i
Theorem (Asi & D. 18) Under the same assumptions, sup
k
dist(xk, X ⋆) < ∞ and dist(xk, X ⋆) a.s. → 0.
SLIDE 65 Stability guarantees (convex)
Use any model with fx(y; s) ≥ infz f(z; s) (i.e. good lower bound) xk+1 = argmin
x∈X
1 2αk x − xk2
Theorem (Asi & D. 19) Assume X ⋆ = argminx∈X F(x) is non-empty and there exists p < ∞ such that E[
- f′(x; S)
- 2] ≤ C(1 + dist(x, X ⋆)p).
Then sup
k
dist(xk, X ⋆) < ∞ and dist(xk, X ⋆) a.s. → 0.
SLIDE 66 Example behaviors
On least-squares objective F(x) =
1 2m
m
i=1(aT i x − bi)2
100 200 300 400 500 10
2
10
1
100 101 102 103 104 105 106
SGM Prox
SLIDE 67 Classical asymptotic analysis
Theorem (Polyak & Juditsky 92) Let F be convex and strongly convex in a neighborhood of x⋆, and assume that f(x; S) are globally smooth. For xk generated by stochastic gradient method, 1 √ k
k
(xi − x⋆) d N
- 0, ∇2F(x⋆)−1 Cov(∇f(x⋆; S))∇2F(x⋆)−1
.
SLIDE 68 New asymptotic analysis (convex case)
Theorem (Asi & D. 18) Let F be convex and strongly convex in a neighborhood of x⋆, and assume that f(x; S) are smooth near x⋆. Then if xk remain bounded and the models fxk(·; Sk) satisfy our conditions, 1 √ k
k
(xi − x⋆) d N
- 0, ∇2F(x⋆)−1 Cov(∇f(x⋆; S))∇2F(x⋆)−1
.
truncated
SLIDE 69 New asymptotic analysis (convex case)
Theorem (Asi & D. 18) Let F be convex and strongly convex in a neighborhood of x⋆, and assume that f(x; S) are smooth near x⋆. Then if xk remain bounded and the models fxk(·; Sk) satisfy our conditions, 1 √ k
k
(xi − x⋆) d N
- 0, ∇2F(x⋆)−1 Cov(∇f(x⋆; S))∇2F(x⋆)−1
.
◮ Optimal by local minimax
theorem [H´
ajek 72; Le Cam 73;
◮ Key insight: subgradients of
fxk(·; Sk) close to ∇f(xk; Sk)
truncated
SLIDE 70 Convergence to stationarity in weakly convex cases
Convergence requires Moreau envelope [Davis & Drusvyatskiy 18] Fλ(x) := inf
y∈X
2 y − x2
2
Important properties:
◮ Proximal mapping:
xλ := proxF/λ(x) := argmin
y∈X
2 y − x2
2
∇Fλ(x) = λ(x − xλ)
◮ Near stationarity and decrease:
F(xλ) ≤ F(x) and dist(0, ∂F(xλ)) ≤ ∇Fλ(x)2
SLIDE 71 Convergence to stationarity in weakly convex cases
Convergence requires Moreau envelope [Davis & Drusvyatskiy 18] Fλ(x) := inf
y∈X
2 y − x2
2
Important properties:
◮ Proximal mapping:
xλ := proxF/λ(x) := argmin
y∈X
2 y − x2
2
∇Fλ(x) = λ(x − xλ)
◮ Near stationarity and decrease:
F(xλ) ≤ F(x) and dist(0, ∂F(xλ)) ≤ ∇Fλ(x)2 Convergence: Say iterates xk converge if ∇Fλ(xk) → 0
SLIDE 72 Moreau envelope of the absolute value
F Fλ For F(x) = |x|, Fλ(x) =
2x2
if |x| ≤ λ−1 |x| − 1
2λ
if |x| > λ−1
◮ F ′ λ(x) = λx ◮ |F ′ λ(x)| = λ dist(x, 0) ◮ prox step xλ = 0 if |x| ≤ 1/λ
SLIDE 73 Convergence in weakly convex cases
Use regularized stochastic-proximal point method, xk+1 = argmin
x∈X
2 x − xk2
2 +
1 2αk x − xk2
2
Theorem (Asi & D. 19) Let random f be ρ(s) weakly convex with E[ρ2(S)] < ∞. With proximal-point iteration, iterates xk satisfy Fλ(xk) a.s. → G and
∞
αk ∇Fλ(xk)2
2 < ∞.
SLIDE 74 Convergence in weakly convex cases
Use regularized stochastic-proximal point method, xk+1 = argmin
x∈X
2 x − xk2
2 +
1 2αk x − xk2
2
Theorem (Asi & D. 19) Let random f be ρ(s) weakly convex with E[ρ2(S)] < ∞. With proximal-point iteration, iterates xk satisfy Fλ(xk) a.s. → G and
∞
αk ∇Fλ(xk)2
2 < ∞.
Proposition (Asi & D. 19) If iterates xk remain bounded and image of stationary points has measure zero, ∇Fλ(xk) a.s. → 0.
SLIDE 75 What is an easy problem?
◮ Interpolation problems [Belkin, Hsu, Mitra 18; Ma, Bassily, Belkin 18] ◮ Overparameterized linear systems (Kaczmarz algorithms) [Strohmer &
Vershynin 09; Needell, Srebro, Ward 14; Needell & Tropp 14]
◮ Random projections for linear constraints [Leventhal & Lewis 10]
(a) MNIST (b) CIFAR-10 (c) SVHN (
4 subsamples)
SLIDE 76 What is an easy problem?
minimize
x
F(x) := E[f(x; S)] =
truncated
SLIDE 77 What is an easy problem?
minimize
x
F(x) := E[f(x; S)] =
Definition: Problem is easy if there exists x⋆ such that f(x⋆; S) = infx f(x; S) with probability 1. [Schmidt & Le Roux 13; Ma,
Bassily, Belkin 18; Belkin, Rakhlin, Tsybakov 18]
truncated
SLIDE 78 What is an easy problem?
minimize
x
F(x) := E[f(x; S)] =
Definition: Problem is easy if there exists x⋆ such that f(x⋆; S) = infx f(x; S) with probability 1. [Schmidt & Le Roux 13; Ma,
Bassily, Belkin 18; Belkin, Rakhlin, Tsybakov 18]
One additional condition
- iv. The models fx satisfy
fx(y; s) ≥ inf
x⋆∈X f(x⋆; s)
truncated
SLIDE 79 Easy strongly convex problems
Theorem (Asi & D. 18) Let the function F satisfy the growth condition F(x) ≥ F(x⋆) + λ 2 dist(x, X⋆)2 where X⋆ = argminx F(x), and be easy. Then E[dist(xk, X⋆)2] ≤ max
k
αi
- , exp (−ck)
- dist(x1, X⋆)2.
SLIDE 80 Easy strongly convex problems
Theorem (Asi & D. 18) Let the function F satisfy the growth condition F(x) ≥ F(x⋆) + λ 2 dist(x, X⋆)2 where X⋆ = argminx F(x), and be easy. Then E[dist(xk, X⋆)2] ≤ max
k
αi
- , exp (−ck)
- dist(x1, X⋆)2.
◮ Adaptive no matter the stepsizes ◮ Most other results (e.g. for SGM [Schmidt & Le Roux 13; Ma, Bassily,
Belkin 18]) require careful stepsize choices
SLIDE 81 Sharp problems
Definition: An objective F is sharp if F(x) ≥ F(x⋆) + λ dist(x, X⋆) for X⋆ = argmin F(x). [Ferris 88; Burke & Ferris 95]
◮ Piecewise linear objectives ◮ Hinge loss F(x) = 1 m
m
i=1
i x
SLIDE 82 Sharp problems
Definition: An objective F is sharp if F(x) ≥ F(x⋆) + λ dist(x, X⋆) for X⋆ = argmin F(x). [Ferris 88; Burke & Ferris 95]
◮ Piecewise linear objectives ◮ Hinge loss F(x) = 1 m
m
i=1
i x
SLIDE 83 Sharp problems
Definition: An objective F is sharp if F(x) ≥ F(x⋆) + λ dist(x, X⋆) for X⋆ = argmin F(x). [Ferris 88; Burke & Ferris 95]
◮ Piecewise linear objectives ◮ Hinge loss F(x) = 1 m
m
i=1
i x
SLIDE 84 Sharp convex problems
Definition: An objective F is sharp if F(x) ≥ F(x⋆) + λ dist(x, X⋆) for X⋆ = argmin F(x). [Ferris 88; Burke & Ferris 95]
◮ Piecewise linear objectives ◮ Hinge loss F(x) = 1 m
m
i=1
i x
◮ Projection onto intersections: F(x) = 1 m
m
i=1 dist(x, Ci)
SLIDE 85 Sharp convex problems
Definition: An objective F is sharp if F(x) ≥ F(x⋆) + λ dist(x, X⋆) for X⋆ = argmin F(x). [Ferris 88; Burke & Ferris 95]
◮ Piecewise linear objectives ◮ Hinge loss F(x) = 1 m
m
i=1
i x
◮ Projection onto intersections: F(x) = 1 m
m
i=1 dist(x, Ci)
Theorem (Asi & D. 18) Let F have sharp growth and be easy. If F is convex, E[dist(xk+1, X⋆)2] ≤ max
k
αi
SLIDE 86 Sharp weakly problems
Definition: An objective F is sharp if F(x) ≥ F(x⋆) + λ dist(x, X⋆) for X⋆ = argmin F(x). [Ferris 88; Burke & Ferris 95]
◮ Phase retrieval F(x) = 1 m
◮ Blind deconvolution [Charisopoulos et al. 19]
SLIDE 87 Sharp weakly problems
Definition: An objective F is sharp if F(x) ≥ F(x⋆) + λ dist(x, X⋆) for X⋆ = argmin F(x). [Ferris 88; Burke & Ferris 95]
◮ Phase retrieval F(x) = 1 m
◮ Blind deconvolution [Charisopoulos et al. 19]
Theorem (Asi & D. 19) Let F have sharp growth and be easy. There exists c ∈ (0, 1) such that on the event xk → X⋆, lim sup
k
dist(xk, X⋆) (1 − c)k < ∞.
SLIDE 88
Outline
Motivating experiments Models in optimization Stochastic optimization Stability is better Nothing gets worse Beyond convexity Adaptivity in easy problems Revisiting experimental results
SLIDE 89 Methods
Iterate xk+1 = argmin
x
1 2αk x − xk2
2
SLIDE 90 Methods
Iterate xk+1 = argmin
x
1 2αk x − xk2
2
fxk(x; Sk) = f(xk; Sk) + f′(xk; Sk), x − xk
◮ Truncated gradient (f ≥ 0):
fxk(x; Sk) =
- f(xk; Sk) + f′(xk; Sk), x − xk
- +
◮ (Stochastic) proximal point
fxk(x; Sk) = f(x; Sk)
SLIDE 91 Linear regression with low noise
F(x) = 1 2m
m
(aT
i x − bi)2
10
2
10
1
100 101 102 103 104 2000 4000 6000 8000 10000
SGM Truncated Prox
Initial stepsize α0 Time to ǫ-accuracy
SLIDE 92 Linear regression with no noise
F(x) = 1 2m
m
(aT
i x − bi)2
10
2
10
1
100 101 102 103 104 2000 4000 6000 8000 10000
SGM Truncated Prox
Initial stepsize α0 Time to ǫ-accuracy
SLIDE 93 Linear regression with “poor” conditioning
10
1
100 101 102 103 104 105 300 400 500 600 700 800 900 1000
Accuracy epsilon = 0.055 Proximal SGM Truncated Bundle
SLIDE 94 Linear regression with “poor” conditioning
10
1
100 101 102 103 104 105 300 400 500 600 700 800 900 1000
Accuracy epsilon = 0.055 Proximal SGM Truncated Bundle
Poor conditioning? κ(A) = 15
SLIDE 95 Multiclass hinge loss: no noise
f(x; (a, l)) = max
i=l [1 + a, xi − xl]+
10
1
100 101 102 103 104 105 2000 4000 6000 8000 10000
SGM Truncated Prox
Initial stepsize α0 Time to ǫ-accuracy
SLIDE 96 Multiclass hinge loss: small label flipping
f(x; (a, l)) = max
i=l [1 + a, xi − xl]+
10
1
100 101 102 103 104 105 2000 4000 6000 8000 10000
SGM Truncated Prox
Initial stepsize α0 Time to ǫ-accuracy
SLIDE 97 Multiclass hinge loss: substantial label flipping
f(x; (a, l)) = max
i=l [1 + a, xi − xl]+
10
1
100 101 102 103 104 105 2000 4000 6000 8000 10000
SGM Truncated Prox
Initial stepsize α0 Time to ǫ-accuracy
SLIDE 98
(Robust) Phase retrieval
[Cand` es, Li, Soltanolkotabi 15]
SLIDE 99 (Robust) Phase retrieval
[Cand` es, Li, Soltanolkotabi 15]
Observations (usually) bi = ai, x⋆2 yield objective f(x) = 1 m
m
|ai, x2 − bi|
SLIDE 100 Phase retrieval without noise
F(x) = 1 m
m
|ai, x2 − bi|
10
1
100 101 102 103 104 105 200 400 600 800 1000
Proximal SGM Truncated
Initial stepsize α0 Time to ǫ-accuracy
SLIDE 101 Matrix completion without noise
F(x, y) =
|xi, yj − Mij|
10
1
100 101 102 103 104 105 10 20 30 40 50 60 70 80
SGM Truncated
SLIDE 102 Deep learning experiments
CIFAR 10 Dataset: 10 class image classification
10−3 10−2 10−1 100 101 102 103 20 25 30 35 40 45 50
SGM Truncated adam trunc-adagrad
Time to error ǫ = .11
SLIDE 103 Deep learning experiment: dog recognition
Stanford Dogs: 120 class dog breed classification
10−4 10−3 10−2 10−1 100 101 102 103 12.5 15.0 17.5 20.0 22.5 25.0 27.5 30.0
SGM Truncated adam trunc-adagrad
Initial stepsize α0 Time to error ǫ = .35
SLIDE 104
Conclusions
◮ Perhaps blind application of stochastic gradient methods is not the
right answer
◮ Care and better modeling can yield improved performance ◮ Computational efficiency important in model choice
SLIDE 105
Conclusions
◮ Perhaps blind application of stochastic gradient methods is not the
right answer
◮ Care and better modeling can yield improved performance ◮ Computational efficiency important in model choice
Questions
◮ Parallelism? ◮ The importance of better models in stochastic optimization.
arXiv:1903.08619
◮ Stochastic (Approximate) Proximal Point Methods: Convergence,
Optimality, and Adaptivity. arXiv:1810.05633