Stochastic Algorithms in Machine Learning Aymeric DIEULEVEUT EPFL, - - PowerPoint PPT Presentation

stochastic algorithms in machine learning
SMART_READER_LITE
LIVE PREVIEW

Stochastic Algorithms in Machine Learning Aymeric DIEULEVEUT EPFL, - - PowerPoint PPT Presentation

Stochastic Algorithms in Machine Learning Aymeric DIEULEVEUT EPFL, Lausanne December 1st, 2017 Journ ee Algorithmes Stochastiques, Paris Dauphine 1 Outline 1. Machine learning context. 2. Stochastic algorithms to minimize Empirical Risk .


slide-1
SLIDE 1

Stochastic Algorithms in Machine Learning

Aymeric DIEULEVEUT

EPFL, Lausanne

December 1st, 2017 Journ´ ee Algorithmes Stochastiques, Paris Dauphine

1

slide-2
SLIDE 2

Outline

  • 1. Machine learning context.
  • 2. Stochastic algorithms to minimize Empirical Risk .
  • 3. Stochastic Approximation: using stochastic gradient

descent (SGD) to minimize Generalization Risk.

  • 4. Markov chain: insightful point of view on constant step

size Stochastic Approximation.

2

slide-3
SLIDE 3

Supervised Machine Learning

Goal: predict a phenomenon from “explanatory variables”, given a set of observations. Bio-informatics Input: DNA/RNA sequence, Output: Disease predisposition / Drug responsiveness n → 10 to 104 d (e.g., number of basis) → 106 Image classification Input: Handwritten digits / Images, Output: Digit n → up to 109 d (e.g., number of pixels) → 106 “Large scale” learning framework: both the number of examples n and the number of explanatory variables d are large.

3

slide-4
SLIDE 4

Supervised Machine Learning

◮ Consider an input/output pair (X, Y ) ∈ X × Y,

following some unknown distribution ρ.

◮ Y = R (regression) or {−1, 1} (classification). ◮ Goal: find a function θ : X → R, such that θ(X) is a

good prediction for Y .

◮ Prediction as a linear function θ, Φ(X) of features

Φ(X) ∈ Rd.

◮ Consider a loss function ℓ : Y × R → R+: squared loss,

logistic loss, 0-1 loss, etc.

◮ Define the Generalization risk (a.k.a., generalization

error, “true risk”) as R(θ) := Eρ [ℓ(Y , θ, Φ(X))] .

4

slide-5
SLIDE 5

Empirical Risk minimization (I)

◮ Data: n observations (xi, yi) ∈ X × Y, i = 1, . . . , n,

i.i.d.

◮ n very large, up to 109 ◮ Computer vision: d = 104 to 106

◮ Empirical risk (or training error):

ˆ R(θ) = 1 n

n

  • i=1

ℓ(yi, θ, Φ(xi)).

◮ Empirical risk minimization (ERM) (regularized): find ˆ

θ solution of min

θ∈Rd

1 n

n

  • i=1

  • yi, θ, Φ(xi)
  • +

µΩ(θ). convex data fitting term + regularizer

5

slide-6
SLIDE 6

Empirical Risk minimization (II)

For example, least-squares regression: min

θ∈Rd

1 2n

n

  • i=1
  • yi − θ, Φ(xi)

2 + µΩ(θ), and logistic regression: min

θ∈Rd

1 n

n

  • i=1

log

  • 1 + exp(−yiθ, Φ(xi))
  • +

µΩ(θ). Two fundamental questions: (1) computing (2) analyzing ˆ θ.

Take home

◮ Problem is formalized as a (convex) optimization

problem.

◮ In the large scale setting, high dimensional problem

and many examples.

6

slide-7
SLIDE 7

Stochastic algorithms for ERM

min

θ∈Rd

  • ˆ

R(θ) = 1 n

n

  • i=1

ℓ(yi, θ, Φ(xi))

  • .
  • 1. High dimension d

= ⇒ First order algorithms Gradient Descent (GD) : θk = θk−1 − γk ˆ R′(θk−1) Problem: computing the gradient costs O(dn) per iteration.

  • 2. Large n =

⇒ Stochastic algorithms Stochastic Gradient Descent (SGD)

7

slide-8
SLIDE 8

Stochastic Gradient descent

◮ Goal:

min

θ∈Rd f (θ)

given unbiased gradient estimates f ′

n ◮ θ∗ := argminRd f (θ).

θ∗ θ0 θ θ1

8

θ∗ θ0

slide-9
SLIDE 9

SGD for ERM: f = ˆ R

Loss for a single pair of observations, for any j ≤ n: fj(θ) := ℓ(yj, θ, Φ(xj)). One observation at each step = ⇒ complexity O(d) per iteration. For the empirical risk ˆ R(θ) = 1

n n

  • k=1

ℓ(yk, θ, Φ(xk)).

◮ At each step k ∈ N∗, sample Ik ∼ U{1, . . . n}, and use:

f ′

Ik(θk−1) = ℓ′(yIk, θk−1, Φ(xIk)) ◮ with Fk = σ((xi, yi)1≤i≤n, (Ii)1≤i≤k),

E[f ′

Ik(θk−1)|Fk−1] = 1

n

n

  • k=1

ℓ′(yk, θ, Φ(xk)) = ˆ R′(θk−1). Mathematical framework: smoothness and/or strong convexity.

9

slide-10
SLIDE 10

Mathematical framework: Smoothness

◮ A function g : Rd → R is L-smooth if and only if it is

twice differentiable and ∀θ ∈ Rd, eigenvalues

  • g ′′(θ)
  • L

For all θ ∈ Rd: g(θ) ≤ g(θ′) + g(θ′), θ − θ′ + L

  • θ − θ′

2

10

slide-11
SLIDE 11

Mathematical framework: Strong Convexity

◮ A twice differentiable function g : Rd → R is µ-strongly

convex if and only if ∀θ ∈ Rd, eigenvalues

  • g ′′(θ)
  • µ

For all θ ∈ Rd: g(θ) ≥ g(θ′) + g(θ′), θ − θ′ + µ

  • θ − θ′

2

11

slide-12
SLIDE 12

Application to machine learning

◮ We consider an a.s. convex loss in θ. Thus ˆ

R and R are convex.

◮ Hessian of ˆ

R ≈ covariance matrix 1

n

n

i=1 Φ(xi)Φ(xi)⊤

(≃ E[Φ(X)Φ(X)⊤].) ˆ R′′(θ) = 1 n

n

  • i=1
  • ℓ′′(θ, Φ(Xi), Yi)Φ(xi)Φ(xi)⊤

◮ If ℓ is smooth, and E[Φ(X)2] ≤ r 2 , R is smooth. ◮ If ℓ is µ-strongly convex, and data has an invertible

covariance matrix (low correlation/dimension), R is strongly convex.

12

slide-13
SLIDE 13

Analysis: behaviour of (θn)n≥0

θk = θk−1 − γk f ′

k(θk−1)

Importance of the learning rate (or sequence of step sizes) (γk)k≥0. For smooth and strongly convex problem, traditional analysis shows Fabian (1968); Robbins and Siegmund (1985) that θk → θ∗ almost surely if

  • k=1

γk = ∞

  • k=1

γ2

k < ∞.

And asymptotic normality √ k(θk − θ∗)

d

→ N (0, V ), for γk = γ0

k , γ0 ≥ 1 µ. ◮ Limit variance scales as 1/µ2 ◮ Very sensitive to ill-conditioned problems. ◮ µ generally unknown, so hard to choose the step size...

13

slide-14
SLIDE 14

Polyak Ruppert averaging

Introduced by Polyak and Juditsky (1992) and Ruppert (1988): ¯ θk = 1 k + 1

k

  • i=0

θi.

θ∗ θ0 θ1 θn θn θ1 θ2

◮ off line averaging reduces the noise effect. ◮ on line computing: ¯

θk+1 =

1 k+1θk+1 + k k+1 ¯

θk.

◮ one could also consider other averaging schemes (e.g.,

Lacoste-Julien et al. (2012)).

14

θ∗ θ0 θ1 θn θ1

slide-15
SLIDE 15

Convex stochastic approximation: convergence

◮ Known global minimax rates of convergence for

non-smooth problems Nemirovsky and Yudin (1983); Agarwal et al. (2012)

◮ Strongly convex: O((µk)−1)

Attained by averaged stochastic gradient descent with γk ∝ (µk)−1

◮ Non-strongly convex: O(k−1/2)

Attained by averaged stochastic gradient descent with γk ∝ k−1/2

Smooth strongly convex problems

◮ Rate

1 µk for γk ∝ k−1/2: adapts to strong convexity.

15

slide-16
SLIDE 16

Convergence rate for f (˜ θk) − f (θ∗), smooth f .

min ˆ R min R SGD GD SAG SGD Convex O

  • 1

√ k

  • O
  • 1

k

  • O
  • 1

√ k

  • Stgly-Cvx

O

  • 1

µk

  • O(e−µk)

O

  • 1 − (µ ∧ 1

n)

k O

  • 1

µk

  • ⊖ Gradient descent update costs n times as much as SGD

update. Can we get best of both worlds ?

16

slide-17
SLIDE 17

Convergence rate for f (˜ θk) − f (θ∗), smooth f .

min ˆ R min R SGD GD SAG SGD Convex O

  • 1

√ k

  • O
  • 1

k

  • O
  • 1

√ k

  • Stgly-Cvx

O

  • 1

µk

  • O(e−µk)

O

  • 1 − (µ ∧ 1

n)

k O

  • 1

µk

  • ⊖ Gradient descent update costs n times as much as SGD

update. Can we get best of both worlds ?

16

slide-18
SLIDE 18

Methods for finite sum minimization

◮ GD: at step k, use 1 n

n

i=0 f ′ i (θk) ◮ SGD: at step k, sample ik ∼ U[1; n], use f ′ ik(θk) ◮ SAG: at step k,

◮ keep a “full gradient” 1

n

n

i=0 f ′ i (θki ), with θki ∈ {θ1, . . . θk}

◮ sample ik ∼ U[1; n], use

1 n n

  • i=0

f ′

i (θki ) − f ′ ik(θkik ) + f ′ ik(θk)

  • ,

⊕ update costs the same as SGD ⊖ needs to store all gradients f ′

i (θki ) at “points in the past”

Some references:

◮ SAG Schmidt et al. (2013), SAGA Defazio et al. (2014a) ◮ SVRG Johnson and Zhang (2013) (reduces memory cost but 2 epochs...) ◮ FINITO Defazio et al. (2014b) ◮ S2GD Koneˇ

cn` y and Richt´ arik (2013)... And many others... See for example Niao He’s lecture notes for a nice overview.

17

slide-19
SLIDE 19

Convergence rate for f (˜ θk) − f (θ∗), smooth

  • bjective f .

min ˆ R min R SGD GD SAG SGD Convex O

  • 1

√ k

  • O
  • 1

k

  • O
  • 1

√ k

  • Stgly-Cvx

O

  • 1

µk

  • O(e−µk)

O

  • 1 − (µ ∧ 1

n)

k O

  • 1

µk

  • GD, SGD, SAG (Fig. from Schmidt et al. (2013))

18

slide-20
SLIDE 20

Convergence rate for f (˜ θk) − f (θ∗), smooth

  • bjective f .

min ˆ R min R SGD GD SAG SGD Convex O

  • 1

√ k

  • O
  • 1

k

  • O
  • 1

√ k

  • Stgly-Cvx

O

  • 1

µk

  • O(e−µk)

O

  • 1 − (µ ∧ 1

n)

k O

  • 1

µk

  • Lower Bounds

α β γ α : Stoch. opt. information theoretic lower bounds, Agarwal et al. (2012); β: Black box first order optimization, Nesterov (2004) ; γ: Lower bounds for optimizing finite sums, Agarwal and Bottou (2014); Arjevani and Shamir (2016).

19

slide-21
SLIDE 21

Convergence rate for f (˜ θk) − f (θ∗), smooth

  • bjective f .

min ˆ R min R SGD AGD SAG SGD Convex O

  • 1

√ k

  • O
  • 1

k2

  • O
  • 1

√ k

Stgly-Cvx O

  • 1

µk

  • O(e−√µk)

O

  • 1 − (µ ∧ 1

n)

k O

  • 1

µk

Lower Bounds α β γ α : Stoch. opt. information theoretic lower bounds, Agarwal et al. (2012); β: Black box first order optimization, Nesterov (2004); γ: Lower bounds for optimizing finite sums, Agarwal and Bottou (2014).

20

slide-22
SLIDE 22

Take home

Stochastic algorithms for Empirical Risk Minimization.

◮ Several algorithms to optimize empirical risk, most

efficient ones are stochastic and rely on finite sum structure

◮ Stochastic algorithms to optimize a deterministic

function.

◮ Rates depend on the regularity of the function.

21

slide-23
SLIDE 23

What about generalization risk

Generalization guarantees:

◮ Uniform upper bound supθ

  • ˆ

R(θ) − R(θ)

  • . (empirical

process theory)

◮ More precise: localized complexities (Bartlett et al.,

2002), stability (Bousquet and Elisseeff, 2002). Problems for ERM:

◮ Choose regularization (overfitting risk) ◮ How many iterations (i.e., passes on the data)? ◮ Generalization guarantees generally of order O(1/√n),

no need to be precise 2 important insights:

  • 1. No need to optimize below statistical error,
  • 2. Generalization risk is more important than empirical risk.

SGD can be used to minimize the generalization risk.

22

slide-24
SLIDE 24

SGD for the generalization risk: f = R

SGD: key assumption E[f ′

n(θn−1)|Fn−1] = f ′(θn−1).

For the risk R(θ) = Eρ [ ℓ(Y , θ, Φ(X))]

◮ At step 0 < k ≤ n, use a new point independent of

θk−1: f ′

k(θk−1) = ℓ′(yk, θk−1, Φ(xk)) ◮ For 0 ≤ k ≤ n, Fk = σ((xi, yi)1≤i≤k).

E[f ′

k(θk−1)|Fk−1]

= Eρ[ ℓ′(yk, θk−1, Φ(xk))|Fk−1] = Eρ

  • ℓ′(Y , θk−1, Φ(X))
  • = R′(θk−1)

◮ Single pass through the data, Running-time = O(nd), ◮ “Automatic” regularization.

23

slide-25
SLIDE 25

SGD for the generalization risk: f = R

ERM minimization

  • Gen. risk minimization

several passes : 0 ≤ k One pass 0 ≤ k ≤ n xi, yi is Ft-measurable for any t Ft-measurable for t ≥ i.

24

slide-26
SLIDE 26

Convergence rate for f (˜ θk) − f (θ∗), smooth

  • bjective f .

min ˆ R min R SGD GD SAG SGD Convex O

  • 1

√ k

  • O
  • 1

k

  • O
  • 1

√ k

  • Stgly-Cvx

O

  • 1

µk

  • O(e−µk)

O

  • 1 − (µ ∧ 1

n)

k O

  • 1

µk

  • 0 ≤ k

0 ≤ k ≤ n Lower Bounds α β γ δ δ :Information theoretic LB - Statistical theory (Tsybakov, 2003). Gradient does not even exist

25

slide-27
SLIDE 27

Convergence rate for f (˜ θk) − f (θ∗), smooth

  • bjective f .

min ˆ R min R SGD GD SAG SGD Convex O

  • 1

√ k

  • O
  • 1

k

  • O
  • 1

√n

  • Stgly-Cvx

O

  • 1

µk

  • O(e−µk)

O

  • 1 − (µ ∧ 1

n)

k O

  • 1

µn

  • 0 ≤ k

0 ≤ k ≤ n Lower Bounds α β γ δ δ : Information theoretic LB - Statistical theory (Tsybakov, 2003). Gradient is unknown

25

slide-28
SLIDE 28

Least Mean Squares: rate independent of µ

◮ Least-squares: R(θ) = 1 2E

  • (Y − Φ(X), θ)2

with θ ∈ Rd

◮ SGD = least-mean-square algorithm ◮ Usually studied without averaging and decreasing

step-sizes.

◮ New analysis for averaging and constant step-size

γ = 1/(4R2) Bach and Moulines (2013)

◮ Assume Φ(xn) r and |yn − Φ(xn), θ∗| σ almost

surely

◮ No assumption regarding lowest eigenvalues of the

Hessian

◮ Main result:

ER(¯ θn) − R(θ∗) 4σ2d n + θ0 − θ∗2 γn

◮ Matches statistical lower bound (Tsybakov, 2003). ◮ Optimal rate with “large” (constant) step sizes

26

slide-29
SLIDE 29

Take home

◮ SGD can be used to minimize the true risk directly ◮ Stochastic algorithm to minimize unknown function ◮ No regularization needed, only one pass ◮ For Least Squares, with constant step, optimal rate .

Stochastic approximation, beyond Least Squares ?

27

slide-30
SLIDE 30

Take home

◮ SGD can be used to minimize the true risk directly ◮ Stochastic algorithm to minimize unknown function ◮ No regularization needed, only one pass ◮ For Least Squares, with constant step, optimal rate .

Stochastic approximation, beyond Least Squares ?

27

slide-31
SLIDE 31

Beyond finite dimensional Least squares

◮ Beyond parametric models: Non Parametric Stochastic

Approximation with Large step sizes. (Dieuleveut and Bach, 2015)

◮ Improved Sampling: Averaged least-mean-squares:

bias-variance trade-offs and optimal sampling

  • distributions. (D´

efossez and Bach, 2015)

◮ Acceleration: Harder, Better, Faster, Stronger

Convergence Rates for Least-Squares

  • Regression. (Dieuleveut et al., 2016)

◮ Beyond smoothness and euclidean geometry: Stochastic

Composite Least-Squares Regression with convergence rate O(1/n). (Flammarion and Bach, 2017)

◮ General smooth and strongly convex optimization:

Bridging the Gap between Constant Step Size Stochastic Gradient Descent and Markov Chains (Dieuleveut et al., 2017).

28

slide-32
SLIDE 32

Beyond least squares. Logistic regression

min

θ∈Rd

E log

  • 1 + exp(−Y θ, Φ(X))
  • .

log10

  • R(¯

θn) − R(θ∗)

  • log10(n)

Logistic regression. Final iterate (dashed), and averaged recursion (plain).

29

slide-33
SLIDE 33

Beyond least squares. Logistic regression, real data

log10

  • R(¯

θn) − R(θ∗)

  • log10(n)

Logistic regression, Covertype dataset, n = 581012, d = 54. Comparison between a constant learning rate and decaying learning rate as

1 √n.

30

slide-34
SLIDE 34

Motivation 2/ 2. Difference between quadratic and logistic loss

Logistic Regression Least-Squares Regression ER(¯ θn) − R(θ∗) = O(γ2) ER(¯ θn) − R(θ∗) = O 1 n

  • with γ = 1/(4R2)

with γ = 1/(4R2)

31

slide-35
SLIDE 35

SGD: an homogeneous Markov chain

Consider a L−smooth and µ−strongly convex function R. SGD with a step-size γ > 0 is an homogeneous Markov chain: θγ

k+1 = θγ k − γ

  • R′(θγ

k ) + εk+1(θγ k )

  • ,

◮ satisfies Markov property ◮ is homogeneous, for γ constant, (εk)k∈N i.i.d.

Also assume:

◮ R′ k = R′ + εk+1 is almost surely L-co-coercive. ◮ Bounded moments

E[εk(θ∗)4] < ∞.

32

slide-36
SLIDE 36

Stochastic gradient descent as a Markov Chain: Analysis framework†

◮ Existence of a limit distribution πγ, and linear convergence to

this distribution: θγ

k d

→ πγ.

◮ Convergence of second order moments of the chain,

¯ θγ

k L2

− →

k→∞

¯ θγ := Eπγ [θ] .

◮ Behavior under the limit distribution (γ → 0): ¯

θγ=θ∗ + ?. Provable convergence improvement with extrapolation tricks.

†Dieuleveut, Durmus, Bach [2017]. 33

slide-37
SLIDE 37

Existence of a limit distribution γ → 0

Goal: (θγ

k )k≥0 d

→ πγ .

Theorem

For any γ < L−1, the chain (θγ

k )k≥0 admits a unique stationary

distribution πγ. In addition for all θ0 ∈ Rd, k ∈ N: W 2

2 (θγ k , πγ) ≤ (1 − 2µγ(1 − γL))k

  • Rd θ0 − ϑ2 dπγ(ϑ) .

Wasserstein metric: distance between probability measures.

34

slide-38
SLIDE 38

Behavior under limit distribution.

Ergodic theorem: ¯ θk → Eπγ[θ] =: ¯ θγ. Where is ¯ θγ ? If θ0 ∼ πγ, then θ1 ∼ πγ. θγ

1 = θγ 0 − γ

  • R′(θγ

0 ) + ε1(θγ 0 )

  • .

Eπγ

  • R′(θ)
  • = 0

In the quadratic case (linear gradients) ΣEπγ [θ − θ∗] = 0: ¯ θγ = θ∗! In the general case, using Eπγ

  • θ − θ∗4

≤ Cγ2, and expand the Taylor expansion of R: And iterating this reasoning on higher moments of the chain:

¯ θγ − θ∗ = γR′′(θ∗)−1R′′′(θ∗)

  • R′′(θ∗) ⊗ I + I ⊗ R′′(θ∗)

−1Eπγ [ε(θ)⊗2]

  • + O(γ2

Overall, ¯ θγ − θ∗ = γ∆ + O(γ2).

35

slide-39
SLIDE 39

Constant learning rate SGD: convergence in the quadratic case

36

slide-40
SLIDE 40

Constant learning rate SGD: convergence in the quadratic case

θ0 θ1 θn

36

slide-41
SLIDE 41

Constant learning rate SGD: convergence in the quadratic case

θ0 θ1 θn θ1 θ2 θn

36

slide-42
SLIDE 42

Constant learning rate SGD: convergence in the quadratic case

θ0 θ1 θn θ1 θ2 θn θ∗

36

slide-43
SLIDE 43

Behavior under limit distribution.

Ergodic theorem: ¯ θn → Eπγ[θ] =: ¯ θγ. Where is ¯ θγ ? If θ0 ∼ πγ, then θ1 ∼ πγ. θγ

1 = θγ 0 − γ

  • R′(θγ

0 ) + ε1(θγ 0 )

  • .

Eπγ

  • R′(θ)
  • = 0

In the quadratic case (linear gradients) ΣEπγ [θ − θ∗] = 0: ¯ θγ = θ∗! In the general case, Taylor expansion of R, and same reasoning on higher moments of the chain leads to

¯ θγ − θ∗ ≃ γR′′(θ∗)−1R′′′(θ∗)

  • R′′(θ∗) ⊗ I + I ⊗ R′′(θ∗)

−1Eε[ε(θ∗)⊗2]

  • Overall, ¯

θγ − θ∗ = γ∆ + O(γ2).

37

slide-44
SLIDE 44

Constant learning rate SGD: convergence in the non-quadratic case

θ∗ θ0 θ1 θn θn θ1 θ2

38

slide-45
SLIDE 45

Constant learning rate SGD: convergence in the non-quadratic case

θ∗ θ0 θ1 θn θ1 θ2

38

slide-46
SLIDE 46

Constant learning rate SGD: convergence in the non-quadratic case

θ∗ θ0 θ1 θn θ1 θ2

38

slide-47
SLIDE 47

Constant learning rate SGD: convergence in the non-quadratic case

θ∗ θ0 θ1 θn θ1 θ2 θγ

38

slide-48
SLIDE 48

Richardson extrapolation

θ∗ ¯ θγ

n − ¯

θγ = Op(n−1=2) ¯ θ1 θn θγ θn θ0 θγ

n − ¯

θγ = Op(γ1=2) θ∗ − ¯ θγ = O(γ) Recovering convergence closer to θ∗ by Richardson extrapolation 2¯ θγ

n − ¯

θ2γ

n

39

slide-49
SLIDE 49

Richardson extrapolation

θ∗ ¯ θγ

n − ¯

θγ = Op(n−1=2) ¯ θ1 θn θγ θn θ0 θγ

n − ¯

θγ = Op(γ1=2) θ∗ − ¯ θγ = O(γ) Recovering convergence closer to θ∗ by Richardson extrapolation 2¯ θγ

n − ¯

θ2γ

n

39

slide-50
SLIDE 50

Richardson extrapolation

θ∗ ¯ θγ

n − ¯

θγ = Op(n−1=2) ¯ θ1 θn θγ θn θ0 θγ

n − ¯

θγ = Op(γ1=2) θ∗ − ¯ θγ = O(γ) Recovering convergence closer to θ∗ by Richardson extrapolation 2¯ θγ

n − ¯

θ2γ

n

39

slide-51
SLIDE 51

Richardson extrapolation

θ∗ ¯ θγ

n − ¯

θγ = Op(n−1=2) ¯ θ1 θn θγ θn θ0 θγ

n − ¯

θγ = Op(γ1=2) θ∗ − ¯ θγ = O(γ) Recovering convergence closer to θ∗ by Richardson extrapolation 2¯ θγ

n − ¯

θ2γ

n

39

slide-52
SLIDE 52

Richardson extrapolation

θ∗ ¯ θγ

n − ¯

θγ = Op(n−1=2) ¯ θ1 θn θγ θn θ0 θγ

n − ¯

θγ = Op(γ1=2) θ∗ − ¯ θγ = O(γ) Recovering convergence closer to θ∗ by Richardson extrapolation 2¯ θγ

n − ¯

θ2γ

n

39

slide-53
SLIDE 53

Richardson extrapolation

θ∗ ¯ θγ

n − ¯

θγ = Op(n−1=2) ¯ θ1 θn θγ θn θ0 θγ

n − ¯

θγ = Op(γ1=2) θ∗ − ¯ θγ = O(γ) Recovering convergence closer to θ∗ by Richardson extrapolation 2¯ θγ

n − ¯

θ2γ

n

39

slide-54
SLIDE 54

Experiments: smaller dimension

log10 [R(θ) − R(θ∗)] log10(n) Synthetic data, logistic regression, n = 8.106

40

slide-55
SLIDE 55

Experiments: Double Richardson

log10 [R(θ) − R(θ∗)] log10(n) Synthetic data, logistic regression, n = 8.106 “Richardson 3γ”: estimator built using Richardson on 3 different sequences: ˜ θ3

n = 8 3 ¯

θγ

n − 2¯

θ2γ

n

+ 1

3 ¯

θ4γ

n

41

slide-56
SLIDE 56

Conclusion MC

Take home

◮ Asymptotic sometimes matter less than first iterations:

consider large step size.

◮ Constant step size SGD is a homogeneous Markov chain. ◮ Difference between LS and general smooth loss is intuitive.

For smooth strongly convex loss:

◮ Convergence in terms of Wasserstein distance. ◮ Decomposition as three sources of error: variance, initial

conditions, and “drift”

◮ Detailed analysis of the position of the limit point: the

direction does not depend on γ at first order = ⇒ Extrapolation tricks can help.

42

slide-57
SLIDE 57

Further references

Many stochastic algorithms not covered in this talk (coordinate descent, online Newton, composite optimization, non convex learning) ...

◮ Good introduction: Francis’s lecture notes at Orsay ◮ Book:

Convex Optimization: Algorithms and Complexity, S´ ebastien Bubeck

43

slide-58
SLIDE 58

Agarwal, A., Bartlett, P. L., Ravikumar, P., and Wainwright, M. J. (2012). Information-theoretic lower bounds on the oracle complexity of stochastic convex

  • ptimization. IEEE Transactions on Information Theory, 58(5):3235–3249.

Agarwal, A. and Bottou, L. (2014). A Lower Bound for the Optimization of Finite

  • Sums. ArXiv e-prints.

Arjevani, Y. and Shamir, O. (2016). Dimension-free iteration complexity of finite sum optimization problems. In Lee, D. D., Sugiyama, M., Luxburg, U. V., Guyon, I., and Garnett, R., editors, Advances in Neural Information Processing Systems 29, pages 3540–3548. Curran Associates, Inc. Bach, F. and Moulines, E. (2013). Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n). Advances in Neural Information Processing Systems (NIPS). Bartlett, P. L., Bousquet, O., and Mendelson, S. (2002). Localized Rademacher Complexities, pages 44–58. Springer Berlin Heidelberg, Berlin, Heidelberg. Bousquet, O. and Elisseeff, A. (2002). Stability and generalization. Journal of Machine Learning Research, 2(Mar):499–526. Defazio, A., Bach, F., and Lacoste-Julien, S. (2014a). Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems, pages 1646–1654. Defazio, A., Domke, J., and Caetano, T. (2014b). Finito: A faster, permutable incremental gradient method for big data problems. In Proceedings of the 31st international conference on machine learning (ICML-14), pages 1125–1133. D´ efossez, A. and Bach, F. (2015). Averaged least-mean-squares: bias-variance trade-offs and optimal sampling distributions. In Proceedings of the International Conference on Artificial Intelligence and Statistics, (AISTATS).

43

slide-59
SLIDE 59

Dieuleveut, A. and Bach, F. (2015). Non-parametric stochastic approximation with large step sizes. Annals of Statistics. Dieuleveut, A., Durmus, A., and Bach, F. (2017). Bridging the Gap between Constant Step Size Stochastic Gradient Descent and Markov Chains. arxiv. Dieuleveut, A., Flammarion, N., and Bach, F. (2016). Harder, Better, Faster, Stronger Convergence Rates for Least-Squares Regression. ArXiv e-prints. Fabian, V. (1968). On asymptotic normality in stochastic approximation. The Annals of Mathematical Statistics, pages 1327–1332. Flammarion, N. and Bach, F. (2017). Stochastic composite least-squares regression with convergence rate o (1/n). Johnson, R. and Zhang, T. (2013). Accelerating stochastic gradient descent using predictive variance reduction. In Advances in neural information processing systems, pages 315–323. Koneˇ cn` y, J. and Richt´ arik, P. (2013). Semi-stochastic gradient descent methods. arXiv preprint arXiv:1312.1666. Lacoste-Julien, S., Schmidt, M., and Bach, F. (2012). A simpler approach to

  • btaining an O(1/t) rate for the stochastic projected subgradient method. ArXiv

e-prints 1212.2002. Nemirovsky, A. S. and Yudin, D. B. (1983). Problem complexity and method efficiency in optimization. A Wiley-Interscience Publication. John Wiley & Sons, Inc., New York. Translated from the Russian and with a preface by E. R. Dawson, Wiley-Interscience Series in Discrete Mathematics. Nesterov, Y. (2004). Introductory Lectures on Convex Optimization: A Basic

  • Course. Applied Optimization. Springer.

43

slide-60
SLIDE 60

Polyak, B. T. and Juditsky, A. B. (1992). Acceleration of stochastic approximation by averaging. SIAM J. Control Optim., 30(4):838–855. Robbins, H. and Monro, S. (1951). A stochastic approxiation method. The Annals

  • f mathematical Statistics, 22(3):400–407.

Robbins, H. and Siegmund, D. (1985). A convergence theorem for non negative almost supermartingales and some applications. In Herbert Robbins Selected Papers, pages 111–135. Springer. Ruppert, D. (1988). Efficient estimations from a slowly convergent Robbins-Monro

  • process. Technical report, Cornell University Operations Research and Industrial

Engineering. Schmidt, M., Le Roux, N., and Bach, F. (2013). Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162(1-2):83–112. Tsybakov, A. B. (2003). Optimal rates of aggregation. In Proceedings of the Annual Conference on Computational Learning Theory.

43