A Stochastic Gradient Method with an Exponential Convergence Rate - - PowerPoint PPT Presentation

a stochastic gradient method with an exponential
SMART_READER_LITE
LIVE PREVIEW

A Stochastic Gradient Method with an Exponential Convergence Rate - - PowerPoint PPT Presentation

A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets Nicolas Le Roux 1 , 2 , Mark Schmidt 1 and Francis Bach 1 1 Sierra project-team, INRIA - Ecole Normale Sup erieure, Paris 2 Now at Criteo 4/12/12


slide-1
SLIDE 1

A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets

Nicolas Le Roux1,2, Mark Schmidt1 and Francis Bach1

1Sierra project-team, INRIA - Ecole Normale Sup´

erieure, Paris

2Now at Criteo

4/12/12

Nicolas Le Roux, Mark Schmidt, Francis Bach A Stochastic Gradient Method with an Exponential Convergence Rate for Finite

slide-2
SLIDE 2

Context : Machine Learning for “Big Data”

Large-scale machine learning : large n, large p

n : number of observations (inputs) p : number of parameters in the model

Nicolas Le Roux, Mark Schmidt, Francis Bach Stochastic Average Gradient

slide-3
SLIDE 3

Context : Machine Learning for “Big Data”

Large-scale machine learning : large n, large p

n : number of observations (inputs) p : number of parameters in the model

Examples : vision, bioinformatics, speech, language, etc.

Pascal large-scale datasets : n = 5 · 105, p = 103 ImageNet : n = 107 Industrial datasets : n > 108, p > 107

Nicolas Le Roux, Mark Schmidt, Francis Bach Stochastic Average Gradient

slide-4
SLIDE 4

Context : Machine Learning for “Big Data”

Large-scale machine learning : large n, large p

n : number of observations (inputs) p : number of parameters in the model

Examples : vision, bioinformatics, speech, language, etc.

Pascal large-scale datasets : n = 5 · 105, p = 103 ImageNet : n = 107 Industrial datasets : n > 108, p > 107

Main computational challenge :

Design algorithms for very large n and p.

Nicolas Le Roux, Mark Schmidt, Francis Bach Stochastic Average Gradient

slide-5
SLIDE 5

A standard machine learning optimization problem

We want to minimize the sum of a finite set of smooth functions : min

θ∈Rp g(θ) = 1

n

n

  • i=1

fi(θ)

Nicolas Le Roux, Mark Schmidt, Francis Bach Stochastic Average Gradient

slide-6
SLIDE 6

A standard machine learning optimization problem

We want to minimize the sum of a finite set of smooth functions : min

θ∈Rp g(θ) = 1

n

n

  • i=1

fi(θ) For instance, we may have fi(θ) = log

  • 1 + exp
  • −yix⊤

i θ

  • + λ

2θ2

Nicolas Le Roux, Mark Schmidt, Francis Bach Stochastic Average Gradient

slide-7
SLIDE 7

A standard machine learning optimization problem

We want to minimize the sum of a finite set of smooth functions : min

θ∈Rp g(θ) = 1

n

n

  • i=1

fi(θ) For instance, we may have fi(θ) = log

  • 1 + exp
  • −yix⊤

i θ

  • + λ

2θ2 We will focus on strongly-convex functions g.

Nicolas Le Roux, Mark Schmidt, Francis Bach Stochastic Average Gradient

slide-8
SLIDE 8

Deterministic methods

min

θ∈Rp g(θ) = 1

n

n

  • i=1

fi(θ) Gradient descent updates θk+1 = θk − αkg′(θk) = θk − αk n

n

  • i=1

f ′

i (θk)

Iteration cost in O(n) Linear convergence rate O

  • Ck

Fancier methods exist but still in O(n)

Nicolas Le Roux, Mark Schmidt, Francis Bach Stochastic Average Gradient

slide-9
SLIDE 9

Stochastic methods

min

θ∈Rp g(θ) = 1

n

n

  • i=1

fi(θ) Stochastic gradient descent updates i(k) ∼ U1, n θk+1 = θk − αkf ′

i(k)(θk)

Iteration cost in O(1) Sublinear convergence rate O(1/k) Bound on the test error valid for one pass

Nicolas Le Roux, Mark Schmidt, Francis Bach Stochastic Average Gradient

slide-10
SLIDE 10

Hybrid methods

time log(excess cost) stochastic deterministic

Nicolas Le Roux, Mark Schmidt, Francis Bach Stochastic Average Gradient

slide-11
SLIDE 11

Hybrid methods

Goal = linear rate and O(1) iteration cost.

hybrid log(excess cost) stochastic deterministic time

Nicolas Le Roux, Mark Schmidt, Francis Bach Stochastic Average Gradient

slide-12
SLIDE 12

Related work - Sublinear convergence rate

Stochastic version of full gradient methods

Schraudolph (1999), Sunehag et al. (2009), Ghadimi and Lan (2010), Martens (2010), Xiao (2010)

Momentum, gradient/iterate averaging

Polyak and Judistky (1992), Tseng (1998), Nesterov (2009), Xiao (2010), Kushner and Yin (2003), Hazan and Kale (2011), Rakhlin et al. (2012)

None of these methods improve on the O(1/k) rate

Nicolas Le Roux, Mark Schmidt, Francis Bach Stochastic Average Gradient

slide-13
SLIDE 13

Related work - Linear convergence rate

Constant step-size SG, accelerated SG

Kesten (1958), Delyon and Juditsky (1993), Nedic and Bertsekas (2000)

Linear convergence but only up to a fixed tolerance

Hybrid methods, incremental average gradient

Bertsekas (1997), Blatt et al. (2007), Friedlander and Schmidt (2012)

Linear rate but iterations make full passes through the data

Stochastic methods in the dual

Shalev-Shwartz and Zhang (2012)

Linear rate but limited choice for the fi’s

Nicolas Le Roux, Mark Schmidt, Francis Bach Stochastic Average Gradient

slide-14
SLIDE 14

Stochastic Average Gradient Method

Full gradient update : θk+1 = θk − αk n

n

  • i=1

f ′

i (θk)

Nicolas Le Roux, Mark Schmidt, Francis Bach Stochastic Average Gradient

slide-15
SLIDE 15

Stochastic Average Gradient Method

Full gradient update : θk+1 = θk − αk n

n

  • i=1

f ′

i (θk)

Nicolas Le Roux, Mark Schmidt, Francis Bach Stochastic Average Gradient

slide-16
SLIDE 16

Stochastic Average Gradient Method

Stochastic average gradient update : θk+1 = θk − αk n

n

  • i=1

yk

i

Memory : yk

i = f ′ i(k′)(θk′) from the last k′ where i was selected.

Random selection of i(k) from {1, 2, . . . , n}. Only evaluates f ′

i(k)(θk) on each iteration.

Nicolas Le Roux, Mark Schmidt, Francis Bach Stochastic Average Gradient

slide-17
SLIDE 17

Stochastic Average Gradient Method

Stochastic average gradient update : θk+1 = θk − αk n

n

  • i=1

yk

i

Memory : yk

i = f ′ i(k′)(θk′) from the last k′ where i was selected.

Random selection of i(k) from {1, 2, . . . , n}. Only evaluates f ′

i(k)(θk) on each iteration.

Stochastic variant of incremental average gradient [Blatt et al., 2007]

Nicolas Le Roux, Mark Schmidt, Francis Bach Stochastic Average Gradient

slide-18
SLIDE 18

SAG convergence analysis

Assume each f ′

i is L-continuous, average g is µ-strongly convex.

With step size αk

1 2nL, SAG has linear convergence rate.

Linear convergence with iteration cost independent of n.

Nicolas Le Roux, Mark Schmidt, Francis Bach Stochastic Average Gradient

slide-19
SLIDE 19

SAG convergence analysis

Assume each f ′

i is L-continuous, average g is µ-strongly convex.

With step size αk

1 2nL, SAG has linear convergence rate.

Linear convergence with iteration cost independent of n. With step size αk =

1 2nµ, if n 8 L µ then

E[g(θk) − g(θ∗)] C

  • 1 − 1

8n k . Rate is “independent” of the condition number.

Constant error reduction after each pass,

  • 1 − 1

8n n ≤ exp

  • −1

8

  • = 0.8825.

Nicolas Le Roux, Mark Schmidt, Francis Bach Stochastic Average Gradient

slide-20
SLIDE 20

Comparison with full gradient methods

Assume L = 100, µ = 0.01 and n = 80000 :

Full gradient has rate

  • 1 − µ

L

2 = 0.9998 Accelerated gradient has rate

  • 1 −

µ

L

  • = 0.9900

SAG (n iterations) multiplies the error by

  • 1 −

1 8n

n = 0.8825 Fastest possible first-order method has rate √

L−√µ √ L+√µ

2 = 0.9608

Nicolas Le Roux, Mark Schmidt, Francis Bach Stochastic Average Gradient

slide-21
SLIDE 21

Comparison with full gradient methods

Assume L = 100, µ = 0.01 and n = 80000 :

Full gradient has rate

  • 1 − µ

L

2 = 0.9998 Accelerated gradient has rate

  • 1 −

µ

L

  • = 0.9900

SAG (n iterations) multiplies the error by

  • 1 −

1 8n

n = 0.8825 Fastest possible first-order method has rate √

L−√µ √ L+√µ

2 = 0.9608

We beat two lower bounds (with additional assumptions)

Stochastic gradient bound Full gradient bound

Nicolas Le Roux, Mark Schmidt, Francis Bach Stochastic Average Gradient

slide-22
SLIDE 22

Experiments - Training cost

Quantum dataset (n = 50000, p = 78) ℓ2-regularized logistic regression

5 10 15 20 25 10

−8

10

−6

10

−4

10

−2

10

Effective Passes Objective minus Optimum

AFG L−BFGS pegasos SAG−C SAG−LS

Nicolas Le Roux, Mark Schmidt, Francis Bach Stochastic Average Gradient

slide-23
SLIDE 23

Experiments - Training cost

RCV1 dataset (n = 20242, p = 47236) ℓ2-regularized logistic regression

5 10 15 20 25 10

−10

10

−8

10

−6

10

−4

10

−2

10

Effective Passes Objective minus Optimum

AFG L−BFGS pegasos SAG−C SAG−LS

Nicolas Le Roux, Mark Schmidt, Francis Bach Stochastic Average Gradient

slide-24
SLIDE 24

Experiments - Testing cost

Quantum dataset (n = 50000, p = 78) ℓ2-regularized logistic regression

5 10 15 20 25 1.4 1.45 1.5 1.55 1.6 1.65 1.7 x 10

4

Effective Passes Test Logistic Loss

AFG L−BFGS pegasos SAG−C SAG−LS

Nicolas Le Roux, Mark Schmidt, Francis Bach Stochastic Average Gradient

slide-25
SLIDE 25

Experiments - Testing cost

RCV1 dataset (n = 20242, p = 47236) ℓ2-regularized logistic regression

5 10 15 20 25 2000 2500 3000 3500 4000 4500 5000 5500 6000 6500 7000

Effective Passes Test Logistic Loss

AFG L−BFGS pegasos SAG−C SAG−LS

Nicolas Le Roux, Mark Schmidt, Francis Bach Stochastic Average Gradient

slide-26
SLIDE 26

Reducing memory requirements

θk+1 = θk − αk n

n

  • i=1

yk

i

yk

i is the last gradient computed on datapoint i

Memory requirement : O(np) Smaller for structured models, e.g., linear models :

If fi(θ) = ℓ(yi, x⊤

i θ), f ′ i (θ) = ℓ′(yi, x⊤ i θ)xi

Memory requirement : O(n)

We can also use mini-batches

Nicolas Le Roux, Mark Schmidt, Francis Bach Stochastic Average Gradient

slide-27
SLIDE 27

Conclusion and Open Problems

Fast theoretical convergence using the ‘sum’ structure common in applications.

Nicolas Le Roux, Mark Schmidt, Francis Bach Stochastic Average Gradient

slide-28
SLIDE 28

Conclusion and Open Problems

Fast theoretical convergence using the ‘sum’ structure common in applications. Simple algorithm, empirically better than theory predicts.

Nicolas Le Roux, Mark Schmidt, Francis Bach Stochastic Average Gradient

slide-29
SLIDE 29

Conclusion and Open Problems

Fast theoretical convergence using the ‘sum’ structure common in applications. Simple algorithm, empirically better than theory predicts. Allows line-search and approximate optimality measures.

Nicolas Le Roux, Mark Schmidt, Francis Bach Stochastic Average Gradient

slide-30
SLIDE 30

Conclusion and Open Problems

Fast theoretical convergence using the ‘sum’ structure common in applications. Simple algorithm, empirically better than theory predicts. Allows line-search and approximate optimality measures. Open problems :

Large-scale distributed implementation. Determine a tight convergence rate in all cases. Apply the method to constrained and non-smooth problems. Speed up the method using non-uniform sampling and non-Euclidean metric.

Nicolas Le Roux, Mark Schmidt, Francis Bach Stochastic Average Gradient