Stochastic Approximation in Hilbert Spaces Aymeric DIEULEVEUT - - PowerPoint PPT Presentation

stochastic approximation in hilbert spaces
SMART_READER_LITE
LIVE PREVIEW

Stochastic Approximation in Hilbert Spaces Aymeric DIEULEVEUT - - PowerPoint PPT Presentation

Stochastic Approximation in Hilbert Spaces Aymeric DIEULEVEUT Supervised by Francis BACH September 28, 2017 0 Outline 1. Introduction: Supervised Machine Learning Stochastic Approximation 2. Finite dimensional results 3. Infinite


slide-1
SLIDE 1

Stochastic Approximation in Hilbert Spaces

Aymeric DIEULEVEUT

Supervised by Francis BACH

September 28, 2017

slide-2
SLIDE 2

Outline

  • 1. Introduction:

◮ Supervised Machine Learning ◮ Stochastic Approximation

  • 2. Finite dimensional results
  • 3. Infinite dimensional results
  • 4. Beyond quadratic loss: interpretation as a Markov chain.

1

slide-3
SLIDE 3

Supervised Machine Learning: definition & applications

Goal: predict a phenomenon from “explanatory variables”, given a set of

  • bservations.

2

slide-4
SLIDE 4

Supervised Machine Learning: definition & applications

Goal: predict a phenomenon from “explanatory variables”, given a set of

  • bservations.

Bio-informatics Input: DNA/RNA sequence, Output: Disease predisposition / Drug responsiveness Image classification Input: Handwritten digits / Images, Output: Digit

2

slide-5
SLIDE 5

Supervised Machine Learning: definition & applications

Goal: predict a phenomenon from “explanatory variables”, given a set of

  • bservations.

Bio-informatics Input: DNA/RNA sequence, Output: Disease predisposition / Drug responsiveness Image classification Input: Handwritten digits / Images, Output: Digit “Large scale” learning framework: both the number of examples n and the number of explanatory variables d are large.

2

slide-6
SLIDE 6

Supervised Machine Learning: definition & applications

Goal: predict a phenomenon from “explanatory variables”, given a set of

  • bservations.

Bio-informatics Input: DNA/RNA sequence, Output: Disease predisposition / Drug responsiveness n → 10 to 104 d (e.g., number of basis) → 106 Image classification Input: Handwritten digits / Images, Output: Digit n → up to 109 d (e.g., number of pixels) → 106 “Large scale” learning framework: both the number of examples n and the number of explanatory variables d are large.

2

slide-7
SLIDE 7

Supervised Machine Learning: mathematical framework

Consider an input/output pair (X, Y ) ∈ X × Y. (X, Y ) ∼ ρ, unknown distribution. Y = R (regression) or {−1, 1} (classification).

3

slide-8
SLIDE 8

Supervised Machine Learning: mathematical framework

Consider an input/output pair (X, Y ) ∈ X × Y. (X, Y ) ∼ ρ, unknown distribution. Y = R (regression) or {−1, 1} (classification). Goal: find g : X → R, such that g(X) is a good prediction for Y .

3

slide-9
SLIDE 9

Supervised Machine Learning: mathematical framework

Consider an input/output pair (X, Y ) ∈ X × Y. (X, Y ) ∼ ρ, unknown distribution. Y = R (regression) or {−1, 1} (classification). Goal: find g : X → R, such that g(X) is a good prediction for Y . Measure accuracy with a loss function ℓ : Y × R → R+: squared loss, logistic loss...

3

slide-10
SLIDE 10

Supervised Machine Learning: mathematical framework

Consider an input/output pair (X, Y ) ∈ X × Y. (X, Y ) ∼ ρ, unknown distribution. Y = R (regression) or {−1, 1} (classification). Goal: find g : X → R, such that g(X) is a good prediction for Y . Measure accuracy with a loss function ℓ : Y × R → R+: squared loss, logistic loss... Risk (generalization error): R(g) := Eρ [ℓ(Y , g(X))] . Parametric case: Prediction as a linear function gθ(X) = θ, Φ(X) of features Φ(X) ∈ Rd. Notation: R(θ) := R(gθ).

3

slide-11
SLIDE 11

Supervised Machine Learning: mathematical framework

Consider an input/output pair (X, Y ) ∈ X × Y. (X, Y ) ∼ ρ, unknown distribution. Y = R (regression) or {−1, 1} (classification). Goal: find g : X → R, such that g(X) is a good prediction for Y . Measure accuracy with a loss function ℓ : Y × R → R+: squared loss, logistic loss... Risk (generalization error): R(g) := Eρ [ℓ(Y , g(X))] . Parametric case: Prediction as a linear function gθ(X) = θ, Φ(X) of features Φ(X) ∈ Rd. Notation: R(θ) := R(gθ). Non-parametric case: Prediction as a function g ∈ H, for H infinite-dimensional space.

3

slide-12
SLIDE 12

Empirical Risk minimization (I) - Parametric case

◮ Data: n observations (xi, yi) ∈ X × Y, i = 1, . . . , n, i.i.d. ◮ Empirical risk (or training error):

ˆ R(θ) = 1 n

n

  • i=1

ℓ(yi, θ, Φ(xi)).

◮ First approach: Empirical risk minimization (regularized):

ˆ θ := argmin

θ∈Rd

ˆ R(θ) + µΩ(θ). data fitting term + regularizer

4

slide-13
SLIDE 13

Empirical Risk minimization (II) - Parametric case

◮ For example, least-squares regression:

min

θ∈Rd

1 2n

n

  • i=1
  • yi − θ, Φ(xi)

2 + µΩ(θ),

5

slide-14
SLIDE 14

Empirical Risk minimization (II) - Parametric case

◮ For example, least-squares regression:

min

θ∈Rd

1 2n

n

  • i=1
  • yi − θ, Φ(xi)

2 + µΩ(θ),

◮ and logistic regression:

min

θ∈Rd

1 n

n

  • i=1

log

  • 1 + exp(−yiθ, Φ(xi))
  • +

µΩ(θ).

5

slide-15
SLIDE 15

Empirical Risk minimization (II) - Parametric case

◮ For example, least-squares regression:

min

θ∈Rd

1 2n

n

  • i=1
  • yi − θ, Φ(xi)

2 + µΩ(θ),

◮ and logistic regression:

min

θ∈Rd

1 n

n

  • i=1

log

  • 1 + exp(−yiθ, Φ(xi))
  • +

µΩ(θ).

◮ Two fundamental questions: (1) computing ˆ

θ and (2) analyzing ˆ θ.

5

slide-16
SLIDE 16

Empirical Risk minimization (II) - Parametric case

◮ For example, least-squares regression:

min

θ∈Rd

1 2n

n

  • i=1
  • yi − θ, Φ(xi)

2 + µΩ(θ),

◮ and logistic regression:

min

θ∈Rd

1 n

n

  • i=1

log

  • 1 + exp(−yiθ, Φ(xi))
  • +

µΩ(θ).

◮ Two fundamental questions: (1) computing ˆ

θ and (2) analyzing ˆ θ. 2 important insights for ML [Bottou and Bousquet, 2008]:

  • 1. No need to optimize below statistical error,
  • 2. True risk is more important than empirical risk.

5

slide-17
SLIDE 17

Stochastic Approximation

◮ Goal:

min

θ∈Rd f (θ)

given unbiased gradient estimates f ′

n ◮ θ∗ := argminRd f (θ).

θ∗

6

slide-18
SLIDE 18

Stochastic Approximation

◮ Goal:

min

θ∈Rd f (θ)

given unbiased gradient estimates f ′

n ◮ θ∗ := argminRd f (θ). ◮ Key algorithm: Stochastic Gradient Descent (SGD) [Robbins

and Monro, 1951]: θn = θn−1 − γn f ′

n(θn−1) ◮ E[f ′ n(θn−1)|Fn−1] = f ′(θn−1) for a filtration (Fn)n≥0, θn is Fn

measurable.

6

θ∗ θ0

slide-19
SLIDE 19

Stochastic Approximation

◮ Goal:

min

θ∈Rd f (θ)

given unbiased gradient estimates f ′

n ◮ θ∗ := argminRd f (θ).

θ∗ θ0 θn θ1

◮ Key algorithm: Stochastic Gradient Descent (SGD) [Robbins

and Monro, 1951]: θn = θn−1 − γn f ′

n(θn−1) ◮ E[f ′ n(θn−1)|Fn−1] = f ′(θn−1) for a filtration (Fn)n≥0, θn is Fn

measurable.

6

slide-20
SLIDE 20

Polyak Ruppert averaging

Introduced by Polyak and Juditsky [1992] and Ruppert [1988]: ¯ θn = 1 n + 1

n

  • k=0

θk.

◮ off line averaging reduces the noise effect.

7

θ∗ θ0 θ1 θn θ1

slide-21
SLIDE 21

Polyak Ruppert averaging

Introduced by Polyak and Juditsky [1992] and Ruppert [1988]: ¯ θn = 1 n + 1

n

  • k=0

θk.

θ∗ θ0 θ1 θn θn θ1 θ2

◮ off line averaging reduces the noise effect.

7

slide-22
SLIDE 22

Stochastic Approximation (SA) in Machine Learning

Loss for a single pair of observations, for any k ≤ n: fk(θ) = ℓ(yk, θ, Φ(xk)). SA for the true risk :

◮ For 0 ≤ k ≤ n, Fk = σ((xi, yi)1≤i≤k). ◮ At step 0 < k ≤ n, use a new point independent of θk−1:

R(θ) = E ℓ(yk, θ, Φ(xk)) f ′

k(θk−1)

= ℓ′(yk, θk−1, Φ(xk))

8

slide-23
SLIDE 23

Stochastic Approximation (SA) in Machine Learning

Loss for a single pair of observations, for any k ≤ n: fk(θ) = ℓ(yk, θ, Φ(xk)). SA for the true risk :

◮ For 0 ≤ k ≤ n, Fk = σ((xi, yi)1≤i≤k). ◮ At step 0 < k ≤ n, use a new point independent of θk−1:

R(θ) = E ℓ(yk, θ, Φ(xk)) f ′

k(θk−1)

= ℓ′(yk, θk−1, Φ(xk)) E[f ′

k(θk−1)|Fk−1]

= R′(θk−1)

8

slide-24
SLIDE 24

Stochastic Approximation (SA) in Machine Learning

Loss for a single pair of observations, for any k ≤ n: fk(θ) = ℓ(yk, θ, Φ(xk)). SA for the true risk :

◮ For 0 ≤ k ≤ n, Fk = σ((xi, yi)1≤i≤k). ◮ At step 0 < k ≤ n, use a new point independent of θk−1:

R(θ) = E ℓ(yk, θ, Φ(xk)) f ′

k(θk−1)

= ℓ′(yk, θk−1, Φ(xk)) E[f ′

k(θk−1)|Fk−1]

= R′(θk−1) Single pass through the data – “Automatic” regularization. Central algorithm in the thesis.

8

slide-25
SLIDE 25

Outline: bibliography

a) Non-parametric Stochastic Approximation with Large Step-sizes,

  • A. Dieuleveut and F. Bach, in the Annals of Statistics

b) Harder, Better, Faster, Stronger Convergence Rates for Least-squares Regression,

  • A. Dieuleveut, N. Flammarion and F. Bach, in Journal of Machine

Learning Research c) Bridging the Gap between Constant Step Size Stochastic Gradient Descent and Markov Chains,

  • A. Dieuleveut, A. Durmus, F. Bach, under submission.

9

slide-26
SLIDE 26

Outline: bibliography

a) Non-parametric Stochastic Approximation with Large Step-sizes,

  • A. Dieuleveut and F. Bach, in the Annals of Statistics

b) Harder, Better, Faster, Stronger Convergence Rates for Least-squares Regression,

  • A. Dieuleveut, N. Flammarion and F. Bach, in Journal of Machine

Learning Research c) Bridging the Gap between Constant Step Size Stochastic Gradient Descent and Markov Chains,

  • A. Dieuleveut, A. Durmus, F. Bach, under submission.

Quadratic loss Smooth loss FD Non-parametric a)

  • b)
  • c)
  • 9
slide-27
SLIDE 27

Outline: bibliography

a) Non-parametric Stochastic Approximation with Large Step-sizes,

  • A. Dieuleveut and F. Bach, in the Annals of Statistics

b) Harder, Better, Faster, Stronger Convergence Rates for Least-squares Regression,

  • A. Dieuleveut, N. Flammarion and F. Bach, in Journal of Machine

Learning Research c) Bridging the Gap between Constant Step Size Stochastic Gradient Descent and Markov Chains,

  • A. Dieuleveut, A. Durmus, F. Bach, under submission.

Quadratic loss Smooth loss FD Non-parametric a)

  • b)
  • c)
  • Part 1

9

slide-28
SLIDE 28

Outline: bibliography

a) Non-parametric Stochastic Approximation with Large Step-sizes,

  • A. Dieuleveut and F. Bach, in the Annals of Statistics

b) Harder, Better, Faster, Stronger Convergence Rates for Least-squares Regression,

  • A. Dieuleveut, N. Flammarion and F. Bach, in Journal of Machine

Learning Research c) Bridging the Gap between Constant Step Size Stochastic Gradient Descent and Markov Chains,

  • A. Dieuleveut, A. Durmus, F. Bach, under submission.

Quadratic loss Smooth loss FD Non-parametric a)

  • b)
  • c)
  • Part 1 – Part 2

9

slide-29
SLIDE 29

Outline: bibliography

a) Non-parametric Stochastic Approximation with Large Step-sizes,

  • A. Dieuleveut and F. Bach, in the Annals of Statistics

b) Harder, Better, Faster, Stronger Convergence Rates for Least-squares Regression,

  • A. Dieuleveut, N. Flammarion and F. Bach, in Journal of Machine

Learning Research c) Bridging the Gap between Constant Step Size Stochastic Gradient Descent and Markov Chains,

  • A. Dieuleveut, A. Durmus, F. Bach, under submission.

Quadratic loss Smooth loss FD Non-parametric a)

  • b)
  • c)
  • Part 1 – Part 2 – Part 3

9

slide-30
SLIDE 30

Outline

  • 1. Introduction.
  • 2. A warm up! Results in finite dimension, (d ≫ n)

◮ Averaged stochastic descent: adaptivity ◮ Acceleration: two optimal rates

  • 3. Non-parametric stochastic approximation
  • 4. Stochastic approximation as a Markov chain: extension to non

quadratic loss functions.

10

slide-31
SLIDE 31

Behavior of Stochastic Approximation in high dimension

Least-squares regression in finite dimension: R(θ) = Eρ

  • θ, Φ(X) − Y

2 .

11

slide-32
SLIDE 32

Behavior of Stochastic Approximation in high dimension

Least-squares regression in finite dimension: R(θ) = Eρ

  • θ, Φ(X) − Y

2 . Let Σ = E

  • Φ(X)Φ(X)⊤

∈ Rd×d: for θ∗ the best linear predictor, R(θ) − R(θ∗) =

  • Σ1/2(θ − θ∗)
  • 2

. Let R2 := E

  • Φ(X)2

, σ2 := E

  • (Y − θ∗, Φ(X))2

.

11

slide-33
SLIDE 33

Behavior of Stochastic Approximation in high dimension

Least-squares regression in finite dimension: R(θ) = Eρ

  • θ, Φ(X) − Y

2 . Let Σ = E

  • Φ(X)Φ(X)⊤

∈ Rd×d: for θ∗ the best linear predictor, R(θ) − R(θ∗) =

  • Σ1/2(θ − θ∗)
  • 2

. Let R2 := E

  • Φ(X)2

, σ2 := E

  • (Y − θ∗, Φ(X))2

. Consider stochastic gradient descent (a.k.a., Least-Mean-Squares)

Theorem

For any γ ≤

1 4R2 , for any α > 1, for any r ≥ 0, for any n ∈ N,

ER ¯ θn

  • − R(θ∗) ≤ 4σ2γ1/α tr(Σ1/α)

n1−1/α + 4

  • Σ1/2−r(θ∗ − θ0)
  • 2

γ2rnmin(2r,2) .

11

slide-34
SLIDE 34

Theorem 1†, consequences

Theorem

For any γ ≤

1 4R2 , for any α > 1, for any r ≥ 0, for any n ∈ N,

ER ¯ θn

  • − R(θ∗) ≤ 4σ2γ1/α tr(Σ1/α)

n1−1/α

  • Variance

+ 4

  • Σ1/2−r(θ∗ − θ0)
  • 2

γ2rnmin(2r,2)

  • Bias

.

†Dieuleveut and Bach [2015]. 12

slide-35
SLIDE 35

Theorem 1†, consequences

Theorem

For any γ ≤

1 4R2 , for any α > 1, for any r ≥ 0, for any n ∈ N,

ER ¯ θn

  • − R(θ∗) ≤ 4σ2γ1/α tr(Σ1/α)

n1−1/α

  • Variance

+ 4

  • Σ1/2−r(θ∗ − θ0)
  • 2

γ2rnmin(2r,2)

  • Bias

. Variance term Bias term γσ2 tr(Σ)

σ2d n θ∗−θ02 γn

Σ−1/2(θ∗−θ0)

2

γ2n2

α = 1 α → ∞ r = 1/2. r = 1.

†Dieuleveut and Bach [2015]. 12

slide-36
SLIDE 36

Theorem 1†, consequences

Theorem

For any γ ≤

1 4R2 , for any α > 1, for any r ≥ 0, for any n ∈ N,

ER ¯ θn

  • − R(θ∗) ≤ 4σ2γ1/α tr(Σ1/α)

n1−1/α

  • Variance

+ 4

  • Σ1/2−r(θ∗ − θ0)
  • 2

γ2rnmin(2r,2)

  • Bias

. Variance term Bias term γσ2 tr(Σ)

σ2d n θ∗−θ02 γn

Σ−1/2(θ∗−θ0)

2

γ2n2

α = 1 α → ∞ r = 1/2. r = 1.

  • Recovers

Bach and Moulines [2013]

  • Improves

asymptotic Bias

†Dieuleveut and Bach [2015]. 12

slide-37
SLIDE 37

Theorem 1, consequences

Theorem

For any γ ≤

1 4R2 , for any n ∈ N,

ER ¯ θn

  • − R(θ∗) ≤

inf

α>1,r≥0

  • 4σ2γ1/α tr(Σ1/α)

n1−1/α

  • Variance

+ 4

  • Σ1/2−r(θ∗ − θ0)
  • 2

γ2rnmin(2r,2)

  • Bias
  • .

γ1/α tr(Σ1/α) n1−1/α

α Adaptivity Upper bound on the variance term as a function of α. d ≫ n.

13

slide-38
SLIDE 38

Limits to SA performance: two lower bounds

Stochastic Approximation in Supervised ML

14

slide-39
SLIDE 39

Limits to SA performance: two lower bounds

Stochastic Approximation in Supervised ML Builds an estimator given n

  • bservations.

statistical lower bound: σ2d n

14

slide-40
SLIDE 40

Limits to SA performance: two lower bounds

Stochastic Approximation in Supervised ML Builds an estimator given n

  • bservations.

statistical lower bound: σ2d n Approximates the minimum of an (L−smooth) function in t iterations, using first order information.

  • ptimization lower bound:

L θ0 − θ∗2 t2 .

14

slide-41
SLIDE 41

Limits to SA performance: two lower bounds

Stochastic Approximation in Supervised ML Builds an estimator given n

  • bservations.

statistical lower bound: σ2d n Approximates the minimum of an (L−smooth) function in t iterations, using first order information.

  • ptimization lower bound:

L θ0 − θ∗2 t2 . here, n = t.

14

slide-42
SLIDE 42

Limits to SA performance: two lower bounds

Stochastic Approximation in Supervised ML Builds an estimator given n

  • bservations.

statistical lower bound: σ2d n Approximates the minimum of an (L−smooth) function in n iterations, using first order information.

  • ptimization lower bound:

L θ0 − θ∗2 n2 . here, n = t.

14

slide-43
SLIDE 43

Limits to SA performance: two lower bounds

Stochastic Approximation in Supervised ML Builds an estimator given n

  • bservations.

statistical lower bound: σ2d n Approximates the minimum of an (L−smooth) function in n iterations, using first order information.

  • ptimization lower bound:

L θ0 − θ∗2 n2 . here, n = t. Theorem 1, for Av-SGD, gives as upper bound: σ2d n + min

  • L θ0 − θ∗2

n ; L2 Σ−1/2(θ0 − θ∗)

  • 2

n2

  • .

14

slide-44
SLIDE 44

Acceleration†

Optimal rate (for deterministic

  • ptimization), is achieved by

accelerated gradient descent: θn = ηn−1 − γnf ′(ηn−1) ηn = θn + δn(θn − θn−1) .

†Dieuleveut, Flammarion, Bach [2016] 15

slide-45
SLIDE 45

Acceleration†

Optimal rate (for deterministic

  • ptimization), is achieved by

accelerated gradient descent: θn = ηn−1 − γnf ′(ηn−1) ηn = θn + δn(θn − θn−1) .

†Dieuleveut, Flammarion, Bach [2016] 15

slide-46
SLIDE 46

Acceleration†

Optimal rate (for deterministic

  • ptimization), is achieved by

accelerated gradient descent: θn = ηn−1 − γnf ′(ηn−1) ηn = θn + δn(θn − θn−1) . Problem: acceleration is sensitive to noise [d’Aspremont, 2008].

†Dieuleveut, Flammarion, Bach [2016] 15

slide-47
SLIDE 47

Acceleration†

Optimal rate (for deterministic

  • ptimization), is achieved by

accelerated gradient descent: θn = ηn−1 − γnf ′(ηn−1) ηn = θn + δn(θn − θn−1) . Problem: acceleration is sensitive to noise [d’Aspremont, 2008]. Combining SGD, acceleration and averaging,

◮ using extra regularization,

†Dieuleveut, Flammarion, Bach [2016] 15

slide-48
SLIDE 48

Acceleration†

Optimal rate (for deterministic

  • ptimization), is achieved by

accelerated gradient descent: θn = ηn−1 − γnf ′(ηn−1) ηn = θn + δn(θn − θn−1) . Problem: acceleration is sensitive to noise [d’Aspremont, 2008]. Combining SGD, acceleration and averaging,

◮ using extra regularization, ◮ and for “additive” noise model only,

†Dieuleveut, Flammarion, Bach [2016] 15

slide-49
SLIDE 49

Acceleration†

Optimal rate (for deterministic

  • ptimization), is achieved by

accelerated gradient descent: θn = ηn−1 − γnf ′(ηn−1) ηn = θn + δn(θn − θn−1) . Problem: acceleration is sensitive to noise [d’Aspremont, 2008]. Combining SGD, acceleration and averaging,

◮ using extra regularization, ◮ and for “additive” noise model only,

Caveat: LMS recursion does not provide an additive noise oracle. Different recursion with Σ known.

†Dieuleveut, Flammarion, Bach [2016] 15

slide-50
SLIDE 50

Acceleration†

Optimal rate (for deterministic

  • ptimization), is achieved by

accelerated gradient descent: θn = ηn−1 − γnf ′(ηn−1) ηn = θn + δn(θn − θn−1) . Problem: acceleration is sensitive to noise [d’Aspremont, 2008]. Combining SGD, acceleration and averaging,

◮ using extra regularization, ◮ and for “additive” noise model only,

we achieve both of the optimal rates. Caveat: LMS recursion does not provide an additive noise oracle. Different recursion with Σ known.

†Dieuleveut, Flammarion, Bach [2016] 15

slide-51
SLIDE 51

Acceleration and averaging

More precisely we consider: θn = νn−1 − γR′

n(νn−1) − γλ(νn−1 − θ0)

νn = θn + δ

  • θn − θn−1
  • ,

Theorem

For any γ ≤ 1/2R2, for δ = 1, and λ = 0, E

  • R(¯

θn)

  • − R(θ∗) ≤ 8 σ2d

n + 1 + 36θ0 − θ∗2 γ(n + 1)2 . Optimal rate from both statistical and optimization point of view.

16

slide-52
SLIDE 52

Outline

  • 1. Introduction.
  • 2. A warm up! Results in finite dimension, (d ≫ n)
  • 3. Non-parametric stochastic approximation

◮ Averaged stochastic descent: statistical rate of convergence ◮ Acceleration: improving convergence in ill-conditioned regimes

  • 4. Stochastic approximation as a Markov chain: extension to non

quadratic loss functions.

17

slide-53
SLIDE 53

Non-parametric Random Design Least Squares Regression

Goal: min

g R(g) = Eρ

  • (Y − g(X))2

18

slide-54
SLIDE 54

Non-parametric Random Design Least Squares Regression

Goal: min

g R(g) = Eρ

  • (Y − g(X))2

◮ ρX marginal distribution of X in X, ◮ L2 ρX set of squared integrable functions w.r.t. ρX.

18

slide-55
SLIDE 55

Non-parametric Random Design Least Squares Regression

Goal: min

g R(g) = Eρ

  • (Y − g(X))2

◮ ρX marginal distribution of X in X, ◮ L2 ρX set of squared integrable functions w.r.t. ρX.

Bayes predictor minimizes the quadratic risk over L2

ρX :

gρ(X) = E [Y |X] .

18

slide-56
SLIDE 56

Non-parametric Random Design Least Squares Regression

Goal: min

g R(g) = Eρ

  • (Y − g(X))2

◮ ρX marginal distribution of X in X, ◮ L2 ρX set of squared integrable functions w.r.t. ρX.

Bayes predictor minimizes the quadratic risk over L2

ρX :

gρ(X) = E [Y |X] . Moreover, for any function g in L2

ρX , the excess risk is:

R(g) − R(gρ) = g − gρ2

L2

ρX . 18

slide-57
SLIDE 57

Non-parametric Random Design Least Squares Regression

Goal: min

g R(g) = Eρ

  • (Y − g(X))2

◮ ρX marginal distribution of X in X, ◮ L2 ρX set of squared integrable functions w.r.t. ρX.

Bayes predictor minimizes the quadratic risk over L2

ρX :

gρ(X) = E [Y |X] . Moreover, for any function g in L2

ρX , the excess risk is:

R(g) − R(gρ) = g − gρ2

L2

ρX .

H a space of functions: there exists gH ∈ ¯ HL2

ρX such that

R(gH) = inf

g∈H R(g).

18

slide-58
SLIDE 58

Reproducing Kernel Hilbert Space

Definition

A Reproducing Kernel Hilbert Space (RKHS) H is a space of functions from X into R, such that there exists a reproducing kernel K : X × X → R, satisfying:

◮ For any x ∈ X, H contains the function Kx, defined by:

Kx : X → R z → K(x, z).

slide-59
SLIDE 59

Reproducing Kernel Hilbert Space

Definition

A Reproducing Kernel Hilbert Space (RKHS) H is a space of functions from X into R, such that there exists a reproducing kernel K : X × X → R, satisfying:

◮ For any x ∈ X, H contains the function Kx, defined by:

Kx : X → R z → K(x, z).

◮ For any x ∈ X and f ∈ H, the reproducing property holds:

Kx, f H = f (x).

19

slide-60
SLIDE 60

Why are RKHS so nice?

◮ Computation:

◮ Linear spaces of functions. ◮ Existence of gradients (Hilbert). ◮ Possible to compute inner products thanks to the reproducing

property.

◮ Only deal with functions in the set span{Kxi, i = 1 . . . n}

(representer theorem).

the algebraic framework is preserved !

20

slide-61
SLIDE 61

Why are RKHS so nice?

◮ Computation:

◮ Linear spaces of functions. ◮ Existence of gradients (Hilbert). ◮ Possible to compute inner products thanks to the reproducing

property.

◮ Only deal with functions in the set span{Kxi, i = 1 . . . n}

(representer theorem).

the algebraic framework is preserved !

◮ Approximation: many kernels satisfy ¯

HL2

ρX = L2

ρX , there is no

approximation error !

20

slide-62
SLIDE 62

Why are RKHS so nice?

◮ Computation:

◮ Linear spaces of functions. ◮ Existence of gradients (Hilbert). ◮ Possible to compute inner products thanks to the reproducing

property.

◮ Only deal with functions in the set span{Kxi, i = 1 . . . n}

(representer theorem).

the algebraic framework is preserved !

◮ Approximation: many kernels satisfy ¯

HL2

ρX = L2

ρX , there is no

approximation error !

◮ Representation: Feature map,

X → H x → Kx maps points from any set into a linear space to apply a linear method.

20

slide-63
SLIDE 63

Stochastic approximation in the RKHS.

As R(g) = E

  • (g, KXH − Y )2

, for each pair of observations (g, KxnH − yn)Kxn = (g(xn) − yn)Kxn is an unbiased stochastic gradient of R at g.

21

slide-64
SLIDE 64

Stochastic approximation in the RKHS.

As R(g) = E

  • (g, KXH − Y )2

, for each pair of observations (g, KxnH − yn)Kxn = (g(xn) − yn)Kxn is an unbiased stochastic gradient of R at g. Consider the stochastic gradient recursion, starting from g0 ∈ H: gn = gn−1 − γ

  • gn−1, KxnH − yn
  • Kxn,

where γ is the step-size.

21

slide-65
SLIDE 65

Stochastic approximation in the RKHS.

As R(g) = E

  • (g, KXH − Y )2

, for each pair of observations (g, KxnH − yn)Kxn = (g(xn) − yn)Kxn is an unbiased stochastic gradient of R at g. Consider the stochastic gradient recursion, starting from g0 ∈ H: gn = gn−1 − γ

  • gn−1, KxnH − yn
  • Kxn,

where γ is the step-size. Thus gn =

n

  • i=1

aiKxi, with (an)n1, an = −γn(gn−1(xn) − yn). With averaging, gn = 1 n + 1

n

  • k=0

gk Total complexity: O(n2)

21

slide-66
SLIDE 66

Kernel regression: Analysis

Assume E [K(X, X)] and E

  • Y 2

are finite. Define the covariance

  • perator.

Σ = E

  • KXK ⊤

X

  • .

We make two assumptions:

◮ Capacity condition: eigenvalue decay of Σ. ◮ Source condition: position of gH w.r.t. the kernel space H.

22

slide-67
SLIDE 67

Kernel regression: Analysis

Assume E [K(X, X)] and E

  • Y 2

are finite. Define the covariance

  • perator.

Σ = E

  • KXK ⊤

X

  • .

We make two assumptions:

◮ Capacity condition: eigenvalue decay of Σ. ◮ Source condition: position of gH w.r.t. the kernel space H.

Σ is a trace-class operator, that can be decomposed over its eigen-spaces. Its power: Στ, τ > 0. are thus well defined.

22

slide-68
SLIDE 68

Capacity condition (CC)

CC(α): for some α > 1, we assume that tr(Σ1/α) < ∞.

23

slide-69
SLIDE 69

Capacity condition (CC)

CC(α): for some α > 1, we assume that tr(Σ1/α) < ∞. If we denote (µi)i∈I the sequence of non-zero eigenvalues of the

  • perator Σ, in decreasing order, then µi = O (i−α).

23

slide-70
SLIDE 70

Capacity condition (CC)

CC(α): for some α > 1, we assume that tr(Σ1/α) < ∞. If we denote (µi)i∈I the sequence of non-zero eigenvalues of the

  • perator Σ, in decreasing order, then µi = O (i−α).

Sobolev first order kernel Gaussian kernel log10(µi)

Eigenvalue decay of the covariance operator.

log10(i) log10(i)

Left: min kernel, ρX = U[0; 1], − → CC(α = 2). Right: Gaussian kernel, ρX = U[−1; 1]. − → CC(α), ∀ α ≥ 1.

23

slide-71
SLIDE 71

Source condition (SC)

Concerning the optimal function gH, we assume: SC(r): for some r 0, gH ∈ Σr L2

ρX

  • Thus Σ−r(gH)L2

ρX < ∞.

r < 0.5 r = 0.5 r > 0.5

24

slide-72
SLIDE 72

NPSA with large step sizes

Theorem

Assume CC(α) and SC(r). Then for any γ ≤

1 4R2 ,

ER (¯ gn) − R(gH) ≤ 4σ2γ1/α tr(Σ1/α) n1−1/α + 4 Σ−r(gH − g0)2

L2

ρX

γ2rnmin(2r,2) . for γ = γ0n

−2αr−1+α 2αr+1

, for α−1

2α ≤ r ≤ 1

ER (¯ gn) − R(gH) ≤ n

−2αr 2αr+1

  • 4σ2 tr(Σ1/α) + 4
  • Σ−r(gH − g0)
  • 2

L2

ρX

  • .

25

slide-73
SLIDE 73

NPSA with large step sizes

Theorem

Assume CC(α) and SC(r). Then for any γ ≤

1 4R2 ,

ER (¯ gn) − R(gH) ≤ 4σ2γ1/α tr(Σ1/α) n1−1/α + 4 Σ−r(gH − g0)2

L2

ρX

γ2rnmin(2r,2) . for γ = γ0n

−2αr−1+α 2αr+1

, for α−1

2α ≤ r ≤ 1

ER (¯ gn) − R(gH) ≤ n

−2αr 2αr+1

  • 4σ2 tr(Σ1/α) + 4
  • Σ−r(gH − g0)
  • 2

L2

ρX

  • .

◮ Statistically optimal rate. [Caponnetto and De Vito, 2007]. ◮ Beyond: online, minimal assumptions...

25

slide-74
SLIDE 74

Optimality regions

2 1 3 4 5 r = 1=2 r = 1

Saturation α r = α−1

B>V

r

r = 0.5 r > 0.5 r < 0.5 r ≪ 0.5 Optimal rate in RKHS can be achieved via large step size and averaging in many situations.

26

slide-75
SLIDE 75

Optimality regions

2 1 3 4 5 r = 1=2 r = 1

Saturation α r = α−1

B>V

r

r = 0.5 r > 0.5 r < 0.5 r ≪ 0.5 Optimal rate in RKHS can be achieved via large step size and averaging in many situations.

26

slide-76
SLIDE 76

Optimality regions

2 1 3 4 5 r = 1=2 r = 1

Saturation α r = α−1

B>V

r

r = 0.5 r > 0.5 r < 0.5 r ≪ 0.5 Optimal rate in RKHS can be achieved via large step size and averaging in many situations.

26

slide-77
SLIDE 77

Acceleration: Reproducing kernel Hilbert space setting

We consider the RKHS setting presented before.

Theorem

Assume CC(α) and SC(r). Then for γ = γ0n− 4rα+2−α

2rα+1 , for

λ =

1 γn2 , for r ≥ α−2 2α ,

ER (¯ gn) − R(gH) ≤ Cθ0,θ∗,Σ n

−2αr 2αr+1 . 27

slide-78
SLIDE 78

Acceleration: Reproducing kernel Hilbert space setting

We consider the RKHS setting presented before.

Theorem

Assume CC(α) and SC(r). Then for γ = γ0n− 4rα+2−α

2rα+1 , for

λ =

1 γn2 , for r ≥ α−2 2α ,

ER (¯ gn) − R(gH) ≤ Cθ0,θ∗,Σ n

−2αr 2αr+1 .

2 1 3 4 5 r = 1=2 r = 1

Saturation α r = α−2

r = α−1

B>V

r

27

slide-79
SLIDE 79

Least squares: some conclusions

◮ Provide optimal rate of convergence under two assumptions

for non-parametric regression in Hilbert spaces: large step sizes and averaging.

28

slide-80
SLIDE 80

Least squares: some conclusions

◮ Provide optimal rate of convergence under two assumptions

for non-parametric regression in Hilbert spaces: large step sizes and averaging.

◮ Sheds some light on FD case.

28

slide-81
SLIDE 81

Least squares: some conclusions

◮ Provide optimal rate of convergence under two assumptions

for non-parametric regression in Hilbert spaces: large step sizes and averaging.

◮ Sheds some light on FD case. ◮ Possible to attain simultaneously optimal rate from the

statistical and optimization point of view.

28

slide-82
SLIDE 82

Outline

  • 1. Introduction.
  • 2. Non-parametric stochastic approximation
  • 3. Faster rates with acceleration
  • 4. Stochastic approximation as a Markov chain: extension to non

quadratic loss functions.

◮ Motivation ◮ Assumptions ◮ Convergence in Wasserstein distance. 29

slide-83
SLIDE 83

Motivation 1/ 2. Large step sizes!

log10

  • R(¯

θn) − R(θ∗)

  • log10(n)

Logistic regression. Final iterate (dashed), and averaged recursion (plain).

30

slide-84
SLIDE 84

Motivation 2/ 2. Difference between quadratic and logistic loss

Logistic Regression Least-Squares Regression ER(¯ θn) − R(θ∗) = O(γ2) ER(¯ θn) − R(θ∗) = O 1 n

  • with γ = 1/(4R2)

with γ = 1/(4R2)

31

slide-85
SLIDE 85

SGD: an homogeneous Markov chain

Consider a L−smooth and µ−strongly convex function R.

32

slide-86
SLIDE 86

SGD: an homogeneous Markov chain

Consider a L−smooth and µ−strongly convex function R. SGD with a step-size γ > 0 is an homogeneous Markov chain: θγ

k+1 = θγ k − γ

  • R′(θγ

k) + εk+1(θγ k)

  • ,

32

slide-87
SLIDE 87

SGD: an homogeneous Markov chain

Consider a L−smooth and µ−strongly convex function R. SGD with a step-size γ > 0 is an homogeneous Markov chain: θγ

k+1 = θγ k − γ

  • R′(θγ

k) + εk+1(θγ k)

  • ,

◮ satisfies Markov property

32

slide-88
SLIDE 88

SGD: an homogeneous Markov chain

Consider a L−smooth and µ−strongly convex function R. SGD with a step-size γ > 0 is an homogeneous Markov chain: θγ

k+1 = θγ k − γ

  • R′(θγ

k) + εk+1(θγ k)

  • ,

◮ satisfies Markov property ◮ is homogeneous, for γ constant, (εk)k∈N i.i.d.

32

slide-89
SLIDE 89

SGD: an homogeneous Markov chain

Consider a L−smooth and µ−strongly convex function R. SGD with a step-size γ > 0 is an homogeneous Markov chain: θγ

k+1 = θγ k − γ

  • R′(θγ

k) + εk+1(θγ k)

  • ,

◮ satisfies Markov property ◮ is homogeneous, for γ constant, (εk)k∈N i.i.d.

Also assume:

◮ R′ k = R′ + εk+1 is almost surely L-co-coercive. ◮ Bounded moments

E[εk(θ∗)4] < ∞.

32

slide-90
SLIDE 90

Stochastic gradient descent as a Markov Chain: Analysis framework†

◮ Existence of a limit distribution πγ, and linear convergence to

this distribution: θγ

n d

→ πγ.

†Dieuleveut, Durmus, Bach [2017]. 33

slide-91
SLIDE 91

Stochastic gradient descent as a Markov Chain: Analysis framework†

◮ Existence of a limit distribution πγ, and linear convergence to

this distribution: θγ

n d

→ πγ.

◮ Convergence of second order moments of the chain,

¯ θn,γ

L2

− →

n→∞

¯ θγ := Eπγ [θ] .

†Dieuleveut, Durmus, Bach [2017]. 33

slide-92
SLIDE 92

Stochastic gradient descent as a Markov Chain: Analysis framework†

◮ Existence of a limit distribution πγ, and linear convergence to

this distribution: θγ

n d

→ πγ.

◮ Convergence of second order moments of the chain,

¯ θn,γ

L2

− →

n→∞

¯ θγ := Eπγ [θ] .

◮ Behavior under the limit distribution (γ → 0): ¯

θγ=θ∗ + ?.

†Dieuleveut, Durmus, Bach [2017]. 33

slide-93
SLIDE 93

Stochastic gradient descent as a Markov Chain: Analysis framework†

◮ Existence of a limit distribution πγ, and linear convergence to

this distribution: θγ

n d

→ πγ.

◮ Convergence of second order moments of the chain,

¯ θn,γ

L2

− →

n→∞

¯ θγ := Eπγ [θ] .

◮ Behavior under the limit distribution (γ → 0): ¯

θγ=θ∗ + ?. Provable convergence improvement with extrapolation tricks.

†Dieuleveut, Durmus, Bach [2017]. 33

slide-94
SLIDE 94

Existence of a limit distribution γ → 0

Goal: (θγ

n)n≥0 d

→ πγ .

Theorem

For any γ < L−1, the chain (θγ

n)n≥0 admits a unique stationary

distribution πγ. In addition for all θ0 ∈ Rd, n ∈ N: W 2

2 (θγ n, πγ) ≤ (1 − 2µγ(1 − γL))n

  • Rd θ0 − ϑ2 dπγ(ϑ) .

34

slide-95
SLIDE 95

Existence of a limit distribution γ → 0

Goal: (θγ

n)n≥0 d

→ πγ .

Theorem

For any γ < L−1, the chain (θγ

n)n≥0 admits a unique stationary

distribution πγ. In addition for all θ0 ∈ Rd, n ∈ N: W 2

2 (θγ n, πγ) ≤ (1 − 2µγ(1 − γL))n

  • Rd θ0 − ϑ2 dπγ(ϑ) .

Wasserstein metric: distance between probability measures.

34

slide-96
SLIDE 96

Behavior under limit distribution.

Ergodic theorem: ¯ θn → Eπγ[θ] =: ¯ θγ. Where is ¯ θγ ?

35

slide-97
SLIDE 97

Behavior under limit distribution.

Ergodic theorem: ¯ θn → Eπγ[θ] =: ¯ θγ. Where is ¯ θγ ? If θ0 ∼ πγ, then θ1 ∼ πγ. θγ

1 = θγ 0 − γ

  • R′(θγ

0) + ε1(θγ 0)

  • .

Eπγ

  • R′(θ)
  • = 0

35

slide-98
SLIDE 98

Behavior under limit distribution.

Ergodic theorem: ¯ θn → Eπγ[θ] =: ¯ θγ. Where is ¯ θγ ? If θ0 ∼ πγ, then θ1 ∼ πγ. θγ

1 = θγ 0 − γ

  • R′(θγ

0) + ε1(θγ 0)

  • .

Eπγ

  • R′(θ)
  • = 0

In the quadratic case (linear gradients) ΣEπγ [θ − θ∗] = 0: ¯ θγ = θ∗! In the general case, using Eπγ

  • θ − θ∗4

≤ Cγ2, and expand the Taylor expansion of R: And iterating this reasoning on higher moments of the chain:

¯ θγ − θ∗ = γR′′(θ∗)−1R′′′(θ∗)

  • R′′(θ∗) ⊗ I + I ⊗ R′′(θ∗)

−1Eπγ [ε(θ)⊗2]

  • + O(γ2)

Overall, ¯ θγ − θ∗ = γ∆ + O(γ2).

35

slide-99
SLIDE 99

Constant learning rate SGD: convergence in the quadratic case

36

slide-100
SLIDE 100

Constant learning rate SGD: convergence in the quadratic case

θ0 θ1 θn

36

slide-101
SLIDE 101

Constant learning rate SGD: convergence in the quadratic case

θ0 θ1 θn θ1 θ2 θn

36

slide-102
SLIDE 102

Constant learning rate SGD: convergence in the quadratic case

θ0 θ1 θn θ1 θ2 θn θ∗

36

slide-103
SLIDE 103

Behavior under limit distribution.

Ergodic theorem: ¯ θn → Eπγ[θ] =: ¯ θγ. Where is ¯ θγ ? If θ0 ∼ πγ, then θ1 ∼ πγ. θγ

1 = θγ 0 − γ

  • R′(θγ

0) + ε1(θγ 0)

  • .

Eπγ

  • R′(θ)
  • = 0

In the quadratic case (linear gradients) ΣEπγ [θ − θ∗] = 0: ¯ θγ = θ∗! In the general case, Taylor expansion of R, and same reasoning on higher moments of the chain leads to

¯ θγ − θ∗ = γR′′(θ∗)−1R′′′(θ∗)

  • R′′(θ∗) ⊗ I + I ⊗ R′′(θ∗)

−1Eπγ,ε[ε(θ)⊗2]

  • + O(γ2)

Overall, ¯ θγ − θ∗ = γ∆ + O(γ2).

37

slide-104
SLIDE 104

Constant learning rate SGD: convergence in the non-quadratic case

θ∗ θ0 θ1 θn θn θ1 θ2

38

slide-105
SLIDE 105

Constant learning rate SGD: convergence in the non-quadratic case

θ∗ θ0 θ1 θn θ1 θ2

38

slide-106
SLIDE 106

Constant learning rate SGD: convergence in the non-quadratic case

θ∗ θ0 θ1 θn θ1 θ2

38

slide-107
SLIDE 107

Constant learning rate SGD: convergence in the non-quadratic case

θ∗ θ0 θ1 θn θ1 θ2 θγ

38

slide-108
SLIDE 108

Richardson extrapolation

θ∗ ¯ θγ

n − ¯

θγ = Op(n−1=2) ¯ θ1 θn θγ θn θ0 θγ

n − ¯

θγ = Op(γ1=2) θ∗ − ¯ θγ = O(γ) Recovering convergence closer to θ∗ by Richardson extrapolation 2¯ θn,γ − ¯ θn,2γ

39

slide-109
SLIDE 109

Richardson extrapolation

θ∗ ¯ θγ

n − ¯

θγ = Op(n−1=2) ¯ θ1 θn θγ θn θ0 θγ

n − ¯

θγ = Op(γ1=2) θ∗ − ¯ θγ = O(γ) Recovering convergence closer to θ∗ by Richardson extrapolation 2¯ θn,γ − ¯ θn,2γ

39

slide-110
SLIDE 110

Richardson extrapolation

θ∗ ¯ θγ

n − ¯

θγ = Op(n−1=2) ¯ θ1 θn θγ θn θ0 θγ

n − ¯

θγ = Op(γ1=2) θ∗ − ¯ θγ = O(γ) Recovering convergence closer to θ∗ by Richardson extrapolation 2¯ θn,γ − ¯ θn,2γ

39

slide-111
SLIDE 111

Richardson extrapolation

θ∗ ¯ θγ

n − ¯

θγ = Op(n−1=2) ¯ θ1 θn θγ θn θ0 θγ

n − ¯

θγ = Op(γ1=2) θ∗ − ¯ θγ = O(γ) Recovering convergence closer to θ∗ by Richardson extrapolation 2¯ θn,γ − ¯ θn,2γ

39

slide-112
SLIDE 112

Richardson extrapolation

θ∗ ¯ θγ

n − ¯

θγ = Op(n−1=2) ¯ θ1 θn θγ θn θ0 θγ

n − ¯

θγ = Op(γ1=2) θ∗ − ¯ θγ = O(γ) Recovering convergence closer to θ∗ by Richardson extrapolation 2¯ θn,γ − ¯ θn,2γ

39

slide-113
SLIDE 113

Richardson extrapolation

θ∗ ¯ θγ

n − ¯

θγ = Op(n−1=2) ¯ θ1 θn θγ θn θ0 θγ

n − ¯

θγ = Op(γ1=2) θ∗ − ¯ θγ = O(γ) Recovering convergence closer to θ∗ by Richardson extrapolation 2¯ θn,γ − ¯ θn,2γ

39

slide-114
SLIDE 114

Experiments: smaller dimension

log10 [R(θ) − R(θ∗)] log10(n) Synthetic data, logistic regression, n = 8.106

40

slide-115
SLIDE 115

Experiments: Double Richardson

log10 [R(θ) − R(θ∗)] log10(n) Synthetic data, logistic regression, n = 8.106 “Richardson 3γ”: estimator built using Richardson on 3 different sequences: ˜ θ3

n = 8 3 ¯

θn,γ − 2¯ θn,2γ + 1

3 ¯

θn,4γ

41

slide-116
SLIDE 116

Conclusion MC

Take home message:

◮ Precise description of the convergence in terms of Wasserstein

distance.

◮ Decomposition as three sources of error: variance, initial

conditions, and “drift”

◮ Detailed analysis of the position of the limit point: the

direction does not depend on γ at first order.

◮ Extrapolation tricks can help. ◮ Beyond: new error decomposition (link with diffusions), ...

42

slide-117
SLIDE 117

Open directions

◮ Markov chain, beyond strong convexity

43

slide-118
SLIDE 118

Open directions

◮ Markov chain, beyond strong convexity ◮ Adaptivity for non-parametric regression

43

slide-119
SLIDE 119

Open directions

◮ Markov chain, beyond strong convexity ◮ Adaptivity for non-parametric regression ◮ Complexity of non-parametric regression. Stochastic gradient

descent and random features.

43

slide-120
SLIDE 120

Open directions

◮ Markov chain, beyond strong convexity ◮ Adaptivity for non-parametric regression ◮ Complexity of non-parametric regression. Stochastic gradient

descent and random features.

◮ Density estimation.

43

slide-121
SLIDE 121

[noframenumbering]

  • F. Bach and E. Moulines. Non-strongly-convex smooth stochastic approximation with

convergence rate O(1/n). Advances in Neural Information Processing Systems (NIPS), 2013.

  • L. Bottou and O. Bousquet. The tradeoffs of large scale learning. In Adv. NIPS, 2008.
  • A. Caponnetto and E. De Vito. Optimal rates for the regularized least-squares
  • algorithm. Foundations of Computational Mathematics, 7(3):331–368, 2007.
  • A. d’Aspremont. Smooth optimization with approximate gradient. SIAM J. Optim.,

19(3):1171–1183, 2008.

  • A. Dieuleveut and F. Bach. Non-parametric stochastic approximation with large step
  • sizes. Annals of Statistics, 2015.
  • S. Lacoste-Julien, M. Schmidt, and F. Bach. A simpler approach to obtaining an

O(1/t) rate for the stochastic projected subgradient method. ArXiv e-prints 1212.2002, 2012.

  • B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by
  • averaging. SIAM J. Control Optim., 30(4):838–855, 1992.
  • H. Robbins and S. Monro. A stochastic approxiation method. The Annals of

mathematical Statistics, 22(3):400–407, 1951.

  • D. Ruppert. Efficient estimations from a slowly convergent Robbins-Monro process.

Technical report, Cornell University Operations Research and Industrial Engineering, 1988.

  • P. Tarr`

es and Y. Yao. Online learning as stochastic approximation of regularization

  • paths. EEE Transactions in Information Theory, (99):5716–5735, 2011.
  • Y. Ying and M. Pontil. Online gradient descent learning algorithms. Foundations of

Computational Mathematics, 2008.

43