Exponential convergence of testing error for stochastic gradient - - PowerPoint PPT Presentation

exponential convergence of testing error for stochastic
SMART_READER_LITE
LIVE PREVIEW

Exponential convergence of testing error for stochastic gradient - - PowerPoint PPT Presentation

Exponential convergence of testing error for stochastic gradient methods Loucas Pillaud-Vivien INRIA - Ecole Normale Sup erieure, Paris, France Joint work with Alessandro Rudi and Francis Bach COLT - July 2018 1/10 Stochastic Gradient


slide-1
SLIDE 1

Exponential convergence of testing error for stochastic gradient methods

Loucas Pillaud-Vivien

INRIA - Ecole Normale Sup´ erieure, Paris, France

Joint work with Alessandro Rudi and Francis Bach COLT - July 2018

1/10

slide-2
SLIDE 2

Stochastic Gradient Descent

Minimizes a function F given unbiased estimates of its gradients: gk = gk−1 − γk∇Fk(gk−1).

2/10

slide-3
SLIDE 3

Stochastic Gradient Descent

Minimizes a function F given unbiased estimates of its gradients: gk = gk−1 − γk∇Fk(gk−1). A workhorse in Machine Learning

◮ n input-output samples (xi, yi)in. ◮ One observation at each step −

→ complexity O(d) per iteration.

2/10

slide-4
SLIDE 4

Stochastic Gradient Descent

Minimizes a function F given unbiased estimates of its gradients: gk = gk−1 − γk∇Fk(gk−1). A workhorse in Machine Learning

◮ n input-output samples (xi, yi)in. ◮ One observation at each step −

→ complexity O(d) per iteration. Regression problems: Best convergence rates O(1/√n) or O(1/n). Nemirovski and Yudin (1983); Polyak and Juditsky (1992)

2/10

slide-5
SLIDE 5

Stochastic Gradient Descent

◮ Regression problems: best convergence rates O(1/√n) or

O(1/n).

◮ Can it be faster for classification problems ? 3/10

slide-6
SLIDE 6

Stochastic Gradient Descent

◮ Regression problems: best convergence rates O(1/√n) or

O(1/n).

◮ Can it be faster for classification problems ?

Take home message: Yes, SGD converges exponentially fast in classification error with some margin condition.

3/10

slide-7
SLIDE 7

Binary classification: problem setting

◮ Data: (x, y) ∈ X × {−1, 1} distributed according to ρ. ◮ Prediction: ˆ

y = sign g(x), with g(x) = g, φ(x)H.

◮ Aim: minimize over g ∈ H the error,

F01(g) = Eℓ01 (y, g(x)) = E1yg(x)<0.

3 2 1 1 2 3 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0

0-1 square hinge logistic

4/10

slide-8
SLIDE 8

Binary classification: problem setting

◮ Data: (x, y) ∈ X × {−1, 1} distributed according to ρ. ◮ Prediction: ˆ

y = sign g(x), with g(x) = g, φ(x)H.

◮ Aim: minimize over g ∈ H the error,

F01(g) = Eℓ01 (y, g(x)) = E1yg(x)<0. From error to losses As ℓ01 is non convex, we use square loss:

3 2 1 1 2 3 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0

0-1 square hinge logistic

4/10

slide-9
SLIDE 9

Binary classification: problem setting

◮ Data: (x, y) ∈ X × {−1, 1} distributed according to ρ. ◮ Prediction: ˆ

y = sign g(x), with g(x) = g, φ(x)H.

◮ Aim: minimize over g ∈ H the error,

F01(g) = Eℓ01 (y, g(x)) = E1yg(x)<0. From error to losses As ℓ01 is non convex, we use square loss:

◮ Square loss: F(g) = Eℓ (y, g(x)) = E (y − g(x))2, minimum for

g∗(x) = E(y|x).

◮ Ridge regression: Fλ(g) = E (y − g(x))2 + λg2 H, minimum for

gλ.

4/10

slide-10
SLIDE 10

Binary classification: problem setting

◮ Data: (x, y) ∈ X × {−1, 1} distributed according to ρ. ◮ Prediction: ˆ

y = sign g(x), with g(x) = g, φ(x)H.

◮ Aim: minimize over g ∈ H the error,

F01(g) = Eℓ01 (y, g(x)) = E1yg(x)<0. From error to losses As ℓ01 is non convex, we use square loss:

◮ Excess error and loss (Bartlett et al., 2006):

Eℓ01 (y, g(x)) − ℓ01∗

  • Excess error
  • E (y − g(x))2 − ℓ∗
  • Excess loss

If we use existing results for SGD: Eℓ01 (y, g(x)) − ℓ01∗ 1 √ λn . − → Not exponential

4/10

slide-11
SLIDE 11

Main assumptions

Margin condition (Mammen and Tsybakov, 1999)

◮ Hard inputs to predict: P(y = 1|x) = 1/2, i.e., E(y|x) = 0 ◮ Easy inputs to predict: P(y = 1|x) ∈ {0, 1}, i.e., |E(y|x)| = 1

− → Margin condition: ∃δ > 0, s.t. |E(y|x)| δ for all x ∈ supp(ρX ).

5/10

slide-12
SLIDE 12

Main assumptions

◮ (A1) Margin condition: ∃δ > 0, s.t. |E(y|x)| δ,

for all x ∈ supp(ρX ).

◮ (A2) Technical condition: ∃λ > 0 s.t. sign(E(y|x))gλ(x) δ/2,

for all x ∈ supp(ρX ). Consequence: for ˆ g s.t. gλ − ˆ gL∞ < δ/2, sign ˆ g(x) = sign(E(y|x)).

6/10

slide-13
SLIDE 13

Main assumptions

◮ (A1) Margin condition: ∃δ > 0, s.t. |E(y|x)| δ,

for all x ∈ supp(ρX ).

◮ (A2) Technical condition: ∃λ > 0 s.t. sign(E(y|x))gλ(x) δ/2,

for all x ∈ supp(ρX ). Single pass SGD through the data on the regularized problem gn = gn−1 − γn [(φ(xn), gn−1 − yn)φ(xn) + λ(gn−1 − g0)] , Take tail averaged estimator, ¯ gtail

n

=

1 n/2

n

i=n/2 gi, Jain et al. (2016).

Theorem: Exponential convergence of SGD for the test error Assume n 1 λγ log R δ and γ 1/(4R2) then, Ex1...xnEℓ01

  • y, ¯

gtail

n

(x)

  • − ℓ01∗ 4 exp
  • −λ2δ2n/R2

.

6/10

slide-14
SLIDE 14

Main result

Theorem: Exponential convergence of SGD for the test error Assume n 1 λγ log R δ and γ 1/(4R2) then, Ex1...xnEℓ01

  • y, ¯

gtail

n

(x)

  • − ℓ01∗ 4 exp
  • −λ2δ2n/R2

.

◮ Main tool for the proof: high probability bound in · L∞ for the

SGD recursion (result on its own)

◮ Excess testing loss not exponentially convergent 7/10

slide-15
SLIDE 15

Main result

Theorem: Exponential convergence of SGD for the test error Assume n 1 λγ log R δ and γ 1/(4R2) then, Ex1...xnEℓ01

  • y, ¯

gtail

n

(x)

  • − ℓ01∗ 4 exp
  • −λ2δ2n/R2

.

◮ Main tool for the proof: high probability bound in · L∞ for the

SGD recursion (result on its own)

◮ Excess testing loss not exponentially convergent ◮ Motivations to look at the paper and come at the poster session:

◮ Sharper bounds ◮ High probability bound in · L∞ (usually · L2 ) for the SGD

recursion

◮ Bounds for the regular averaging ◮ Bounds for general low-noise condition

7/10

slide-16
SLIDE 16

Synthetic experiments

◮ Comparing test/train losses/errors for tail-averaged SGD

(X = [0, 1], H Sobolev.)

(a) Excess losses (b) Excess errors

8/10

slide-17
SLIDE 17

Conclusion

Take home message:

◮ Exponential convergence of test error and not test loss ◮ Importance of the margin condition ◮ High probability bound for the averaged and regularized SGD 9/10

slide-18
SLIDE 18

Conclusion

Take home message:

◮ Exponential convergence of test error and not test loss ◮ Importance of the margin condition ◮ High probability bound for the averaged and regularized SGD

Possible extensions:

◮ No regularization ◮ Study the effect of the regularity of the problem ◮ Beyond least-squares 9/10

slide-19
SLIDE 19

Thank you for your attention ! Come see us at the poster session !

10/10

slide-20
SLIDE 20

Thank you for your attention ! Come see us at the poster session !

3

10/10