Exponential convergence of testing error for stochastic gradient methods
Loucas Pillaud-Vivien
INRIA - Ecole Normale Sup´ erieure, Paris, France
Joint work with Alessandro Rudi and Francis Bach COLT - July 2018
1/10
Exponential convergence of testing error for stochastic gradient - - PowerPoint PPT Presentation
Exponential convergence of testing error for stochastic gradient methods Loucas Pillaud-Vivien INRIA - Ecole Normale Sup erieure, Paris, France Joint work with Alessandro Rudi and Francis Bach COLT - July 2018 1/10 Stochastic Gradient
1/10
2/10
◮ n input-output samples (xi, yi)in. ◮ One observation at each step −
2/10
◮ n input-output samples (xi, yi)in. ◮ One observation at each step −
2/10
◮ Regression problems: best convergence rates O(1/√n) or
◮ Can it be faster for classification problems ? 3/10
◮ Regression problems: best convergence rates O(1/√n) or
◮ Can it be faster for classification problems ?
3/10
◮ Data: (x, y) ∈ X × {−1, 1} distributed according to ρ. ◮ Prediction: ˆ
◮ Aim: minimize over g ∈ H the error,
3 2 1 1 2 3 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
0-1 square hinge logistic
4/10
◮ Data: (x, y) ∈ X × {−1, 1} distributed according to ρ. ◮ Prediction: ˆ
◮ Aim: minimize over g ∈ H the error,
3 2 1 1 2 3 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
0-1 square hinge logistic
4/10
◮ Data: (x, y) ∈ X × {−1, 1} distributed according to ρ. ◮ Prediction: ˆ
◮ Aim: minimize over g ∈ H the error,
◮ Square loss: F(g) = Eℓ (y, g(x)) = E (y − g(x))2, minimum for
◮ Ridge regression: Fλ(g) = E (y − g(x))2 + λg2 H, minimum for
4/10
◮ Data: (x, y) ∈ X × {−1, 1} distributed according to ρ. ◮ Prediction: ˆ
◮ Aim: minimize over g ∈ H the error,
◮ Excess error and loss (Bartlett et al., 2006):
4/10
◮ Hard inputs to predict: P(y = 1|x) = 1/2, i.e., E(y|x) = 0 ◮ Easy inputs to predict: P(y = 1|x) ∈ {0, 1}, i.e., |E(y|x)| = 1
5/10
◮ (A1) Margin condition: ∃δ > 0, s.t. |E(y|x)| δ,
◮ (A2) Technical condition: ∃λ > 0 s.t. sign(E(y|x))gλ(x) δ/2,
6/10
◮ (A1) Margin condition: ∃δ > 0, s.t. |E(y|x)| δ,
◮ (A2) Technical condition: ∃λ > 0 s.t. sign(E(y|x))gλ(x) δ/2,
n
1 n/2
i=n/2 gi, Jain et al. (2016).
n
6/10
n
◮ Main tool for the proof: high probability bound in · L∞ for the
◮ Excess testing loss not exponentially convergent 7/10
n
◮ Main tool for the proof: high probability bound in · L∞ for the
◮ Excess testing loss not exponentially convergent ◮ Motivations to look at the paper and come at the poster session:
◮ Sharper bounds ◮ High probability bound in · L∞ (usually · L2 ) for the SGD
◮ Bounds for the regular averaging ◮ Bounds for general low-noise condition
7/10
◮ Comparing test/train losses/errors for tail-averaged SGD
8/10
◮ Exponential convergence of test error and not test loss ◮ Importance of the margin condition ◮ High probability bound for the averaged and regularized SGD 9/10
◮ Exponential convergence of test error and not test loss ◮ Importance of the margin condition ◮ High probability bound for the averaged and regularized SGD
◮ No regularization ◮ Study the effect of the regularity of the problem ◮ Beyond least-squares 9/10
10/10
10/10