Interpolation, Growth Conditions, and Stochastic Gradient Descent
Aaron Mishkin, amishkin@cs.ubc.ca
1⁄45
Interpolation, Growth Conditions, and Stochastic Gradient Descent - - PowerPoint PPT Presentation
Interpolation, Growth Conditions, and Stochastic Gradient Descent Aaron Mishkin, amishkin@cs.ubc.ca 1 45 Training neural networks is dangerous work! 2 45 Chapter 1: Introduction 3 45 Chapter 1: Goal Premise : modern neural networks
Aaron Mishkin, amishkin@cs.ubc.ca
1⁄45
2⁄45
3⁄45
4⁄45
https://towardsdatascience.com/challenges-deploying-machine-learning-models-to-production-ded3f9009cb3 5⁄45
6⁄45
7⁄45
Stochastic gradient methods are the most popular algorithms for fitting ML models, SGD: wk+1 = wk − ηk∇fi (wk). But practitioners face major challenges with
8⁄45
Stochastic gradient methods are the most popular algorithms for fitting ML models, SGD: wk+1 = wk − ηk∇fi (wk). But practitioners face major challenges with
9⁄45
Idea: exploit “over-parameterization” for better optimization.
convergence.
10⁄45
11⁄45
We need assumptions to analyze the complexity of SGD. Goal: Minimize f : Rd → R, where
f(w∗) ≤ f(w) ∀w ∈ Rd,
∇f(w) − ∇f(u)2 ≤ Lw − u2 ∀w, u ∈ Rd,
f(u) ≥ f(w) + ∇f(w), u − w + µ 2 u − w2
2
∀w, u ∈ Rd.
12⁄45
Stochastic Oracles:
f(wk, zk) and ∇f(wk, zk).
Ezk [f(wk, zk)] = f(wk) and Ezk [∇f(wk, zk)] = ∇f(wk).
∇f(w, zk) − ∇f(u, zk)2 ≤ Lmaxw − u2 ∀w, u ∈ Rd, almost surely.
13⁄45
Definition (Interpolation: Minimizers)
(f, O) satisfies minimizer interpolation if w′ ∈ arg min f = ⇒ w′ ∈ arg min f(·, zk) a.s.
Definition (Interpolation: Stationary Points)
(f, O) satisfies stationary-point interpolation if ∇f(w′) = 0 = ⇒ ∇f(w′, zk)
a.s.
= 0.
Definition (Interpolation: Mixed)
(f, O) satisfies mixed interpolation if w′ ∈ arg min f = ⇒ ∇f(w′, zk)
a.s.
= 0.
w∗ f(w) f(w, z) w∗ w′ w′ w∗
14⁄45
15⁄45
Lemma (Interpolation Relationships)
Let (f, O) be arbitrary. Then only the following relationships hold: Minimizer Interpolation = ⇒ Mixed Interpolation and Stationary-Point Interpolation = ⇒ Mixed Interpolation. However, if f and f(·, zk) are invex (almost surely) for all k, then the three definitions are equivalent. Note: invexity is weaker than convexity and implied by it.
15⁄45
There are two obvious ways that we can leverage interpolation:
◮ This was first done using the weak and strong growth conditions by Vaswani et al. [2019a].
◮ This was first done by Bassily et al. [2018], who analyzed SGD under a curvature condition.
We do both, starting with weak/strong growth.
16⁄45
There are many possible regularity assumptions on O.
Bounded Gradients : E
≤ σ2,
17⁄45
There are many possible regularity assumptions on O.
Bounded Gradients : E
≤ σ2,
Bounded Variance : E
≤ ∇f(w)2 + σ2,
17⁄45
There are many possible regularity assumptions on O.
Bounded Gradients : E
≤ σ2,
Bounded Variance : E
≤ ∇f(w)2 + σ2,
Strong Growth+Noise : E
≤ ρ ∇f(w)2 + σ2.
17⁄45
We obtain the strong and weak growth conditions as follows:
Strong Growth+Noise : E
≤ ρ ∇f(w)2 + σ2.
18⁄45
We obtain the strong and weak growth conditions as follows:
Strong Growth+Noise : E
≤ ρ ∇f(w)2 + σ2.
Strong Growth : E
≤ ρ ∇f(w)2.
18⁄45
We obtain the strong and weak growth conditions as follows:
Strong Growth+Noise : E
≤ ρ ∇f(w)2 + σ2.
Strong Growth : E
≤ ρ ∇f(w)2.
Weak Growth : E
≤ α (f(w) − f(w∗)).
18⁄45
Lemma (Interpolation and Weak Growth)
Assume f is L-smooth and O is Lmax individually-
growth also holds with α ≤ Lmax
L .
Lemma (Interpolation and Strong Growth)
Assume f is L-smooth and µ strongly-convex and O is Lmax individually-smooth. If minimizer interpolation holds, then strong growth also holds with ρ ≤ Lmax
µ .
Comments:
which required convexity.
19⁄45
20⁄45
1.1 Query O for ∇f(wk, zk). 1.2 Update input as wk+1 = wk − η∇f(wk, zk).
21⁄45
Prior work for SGD under growth conditions or interpolation:
Schmidt and Le Roux, 2013].
22⁄45
Prior work for SGD under growth conditions or interpolation:
Schmidt and Le Roux, 2013].
We still provide many new and improved results!
strongly-convex objectives.
22⁄45
23⁄45
24⁄45
The Armijo line-search is a classic solution to step-size selection. f(wk − ηk∇f(wk)
) ≤ f(wk) − c · ηk∇f(wk)2.
25⁄45
1.1 Query O for f(wk, zk), ∇f(wk, zk). 1.2 Set ηk = ∞, and wk+1 ← wk − ηk∇f(wk, zk). 1.3 Exactly backtrack until f(wk+1, zk) ≤ f(wk, zk)−c·ηk∇f(wk, zk)2.
Note: Evaluates Armijo condition on f(·, zk) instead of f and needs direct access to f(·, zk) to backtrack.
26⁄45
27⁄45
Lemma (Step-size Bound)
Assume f is L-smooth and O is Lmax individually-
Then the maximal step-size satisfying the stochastic Armijo condition satisfies the following: 2(1 − c) Lmax ≤ ηmax ≤ f(wk, zk) − f(w∗, zk) c∇f(wk, zk)2 . Comments:
28⁄45
29⁄45
Theorem (Convex + Interpolation)
Assume f is convex, L-smooth and O is Lmax individually-smooth. Assume minimizer interpolation holds and f(·, zk) is almost-surely convex for all k. Then SGD with the Armijo line-search and c = 1
2 converges as
E [f( ¯ wK)] − f(w∗) ≤ Lmax 2 K w0 − w∗2. Comments:
— line-search is just as fast as the best constant step-size!
recovers the deterministic rate when Lmax = L.
µ to µ).
30⁄45
31⁄45
SGD can be accelerated when minimizer interpolation holds:
convergence for strongly-convex functions.
strong growth for strongly-convex and convex functions. We follow Vaswani et al. [2019a], but provide tighter rates.
to √ρ — factor of
32⁄45
general stochastic optimization problems.
interpolation are well-behaved globally.
with the deterministic case.
parameter-free optimization under interpolation.
penalty of only √ρ.
33⁄45
34⁄45
Left to right: Sharan Vaswani, Issam Laradji, Gauthier Gidel, Mark Schmidt, Simon Lacoste-Julien, Frederik Kunstner, Si Yi Meng, Jonathan Lavington, Yihan Zhou, and Betty Shea.
35⁄45
Least Squares : w∗ ∈ arg min 1 2n
n
(w, xi − yi)2 . The sub-sampling oracle sets zk ∼ Uniform(1, . . . , n) and returns f(w, zk) = 1 2 (w, xi − yi)2 and ∇f(wk, zk) = (w, xi − yi) xi.
36⁄45
Least Squares : w∗ ∈ arg min 1 2n
n
(w, xi − yi)2 . The sub-sampling oracle sets zk ∼ Uniform(1, . . . , n) and returns f(w, zk) = 1 2 (w, xi − yi)2 and ∇f(wk, zk) = (w, xi − yi) xi. Observations:
2 individually-smooth since
fi(w) = 1 2 (w, xi − yi)2 , is xi2
2-smooth for each i ∈ [n].
36⁄45
Theorem (Convex + Weak Growth)
Assume f is convex, L-smooth and (f, O) satisfies weak growth. Then SGD with η =
1 2αL converges as
E [f( ¯ wK)] − f(w∗) ≤ 2αL K w0 − w∗2.
Theorem (Convex + Interpolation)
Assume f is convex, L-smooth and O is Lmax individually-smooth. Assume minimizer interpolation holds. Then SGD with η =
1 Lmax converges as
E [f( ¯ wK)] − f(w∗) ≤ Lmax 2 K w0 − w∗2.
37⁄45
Weak Growth : E [f( ¯ wK)] − f(w∗) ≤ 2αL K w0 − w∗2.
Interpolation : E [f( ¯ wK)] − f(w∗) ≤ Lmax 2 K w0 − w∗2. Comments:
α ≤ Lmax L .
38⁄45
Naman Agarwal, Brian Bullins, and Elad Hazan. Second-order stochastic optimization for machine learning in linear time. The Journal of Machine Learning Research, 18(1):4148–4187, 2017. Mahmoud Assran and Michael Rabbat. On the convergence of nesterov’s accelerated gradient method in stochastic settings. arXiv preprint arXiv:2002.12414, 2020. Mahmoud Assran, Nicolas Loizou, Nicolas Ballas, and Michael
arXiv preprint arXiv:1811.10792, 2018. Raef Bassily, Mikhail Belkin, and Siyuan Ma. On exponential convergence of SGD in non-convex over-parametrized learning. arXiv preprint arXiv:1811.02564, 2018.
39⁄45
Jeremy Bernstein, Jiawei Zhao, Kamyar Azizzadenesheli, and Anima Anandkumar. signSGD with majority vote is communication efficient and fault tolerant. arXiv preprint arXiv:1810.05291, 2018. Volkan Cevher and Bang Cˆ
the stochastic gradient method with constant step-size. Optim. Lett., 13(5):1177–1187, 2019. Georgios Damaskinos, El Mahdi El Mhamdi, Rachid Guerraoui, Arsany Hany Abdelmessih Guirguis, and S´ ebastien Louis Alexandre Rouault. Aggregathor: Byzantine machine learning via robust gradient aggregation. In The Conference on Systems and Machine Learning (SysML), 2019, number CONF, 2019.
40⁄45
Yoel Drori and Ohad Shamir. The complexity of finding stationary points with stochastic gradient descent. arXiv preprint arXiv:1910.01845, 2019. Tomas Geffner and Justin Domke. A rule for gradient estimator selection, with an application to variational inference. arXiv preprint arXiv:1911.01894, 2019. Robert Mansel Gower, Nicolas Loizou, Xun Qian, Alibek Sailanbayev, Egor Shulgin, and Peter Richt´
analysis and improved rates. arXiv preprint arXiv:1901.09401, 2019. Roger Grosse and Ruslan Salakhudinov. Scaling up natural gradient by sparsely factorizing the inverse fisher matrix. In International Conference on Machine Learning, pages 2304–2313, 2015.
41⁄45
Thomas Hofmann, Aurelien Lucchi, Simon Lacoste-Julien, and Brian McWilliams. Variance reduced stochastic gradient descent with neighbors. In Advances in Neural Information Processing Systems, pages 2305–2313, 2015. Kenji Kawaguchi and Haihao Lu. Ordered SGD: A new stochastic
International Conference on Artificial Intelligence and Statistics, pages 669–679, 2020. Liping Li, Wei Xu, Tianyi Chen, Georgios B Giannakis, and Qing
distributed learning from heterogeneous datasets. In Proceedings
pages 1544–1551, 2019.
42⁄45
Chaoyue Liu and Mikhail Belkin. Accelerating SGD with momentum for over-parameterized learning. In 8th International Conference on Learning Representations, ICLR 2020. OpenReview.net, 2020. Josh Patterson and Adam Gibson. Deep learning: A practitioner’s
Loucas Pillaud-Vivien, Alessandro Rudi, and Francis Bach. Statistical optimality of stochastic gradient descent on hard learning problems through multiple passes. In Advances in Neural Information Processing Systems, pages 8114–8124, 2018. Mark Schmidt and Nicolas Le Roux. Fast convergence of stochastic gradient descent under a strong growth condition. arXiv preprint arXiv:1308.6370, 2013.
43⁄45
Sharan Vaswani, Francis Bach, and Mark W. Schmidt. Fast and faster convergence of SGD for over-parameterized models and an accelerated perceptron. In The 22nd International Conference
volume 89 of Proceedings of Machine Learning Research, pages 1195–1204. PMLR, 2019a. Sharan Vaswani, Aaron Mishkin, Issam H. Laradji, Mark Schmidt, Gauthier Gidel, and Simon Lacoste-Julien. Painless stochastic gradient: Interpolation, line-search, and convergence rates. In Advances in Neural Information Processing Systems 32: NeurIPS 2019, pages 3727–3740, 2019b. Peng Xu, Farbod Roosta-Khorasani, and Michael W Mahoney. Second-order optimization for non-convex machine learning: An empirical study. arXiv preprint arXiv:1708.07827, 2017.
44⁄45
Jian Zhang, Christopher De Sa, Ioannis Mitliagkas, and Christopher R´
preprint arXiv:1606.07365, 2016.
45⁄45