the implicit regularization of stochastic gradient flow
play

The Implicit Regularization of Stochastic Gradient Flow for Least - PowerPoint PPT Presentation

The Implicit Regularization of Stochastic Gradient Flow for Least Squares Alnur Ali 1 , Edgar Dobriban 2 , and Ryan J. Tibshirani 3 1 Stanford University, 2 University of Pennsylvania, 3 Carnegie Mellon University Outline Overview


  1. The Implicit Regularization of Stochastic Gradient Flow for Least Squares Alnur Ali 1 , Edgar Dobriban 2 , and Ryan J. Tibshirani 3 1 Stanford University, 2 University of Pennsylvania, 3 Carnegie Mellon University

  2. Outline Overview Continuous-time viewpoint Risk bounds Numerical examples Conclusion Overview 2

  3. Introduction ◮ Given the sizes of modern data sets, stochastic gradient descent is one of the most widely used optimization algorithms today – Computational and statistical properties have been studied for decades (Robbins & Monro, 1951; Fabian, 1968; Ruppert, 1988; Kushner & Yin, 2003; Polyak & Juditsky, 1992; ...) Overview 3

  4. Introduction ◮ Given the sizes of modern data sets, stochastic gradient descent is one of the most widely used optimization algorithms today – Computational and statistical properties have been studied for decades (Robbins & Monro, 1951; Fabian, 1968; Ruppert, 1988; Kushner & Yin, 2003; Polyak & Juditsky, 1992; ...) ◮ Recently, lots of interest in implicit regularization ◮ In particular, a line of work showing (early-stopped) gradient descent is linked to ℓ 2 regularization Overview 3

  5. Introduction ◮ Given the sizes of modern data sets, stochastic gradient descent is one of the most widely used optimization algorithms today – Computational and statistical properties have been studied for decades (Robbins & Monro, 1951; Fabian, 1968; Ruppert, 1988; Kushner & Yin, 2003; Polyak & Juditsky, 1992; ...) ◮ Recently, lots of interest in implicit regularization ◮ In particular, a line of work showing (early-stopped) gradient descent is linked to ℓ 2 regularization ◮ Interesting, but also computationally convenient Overview 3

  6. Introduction ◮ Natural to ask: do the iterates generated by (mini-batch) stochastic gradient descent also possess (implicit) ℓ 2 regularity? Overview 4

  7. Introduction ◮ Natural to ask: do the iterates generated by (mini-batch) stochastic gradient descent also possess (implicit) ℓ 2 regularity? ◮ Why might there be a connection, at all? – Compare the paths for least squares regression Ridge Regression Stochastic Gradient Descent 0.8 0.8 0.6 0.6 0.4 0.4 Coefficients Coefficients 0.2 0.2 −0.2 −0.2 −0.6 −0.6 0 2 4 6 8 10 0 200 400 600 800 1000 1/lambda k ◮ In this paper, we’ll focus on least squares regression Overview 4

  8. Introduction ◮ Main tool for making the connection: a stochastic differential equation that we call stochastic gradient flow – Linked to SGD with a constant step size; more on this later ◮ We give a bound on the excess risk of stochastic gradient flow at time t , over ridge regression with tuning parameter λ = 1 /t – Result(s) hold across the entire optimization path – Results do not place strong conditions on the features – Proofs are simpler than in discrete-time Overview 5

  9. Introduction ◮ Main tool for making the connection: a stochastic differential equation that we call stochastic gradient flow – Linked to SGD with a constant step size; more on this later ◮ We give a bound on the excess risk of stochastic gradient flow at time t , over ridge regression with tuning parameter λ = 1 /t – Result(s) hold across the entire optimization path – Results do not place strong conditions on the features – Proofs are simpler than in discrete-time ◮ Roughly speaking, the bound decomposes into three parts – The variance of ridge regression scaled by a constant less than 1 – The “price of stochasticity”: a term that is non-negative, but vanishes as time grows – A term that is tied to the limiting optimization error: this term is zero in the overparametrized regime, but positive otherwise Overview 5

  10. Outline Overview Continuous-time viewpoint Risk bounds Numerical examples Conclusion Continuous-time viewpoint 6

  11. Stochastic gradient flow ◮ We consider the stochastic differential equation dβ ( t ) = 1 nX T ( y − Xβ ( t )) dt Q ǫ ( β ( t )) 1 / 2 dW ( t ) , + (1) � �� � � �� � fluctuations are governed by the just the gradient for cov. of the stochastic gradients least squares regression where β (0) = 0 , � � 1 mX T Q ǫ ( β ) = ǫ · Cov I I ( y I − X I β ) is the diffusion coefficient, I ⊆ { 1 , . . . , n } is a mini-batch, and ǫ > 0 is a (fixed) step size ◮ We call (1) stochastic gradient flow – Has a few nice properties, and bears several connections to SGD with a constant step size; more on this next Continuous-time viewpoint 7

  12. Stochastic gradient flow ◮ Lemma: the Euler discretization of stochastic gradient flow ˜ β ( k ) , and constant step size SGD β ( k ) , share first and second moments, i.e., E (˜ β ( k ) ) = E ( β ( k ) ) Cov(˜ β ( k ) ) = Cov( β ( k ) ) and Continuous-time viewpoint 8

  13. Stochastic gradient flow ◮ Lemma: the Euler discretization of stochastic gradient flow ˜ β ( k ) , and constant step size SGD β ( k ) , share first and second moments, i.e., E (˜ β ( k ) ) = E ( β ( k ) ) Cov(˜ β ( k ) ) = Cov( β ( k ) ) and – Implies the prediction errors match – Also, implies any deviation between the first two moments of stochastic gradient flow and SGD must be due to discretization error Continuous-time viewpoint 8

  14. Stochastic gradient flow ◮ Lemma: the Euler discretization of stochastic gradient flow ˜ β ( k ) , and constant step size SGD β ( k ) , share first and second moments, i.e., E (˜ β ( k ) ) = E ( β ( k ) ) Cov(˜ β ( k ) ) = Cov( β ( k ) ) and – Implies the prediction errors match – Also, implies any deviation between the first two moments of stochastic gradient flow and SGD must be due to discretization error ◮ Sanity check: revisiting the solution/optimization paths from earlier Ridge Regression Stochastic Gradient Descent Stochastic Gradient Flow 0.8 0.8 0.8 0.6 0.6 0.6 0.4 0.4 0.4 Coefficients Coefficients Coefficients 0.2 0.2 0.2 −0.2 −0.2 −0.2 −0.6 −0.6 −0.6 0 2 4 6 8 10 0 200 400 600 800 1000 0 2 4 6 8 10 1/lambda k t Continuous-time viewpoint 8

  15. Stochastic gradient flow ◮ A number of works consider instead the constant covariance process, � ǫ dβ ( t ) = 1 � 1 / 2 nX T ( y − Xβ ( t )) dt + m · ˆ Σ dW ( t ) , (2) Σ = X T X/n (cf. Langevin dynamics) where ˆ

  16. Stochastic gradient flow ◮ A number of works consider instead the constant covariance process, � ǫ dβ ( t ) = 1 � 1 / 2 nX T ( y − Xβ ( t )) dt + m · ˆ Σ dW ( t ) , (2) Σ = X T X/n (cf. Langevin dynamics) where ˆ ◮ Turns out (theoretically, empirically) stochastic gradient flow is a more accurate approximation to SGD than (2) is 1.2 2.5 SGD Non−Constant Covariance SGF Constant Covariance SGF 1.0 2.0 0.8 1.5 0.6 0.4 1.0 0.2 0.5 0.0 −0.2 0.0 0.0 0.5 1.0 1.5 2.0

  17. Outline Overview Continuous-time viewpoint Risk bounds Numerical examples Conclusion Risk bounds 10

  18. Setup ◮ Assume a standard regression model η ∼ (0 , σ 2 I ) y = Xβ 0 + η, ◮ Fix X ; let s i , i = 1 , . . . , p , denote the eigenvalues of X T X/n

  19. Setup ◮ Assume a standard regression model η ∼ (0 , σ 2 I ) y = Xβ 0 + η, ◮ Fix X ; let s i , i = 1 , . . . , p , denote the eigenvalues of X T X/n ◮ Recall a useful result for (batch) gradient flow (Ali et al., 2018) – For least squares regression, gradient flow is β ( t ) = 1 nX T ( y − Xβ ( t )) dt, ˙ β (0) = 0 – Has the solution β gf ( t ) = ( X T X ) + � I − exp( − tX T X/n ) X T y ˆ �

  20. Setup ◮ Assume a standard regression model η ∼ (0 , σ 2 I ) y = Xβ 0 + η, ◮ Fix X ; let s i , i = 1 , . . . , p , denote the eigenvalues of X T X/n ◮ Recall a useful result for (batch) gradient flow (Ali et al., 2018) – For least squares regression, gradient flow is β ( t ) = 1 nX T ( y − Xβ ( t )) dt, ˙ β (0) = 0 – Has the solution β gf ( t ) = ( X T X ) + � I − exp( − tX T X/n ) X T y ˆ � – Then, for any time t ≥ 0 (note the correspondence with λ ), Bias 2 (ˆ β gf ( t ); β 0 ) ≤ Bias 2 (ˆ β ridge (1 /t ); β 0 ) and Var(ˆ β gf ( t )) ≤ 1 . 6862 · Var(ˆ β ridge (1 /t )) , so that Risk(ˆ β gf ( t ); β 0 ) ≤ 1 . 6862 · Risk(ˆ β ridge (1 /t ); β 0 )

  21. Excess risk bound (over ridge) ◮ Thm.: for any time t > 0 (provided the step size is small enough), Risk(ˆ β sgf ( t ); β 0 ) − Risk(ˆ β ridge (1 /t ); β 0 ) ≤ 0 . 6862 · Var η (ˆ β ridge (1 /t )) (scaled ridge variance) � � p + ǫ · n exp( δ y ) s i � � � exp( − αt ) − exp( − 2 ts i ) E η m s i − α/ 2 i =1 (“price of stochasticity”) p + ǫ · n � �� � � E η γ y 1 − exp( − 2 ts i ) (limiting opt. error) m i =1 ◮ ǫ, m denote the step size and mini-batch size, respectively ◮ s i denote the eigenvalues of the sample covariance matrix ◮ α, γ y , δ y depend on n, p, m, ǫ, s i , y , but not t (see paper for details) Risk bounds 12

  22. Implications/observations ◮ The second and third (variance) terms ... – Roughly scale with ǫ/m (Goyal et al., 2017; Smith et al., 2017; You et al., 2017; Shallue et al., 2019); this is different from gradient flow – Depend on the signal-to-noise ratio; this is different from gradient flow (and linear smoothers in general, because stochastic gradient flow/descent are actually randomized linear smoothers) – The second term decreases with time, just as a bias would; this is different from gradient flow (see lemma in the paper) Risk bounds 13

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend