The Implicit Regularization of Stochastic Gradient Flow for Least - PowerPoint PPT Presentation

The Implicit Regularization of Stochastic Gradient Flow for Least Squares Alnur Ali 1 , Edgar Dobriban 2 , and Ryan J. Tibshirani 3 1 Stanford University, 2 University of Pennsylvania, 3 Carnegie Mellon University

Outline Overview Continuous-time viewpoint Risk bounds Numerical examples Conclusion Overview 2

Introduction ◮ Given the sizes of modern data sets, stochastic gradient descent is one of the most widely used optimization algorithms today – Computational and statistical properties have been studied for decades (Robbins & Monro, 1951; Fabian, 1968; Ruppert, 1988; Kushner & Yin, 2003; Polyak & Juditsky, 1992; ...) Overview 3

Introduction ◮ Given the sizes of modern data sets, stochastic gradient descent is one of the most widely used optimization algorithms today – Computational and statistical properties have been studied for decades (Robbins & Monro, 1951; Fabian, 1968; Ruppert, 1988; Kushner & Yin, 2003; Polyak & Juditsky, 1992; ...) ◮ Recently, lots of interest in implicit regularization ◮ In particular, a line of work showing (early-stopped) gradient descent is linked to ℓ 2 regularization Overview 3

Introduction ◮ Given the sizes of modern data sets, stochastic gradient descent is one of the most widely used optimization algorithms today – Computational and statistical properties have been studied for decades (Robbins & Monro, 1951; Fabian, 1968; Ruppert, 1988; Kushner & Yin, 2003; Polyak & Juditsky, 1992; ...) ◮ Recently, lots of interest in implicit regularization ◮ In particular, a line of work showing (early-stopped) gradient descent is linked to ℓ 2 regularization ◮ Interesting, but also computationally convenient Overview 3

Introduction ◮ Natural to ask: do the iterates generated by (mini-batch) stochastic gradient descent also possess (implicit) ℓ 2 regularity? Overview 4

Introduction ◮ Natural to ask: do the iterates generated by (mini-batch) stochastic gradient descent also possess (implicit) ℓ 2 regularity? ◮ Why might there be a connection, at all? – Compare the paths for least squares regression Ridge Regression Stochastic Gradient Descent 0.8 0.8 0.6 0.6 0.4 0.4 Coefficients Coefficients 0.2 0.2 −0.2 −0.2 −0.6 −0.6 0 2 4 6 8 10 0 200 400 600 800 1000 1/lambda k ◮ In this paper, we’ll focus on least squares regression Overview 4

Introduction ◮ Main tool for making the connection: a stochastic differential equation that we call stochastic gradient flow – Linked to SGD with a constant step size; more on this later ◮ We give a bound on the excess risk of stochastic gradient flow at time t , over ridge regression with tuning parameter λ = 1 /t – Result(s) hold across the entire optimization path – Results do not place strong conditions on the features – Proofs are simpler than in discrete-time Overview 5

Introduction ◮ Main tool for making the connection: a stochastic differential equation that we call stochastic gradient flow – Linked to SGD with a constant step size; more on this later ◮ We give a bound on the excess risk of stochastic gradient flow at time t , over ridge regression with tuning parameter λ = 1 /t – Result(s) hold across the entire optimization path – Results do not place strong conditions on the features – Proofs are simpler than in discrete-time ◮ Roughly speaking, the bound decomposes into three parts – The variance of ridge regression scaled by a constant less than 1 – The “price of stochasticity”: a term that is non-negative, but vanishes as time grows – A term that is tied to the limiting optimization error: this term is zero in the overparametrized regime, but positive otherwise Overview 5

Outline Overview Continuous-time viewpoint Risk bounds Numerical examples Conclusion Continuous-time viewpoint 6

Stochastic gradient flow ◮ We consider the stochastic differential equation dβ ( t ) = 1 nX T ( y − Xβ ( t )) dt Q ǫ ( β ( t )) 1 / 2 dW ( t ) , + (1) � �� fluctuations are governed by the just the gradient for cov. of the stochastic gradients least squares regression where β (0) = 0 , � � 1 mX T Q ǫ ( β ) = ǫ · Cov I I ( y I − X I β ) is the diffusion coefficient, I ⊆ { 1 , . . . , n } is a mini-batch, and ǫ > 0 is a (fixed) step size ◮ We call (1) stochastic gradient flow – Has a few nice properties, and bears several connections to SGD with a constant step size; more on this next Continuous-time viewpoint 7

Stochastic gradient flow ◮ Lemma: the Euler discretization of stochastic gradient flow ˜ β ( k ) , and constant step size SGD β ( k ) , share first and second moments, i.e., E (˜ β ( k ) ) = E ( β ( k ) ) Cov(˜ β ( k ) ) = Cov( β ( k ) ) and Continuous-time viewpoint 8

Stochastic gradient flow ◮ Lemma: the Euler discretization of stochastic gradient flow ˜ β ( k ) , and constant step size SGD β ( k ) , share first and second moments, i.e., E (˜ β ( k ) ) = E ( β ( k ) ) Cov(˜ β ( k ) ) = Cov( β ( k ) ) and – Implies the prediction errors match – Also, implies any deviation between the first two moments of stochastic gradient flow and SGD must be due to discretization error Continuous-time viewpoint 8

Stochastic gradient flow ◮ Lemma: the Euler discretization of stochastic gradient flow ˜ β ( k ) , and constant step size SGD β ( k ) , share first and second moments, i.e., E (˜ β ( k ) ) = E ( β ( k ) ) Cov(˜ β ( k ) ) = Cov( β ( k ) ) and – Implies the prediction errors match – Also, implies any deviation between the first two moments of stochastic gradient flow and SGD must be due to discretization error ◮ Sanity check: revisiting the solution/optimization paths from earlier Ridge Regression Stochastic Gradient Descent Stochastic Gradient Flow 0.8 0.8 0.8 0.6 0.6 0.6 0.4 0.4 0.4 Coefficients Coefficients Coefficients 0.2 0.2 0.2 −0.2 −0.2 −0.2 −0.6 −0.6 −0.6 0 2 4 6 8 10 0 200 400 600 800 1000 0 2 4 6 8 10 1/lambda k t Continuous-time viewpoint 8

Stochastic gradient flow ◮ A number of works consider instead the constant covariance process, � ǫ dβ ( t ) = 1 � 1 / 2 nX T ( y − Xβ ( t )) dt + m · ˆ Σ dW ( t ) , (2) Σ = X T X/n (cf. Langevin dynamics) where ˆ

Stochastic gradient flow ◮ A number of works consider instead the constant covariance process, � ǫ dβ ( t ) = 1 � 1 / 2 nX T ( y − Xβ ( t )) dt + m · ˆ Σ dW ( t ) , (2) Σ = X T X/n (cf. Langevin dynamics) where ˆ ◮ Turns out (theoretically, empirically) stochastic gradient flow is a more accurate approximation to SGD than (2) is 1.2 2.5 SGD Non−Constant Covariance SGF Constant Covariance SGF 1.0 2.0 0.8 1.5 0.6 0.4 1.0 0.2 0.5 0.0 −0.2 0.0 0.0 0.5 1.0 1.5 2.0

Outline Overview Continuous-time viewpoint Risk bounds Numerical examples Conclusion Risk bounds 10

Setup ◮ Assume a standard regression model η ∼ (0 , σ 2 I ) y = Xβ 0 + η, ◮ Fix X ; let s i , i = 1 , . . . , p , denote the eigenvalues of X T X/n

Setup ◮ Assume a standard regression model η ∼ (0 , σ 2 I ) y = Xβ 0 + η, ◮ Fix X ; let s i , i = 1 , . . . , p , denote the eigenvalues of X T X/n ◮ Recall a useful result for (batch) gradient flow (Ali et al., 2018) – For least squares regression, gradient flow is β ( t ) = 1 nX T ( y − Xβ ( t )) dt, ˙ β (0) = 0 – Has the solution β gf ( t ) = ( X T X ) + � I − exp( − tX T X/n ) X T y ˆ �

Setup ◮ Assume a standard regression model η ∼ (0 , σ 2 I ) y = Xβ 0 + η, ◮ Fix X ; let s i , i = 1 , . . . , p , denote the eigenvalues of X T X/n ◮ Recall a useful result for (batch) gradient flow (Ali et al., 2018) – For least squares regression, gradient flow is β ( t ) = 1 nX T ( y − Xβ ( t )) dt, ˙ β (0) = 0 – Has the solution β gf ( t ) = ( X T X ) + � I − exp( − tX T X/n ) X T y ˆ � – Then, for any time t ≥ 0 (note the correspondence with λ ), Bias 2 (ˆ β gf ( t ); β 0 ) ≤ Bias 2 (ˆ β ridge (1 /t ); β 0 ) and Var(ˆ β gf ( t )) ≤ 1 . 6862 · Var(ˆ β ridge (1 /t )) , so that Risk(ˆ β gf ( t ); β 0 ) ≤ 1 . 6862 · Risk(ˆ β ridge (1 /t ); β 0 )

Excess risk bound (over ridge) ◮ Thm.: for any time t > 0 (provided the step size is small enough), Risk(ˆ β sgf ( t ); β 0 ) − Risk(ˆ β ridge (1 /t ); β 0 ) ≤ 0 . 6862 · Var η (ˆ β ridge (1 /t )) (scaled ridge variance) � � p + ǫ · n exp( δ y ) s i � � � exp( − αt ) − exp( − 2 ts i ) E η m s i − α/ 2 i =1 (“price of stochasticity”) p + ǫ · n � �� E η γ y 1 − exp( − 2 ts i ) (limiting opt. error) m i =1 ◮ ǫ, m denote the step size and mini-batch size, respectively ◮ s i denote the eigenvalues of the sample covariance matrix ◮ α, γ y , δ y depend on n, p, m, ǫ, s i , y , but not t (see paper for details) Risk bounds 12

Implications/observations ◮ The second and third (variance) terms ... – Roughly scale with ǫ/m (Goyal et al., 2017; Smith et al., 2017; You et al., 2017; Shallue et al., 2019); this is different from gradient flow – Depend on the signal-to-noise ratio; this is different from gradient flow (and linear smoothers in general, because stochastic gradient flow/descent are actually randomized linear smoothers) – The second term decreases with time, just as a bias would; this is different from gradient flow (see lemma in the paper) Risk bounds 13

The Implicit Regularization of Stochastic Gradient Flow for Least - PowerPoint PPT Presentation

The Implicit Regularization of Stochastic Gradient Flow for Least Squares Alnur Ali 1 , Edgar Dobriban 2 , and Ryan J. Tibshirani 3 1 Stanford University, 2 University of Pennsylvania, 3 Carnegie Mellon University Outline Overview

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

Direction Matters: On the Implicit Regularization Effect of Stochastic Gradient Descent with

Implicit Guarantees and Risk Taking: Implicit Guarantees and Risk Taking: Implicit Guarantees and

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

Overview of the Stochastic Gradient Method December 02, 2020 P. Carpentier Master Optimization

Implicit Bias Implicit bias Implicit bias refers to attitudes or stereotypes that affect our

Implicit Surfaces Implicit Surfaces An implicit surface is simply an iso-contour CIS 781 of a

Regularization Overview Regularization Overview Problems & Multicollinearity We will

Introduction CSCE 970 CSCE 970 Lecture 3: Lecture 3: Regularization Regularization CSCE 970

Regularization Regularization is a general approach to add a complexity parameter to a

Adaptive primal-dual stochastic gradient methods Yangyang Xu Mathematical Sciences, Rensselaer

Gradient Analysis NMDS Indirect Gradient Analysis NMDS Direct Gradient Analysis Objective:

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

CSC2541 Lecture 5 Natural Gradient Roger Grosse Roger Grosse CSC2541 Lecture 5 Natural Gradient

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

Paper Reading Paper HetConv: Heterogeneous Kernel-Based Convolutions for Deep

Assistive robotics: helping with tasks for fun and profit. Neil Bell 3 March 2016 CMSC691-HRI

An approach to healthy, local, and profitable fundraising. Who can participate? FarmRaiser needs

+ Profit Maximization for Online Advertising Demand-Side Platforms Paul Grigas, UC Berkeley

Signature sizes: RSA signatures are big. a call to action D. J. Bernstein University of

Web Mining and Recommender Systems Advanced Recommender Systems This week Methodological papers

On Estimating the Size and Confidence of a Statistical Audit Raluca A. Popa and Ronald L. Rivest

Trustworthy Elections: Evidence and Dispute Resolution 2019 Def Con Las Vegas, NV Philip B.

The Implicit Regularization of Stochastic Gradient Flow for Least - PowerPoint PPT Presentation

The Implicit Regularization of Stochastic Gradient Flow for Least Squares Alnur Ali 1 , Edgar Dobriban 2 , and Ryan J. Tibshirani 3 1 Stanford University, 2 University of Pennsylvania, 3 Carnegie Mellon University Outline Overview

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

Direction Matters: On the Implicit Regularization Effect of Stochastic Gradient Descent with

Implicit Guarantees and Risk Taking: Implicit Guarantees and Risk Taking: Implicit Guarantees and

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

Overview of the Stochastic Gradient Method December 02, 2020 P. Carpentier Master Optimization

Implicit Bias Implicit bias Implicit bias refers to attitudes or stereotypes that affect our

Implicit Surfaces Implicit Surfaces An implicit surface is simply an iso-contour CIS 781 of a

Regularization Overview Regularization Overview Problems &amp; Multicollinearity We will

Introduction CSCE 970 CSCE 970 Lecture 3: Lecture 3: Regularization Regularization CSCE 970

Regularization Regularization is a general approach to add a complexity parameter to a

Adaptive primal-dual stochastic gradient methods Yangyang Xu Mathematical Sciences, Rensselaer

Gradient Analysis NMDS Indirect Gradient Analysis NMDS Direct Gradient Analysis Objective:

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

CSC2541 Lecture 5 Natural Gradient Roger Grosse Roger Grosse CSC2541 Lecture 5 Natural Gradient

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

Paper Reading Paper HetConv: Heterogeneous Kernel-Based Convolutions for Deep

Assistive robotics: helping with tasks for fun and profit. Neil Bell 3 March 2016 CMSC691-HRI

An approach to healthy, local, and profitable fundraising. Who can participate? FarmRaiser needs

+ Profit Maximization for Online Advertising Demand-Side Platforms Paul Grigas, UC Berkeley

Signature sizes: RSA signatures are big. a call to action D. J. Bernstein University of

Web Mining and Recommender Systems Advanced Recommender Systems This week Methodological papers

On Estimating the Size and Confidence of a Statistical Audit Raluca A. Popa and Ronald L. Rivest

Trustworthy Elections: Evidence and Dispute Resolution 2019 Def Con Las Vegas, NV Philip B.

Regularization Overview Regularization Overview Problems & Multicollinearity We will