The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior - PowerPoint PPT Presentation

The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior of Escaping from Sharp Minima and Regularization Effects Zhanxing Zhu ∗ , Jingfeng Wu ∗ , Bing Yu, Lei Wu, Jinwen Ma. Peking University Beijing Institute of Big Data Research June, 2019

The implicit bias of stochastic gradient descent ◮ Compared with gradient descent (GD), stochastic gradient descent (SGD) tends to generalize better. ◮ This is attributed to the noise in SGD. ◮ In this work we study the anisotropic structure of SGD noise and its importance for escaping and regularization .

Stochastic gradient descent and its variants Loss function L ( θ ) := 1 � N i =1 ℓ ( x i ; θ ) . N Gradient Langevin dynamic (GLD) � 0 , σ 2 � θ t +1 = θ t − η ∇ θ L ( θ t ) + ηǫ t , ǫ t ∼ N t I . Stochastic gradient descent (SGD) g ( θ t ) = 1 θ t +1 = θ t − η ˜ � x ∈ B t ∇ θ ℓ ( x ; θ t ) . g ( θ t ) , ˜ m The structure of SGD noise � � ∇ L ( θ t ) , Σ sgd ( θ t ) , Σ sgd ( θ t ) ≈ g ( θ t ) ∼ N ˜ � i =1 ∇ ℓ ( x i ; θ t ) ∇ ℓ ( x i ; θ t ) T − ∇ L ( θ t ) ∇ L ( θ t ) T � 1 1 � N . m N SGD reformulation � � 0 , Σ sgd ( θ t ) θ t +1 = θ t − η ∇ L ( θ t ) + ηǫ t , ǫ t ∼ N .

GD with unbiased noise θ t +1 = θ t − η ∇ θ L ( θ t ) + ǫ t , ǫ t ∼ N (0 , Σ t ) . (1) Iteration (1) could be viewed as a discretization of the following continuous stochastic differential equation (SDE): � d θ t = −∇ θ L ( θ t ) d t + Σ t d W t . (2) Next we study the role of noise structure Σ t by analyzing the continous SDE (2).

Escaping efficiency Definition (Escaping efficiency) Suppose the SDE (2) is initialized at minimum θ 0 , then for a fixed time t small enough, the escaping efficiency is defined as the increase of loss potential: E θ t [ L ( θ t ) − L ( θ 0 )] (3) Under suitable approximations, we could compute the escaping efficiency for SDE (2), � t � t 1 � � ∇ L T ∇ L E [ L ( θ t ) − L ( θ 0 )] = − E + 2 E Tr( H t Σ t ) d t (4) 0 0 ≈ 1 �� ≈ t I − e − 2 Ht � 4Tr Σ 2Tr ( H Σ) . (5) Thus Tr ( HΣ )serves as an important indicator for measuring the escaping behavior of noises with different structures.

Factors affecting the escaping behavior The noise scale For Gaussian noise ǫ t ∼ N (0 , Σ t ), we can measure its scale by � ǫ t � trace := E [ ǫ T t ǫ t ] = · · · = Tr(Σ t ). Thus based on Tr( H Σ), we see that the larger noise scale is, the faster the escaping happens. To eliminate the impact of noise scale, assume that given time t , Tr( Σ t ) is constant . (6) The ill-conditioning of minima For the minima with Hessian as scalar matrix H t = λ I , the noises in same magnitude make no difference since Tr( H t Σ t ) = λ TrΣ t . The structure of noise For the ill-conditioned minima, the structure of noise plays an important role on the escaping!

The impact of noise structure Proposition Let H D × D and Σ D × D be semi-positive definite. If 1. H is ill-conditioned. Let λ 1 , λ 2 . . . λ D be the eigenvalues of H in descent order, and for some constant k ≪ D and d > 1 2 , the eigenvalues satisfy λ 1 > 0 , λ k +1 , λ k +2 , . . . , λ D < λ 1 D − d ; (7) 2. Σ is “aligned” with H . Let u i be the corresponding unit eigenvector of eigenvalue λ i , for some projection coefficient a > 0 , we have Tr Σ u T 1 Σ u 1 ≥ a λ 1 (8) TrH . Then for such anisotropic Σ and its isotropic equivalence ¯ Σ = Tr Σ D I under constraint (6) , we have the follow ratio describing their difference in term of escaping efficiency, Tr ( H Σ) d > 1 � aD (2 d − 1) � Σ) = O , 2 . (9) Tr ( H ¯

Analyze the noise of SGD via Proposition 1 By Proposition 1, The anisotropic noises satisfying the two conditions indeed help escape from the ill-conditioned minima. Thus to see the importance of SGD noise, we only need to show it meets the two conditions. ◮ Condition 1 is naturally hold for neural networks, thanks to their over-parameterization! ◮ See the following Proposition 2 for the second condition.

SGD noise and Hessian Proposition Consider a binary classification problem with data { ( x i , y i ) } i ∈ I , y ∈ { 0 , 1 } , � 2 , where f denotes � � and mean square loss, L ( θ ) = E ( x , y ) � φ ◦ f ( x ; θ ) − y the network and φ is a threshold activation function, φ ( f ) = min { max { f , δ } , 1 − δ } , (10) δ is a small positive constant. Suppose the network f satisfies: 1. it has one hidden layer and piece-wise linear activation; 2. the parameters of its output layer are fixed during training. Then there is a constant a > 0 , for θ close enough to minima θ ∗ , u ( θ ) T Σ( θ ) u ( θ ) ≥ a λ ( θ ) Tr Σ( θ ) (11) TrH ( θ ) holds almost everywhere, for λ ( θ ) and u ( θ ) being the maximal eigenvalue and its corresponding eigenvector of Hessian H ( θ ) .

Examples of different noise structures Table: Compared dynamics defined in Eq. (1). Dynamics Noise ǫ t Remarks � � 0 , Σ sgd Σ sgd SGD ǫ t ∼ N is the gradient covariance matrix. t t 0 , ̺ 2 GLD ǫ t ∼ N � � ̺ t is a tunable constant. t I constant � 0 , σ 2 � GLD dy- ǫ t ∼ N t I σ t is adjusted to force the noise share namic the same magnitude with SGD noise, similarly hereinafter. � � 0 , diag(Σ sgd diag(Σ sgd GLD di- ǫ t ∼ N ) ) is the diagonal of the covari- t t ance of SGD noise Σ sgd agonal . t � � 0 , σ t ˜ ˜ GLD ǫ t ∼ N Σ t Σ t is the best low rank approximation of Σ sgd leading . t � � 0 , σ t ˜ ˜ GLD ǫ t ∼ N H t H t is the best low rank approximation Hessian of the Hessian. � � 0 , σ t λ 1 u 1 u T GLD 1st ǫ t ∼ N λ 1 , u 1 are the maximal eigenvalue and 1 eigven( H ) its corresponding unit eigenvector of the Hessian.

2-D toy example 7.500 1.5 1 6.000 0 . 5 0 0 35 4.500 7.500 9 . 0 1.0 0 1.500 3.000 0 30 success rate (%) 4.500 0.5 25 7.500 6.000 3 . 0 0 0 w2 1 . 20 5 0 0 GD 0.0 GLD const 15 GLD diag −0.5 10 GLD leading GLD Hessian −1.0 5 GLD 1st eigven(H) 0 −1.0 −0.5 0.0 0.5 1.0 1.5 GD GLD GLD GLD GLD GLD 1st w1 const diag leading Hessian eigven(H) Figure: 2-D toy example. Compared dynamics are initialized at the sharp minima. Left : The trajectory of each compared dynamics for escaping from the sharp minimum in one run. Right : Success rate of arriving the flat solution in 100 repeated runs

One hidden layer network 10 3 Tr ( H t Σ t ) Tr ( H t ¯ Σ t ) 32 hidden nodes 128 hidden nodes 10 2 512 hidden nodes 10 1 10 0 10 − 1 0 5 10 15 20 25 30 iteration Figure: One hidden layer neural networks. The solid and the dotted lines represent the value of Tr( H Σ) and Tr( H ¯ Σ), respectively. The number of hidden nodes varies in { 32 , 128 , 512 } .

FashionMNIST experiments 5 10 1 10 3 eigenvalue spectrum at iter 3000 Estimation of projection coefficient Tr ( H t Σ t ) Tr ( H t ¯ Σ t ) 4 10 1 eigenvalue 10 0 10 − 1 3 10 − 3 2 10 −1 10 − 5 1 10 − 7 10 −2 0 0 50 100 150 200 250 300 350 400 0 2000 4000 6000 8000 10000 12000 14000 0 2000 4000 6000 8000 10000 12000 14000 order of eigenvalues iteration iteration Figure: FashionMNIST experiments. Left : The first 400 eigenvalues of Hessian at θ ∗ GD , the sharp minima found by GD after 3000 iterations. a = u T 1 Σ u 1 Tr H Middle : The projection coefficient estimation ˆ in λ 1 TrΣ Proposition 1. Right : Tr( H t Σ t ) versus Tr( H t ¯ Σ t ) during SGD GD , ¯ optimization initialized from θ ∗ Σ t = TrΣ t D I denotes the isotropic equivalence of SGD noise.

FashionMNIST experiments 0.030 75 GD GLD const 70 0.025 expected sharpness test accuracy (%) GLD dynamic 65 GLD diag GD (66.22) 0.020 GLD leading GLD const (66.33) 60 GLD Hessian GLD dynamic (66.17) 0.015 GLD 1st eigven(H) 55 GLD diag (66.41) SGD GLD leading (68.83) 0.010 50 GLD Hessian (68.74) GLD 1st eigven(H) (68.90) 0.005 45 SGD (68.79) 40 0.000 0 2000 4000 6000 8000 10000 12000 14000 0 2000 4000 6000 8000 10000 12000 14000 16000 iteration iteration Figure: FashionMNIST experiments. Compared dynamics are initialized at θ ∗ GD found by GD, marked by the vertical dashed line in iteration 3000. Left : Test accuracy versus iteration. Right : Expected sharpness versus iteration. Expected sharpness (the higher the sharper) is measured as � � − L ( θ ), and δ = 0 . 01, the expectation is computed E ν ∼N (0 ,δ 2 I ) L ( θ + ν ) by average on 1000 times sampling.

Conclusion ◮ We explore the escaping behavior of SGD-like processes through analyzing their continuous approximation. ◮ We show that thanks to the anisotropic noise, SGD could escape from sharp minima efficiently, which leads to implicit regularization effects. ◮ Our work raises concerns over studying the structure of SGD noise and its effect. ◮ Experiments support our understanding. Poster: Wed Jun 12th 06 : 30 ∼ 09 : 00 PM @ Pacific Ballroom #97

The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior - PowerPoint PPT Presentation

The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior of Escaping from Sharp Minima and Regularization Effects Zhanxing Zhu , Jingfeng Wu , Bing Yu, Lei Wu, Jinwen Ma. Peking University Beijing Institute of Big Data Research

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade

Fitting Neural Networks Gradient Descent and Stochastic Gradient Descent CS109A Introduction to

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla

Large Scale Machine Learning with Stochastic Gradient Descent L eon Bottou leon@bottou.org

Gradient Descent Michail Michailidis & Patrick Maiden Outline

Learning to learn by gradient descent by gradient descent Liyan Jiang July 18, 2019 1

LOGISTIC REGRESSION, GRADIENT LOGISTIC REGRESSION, GRADIENT DESCENT, NEWTON DESCENT, NEWTON

Conjugate gradient training algorithm Steepest descent algorithm Definitions: So far: j

On the Generalization Benefjt of Noise in Stochastic Gradient Descent Samuel L. Smith, Erich

PA-GD: On the Convergence of Perturbed Alternating Gradient Descent to Second-Order Stationary

LAG: Lazily Aggregated Gradient for Communication-Efficient Distributed Learning Tianyi Chen

SeparatingThickness fromGeometricThickness DavidEppstein

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst.

Walk alkin ing Ran andomly ly, Mas Massiv ively ly, an and Effic iciently ly Jakub

materials for magnetic refrigeration Ekkes Brck Introduction Magnetic cooling Giant

Learning From Data Lecture 9 Logistic Regression and Gradient Descent Logistic Regression

SK-Gd

The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior - PowerPoint PPT Presentation

The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior of Escaping from Sharp Minima and Regularization Effects Zhanxing Zhu , Jingfeng Wu , Bing Yu, Lei Wu, Jinwen Ma. Peking University Beijing Institute of Big Data Research

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade

Fitting Neural Networks Gradient Descent and Stochastic Gradient Descent CS109A Introduction to

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla

Large Scale Machine Learning with Stochastic Gradient Descent L eon Bottou leon@bottou.org

Gradient Descent Michail Michailidis &amp; Patrick Maiden Outline

Learning to learn by gradient descent by gradient descent Liyan Jiang July 18, 2019 1

LOGISTIC REGRESSION, GRADIENT LOGISTIC REGRESSION, GRADIENT DESCENT, NEWTON DESCENT, NEWTON

Conjugate gradient training algorithm Steepest descent algorithm Definitions: So far: j

On the Generalization Benefjt of Noise in Stochastic Gradient Descent Samuel L. Smith, Erich

PA-GD: On the Convergence of Perturbed Alternating Gradient Descent to Second-Order Stationary

LAG: Lazily Aggregated Gradient for Communication-Efficient Distributed Learning Tianyi Chen

SeparatingThickness fromGeometricThickness DavidEppstein

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst.

Walk alkin ing Ran andomly ly, Mas Massiv ively ly, an and Effic iciently ly Jakub

materials for magnetic refrigeration Ekkes Brck Introduction Magnetic cooling Giant

Learning From Data Lecture 9 Logistic Regression and Gradient Descent Logistic Regression

SK-Gd

Gradient Descent Michail Michailidis & Patrick Maiden Outline