Generalization Bounds of Stochastic Gradient Descent for Wide and - PowerPoint PPT Presentation

Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks Yuan Cao and Quanquan Gu Computer Science Department 1 / 14

Learning Over-parameterized DNNs Empirical observation on extremely wide deep neural networks (Zhang et al. 2017; Bartlett et al. 2017; Neyshabur et al. 2018; Arora et al. 2019) 2 / 14

Learning Over-parameterized DNNs Empirical observation on extremely wide deep neural networks (Zhang et al. 2017; Bartlett et al. 2017; Neyshabur et al. 2018; Arora et al. 2019) ◮ Why can extremely wide neural networks generalize? ◮ What data can be learned by deep and wide neural networks? 3 / 14

Learning Over-parameterized DNNs ◮ Fully connected neural network with width m : f W ( x ) = √ m · W L σ ( W L − 1 · · · σ ( W 1 x ) · · · )) . ◮ σ ( · ) is the ReLU activation function: σ ( t ) = max(0 , t ) . ◮ L ( x i ,y i ) ( W ) = ℓ [ y i · f W ( x i )] , ℓ ( z ) = log(1 + exp( − z )) . 4 / 14

Learning Over-parameterized DNNs ◮ Fully connected neural network with width m : f W ( x ) = √ m · W L σ ( W L − 1 · · · σ ( W 1 x ) · · · )) . ◮ σ ( · ) is the ReLU activation function: σ ( t ) = max(0 , t ) . ◮ L ( x i ,y i ) ( W ) = ℓ [ y i · f W ( x i )] , ℓ ( z ) = log(1 + exp( − z )) . Algorithm SGD for DNNs starting at Gaussian initialization W (0) ∼ N (0 , 2 /m ) , l ∈ [ L − 1] , W (0) ∼ N (0 , 1 /m ) l L for i = 1 , 2 , . . . , n do Draw ( x i , y i ) from D . Update W ( i ) = W ( i − 1) − η · ∇ W L ( x i ,y i ) ( W ( i − 1) ) . end for Output: Randomly choose � W uniformly from { W (0) , . . . , W ( n − 1) } . 5 / 14

Generalization Bounds for DNNs Theorem � � For any R > 0 , if m ≥ � , then with high probability, SGD returns � Ω poly( R, L, n ) W that satisfies � � � � � � n � � 4 LR log(1 /δ ) ( � L 0 − 1 W ) ≤ inf ℓ [ y i · f ( x i )] + O √ n + , E D n n f ∈F ( W (0) ,R ) i =1 where � � F ( W (0) , R ) = f W (0) ( · ) + �∇ W f W (0) ( · ) , W � : � W l � F ≤ R · m − 1 / 2 , l ∈ [ L ] . 6 / 14

Generalization Bounds for DNNs Theorem � � For any R > 0 , if m ≥ � , then with high probability, SGD returns � Ω poly( R, L, n ) W that satisfies � � � � � � n � � 4 LR log(1 /δ ) ( � L 0 − 1 W ) ≤ inf ℓ [ y i · f ( x i )] + O √ n + , E D n n f ∈F ( W (0) ,R ) i =1 where � � F ( W (0) , R ) = f W (0) ( · ) + �∇ W f W (0) ( · ) , W � : � W l � F ≤ R · m − 1 / 2 , l ∈ [ L ] . Neural Tangent Random Feature (NTRF) model 7 / 14

Generalization Bounds for DNNs Corollary � � Let y = ( y 1 , . . . , y n ) ⊤ and λ 0 = λ min ( Θ ( L ) ) . If m ≥ � poly( L, n, λ − 1 Ω 0 ) , then with high probability, SGD returns � W that satisfies � � �� y ⊤ ( Θ ( L ) ) − 1 � � y log(1 /δ ) ( � ≤ � L 0 − 1 E W ) O L · inf + O . D n n y i y i ≥ 1 � where Θ ( L ) is the neural tangent kernel (Jacot et al. 2018) Gram matrix. Θ ( L ) i,j := lim m →∞ m − 1 �∇ W f W (0) ( x i ) , ∇ W f W (0) ( x j ) � . 8 / 14

Generalization Bounds for DNNs Corollary � � Let y = ( y 1 , . . . , y n ) ⊤ and λ 0 = λ min ( Θ ( L ) ) . If m ≥ � poly( L, n, λ − 1 Ω 0 ) , then with high probability, SGD returns � W that satisfies � � �� y ⊤ ( Θ ( L ) ) − 1 � � y log(1 /δ ) ( � ≤ � L 0 − 1 E W ) O L · inf + O . D n n y i y i ≥ 1 � where Θ ( L ) is the neural tangent kernel (Jacot et al. 2018) Gram matrix. Θ ( L ) i,j := lim m →∞ m − 1 �∇ W f W (0) ( x i ) , ∇ W f W (0) ( x j ) � . The “classifiability” of the underlying data distribution D can also be � y ⊤ ( Θ ( L ) ) − 1 � measured by the quantity inf � � y . y i y i ≥ 1 9 / 14

Overview of the Proof Key observations ◮ Deep ReLU networks are almost linear in terms of their parameters in a small neighbour- hood around random initialization f W ′ ( x i ) ≈ f W ( x i ) + �∇ f W ( x i ) , W ′ − W � . ◮ L ( x i ,y i ) ( W ) is Lipschitz continuous and almost convex �∇ W l L ( x i ,y i ) ( W ) � F ≤ O ( √ m ) , l ∈ [ L ] , L ( x i ,y i ) ( W ′ ) � L ( x i ,y i ) ( W ) + �∇ W L ( x i ,y i ) ( W ) , W ′ − W � . 10 / 14

Overview of the Proof Key observations ◮ Deep ReLU networks are almost linear in terms of their parameters in a small neighbour- hood around random initialization f W ′ ( x i ) ≈ f W ( x i ) + �∇ f W ( x i ) , W ′ − W � . ◮ L ( x i ,y i ) ( W ) is Lipschitz continuous and almost convex �∇ W l L ( x i ,y i ) ( W ) � F ≤ O ( √ m ) , l ∈ [ L ] , L ( x i ,y i ) ( W ′ ) � L ( x i ,y i ) ( W ) + �∇ W L ( x i ,y i ) ( W ) , W ′ − W � . Optimization for Lipschitz and (almost) convex functions + Online-to-batch conversion 11 / 14

Overview of the Proof Key observations ◮ Deep ReLU networks are almost linear in terms of their parameters in a small neighbour- hood around random initialization f W ′ ( x i ) ≈ f W ( x i ) + �∇ f W ( x i ) , W ′ − W � . ◮ L ( x i ,y i ) ( W ) is Lipschitz continuous and almost convex �∇ W l L ( x i ,y i ) ( W ) � F ≤ O ( √ m ) , l ∈ [ L ] , L ( x i ,y i ) ( W ′ ) � L ( x i ,y i ) ( W ) + �∇ W L ( x i ,y i ) ( W ) , W ′ − W � . Applicable to general loss functions: ℓ ( · ) is convex/Lipschitz/smooth ⇒ L ( x i ,y i ) ( W ) is (almost) convex/Lipschitz/smooth 12 / 14

Summary ◮ Generalization bounds for wide DNNs that do not increase in network width. ◮ A random feature model (NTRF model) that naturally connects over-parameterized DNNs with NTK. � ◮ A quantification of the “classifiability” of data: inf � � y ⊤ ( Θ ( L ) ) − 1 � y . y i y i ≥ 1 ◮ A clean and simple proof framework for neural networks in the “NTK regime” that is applicable to various problem settings. 13 / 14

Summary ◮ Generalization bounds for wide DNNs that do not increase in network width. ◮ A random feature model (NTRF model) that naturally connects over-parameterized DNNs with NTK. � ◮ A quantification of the “classifiability” of data: inf � � y ⊤ ( Θ ( L ) ) − 1 � y . y i y i ≥ 1 ◮ A clean and simple proof framework for neural networks in the “NTK regime” that is applicable to various problem settings. Thank you! Poster #141 14 / 14

Generalization Bounds of Stochastic Gradient Descent for Wide and - PowerPoint PPT Presentation

Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks Yuan Cao and Quanquan Gu Computer Science Department 1 / 14 Learning Over-parameterized DNNs Empirical observation on extremely wide deep neural networks

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade

Fitting Neural Networks Gradient Descent and Stochastic Gradient Descent CS109A Introduction to

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla

Large Scale Machine Learning with Stochastic Gradient Descent L eon Bottou leon@bottou.org

Gradient Descent Michail Michailidis & Patrick Maiden Outline

Learning to learn by gradient descent by gradient descent Liyan Jiang July 18, 2019 1

LOGISTIC REGRESSION, GRADIENT LOGISTIC REGRESSION, GRADIENT DESCENT, NEWTON DESCENT, NEWTON

On the Generalization Benefjt of Noise in Stochastic Gradient Descent Samuel L. Smith, Erich

The Effect of Network Width on Stochastic Gradient Descent and Generalization Daniel S. Park

UMBC A N R Y L D A B M A L F T U M B C I O M 1 (March 29, 1998 4:05 pm) Y O

Regression via Iteratively Reweighted Least Squares Alina Ene, Adrian Vladu IRLS Method Basic

On Approximating the Covering Radius and Finding Dense Lattice Subspaces Daniel Dadush Centrum

Hypergeometric L -functions in average polynomial time Edgar Costa, Kiran S. Kedlaya, and David

Dependency Parsing Joakim Nivre Uppsala University Department of Linguistics and Philology

Improving Your TABLEGEN Description Javed Absar WHAT IS TABLEGEN ? DSL invented for LLVM

MT@EC Final Multilingual W eb w orkshop Luxem bourg 1 5 -1 6 March 2 0 1 2 Spyridon Pilos

Bandit opmizaon with large strategy sets Alexandre Prou*ere

Generalization Bounds of Stochastic Gradient Descent for Wide and - PowerPoint PPT Presentation

Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks Yuan Cao and Quanquan Gu Computer Science Department 1 / 14 Learning Over-parameterized DNNs Empirical observation on extremely wide deep neural networks

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade

Fitting Neural Networks Gradient Descent and Stochastic Gradient Descent CS109A Introduction to

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla

Large Scale Machine Learning with Stochastic Gradient Descent L eon Bottou leon@bottou.org

Gradient Descent Michail Michailidis &amp; Patrick Maiden Outline

Learning to learn by gradient descent by gradient descent Liyan Jiang July 18, 2019 1

LOGISTIC REGRESSION, GRADIENT LOGISTIC REGRESSION, GRADIENT DESCENT, NEWTON DESCENT, NEWTON

On the Generalization Benefjt of Noise in Stochastic Gradient Descent Samuel L. Smith, Erich

The Effect of Network Width on Stochastic Gradient Descent and Generalization Daniel S. Park

UMBC A N R Y L D A B M A L F T U M B C I O M 1 (March 29, 1998 4:05 pm) Y O

Regression via Iteratively Reweighted Least Squares Alina Ene, Adrian Vladu IRLS Method Basic

On Approximating the Covering Radius and Finding Dense Lattice Subspaces Daniel Dadush Centrum

Hypergeometric L -functions in average polynomial time Edgar Costa, Kiran S. Kedlaya, and David

Dependency Parsing Joakim Nivre Uppsala University Department of Linguistics and Philology

Improving Your TABLEGEN Description Javed Absar WHAT IS TABLEGEN ? DSL invented for LLVM

MT@EC Final Multilingual W eb w orkshop Luxem bourg 1 5 -1 6 March 2 0 1 2 Spyridon Pilos

Bandit op*miza*on with large strategy sets Alexandre Prou*ere

Gradient Descent Michail Michailidis & Patrick Maiden Outline

Bandit opmizaon with large strategy sets Alexandre Prou*ere