Over-parameterized nonlinear learning: Gradient descent follows the - PowerPoint PPT Presentation

Over-parameterized nonlinear learning: Gradient descent follows the shortest path? Samet Oymak and Mahdi Soltanolkotabi Department of Electrical and Computer Engineering June 2019 June 2019

Motivation Modern learning (e.g. deep learning) involves fitting nonlinear models

Motivation Modern learning (e.g. deep learning) involves fitting nonlinear models Mystery # of parameters >> # of training data

Motivation Modern learning (e.g. deep learning) involves fitting nonlinear models Mystery # of parameters >> # of training data Challenges Optimization: Why can you find a global optima despite nonconvexity? Generalization: Why is the global optima any good for prediction?

Prelude: over-parametrized linear least-squares θ ∈ R p L ( θ ) := 1 2 � Xθ − y � 2 X ∈ R n × p min with and n ≤ p. ℓ 2

Prelude: over-parametrized linear least-squares θ ∈ R p L ( θ ) := 1 2 � Xθ − y � 2 X ∈ R n × p min with and n ≤ p. ℓ 2 Gradient descent starting from θ 0 has three properties:

Prelude: over-parametrized linear least-squares θ ∈ R p L ( θ ) := 1 2 � Xθ − y � 2 X ∈ R n × p min with and n ≤ p. ℓ 2 Gradient descent starting from θ 0 has three properties: Global convergence

Prelude: over-parametrized linear least-squares θ ∈ R p L ( θ ) := 1 2 � Xθ − y � 2 X ∈ R n × p min with and n ≤ p. ℓ 2 Gradient descent starting from θ 0 has three properties: Global convergence Converges to closest global optima to θ 0

Prelude: over-parametrized linear least-squares θ ∈ R p L ( θ ) := 1 2 � Xθ − y � 2 X ∈ R n × p min with and n ≤ p. ℓ 2 Gradient descent starting from θ 0 has three properties: Global convergence Converges to closest global optima to θ 0 Follows a direct trajectory

Over-parametrized nonlinear least-squares θ ∈ R p L ( θ ) := 1 2 � f ( θ ) − y � 2 min ℓ 2 , where     y 1 f ( x 1 ; θ ) y 2 f ( x 2 ; θ )      ∈ R n ,  ∈ R n , y := f ( θ ) := and n ≤ p.  .   .  . .     . .   y n f ( x n ; θ )

Over-parametrized nonlinear least-squares θ ∈ R p L ( θ ) := 1 2 � f ( θ ) − y � 2 min ℓ 2 , where     y 1 f ( x 1 ; θ ) y 2 f ( x 2 ; θ )      ∈ R n ,  ∈ R n , y := f ( θ ) := and n ≤ p.  .   .  . .     . .   y n f ( x n ; θ ) Run gradient descent: θ τ +1 = θ τ − η τ ∇L ( θ τ )

Over-parametrized nonlinear least-squares θ ∈ R p L ( θ ) := 1 2 � f ( θ ) − y � 2 min ℓ 2 , where     y 1 f ( x 1 ; θ ) y 2 f ( x 2 ; θ )      ∈ R n ,  ∈ R n , y := f ( θ ) := and n ≤ p.  .   .  . .     . .   y n f ( x n ; θ ) Run gradient descent: θ τ +1 = θ τ − η τ ∇L ( θ τ ) Gradient and Jacobian ∇L ( θ ) = J ( θ ) T ( f ( θ ) − y ) . ∈ R n × p is the Jacobian matrix J ( θ ) = ∂f ( θ ) ∂ θ Intuition: Jacobian replaces the feature matrix X

Gradient descent trajectory Assumptions minimum singular value at initialization: σ min ( J ( θ 0 )) ≥ 2 α maximum singular value: �J ( θ ) � ≤ β Jacobian smoothness: �J ( θ 2 ) − J ( θ 1 ) � ≤ L � θ 2 − θ 1 � ℓ 2 Initial error: � f ( θ 0 ) − y � ℓ 2 ≤ α 2 4 L

Gradient descent trajectory Assumptions minimum singular value at initialization: σ min ( J ( θ 0 )) ≥ 2 α maximum singular value: �J ( θ ) � ≤ β Jacobian smoothness: �J ( θ 2 ) − J ( θ 1 ) � ≤ L � θ 2 − θ 1 � ℓ 2 Initial error: � f ( θ 0 ) − y � ℓ 2 ≤ α 2 4 L Theorem (Oymak and Soltanolkotabi 2018) � f ( θ 0 ) − y � ℓ 2 1 Assume above over a ball of radius R = around θ 0 and Set η = β 2 . α Global convergence: � τ α 2 � 1 − 1 � f ( θ τ ) − y � 2 � f ( θ 0 ) − y � 2 ℓ 2 ≤ ℓ 2 2 β 2 Converges to near closest global minima to initialization: � θ τ − θ 0 � ℓ 2 ≤ β α � θ ∗ − θ 0 � ℓ 2 Takes an approximately direct route

Concrete example: One-hidden layer neural net Training data: ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . , ( x n , y n ) Loss: � 2 v T φ ( W x i ) − y i L ( v , W ) := � n � i =1 Algorithm: gradient descent with random Gaussian initialization Theorem (Oymak and Soltanolkotabi 2019) As long as # parameters � (# of training data ) 2 Then, with high probability Zero training error: L ( v τ , W τ ) ≤ (1 − ρ ) τ L ( v 0 , W 0 ) Iterates remain close to initialization

Further results and applications Extensions to SGD and other loss functions Theoretical justification for Early stopping Generalization Robustness to label noise Other applications Fitting generalized Low-rank matrix recovery linear models

Conclusion (Stochastic) gradient descent has three intriguing properties Global convergence Converges to near closest global optima to init. Follows a direct trajectory

Thanks! Poster Thursday, 6:30 PM, # 95 References Over-parametrized nonlinear learning: Gradient descent follows the shortest path? S. Oymak and M. Soltanolkotabi Towards moderate overparameterization: global convergence guarantees for training shallow neural networks. S. Oymak and M. Soltanolkotabi Gradient Descent with Early Stopping is Provably Robust to Label Noise for Overparameterized Neural Networks. M. Li, M. Soltanolkotabi, and S. Oymak Generalization Guarantees for Neural Networks via Harnessing the Low-rank Structure of the Jacobian. S. Oymak, Z. Fabian, M. Li, and M. Soltanolkotabi

Over-parameterized nonlinear learning: Gradient descent follows the - PowerPoint PPT Presentation

Over-parameterized nonlinear learning: Gradient descent follows the shortest path? Samet Oymak and Mahdi Soltanolkotabi Department of Electrical and Computer Engineering June 2019 June 2019 Motivation Modern learning (e.g. deep learning)

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

Learning to learn by gradient descent by gradient descent Liyan Jiang July 18, 2019 1

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Gradient Descent Michail Michailidis & Patrick Maiden Outline

LOGISTIC REGRESSION, GRADIENT LOGISTIC REGRESSION, GRADIENT DESCENT, NEWTON DESCENT, NEWTON

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

Fitting Neural Networks Gradient Descent and Stochastic Gradient Descent CS109A Introduction to

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Conjugate gradient training algorithm Steepest descent algorithm Definitions: So far: j

Nonlinear Stein Variational Gradient Descent for Learning Diversified Mixture Models Dilin Wang

Large Scale Machine Learning with Stochastic Gradient Descent L eon Bottou leon@bottou.org

Some Geometrical Considerations James H. Steiger Department of Psychology and Human Development

Workshop 8.2a: Heterogeneity Murray Logan 23 Jul 2016 Section 1 Linear modelling assumptions

Linear algebra A brush-up course Anders Ringgaard Kristensen Slide 1 Outline Real numbers

Hybrid Steepest Descent Method for Variational Inequality Problem over Fixed Point Sets of

Convex Optimization by Stephen Boyd, and Lieven Vandenberghe. Optimization for Machine Learning by

Session 06 Generalized Linear Models 1 Nature of the generalization Single response variable,

Generalized Linear Factor Models: a local EM estimation Xavier Bry a, Christian Lavergne ab and

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture18: Logistic regression

Over-parameterized nonlinear learning: Gradient descent follows the - PowerPoint PPT Presentation

Over-parameterized nonlinear learning: Gradient descent follows the shortest path? Samet Oymak and Mahdi Soltanolkotabi Department of Electrical and Computer Engineering June 2019 June 2019 Motivation Modern learning (e.g. deep learning)

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

Learning to learn by gradient descent by gradient descent Liyan Jiang July 18, 2019 1

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Gradient Descent Michail Michailidis &amp; Patrick Maiden Outline

LOGISTIC REGRESSION, GRADIENT LOGISTIC REGRESSION, GRADIENT DESCENT, NEWTON DESCENT, NEWTON

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

Fitting Neural Networks Gradient Descent and Stochastic Gradient Descent CS109A Introduction to

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Conjugate gradient training algorithm Steepest descent algorithm Definitions: So far: j

Nonlinear Stein Variational Gradient Descent for Learning Diversified Mixture Models Dilin Wang

Large Scale Machine Learning with Stochastic Gradient Descent L eon Bottou leon@bottou.org

Some Geometrical Considerations James H. Steiger Department of Psychology and Human Development

Workshop 8.2a: Heterogeneity Murray Logan 23 Jul 2016 Section 1 Linear modelling assumptions

Linear algebra A brush-up course Anders Ringgaard Kristensen Slide 1 Outline Real numbers

Hybrid Steepest Descent Method for Variational Inequality Problem over Fixed Point Sets of

Convex Optimization by Stephen Boyd, and Lieven Vandenberghe. Optimization for Machine Learning by

Session 06 Generalized Linear Models 1 Nature of the generalization Single response variable,

Generalized Linear Factor Models: a local EM estimation Xavier Bry a, Christian Lavergne ab and

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture18: Logistic regression

Gradient Descent Michail Michailidis & Patrick Maiden Outline