Towards Demystifying Overparameterization in Deep Learning Mahdi - PowerPoint PPT Presentation

Towards Demystifying Overparameterization in Deep Learning Mahdi Soltanolkotabi Department of Electrical and Computer Engineering April 4, 2019 Mathematics of Imaging Workshop # 3 Henri Poincare Institute April 4, 2019 Mathematics of Imaging Workshop # 3

Collaborators: Samet Oymak and Mingchen Li

Motivation (Theory)

Many success stories Neural networks very effective at learning from data

Lots of hype

Some failures

Need more principled understanding Deep learning-based AI increasingly used in human facing services Challenges: Optimization: Why can they fit? Generalization: Why can they predict? Achitecture: Which neural nets?

This talk: Overparameterization without overfitting Mystery # of parameters >> # training data

Surprising experiment I (stolen from B. Recht) p parameters, n = 50 , 000 training samples, d = 3072 feature size, and 10 classes

Surprising experiment II-Overfitting to corruption Add corruption Corrupt a fraction of training labels by replacing with another random label No corruption on test labels

Surprising experiment III-Robustness Repeat the same experiment but stop early

Benefits of overparameterization for neural networks Benefit I: Tractable nonconvex optimization Benefit II: Robustness to corruption with early stopping

Benefit I: Tractable nonconvex optimization

One-hidden layer y i = v T φ ( W x i )

Theory for smooth activations Data set { ( x i , y i ) } n i =1 with � x i � ℓ 2 = 1 4 n � � � 2 v T φ ( W x i ) − y i 2 min L ( W ) := W i =1 0 − 6 − 4 − 2 0 2 4 6

Theory for smooth activations Data set { ( x i , y i ) } n i =1 with � x i � ℓ 2 = 1 4 n � � � 2 v T φ ( W x i ) − y i 2 min L ( W ) := W i =1 0 − 6 − 4 − 2 0 2 4 6 Set v at random or balanced (half + , half − ) Run gradient descent W τ +1 = W τ − µ τ ∇L ( W τ ) with random initialization Theorem (Oymak and Soltanolkotabi 2019) Assume Smooth activations | φ ′ ( z ) | ≤ B and | φ ′′ ( z ) | ≤ B √ Overparameterization kd � κ ( X ) n Initialization with i.i.d. N (0 , 1) entries Then, with high probability � � 2 τ L ( W 0 ) 1 − c d Zero training error: L ( W τ ) ≤ n √ � W − W 0 � F kd Iterates remain close to initialization: � √ n � W 0 � F

Dependence on data? Diversity of input data is important...   x T 1   x T  2  X =  .  .   . x T n � d n � X � κ ( X ) := λ ( X ) Definition (Neural network covariance matrix and eigenvalue) Neural net covariance matrix � � Σ ( X ) :=1 J ( W 0 ) J T ( W 0 ) k E W 0 � � XX T � � φ ′ ( Xw ) φ ′ ( Xw ) T � � = E w ⊙ . Eigenvalue λ ( X ) := λ min ( Σ ( X ))

Hermite expansion Lemma Let { µ r ( φ ′ ) } ∞ r =0 be the Hermite coefficients of φ ′ . Then, + ∞ � � XX T � � XX T � � XX T � � XX T � µ 2 r ( φ ′ ) µ 2 2 ( φ ′ ) Σ ( X ) = ⊙ . . . ⊙ � ⊙ � �� r =0 r +1 ( E [ φ ′′ ( g )]) 2 arbitrary activation ⇔ quadratic activation Conclusion For generic data e.g. x i i.i.d. uniform on the unit sphere κ ( X ) scales like a constant

Theory for ReLU activations Data set { ( x i , y i ) } n i =1 with � x i � ℓ 2 = 1 4 n � � � 2 v T φ ( W x i ) − y i 2 min L ( W ) := W i =1 0 − 6 − 4 − 2 0 2 4 6 Set v at random or balanced (half + , half − ) Run gradient descent W τ +1 = W τ − µ τ ∇L ( W τ ) with random initialization Theorem (Oymak and Soltanolkotabi 2019) Assume ReLU activation φ ( z ) = ReLU ( z ) = max(0 , z ) √ kd � κ 3 ( X ) n Overparameterization d × n Initialization with i.i.d. N (0 , 1) entries Then, with high probability � � 2 τ L ( W 0 ) 1 − c d Zero training error: L ( W τ ) ≤ n √ � W − W 0 � F kd Iterates remain close to initialization: � √ n � W 0 � F

Theory for SGD Data set { ( x i , y i ) } n i =1 with � x i � ℓ 2 = 1 n � � � 2 v T φ ( W x i ) − y i min L ( W ) := W i =1 Set v at random or balanced (half + , half − ) Run gradient descent W τ +1 = W τ − µ τ ∇L ( W τ ) with random initialization Theorem (Oymak and Soltanolkotabi 2019) Assume Smooth activations | φ ′ ( z ) | ≤ B and | φ ′′ ( z ) | ≤ B √ Overparameterization kd � κ ( X ) n Initialization with i.i.d. N (0 , 1) entries Then, with high probability � � 2 τ L ( W 0 ) 1 − c d Zero training error: E [ L ( W τ )] ≤ n 2 √ � W − W 0 � F kd Iterates remain close to initialization: � √ n � W 0 � F

Proof Sketch

Prelude: over-parametrized linear least-squares θ ∈ R p L ( θ ) := 1 2 � Xθ − y � 2 X ∈ R n × p min with and n ≤ p. ℓ 2 Gradient descent starting from θ 0 has three properties: Global convergence Converges to a global optimum which is closest to θ 0 Total gradient path length is relatively short

Over-parametrized nonlinear least-squares θ ∈ R p L ( θ ) := 1 2 � f ( θ ) − y � 2 min ℓ 2 , where     y 1 f ( x 1 ; θ )     y 2 f ( x 2 ; θ )      ∈ R n ,  ∈ R n , y := f ( θ ) := and n ≤ p.  .   .  . .   . . y n f ( x n ; θ ) Gradient descent: start from some initial parameter θ 0 and run θ τ +1 = θ τ − η τ ∇L ( θ τ ) , ∇L ( θ ) = J ( θ ) T ( f ( θ ) − y ) . Here, J ( θ ) ∈ R n × p is the Jacobian matrix with entries J ij = ∂f ( x i , θ ) . ∂ θ j

Key lemma Lemma 4 � f ( θ 0 ) − y � ℓ 2 Following assumptions on B ( θ 0 , R ) with R := α Jacobian at initialization: σ min ( J ( θ 0 )) ≥ 2 α Bounded Jacobian spectrum: �J ( θ ) � ≤ β � � � � � � � � � J ( � �� Lipschitz Jacobian: θ ) − J ( θ ) � ≤ L θ − θ � F Small initial residual: � f ( θ 0 ) − y � ℓ 2 ≤ α 2 4 L Then using step size η ≤ 2 β � � τ Global geometric convergence: � f ( θ τ ) − y � 2 1 − ηα 2 � f ( θ 0 ) − y � 2 ℓ 2 ≤ 2 ℓ 2 α � θ ∗ − θ 0 � ℓ 2 iterates stay close to init.: � θ τ − θ 0 � ℓ 2 ≤ 4 α � f ( θ 0 ) − y � ℓ 2 ≤ 4 β Total gradient path bounded: � ∞ τ =0 � θ τ +1 − θ τ � ℓ 2 ≤ 4 α � f ( θ 0 ) − y � ℓ 2 Key Ideal Track dynamics of 1 − ηβ 2 � τ − 1 � � V τ := � r τ � ℓ 2 + 1 � θ t +1 − θ t � ℓ 2 . 2

Proof sketch (SGD) Challenge: show that SGD remains in the local neighborhood Attempt I: Show � θ τ − θ 0 � ℓ 2 is a super martingale (see also [Tan and Vershynin 2017]) Attempt II: Show that � f ( θ τ ) − y � ℓ 2 + λ � θ τ − θ 0 � ℓ 2 is a super martingale Final attempt: Show that K � � � 1 � θ τ − v i � ℓ 2 + 3 η � J T ( θ τ ) ( f ( θ τ ) − y ) � K n ℓ 2 j =1 is a super-martingale. Here, v i is a very fine cover of B ( θ 0 , R )

Over-parametrized nonlinear least-squares for neural nets W ∈ R k × d L ( W ) := 1 2 � f ( W ) − y � 2 min ℓ 2 , where     f ( W , x 1 ) y 1     f ( W , x 2 ) y 2      ∈ R n ,  ∈ R n , y := f ( W ) := and n ≤ kd.     . . . .   . . f ( W , x n ) y n Linearization via Jacobian � φ ′ � XW T � � J ( W ) = X ∗ diag ( v )

Key Techniques Hadamard product � � � XX T � J ( W ) J T ( W ) = φ ′ ( XW T ) φ ′ ( W X T ) ⊙ Theorem (Schur 1913) For two PSD A , B ∈ R n × n � � λ min ( A ⊙ B ) ≥ min B ii λ min ( A ) i � � λ max ( A ⊙ B ) ≤ max λ max ( A ) B ii i Random matrix theory k � � φ ′ ( Xw ℓ ) φ ′ ( Xw ℓ ) T � � XX T � J ( W ) J T ( W ) = ⊙ ℓ =1

Side corollary: Nonconvex matrix recovery Features: A 1 , A 2 , . . . , A n ∈ R d × d . Labels: y 1 , y 2 , . . . , y n Solve Nonconvex matrix factorization n � � � 2 1 y i − � A i , UU T � min 2 U ∈ R d × r i =1 Theorem (Oymak and Soltanolkotabi 2018) Assume i.i.d. Gaussian A i any label y i Initialization at well conditioned matrix U 0 Then, gradient descent iterations U τ converge with a geometric rate to a close global optima as soon as n ≤ dr . Burer-Monteiro and many others r ≥ √ n For Gaussian A i we allow r ≥ n d when n ≈ dr 0 Burer-Monteiro: r � √ dr 0 Ours: r � r 0

Previous work Unrealistic quadratic: [Soltanolkotabi, Javanmard, Lee 2018] and [Venturi, Bandeira, Bruna,...] Smooth activations: [Du, Lee, Li, Wang, Zhai 2018] kd � n 2 k � n 4 . versus ReLU activation: [Du et. al. 2018] k � n 4 k � n 6 . versus d 3 Separation: [Li and Liang 2018] [ Allen-Zhu, Li, Song 2018] k � n 12 k � n 25 ???? . versus δ 4 Begin to move beyond “lazy training” [Chizat & Bach, 2018]; Faster convergence rate Deep: [Du, Lee, Li, Wang, Zhai 2018] and [ Allen-Zhu, Li, Song 2018] Mean field analysis for infinitely wide: [Mei et al., 2018]; [Chizat & Bach, 2018]; [Sirignano & Spiliopoulos, 2018]; [Rotskoff & Vanden-Eijnden, 2018]; [Wei et al., 2018].

Towards Demystifying Overparameterization in Deep Learning Mahdi - PowerPoint PPT Presentation

Towards Demystifying Overparameterization in Deep Learning Mahdi Soltanolkotabi Department of Electrical and Computer Engineering April 4, 2019 Mathematics of Imaging Workshop # 3 Henri Poincare Institute April 4, 2019 Mathematics of Imaging

Empirical Study of the Benefits of Overparameterization in Learning Latent Variable Models

Demystifying the Finance Audit Committee DEMYSTIFYING THE FINANCE AND AUDIT COMMITTEE

Demystifying SEO for Government Agencies Demystifying SEO for Government Agencies Why should you

Towards Deep Multi-View Stereo Silvano Galliani October 2, 2017 1 / 40 Towards Deep Multi-View

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Demystifying DNA Demystifying DNA What is it? How do I get it? What is it? How do I

Demystifying Python Metaclasses Demystifying Python Metaclasses Eric D. Wills, Ph.D. Instructor,

An Investigation of Why Overparameterization Exacerbates Spurious Correlation Authors: Shiori

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

An Investigation of Why Overparameterization Exacerbates Spurious Correlations Shiori Sagawa*

Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief introduction of Deep learning.

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

IV and IV-GMM Christopher F Baum ECON 8823: Applied Econometrics Boston College, Spring 2016

Foundations of Machine Learning Boosting Weak Learning (Kearns and Valiant, 1994) Definition:

Inverting Sampled Traffic Nicolas Hohn, Darryl Veitch Australian Research Council Special

Spectrum Sharing Applications Sreeraj Rajendran rsreeraj@gmail.com FOSDEM 15 , Brussels

Steins Method for Matrix Concentration Lester Mackey Collaborators: Michael I. Jordan ,

Generalization Error of Generalized Linear Models in High Dimensions Melika Emami 1 , Mojtaba

Truncations of unitary matrices and Brownian bridges Alain Rouault (Laboratoire de

Outline 1