theories of neural networks training
play

Theories of Neural Networks Training Lazy and Mean Field Regimes c - PowerPoint PPT Presentation

Theories of Neural Networks Training Lazy and Mean Field Regimes c Chizat * , joint work with Francis Bach + L ena April 10th 2019 - University of Basel e Paris-Sud + INRIA and ENS Paris CNRS and Universit Introduction Setting


  1. Theories of Neural Networks Training Lazy and Mean Field Regimes ıc Chizat * , joint work with Francis Bach + L´ ena¨ April 10th 2019 - University of Basel e Paris-Sud + INRIA and ENS Paris ∗ CNRS and Universit´

  2. Introduction

  3. Setting Supervised machine learning • given input/output training data ( x (1) , y (1) ) , . . . , ( x ( n ) , y ( n ) ) • build a function f such that f ( x ) ≈ y for unseen data ( x , y ) Gradient-based learning • choose a parametric class of functions f ( w , · ) : x �→ f ( w , x ) • a loss ℓ to compare outputs: squared, logistic, cross-entropy... • starting from some w 0 , update parameters using gradients Example: Stochastic Gradient Descent with step-sizes ( η ( k ) ) k ≥ 1 w ( k ) = w ( k − 1) − η ( k ) ∇ w [ ℓ ( f ( w ( k − 1) , x ( k ) ) , y ( k ) )] [Refs]: Robbins, Monroe (1951). A Stochastic Approximation Method. LeCun, Bottou, Bengio, Haffner (1998). Gradient-Based Learning Applied to Document Recognition. 1/20

  4. Models Linear : linear regression, ad-hoc features, kernel methods: f ( w , x ) = w · φ ( x ) Non-linear : neural networks (NNs). Example of a vanilla NN: f ( w , x ) = W T L σ ( W T L − 1 σ ( . . . σ ( W T 1 x + b 1 ) . . . ) + b L − 1 ) + b L with activation σ and parameters w = ( W 1 , b 1 ) , . . . , ( W L , b L ). x [1] y x [2] 2/20

  5. Challenges for Theory Need for new theoretical approaches • optimization: non-convex, compositional structure • statistics: over-parameterized, works without regularization Why should we care? • effects of hyper-parameters • insights on individual tools in a pipeline • more robust, more efficient, more accessible models Today’s program • lazy training • global convergence for over-parameterized two-layers NNs [Refs]: Zhang, Bengio, Hardt, Recht, Vinyals (2016). Understanding Deep Learning Requires Rethinking Generalization . 3/20

  6. Lazy Training

  7. Tangent Model Let f ( w , x ) be a differentiable model and w 0 an initialization. f ∗ × w �→ f ( w , · ) • • × f ( w 0 , · ) × w 0 W 0 f ( W 0 , · ) 4/20

  8. Tangent Model Let f ( w , x ) be a differentiable model and w 0 an initialization. f ∗ × • w �→ T f ( w , · ) • • • × f ( w 0 , · ) × w 0 T f ( w 0 , · ) W 0 f ( W 0 , · ) Tangent model T f ( w , x ) = f ( w 0 , x ) + ( w − w 0 ) · ∇ w f ( w 0 , x ) Scaling the output by α makes the linearization more accurate. 4/20

  9. Lazy Training Theorem Theorem (Lazy training through rescaling) Assume that f ( w 0 , · ) = 0 and that the loss is quadratic. In the limit of a small step-size and a large scale α , gradient-based methods on the non-linear model α f and on the tangent model T f learn the same model, up to a O (1 /α ) remainder. • lazy because parameters hardly move • optimization of linear models is rather well understood • recovers kernel ridgeless regression with offset f ( w 0 , · ) and K ( x , x ′ ) = �∇ w f ( w 0 , x ) , ∇ w f ( w 0 , x ′ ) � [Refs]: Jacot, Gabriel, Hongler (2018). Neural Tangent Kernel: Convergence and Generalization in Neural Networks . Du, Lee, Li, Wang, Zhai (2018). Gradient Descent Finds Global Minima of Deep Neural Networks . Allen-Zhu, Li, Liang (2018). Learning and Generalization in Overparameterized Neural Networks [...] . Chizat, Bach (2018). A Note on Lazy Training in Supervised Differentiable Programming . 5/20

  10. Range of Lazy Training Criteria for lazy training (informal) �∇ f ( w 0 , · ) � 2 � T f ( w ∗ , · ) − f ( w 0 , · ) � ≪ �∇ 2 f ( w 0 , · ) � � �� � � �� � Distance to best linear model “Flatness” around initialization � difficult to estimate in general Examples • Homogeneous models. If for λ > 0, f ( λ w , x ) = λ L f ( w , x ) then flatness ∼ � w 0 � L • NNs with large layers. Occurs if initialized with scale O (1 / √ fan in ) 6/20

  11. Large Neural Networks i . i . d i . i . d Vanilla NN with W l ∼ N (0 , τ 2 w / fan in ) and b l ∼ N (0 , τ 2 b ). i , j i Model at initialization As widths of layers diverge, f ( w 0 , · ) ∼ GP (0 , Σ L ) where Σ l +1 ( x , x ′ ) = τ 2 b + τ 2 w · E z l ∼GP (0 , Σ l ) [ σ ( z l ( x )) · σ ( z l ( x ′ ))] . Limit tangent kernel In the same limit, �∇ w f ( w 0 , x ) , ∇ w f ( w 0 , x ′ ) � → K L ( x , x ′ ) where K l +1 ( x , x ′ ) = K l ( x , x ′ ) ˙ Σ l +1 ( x , x ′ ) + Σ l +1 ( x , x ′ ) and ˙ Σ l +1 ( x , x ′ ) = E z l ∼GP (0 , Σ l ) [ ˙ σ ( z l ( x )) · ˙ σ ( z l ( x ′ ))]. � cf. A. Jacot’s talk of last week [Refs]: Matthews, Rowland, Hron, Turner, Ghahramani (2018). Gaussian process behaviour in wide deep neural networks . 7/20 Lee, Bahri, Novak, Schoenholz, Pennington, Sohl-Dickstein (2018). Deep neural networks as gaussian processes . Jacot, Gabriel, Hongler (2018). Neural Tangent Kernel: Convergence and Generalization in Neural Networks .

  12. Numerical Illustrations circle of radius 1 gradient flow (+) gradient flow (-) (a) Not lazy (b) Lazy 4.0 end of training not yet converged best throughout training 3.5 Population loss at convergence 3 3.0 2.5 Test loss 2 2.0 1.5 1.0 1 0.5 0.0 0 10 2 10 1 10 0 10 1 10 2 10 1 10 0 10 1 (c) Over-param. (d) Under-param. Training a 2-layers ReLU NN in the teacher-student setting 8/20 (a-b) trajectories (c-d) generalization in 100-d vs init. scale τ

  13. Lessons to be drawn For practice • our guess: instead, feature selection is why NNs work • investigation needed on hard tasks For theory • in depth analysis sometimes possible • not just one theory for NNs training [Refs]: Zhang, Bengio, Singer (2019). Are all layers created equal? Lee, Bahri, Novak, Schoenholz, Pennington, Sohl-Dickstein (2018). Deep neural networks as gaussian processes 9/20

  14. Global convergence for 2 -layers NNs

  15. Two Layers NNs x [1] y x [2] With activation σ , define φ ( w i , x ) = c i σ ( a i · x + b i ) and m f ( w , x ) = 1 � φ ( w i , x ) m i =1 Statistical setting : minimize population loss E ( x , y ) [ ℓ ( f ( w , x ) , y )]. Hard problem : existence of spurious minima even with slight over-parameterization and good initialization [Refs]: Livni, Shalev-Shwartz, Shamir (2014). On the Computational Efficiency of Training Neural Networks . 10/20 Safran, Shamir (2018). Spurious Local Minima are Common in Two-layer ReLU Neural Networks .

  16. Mean-Field Analysis Many-particle limit Training dynamics in the small step-size and infinite width limit: m µ t , m = 1 � δ w i ( t ) m →∞ µ t , ∞ → m i =1 [Refs]: Nitanda, Suzuki (2017). Stochastic particle gradient descent for infinite ensembles. Mei, Montanari, Nguyen (2018). A Mean Field View of the Landscape of Two-Layers Neural Networks . Rotskoff, Vanden-Eijndem (2018). Parameters as Interacting Particles [...] . 11/20 Sirignano, Spiliopoulos (2018). Mean Field Analysis of Neural Networks . Chizat, Bach (2018) On the Global Convergence of Gradient Descent for Over-parameterized Models [...]

  17. Global Convergence Theorem (Global convergence, informal) In the limit of a small step-size, a large data set and large hidden layer, NNs trained with gradient-based methods initialized with “sufficient diversity” converge globally. • diversity at initialization is key for success of training • highly non-linear dynamics and regularization allowed [Refs]: Chizat, Bach (2018). On the Global Convergence of Gradient Descent for Over-parameterized Models [...]. 12/20

  18. Numerical Illustrations 10 0 10 0 10 1 10 1 10 2 particle gradient flow 10 2 convex minimization 10 3 below optim. error 10 3 m 0 10 4 4 10 5 10 10 6 10 5 10 1 10 2 10 1 10 2 (a) ReLU (b) Sigmoid Population loss at convergence vs m for training a 2-layers NN in the teacher-student setting in 100-d. 13/20 This principle is general: e.g. sparse deconvolution.

  19. Idealized Dynamic • parameterize the model with a probability measure µ : � µ ∈ P ( R d ) f ( µ, x ) = φ ( w , x ) d µ ( w ) , 14/20

  20. Idealized Dynamic • parameterize the model with a probability measure µ : � µ ∈ P ( R d ) f ( µ, x ) = φ ( w , x ) d µ ( w ) , • consider the population loss over P ( R d ): F ( µ ) := E ( x , y ) [ ℓ ( f ( µ, x ) , y )] . � convex in linear geometry but non-convex in Wasserstein 14/20

  21. Idealized Dynamic • parameterize the model with a probability measure µ : � µ ∈ P ( R d ) f ( µ, x ) = φ ( w , x ) d µ ( w ) , • consider the population loss over P ( R d ): F ( µ ) := E ( x , y ) [ ℓ ( f ( µ, x ) , y )] . � convex in linear geometry but non-convex in Wasserstein • define the Wasserstein Gradient Flow : d µ 0 ∈ P ( R d ) , dt µ t = − div ( µ t v t ) where v t ( w ) = −∇ F ′ ( µ t ) is the Wasserstein gradient of F . [Refs]: Bach (2017). Breaking the Curse of Dimensionality with Convex Neural Networks. Ambrosio, Gigli, Savar´ e (2008). Gradient Flows in Metric Spaces and in the Space of Probability Measures. 14/20

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend