Theories of Neural Networks Training Lazy and Mean Field Regimes c - PowerPoint PPT Presentation

Theories of Neural Networks Training Lazy and Mean Field Regimes ıc Chizat * , joint work with Francis Bach + L´ ena¨ April 10th 2019 - University of Basel e Paris-Sud + INRIA and ENS Paris ∗ CNRS and Universit´

Introduction

Setting Supervised machine learning • given input/output training data ( x (1) , y (1) ) , . . . , ( x ( n ) , y ( n ) ) • build a function f such that f ( x ) ≈ y for unseen data ( x , y ) Gradient-based learning • choose a parametric class of functions f ( w , · ) : x �→ f ( w , x ) • a loss ℓ to compare outputs: squared, logistic, cross-entropy... • starting from some w 0 , update parameters using gradients Example: Stochastic Gradient Descent with step-sizes ( η ( k ) ) k ≥ 1 w ( k ) = w ( k − 1) − η ( k ) ∇ w [ ℓ ( f ( w ( k − 1) , x ( k ) ) , y ( k ) )] [Refs]: Robbins, Monroe (1951). A Stochastic Approximation Method. LeCun, Bottou, Bengio, Haffner (1998). Gradient-Based Learning Applied to Document Recognition. 1/20

Models Linear : linear regression, ad-hoc features, kernel methods: f ( w , x ) = w · φ ( x ) Non-linear : neural networks (NNs). Example of a vanilla NN: f ( w , x ) = W T L σ ( W T L − 1 σ ( . . . σ ( W T 1 x + b 1 ) . . . ) + b L − 1 ) + b L with activation σ and parameters w = ( W 1 , b 1 ) , . . . , ( W L , b L ). x [1] y x [2] 2/20

Challenges for Theory Need for new theoretical approaches • optimization: non-convex, compositional structure • statistics: over-parameterized, works without regularization Why should we care? • effects of hyper-parameters • insights on individual tools in a pipeline • more robust, more efficient, more accessible models Today’s program • lazy training • global convergence for over-parameterized two-layers NNs [Refs]: Zhang, Bengio, Hardt, Recht, Vinyals (2016). Understanding Deep Learning Requires Rethinking Generalization . 3/20

Lazy Training

Tangent Model Let f ( w , x ) be a differentiable model and w 0 an initialization. f ∗ × w �→ f ( w , · ) • • × f ( w 0 , · ) × w 0 W 0 f ( W 0 , · ) 4/20

Tangent Model Let f ( w , x ) be a differentiable model and w 0 an initialization. f ∗ × • w �→ T f ( w , · ) • • • × f ( w 0 , · ) × w 0 T f ( w 0 , · ) W 0 f ( W 0 , · ) Tangent model T f ( w , x ) = f ( w 0 , x ) + ( w − w 0 ) · ∇ w f ( w 0 , x ) Scaling the output by α makes the linearization more accurate. 4/20

Lazy Training Theorem Theorem (Lazy training through rescaling) Assume that f ( w 0 , · ) = 0 and that the loss is quadratic. In the limit of a small step-size and a large scale α , gradient-based methods on the non-linear model α f and on the tangent model T f learn the same model, up to a O (1 /α ) remainder. • lazy because parameters hardly move • optimization of linear models is rather well understood • recovers kernel ridgeless regression with offset f ( w 0 , · ) and K ( x , x ′ ) = �∇ w f ( w 0 , x ) , ∇ w f ( w 0 , x ′ ) � [Refs]: Jacot, Gabriel, Hongler (2018). Neural Tangent Kernel: Convergence and Generalization in Neural Networks . Du, Lee, Li, Wang, Zhai (2018). Gradient Descent Finds Global Minima of Deep Neural Networks . Allen-Zhu, Li, Liang (2018). Learning and Generalization in Overparameterized Neural Networks [...] . Chizat, Bach (2018). A Note on Lazy Training in Supervised Differentiable Programming . 5/20

Range of Lazy Training Criteria for lazy training (informal) �∇ f ( w 0 , · ) � 2 � T f ( w ∗ , · ) − f ( w 0 , · ) � ≪ �∇ 2 f ( w 0 , · ) � � �� Distance to best linear model “Flatness” around initialization � difficult to estimate in general Examples • Homogeneous models. If for λ > 0, f ( λ w , x ) = λ L f ( w , x ) then flatness ∼ � w 0 � L • NNs with large layers. Occurs if initialized with scale O (1 / √ fan in ) 6/20

Large Neural Networks i . i . d i . i . d Vanilla NN with W l ∼ N (0 , τ 2 w / fan in ) and b l ∼ N (0 , τ 2 b ). i , j i Model at initialization As widths of layers diverge, f ( w 0 , · ) ∼ GP (0 , Σ L ) where Σ l +1 ( x , x ′ ) = τ 2 b + τ 2 w · E z l ∼GP (0 , Σ l ) [ σ ( z l ( x )) · σ ( z l ( x ′ ))] . Limit tangent kernel In the same limit, �∇ w f ( w 0 , x ) , ∇ w f ( w 0 , x ′ ) � → K L ( x , x ′ ) where K l +1 ( x , x ′ ) = K l ( x , x ′ ) ˙ Σ l +1 ( x , x ′ ) + Σ l +1 ( x , x ′ ) and ˙ Σ l +1 ( x , x ′ ) = E z l ∼GP (0 , Σ l ) [ ˙ σ ( z l ( x )) · ˙ σ ( z l ( x ′ ))]. � cf. A. Jacot’s talk of last week [Refs]: Matthews, Rowland, Hron, Turner, Ghahramani (2018). Gaussian process behaviour in wide deep neural networks . 7/20 Lee, Bahri, Novak, Schoenholz, Pennington, Sohl-Dickstein (2018). Deep neural networks as gaussian processes . Jacot, Gabriel, Hongler (2018). Neural Tangent Kernel: Convergence and Generalization in Neural Networks .

Numerical Illustrations circle of radius 1 gradient flow (+) gradient flow (-) (a) Not lazy (b) Lazy 4.0 end of training not yet converged best throughout training 3.5 Population loss at convergence 3 3.0 2.5 Test loss 2 2.0 1.5 1.0 1 0.5 0.0 0 10 2 10 1 10 0 10 1 10 2 10 1 10 0 10 1 (c) Over-param. (d) Under-param. Training a 2-layers ReLU NN in the teacher-student setting 8/20 (a-b) trajectories (c-d) generalization in 100-d vs init. scale τ

Lessons to be drawn For practice • our guess: instead, feature selection is why NNs work • investigation needed on hard tasks For theory • in depth analysis sometimes possible • not just one theory for NNs training [Refs]: Zhang, Bengio, Singer (2019). Are all layers created equal? Lee, Bahri, Novak, Schoenholz, Pennington, Sohl-Dickstein (2018). Deep neural networks as gaussian processes 9/20

Global convergence for 2 -layers NNs

Two Layers NNs x [1] y x [2] With activation σ , define φ ( w i , x ) = c i σ ( a i · x + b i ) and m f ( w , x ) = 1 � φ ( w i , x ) m i =1 Statistical setting : minimize population loss E ( x , y ) [ ℓ ( f ( w , x ) , y )]. Hard problem : existence of spurious minima even with slight over-parameterization and good initialization [Refs]: Livni, Shalev-Shwartz, Shamir (2014). On the Computational Efficiency of Training Neural Networks . 10/20 Safran, Shamir (2018). Spurious Local Minima are Common in Two-layer ReLU Neural Networks .

Mean-Field Analysis Many-particle limit Training dynamics in the small step-size and infinite width limit: m µ t , m = 1 � δ w i ( t ) m →∞ µ t , ∞ → m i =1 [Refs]: Nitanda, Suzuki (2017). Stochastic particle gradient descent for infinite ensembles. Mei, Montanari, Nguyen (2018). A Mean Field View of the Landscape of Two-Layers Neural Networks . Rotskoff, Vanden-Eijndem (2018). Parameters as Interacting Particles [...] . 11/20 Sirignano, Spiliopoulos (2018). Mean Field Analysis of Neural Networks . Chizat, Bach (2018) On the Global Convergence of Gradient Descent for Over-parameterized Models [...]

Global Convergence Theorem (Global convergence, informal) In the limit of a small step-size, a large data set and large hidden layer, NNs trained with gradient-based methods initialized with “sufficient diversity” converge globally. • diversity at initialization is key for success of training • highly non-linear dynamics and regularization allowed [Refs]: Chizat, Bach (2018). On the Global Convergence of Gradient Descent for Over-parameterized Models [...]. 12/20

Numerical Illustrations 10 0 10 0 10 1 10 1 10 2 particle gradient flow 10 2 convex minimization 10 3 below optim. error 10 3 m 0 10 4 4 10 5 10 10 6 10 5 10 1 10 2 10 1 10 2 (a) ReLU (b) Sigmoid Population loss at convergence vs m for training a 2-layers NN in the teacher-student setting in 100-d. 13/20 This principle is general: e.g. sparse deconvolution.

Idealized Dynamic • parameterize the model with a probability measure µ : � µ ∈ P ( R d ) f ( µ, x ) = φ ( w , x ) d µ ( w ) , 14/20

Idealized Dynamic • parameterize the model with a probability measure µ : � µ ∈ P ( R d ) f ( µ, x ) = φ ( w , x ) d µ ( w ) , • consider the population loss over P ( R d ): F ( µ ) := E ( x , y ) [ ℓ ( f ( µ, x ) , y )] . � convex in linear geometry but non-convex in Wasserstein 14/20

Idealized Dynamic • parameterize the model with a probability measure µ : � µ ∈ P ( R d ) f ( µ, x ) = φ ( w , x ) d µ ( w ) , • consider the population loss over P ( R d ): F ( µ ) := E ( x , y ) [ ℓ ( f ( µ, x ) , y )] . � convex in linear geometry but non-convex in Wasserstein • define the Wasserstein Gradient Flow : d µ 0 ∈ P ( R d ) , dt µ t = − div ( µ t v t ) where v t ( w ) = −∇ F ′ ( µ t ) is the Wasserstein gradient of F . [Refs]: Bach (2017). Breaking the Curse of Dimensionality with Convex Neural Networks. Ambrosio, Gigli, Savar´ e (2008). Gradient Flows in Metric Spaces and in the Space of Probability Measures. 14/20

Theories of Neural Networks Training Lazy and Mean Field Regimes c - PowerPoint PPT Presentation

Theories of Neural Networks Training Lazy and Mean Field Regimes c Chizat * , joint work with Francis Bach + L ena April 10th 2019 - University of Basel e Paris-Sud + INRIA and ENS Paris CNRS and Universit Introduction Setting

Enriched Lawvere Theories theories for Operational Semantics Lawvere theories enriched theories

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Enriched Regular Theories Giacomo Tendas Joint work with: Stephen Lack 8 July 2019 Outline 1

Neural Networks and their Application to Go Neural Networks Learning Blackjack Theory Training

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Artificial Neural Networks Oliver Schulte - CMPT 726 Feed-forward Networks Network Training

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

CHAPTER II III I CHAPTER Neural Networks as Neural Networks as Associative Memory

Convolutional Neural Networks Convolutional neural networks One of the major kinds of ANNs in use

Neural Networks 0. Logistics Spring 2019 1 Neural Networks are taking over! Neural networks

Neural Networks 1. Introduction Fall 2017 Neural Networks are taking over! Neural networks

Neural Networks Neural Net Basics Dan Klein, John DeNero UC Berkeley Slides adapted from Greg

Relaxation and Hopfield Networks Neural Networks Neural Networks - Hopfield 1 Bibliography

Marcel Dettling Institute for Data Analysis and Process Design Zurich University of Applied

Workshop 10.6a: Poisson regression Murray Logan 12 Sep 2016 Section 1 Poisson regression

Using phylogenetics to estimate species divergence times ... More accurately ... Basics and

Game-changers: . Detecting shifts in the flow of campaign contributions . University of

Networks on Structured Data Yingyu Liang@UW-Madison Joint work with Yuanzhi Li@Princeton

Research Goal : reliable and easy-to-use optimizers for ML. 1 10 Challenges in Optimization

A Bayesian approach to estimate the number and position of knots for linear regression splines

IN5550: Neural Methods in Natural Language Processing IN5550 Neural Methods in Natural