An Empirical Look at the Loss Landscape HEP AI - September 4, 2018

Components of training an image classifier For fixed architecture of ResNet 56 we have: 1. Preprocessing: normalize, shift and flip (show examples) 2. Momentum 3. Weight decay (aka L 2 regularization) 4. Learning rate scheduling 1

Components of training an image classifier Dataset: CIFAR10 raw 2

Components of training an image classifier Dataset: CIFAR10 processed (normalize, shift and flip) 3

Components of training an image classifier With all the ingredients (mom, wd, prep) we get 93.1% accuracy on C10! • Remove momentum only: -1.5% • Remove weight decay only: -3.2% • Remove preprocessing only: -6.3% • Remove all three: -12.5% What components are essentially necessary? 4

Expressivity and overfitting • Regression vs. classification is there a fundamental reason that makes one harder? • Is it always possible to memorize the training set? (9 examples in CIFAR100) • What’s happening to the loss when the accuracy is stable? 5

State of Image Recognition - http://clarifai.com/ 6

State of Image Recognition - http://clarifai.com/ Is all we do still just a fancy curve fitting? 9

Geometry of the training surface 9

The Loss Function 1. Take a dataset and split it into two parts: D train & D test 2. Form the loss using only D train : 1 � L train ( w ) = ℓ ( y , f ( w ; x )) |D train | ( x , y ) ∈D train 3. Find: w ∗ = arg min L train ( w ) 4. ...and hope that it will work on D test . 10

The Loss Function Some quantites: • M : number of parameters w ∈ R M • N : number of neurons in the first layer • P : number of examples in the training set |D train | • d : number of dimension in the input x ∈ R d • k : number of classes in the dataset Question: When do we call a model over-parametrized? Question: How to minimize the high-dimensional, non-convex loss? 11

GD is bad use SGD “Stochastic gradient learning in neural networks”, L´ eon Bottou, 1991 12

GD is bad use SGD Bourelly (1988) 13

GD is bad use SGD Simple fully-connected network on MNIST: M ∼ 450K (right) Cost vs. step no for 500-300 network 10 1 SGD train SGD test GD train GD test 10 0 10 -1 10 -2 10 -3 10 -4 0 10000 20000 30000 40000 50000 Average number of mistakes: SGD 174, GD 194 14

GD is bad use SGD The network has only 5 neurons in the hidden layer! 15

GD vs SGD in the mean field approach Take ℓ ( y , f ( w ; x )) = ( y − f ( w ; x )) 2 where f ( w ; x ) = 1 � N i =1 σ ( w i , x ) N Expand the square and take expectation over data: N N L ( w ) = Const + 2 V ( w i ) + 1 � � U ( w i , w j ) N 2 N i =1 i , j =1 Population risk in the large N limit: � � L ( ρ ) = Const + 2 V ( w ) ρ ( dw ) + U ( w 1 , w 2 ) ρ ( dw 1 ) ρ ( dw 2 ) Proposition: Minimizing the two functions are the same 16

GD vs SGD in the mean field approach Write the gradient update per example and rearrange: N ∆ w i = 2 η ∇ w i σ ( w i , x )( y − 1 � σ ( w i , x )) N i =1 N = 2 η ∇ w i y σ ( w i , x ) − 2 η ∇ w i σ ( w i , x ) 1 � σ ( w i , x ) N i =1 Taking expectation over (past) data gives the update ( i th neuron): N E (∆ w | past ) / 2 η = −∇ w i V ( w i ) − 1 � ∇ w i U ( w i , w j ) N j =1 - Then pass to the large N limit (with proper timestep scaling) - And write the continuity equation for the density. 17

GD vs SGD in the mean field approach References: 1. Mei, Montanari, Nguyen 2018 (above approach) 2. Sirignano, Spiliopoulos 2018 (harder to read) 3. Rotskoff, Vanden-Eijnden 2018 (additional diffusive and noise terms, as well as a CLT) 4. Wang, Mattingly, Lu 2017 (same approach different problems) Is it really the case that in the large N limit, GD and SGD are the same? 18

Quick look into Rotskoff and Vanden-Eijnden Here θ is learning rate / batch size 19

SGD is really special Where common wisdom may be true (Keskar et. al. 2016.): F2: fully connected, TIMIT ( M = 1 . 2M) C1: conv-net, CIFAR10 ( M = 1 . 7M) • Similar training error, but gap in the test error. 20

SGD is really special Moreover, Keskar et. al. (2016) observe that: • LB → sharp minima • SB → wide minima Considerations around the idea of sharp/wide minima: Pardalos et. al. 1993 ( More recently: Zecchina et. al., Bengio et. al., ...) 21

LB SB and outlier eigenvalues of the Hessian MNIST on a simple fully-connected network. Increasing the batch-size leads to larger outlier eigenvalues. Right eigenvalue distribution 1e1 Heuristic threshold 2.00 Small batch Large batch 1.75 1.50 Eigenvalues 1.25 1.00 0.75 0.50 0.25 0.00 40 35 30 25 20 15 10 5 Order of largest eigenvalues 22

Geometry of redundant over-parametrization Figure: w 2 (left) vs. ( w 1 w 2 ) 2 (right) 23

Searching for sharp basins Repeating the LB/SB with a twist 1. Train a large batch CIFAR10 on a bare AlexNet 2. At the end point switch to small batch Continuous training in two phases 1.0 Train acc Test acc 2.0 0.8 1.5 0.6 Loss value Accuracy 1.0 0.4 0.5 0.2 Train loss Test loss 0.0 0.0 0 10000 20000 30000 40000 50000 Number of steps (measurements every 100) 24

Searching for sharp basins Keep the two points: end of LB training and end of SB continuation. 1. Extend a line away from the LB solution Line interpolation between end points of the two phases 7 1.0 Train accuracy 6 Test accuracy 0.8 5 0.6 4 Loss value Accuracy 3 0.4 2 Train loss 0.2 1 Test loss 0 0.0 1.0 0.5 0.0 0.5 1.0 1.5 2.0 Interpolation coefficient 25

Searching for sharp basins Keep the two points: end of LB training and end of SB continuation. 1. Extend a line away from the LB solution 2. Extend a line away from the SB solution Line interpolation between end points of the two phases 7 1.0 Train accuracy 6 Test accuracy 0.8 5 0.6 4 Loss value Accuracy 3 0.4 2 Train loss 0.2 1 Test loss 0 0.0 1.0 0.5 0.0 0.5 1.0 1.5 2.0 Interpolation coefficient 25

Searching for sharp basins Keep the two points: end of LB training and end of SB continuation. 1. Extend a line away from the LB solution 2. Extend a line away from the SB solution 3. Extend a line away between the two solutions Line interpolation between end points of the two phases 7 1.0 Train accuracy 6 Test accuracy 0.8 5 4 0.6 Loss value Accuracy 3 0.4 2 Train loss 0.2 1 Test loss 0 0.0 1.0 0.5 0.0 0.5 1.0 1.5 2.0 Interpolation coefficient 25

Connecting arbitrary solutions 1. Freeman and Bruna 2017: barriers of order 1 / M 2. Draxler et. al. 2018: no barriers between solutions String method video: https://cims.nyu.edu/~eve2/string.htm 26

What about GD + noise vs SGD A walk with SGD, Xing et. al. 2018 String method video: https://cims.nyu.edu/~eve2/string.htm 27

Back to the beginning Does this mean any solution, obtained by any method is in the same basin? 1. Different algorithms 2. Pre-processing vs not pre-processing 3. MSE vs log-loss - If so, what’s the threshold for M ? - Is there an under-parametrized regime in which solutions are disconnected? 28

The End 28

Gauss-Newton decomposition of the Hessian Loss functions between the output, s , and label, y • MSE ℓ ( s , y ) = ( s − y ) 2 • Hinge ℓ ( s , y ) = max { 0 , sy } • NLL ℓ ( s y , y ) = − s y + log � y ′ exp s y ′ are all convex in their output: s = f ( w ; x ) 29

Gauss-Newton decomposition of the Hessian With ℓ ◦ f in mind, the gradient and the Hessian per loss: ∇ ℓ ( f ( w )) = ℓ ′ ( f ( w )) ∇ f ( w ) ∇ 2 ℓ ( f ( w )) = ℓ ′′ ( f ( w )) ∇ f ( w ) ∇ f ( w ) T + ℓ ′ ( f ( w )) ∇ 2 f ( w ) then average over the training data: P P ∇ 2 L ( w ) = 1 ℓ ′′ ( f ( w )) ∇ f ( w ) ∇ f ( w ) T + 1 � � ℓ ′ ( f ( w )) ∇ 2 f ( w ) P P i =1 i =1 30

An Empirical Look at the Loss Landscape HEP AI - September 4, 2018 - PowerPoint PPT Presentation

An Empirical Look at the Loss Landscape HEP AI - September 4, 2018 Components of training an image classifier For fixed architecture of ResNet 56 we have: 1. Preprocessing: normalize, shift and flip (show examples) 2. Momentum 3. Weight decay

Collection #1 LOOk 1/8 LOOk 2/8 LOOk 3/8 LOOk 4/8 LOOk 5/8 LOOk 6/8

CHRONIC CHRONIC VISUAL LOSS VISUAL LOSS Wasu Supakornthanasarn, MD. Visual loss Sensory

Early Hearing Early Hearing Early Hearing loss D Early Hearing-loss D loss D loss D

State Tax Litigation State Tax Litigation Landscape Landscape Landscape Landscape Helen

Empirical Method Based Aggregate Loss Distributions C. K. Stan Khury 2012 INTRODUCTION

Functional Principal Component Analysis May 14, 2018 Empirical Principal Component FPC for the

Literature for Landscape Management Inhalt 1. Landscape analyses, landscape assessments,

Empirical Methods Empirical Methods t= a +b Research Landscape Quantitative =

The landscape of empirical risk for non-convex losses Song Mei ICME, Stanford December 3, 2016

The Empirical Landscape of Trade Policy Chad P. Bown and Meredith A. Crowley World Bank

Hearing Loss Hearing Loss and and Relationships Relationships Shanna Groves and Melissa Frye

Repetitive Loss Properties and the CRS NFIP/Community Rating System Visual 10.1 Repetitive Loss

What is it? A casualty loss is defined as the damage, destruction, or loss of property

Water Loss May 14, 2020 Water Loss Regulation Overview Senate Bill 555 Validated AWWA

Water Loss Water Research Foundation How to use the Free Water Loss Audit Software v 5.0 2

Prior and loss robustness for varoius loss functions Agnieszka Kami nska and Zdzis law

Findings of the 2015 Workshop on Statistical Machine Translation Ond ej Bojar, Rajen

Welcome! Ongoing effort in Florida Challenges & Benefits to merging RtI and Aligning

Announcements Efficiency Recursive Computation of the Fibonacci Sequence Our first example of

Dynamics of Periodically-Kicked Oscillators Lai-Sang Young Courant Institute, NYU

Domain Decomposition Algorithms for Mortar discretizations Hyea Hyun Kim Courant Institute (NYU)

Some numerical and experimental advances in chaotic scattering Microlocal Analysis and Spectral

Software Engineering Chap.5 - System Modeling Sim ao Melo de Sousa RELEASE (UBI), LIACC

Stable Outgoing Wave Filters for Anisotropic Waves Chris Stucchio Courant Institute of