An Empirical Look at the Loss Landscape HEP AI - September 4, 2018 - - PowerPoint PPT Presentation

an empirical look at the loss landscape
SMART_READER_LITE
LIVE PREVIEW

An Empirical Look at the Loss Landscape HEP AI - September 4, 2018 - - PowerPoint PPT Presentation

An Empirical Look at the Loss Landscape HEP AI - September 4, 2018 Components of training an image classifier For fixed architecture of ResNet 56 we have: 1. Preprocessing: normalize, shift and flip (show examples) 2. Momentum 3. Weight decay


slide-1
SLIDE 1

An Empirical Look at the Loss Landscape

HEP AI - September 4, 2018

slide-2
SLIDE 2

Components of training an image classifier

For fixed architecture of ResNet 56 we have:

  • 1. Preprocessing: normalize, shift and flip (show examples)
  • 2. Momentum
  • 3. Weight decay (aka L2 regularization)
  • 4. Learning rate scheduling

1

slide-3
SLIDE 3

Components of training an image classifier

Dataset: CIFAR10 raw

2

slide-4
SLIDE 4

Components of training an image classifier

Dataset: CIFAR10 processed (normalize, shift and flip)

3

slide-5
SLIDE 5

Components of training an image classifier

With all the ingredients (mom, wd, prep) we get 93.1% accuracy on C10!

  • Remove momentum only: -1.5%
  • Remove weight decay only: -3.2%
  • Remove preprocessing only: -6.3%
  • Remove all three: -12.5%

What components are essentially necessary?

4

slide-6
SLIDE 6

Expressivity and overfitting

  • Regression vs. classification is there a fundamental reason that

makes one harder?

  • Is it always possible to memorize the training set? (9 examples in

CIFAR100)

  • What’s happening to the loss when the accuracy is stable?

5

slide-7
SLIDE 7

State of Image Recognition - http://clarifai.com/

6

slide-8
SLIDE 8

State of Image Recognition - http://clarifai.com/

7

slide-9
SLIDE 9

State of Image Recognition - http://clarifai.com/

8

slide-10
SLIDE 10

State of Image Recognition - http://clarifai.com/

Is all we do still just a fancy curve fitting?

9

slide-11
SLIDE 11

Geometry of the training surface

9

slide-12
SLIDE 12

The Loss Function

  • 1. Take a dataset and split it into two parts: Dtrain & Dtest
  • 2. Form the loss using only Dtrain:

Ltrain(w) = 1 |Dtrain|

  • (x,y)∈Dtrain

ℓ(y, f (w; x))

  • 3. Find: w ∗ = arg min Ltrain(w)
  • 4. ...and hope that it will work on Dtest.

10

slide-13
SLIDE 13

The Loss Function

Some quantites:

  • M : number of parameters w ∈ RM
  • N : number of neurons in the first layer
  • P : number of examples in the training set |Dtrain|
  • d : number of dimension in the input x ∈ Rd
  • k : number of classes in the dataset

Question: When do we call a model over-parametrized? Question: How to minimize the high-dimensional, non-convex loss?

11

slide-14
SLIDE 14

GD is bad use SGD

“Stochastic gradient learning in neural networks”, L´ eon Bottou, 1991

12

slide-15
SLIDE 15

GD is bad use SGD

Bourelly (1988)

13

slide-16
SLIDE 16

GD is bad use SGD

Simple fully-connected network on MNIST: M ∼ 450K (right)

10000 20000 30000 40000 50000 10-4 10-3 10-2 10-1 100 101

Cost vs. step no for 500-300 network SGD train SGD test GD train GD test

Average number of mistakes: SGD 174, GD 194

14

slide-17
SLIDE 17

GD is bad use SGD

The network has only 5 neurons in the hidden layer!

15

slide-18
SLIDE 18

GD vs SGD in the mean field approach

Take ℓ(y, f (w; x)) = (y − f (w; x))2 where f (w; x) = 1

N

N

i=1 σ(wi, x)

Expand the square and take expectation over data: L(w) = Const + 2 N

N

  • i=1

V (wi) + 1 N2

N

  • i,j=1

U(wi, wj) Population risk in the large N limit: L(ρ) = Const + 2

  • V (w)ρ(dw) +
  • U(w1, w2)ρ(dw1)ρ(dw2)

Proposition: Minimizing the two functions are the same

16

slide-19
SLIDE 19

GD vs SGD in the mean field approach

Write the gradient update per example and rearrange: ∆wi = 2η∇wiσ(wi, x)(y − 1 N

N

  • i=1

σ(wi, x)) = 2η∇wiyσ(wi, x) − 2η∇wiσ(wi, x) 1 N

N

  • i=1

σ(wi, x) Taking expectation over (past) data gives the update (ith neuron): E(∆w|past)/2η = −∇wiV (wi) − 1 N

N

  • j=1

∇wiU(wi, wj)

  • Then pass to the large N limit (with proper timestep scaling)
  • And write the continuity equation for the density.

17

slide-20
SLIDE 20

GD vs SGD in the mean field approach

References:

  • 1. Mei, Montanari, Nguyen 2018 (above approach)
  • 2. Sirignano, Spiliopoulos 2018 (harder to read)
  • 3. Rotskoff, Vanden-Eijnden 2018 (additional diffusive and noise terms,

as well as a CLT)

  • 4. Wang, Mattingly, Lu 2017 (same approach different problems)

Is it really the case that in the large N limit, GD and SGD are the same?

18

slide-21
SLIDE 21

Quick look into Rotskoff and Vanden-Eijnden

Here θ is learning rate / batch size

19

slide-22
SLIDE 22

SGD is really special

Where common wisdom may be true (Keskar et. al. 2016.): F2: fully connected, TIMIT (M = 1.2M) C1: conv-net, CIFAR10 (M = 1.7M)

  • Similar training error, but gap in the test error.

20

slide-23
SLIDE 23

SGD is really special

Moreover, Keskar et. al. (2016) observe that:

  • LB → sharp minima
  • SB → wide minima

Considerations around the idea of sharp/wide minima:

Pardalos et. al. 1993 (More recently: Zecchina et. al., Bengio et. al., ...)

21

slide-24
SLIDE 24

LB SB and outlier eigenvalues of the Hessian

MNIST on a simple fully-connected network. Increasing the batch-size leads to larger outlier eigenvalues.

5 10 15 20 25 30 35 40 Order of largest eigenvalues 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Eigenvalues 1e1

Right eigenvalue distribution

Heuristic threshold Small batch Large batch

22

slide-25
SLIDE 25

Geometry of redundant over-parametrization

Figure: w 2 (left) vs. (w1w2)2 (right)

23

slide-26
SLIDE 26

Searching for sharp basins

Repeating the LB/SB with a twist

  • 1. Train a large batch CIFAR10 on a bare AlexNet
  • 2. At the end point switch to small batch

10000 20000 30000 40000 50000 Number of steps (measurements every 100) 0.0 0.5 1.0 1.5 2.0 Loss value Train loss Test loss 0.0 0.2 0.4 0.6 0.8 1.0 Accuracy

Continuous training in two phases

Train acc Test acc

24

slide-27
SLIDE 27

Searching for sharp basins

Keep the two points: end of LB training and end of SB continuation.

  • 1. Extend a line away from the LB solution

1.0 0.5 0.0 0.5 1.0 1.5 2.0 Interpolation coefficient 1 2 3 4 5 6 7 Loss value Train loss Test loss 0.0 0.2 0.4 0.6 0.8 1.0 Accuracy

Line interpolation between end points of the two phases

Train accuracy Test accuracy

25

slide-28
SLIDE 28

Searching for sharp basins

Keep the two points: end of LB training and end of SB continuation.

  • 1. Extend a line away from the LB solution
  • 2. Extend a line away from the SB solution

1.0 0.5 0.0 0.5 1.0 1.5 2.0 Interpolation coefficient 1 2 3 4 5 6 7 Loss value Train loss Test loss 0.0 0.2 0.4 0.6 0.8 1.0 Accuracy

Line interpolation between end points of the two phases

Train accuracy Test accuracy

25

slide-29
SLIDE 29

Searching for sharp basins

Keep the two points: end of LB training and end of SB continuation.

  • 1. Extend a line away from the LB solution
  • 2. Extend a line away from the SB solution
  • 3. Extend a line away between the two solutions

1.0 0.5 0.0 0.5 1.0 1.5 2.0 Interpolation coefficient 1 2 3 4 5 6 7 Loss value Train loss Test loss 0.0 0.2 0.4 0.6 0.8 1.0 Accuracy

Line interpolation between end points of the two phases

Train accuracy Test accuracy

25

slide-30
SLIDE 30

Connecting arbitrary solutions

  • 1. Freeman and Bruna 2017: barriers of order 1/M
  • 2. Draxler et. al. 2018: no barriers between solutions

String method video: https://cims.nyu.edu/~eve2/string.htm

26

slide-31
SLIDE 31

What about GD + noise vs SGD

A walk with SGD, Xing et. al. 2018 String method video: https://cims.nyu.edu/~eve2/string.htm

27

slide-32
SLIDE 32

Back to the beginning

Does this mean any solution, obtained by any method is in the same basin?

  • 1. Different algorithms
  • 2. Pre-processing vs not pre-processing
  • 3. MSE vs log-loss
  • If so, what’s the threshold for M?
  • Is there an under-parametrized regime in which solutions are

disconnected?

28

slide-33
SLIDE 33

The End

28

slide-34
SLIDE 34

Gauss-Newton decomposition of the Hessian

Loss functions between the output, s, and label, y

  • MSE ℓ(s, y) = (s − y)2
  • Hinge ℓ(s, y) = max{0, sy}
  • NLL ℓ(sy, y) = −sy + log

y ′ exp sy ′

are all convex in their output: s = f (w; x)

29

slide-35
SLIDE 35

Gauss-Newton decomposition of the Hessian

With ℓ ◦ f in mind, the gradient and the Hessian per loss: ∇ℓ(f (w)) = ℓ′(f (w))∇f (w) ∇2ℓ(f (w)) = ℓ′′(f (w))∇f (w)∇f (w)T + ℓ′(f (w))∇2f (w) then average over the training data: ∇2L(w) = 1 P

P

  • i=1

ℓ′′(f (w))∇f (w)∇f (w)T + 1 P

P

  • i=1

ℓ′(f (w))∇2f (w)

30