The role of over-parametrisation in NNs The role of - - PowerPoint PPT Presentation

the role of over parametrisation in nns the role of over
SMART_READER_LITE
LIVE PREVIEW

The role of over-parametrisation in NNs The role of - - PowerPoint PPT Presentation

The role of over-parametrisation in NNs The role of over-parametrisation in NNs Levent Sagun, EPFL Classical bias-variance dilemma Classical bias-variance dilemma Error Test Train Capacity Classical bias-variance dilemma, or? Classical


slide-1
SLIDE 1

The role of over-parametrisation in NNs The role of over-parametrisation in NNs

Levent Sagun, EPFL

slide-2
SLIDE 2

Classical bias-variance dilemma Classical bias-variance dilemma

Capacity Error Test Train

slide-3
SLIDE 3

Classical bias-variance dilemma, or? Classical bias-variance dilemma, or?

Test Train Error Capacity

slide-4
SLIDE 4

Observation 1 Observation 1

GD vs SGD GD vs SGD

slide-5
SLIDE 5

Moving on the fixed landscape Moving on the fixed landscape

  • 1. Take an iid dataset and split into two parts
  • 2. Form the loss using only
  • 3. Find:
  • 4. ...and hope that it will work on

D

& D

train test

D

train

L

(θ) =

train

ℓ(y, f(θ; x))

∣D

train

1

(x,y)∈D

train

∑ θ =

arg min L

(θ)

train

D

test

number of parameters number of examples in the training set N : θ ∈ RN P : ∣D

train

slide-6
SLIDE 6

Moving on the fixed landscape Moving on the fixed landscape

  • 1. Take an iid dataset and split into two parts
  • 2. Form the loss using only
  • 3. Find:
  • 4. ...and hope that it will work on

D

& D

train test

D

train

L

(θ) =

train

ℓ(y, f(θ; x))

∣D

train

1

(x,y)∈D

train

∑ θ =

arg min L

(θ)

train

D

test

number of parameters number of examples in the training set N : θ ∈ RN P : ∣D

train

by SGD

slide-7
SLIDE 7

“Stochastic gradient learning in neural networks” Léon Bottou, 1991

GD is bad use SGD GD is bad use SGD

slide-8
SLIDE 8

Bourrely, 1988

GD is bad use SGD GD is bad use SGD

slide-9
SLIDE 9

Fully connected network on MNIST: K N ∼ 450

GD is the same as SGD GD is the same as SGD

Sagun, Guney, LeCun, Ben Arous 2014

slide-10
SLIDE 10

Bourrely, 1988

Different regimes depending on Different regimes depending on N

slide-11
SLIDE 11

Fully connected network on MNIST: K N ∼ 450

GD is the same as SGD GD is the same as SGD

Average number of mistakes: SGD 174, GD 194 Sagun, Guney, LeCun, Ben Arous 2014

slide-12
SLIDE 12

GD is the same as SGD GD is the same as SGD

Further empirical confirmations on

  • ver-p. optimization landscape (Sagun, Guney, Ben Arous, LeCun 2014)

Teacher-Student setup landscape of the p-spin model GD vs SGD on fully-connected MNIST more on GD vs. SGD (together with Bottou in 2016): Scrambled labels Noisy inputs Sum mod 10 ...

slide-13
SLIDE 13

Regime where SGD is really special? Regime where SGD is really special?

Where common wisdom may be true (Keskar et. al. 2016): Similar training error, but gap in the test error. →

fully connected, TIMIT M N = 1.2 conv-net, CIFAR10 M N = 1.7

slide-14
SLIDE 14

Jastrzębski et. al. 2018 Goyal et. al. 2018 Shallue and Lee et. al. 2018 McCandlish et. al. 2018 Smith et. al. 2018

The 'generalization gap' can be filled The 'generalization gap' can be filled

slide-15
SLIDE 15

Jastrzębski et. al. 2018 Goyal et. al. 2018 Shallue and Lee et. al. 2018 McCandlish et. al. 2018 Smith et. al. 2018

The 'generalization gap' can be filled The 'generalization gap' can be filled

slide-16
SLIDE 16

The 'generalization gap' can be filled The 'generalization gap' can be filled

Jastrzębski et. al. 2018 Goyal et. al. 2018 Shallue and Lee et. al. 2018 McCandlish et. al. 2018 Smith et. al. 2018

slide-17
SLIDE 17

Jastrzębski et. al. 2018 Goyal et. al. 2018 Shallue and Lee et. al. 2018 McCandlish et. al. 2018 Smith et. al. 2018

The 'generalization gap' can be filled The 'generalization gap' can be filled

slide-18
SLIDE 18

The 'generalization gap' can be filled The 'generalization gap' can be filled

Why is it important?

slide-19
SLIDE 19

Large batch allows parallel training Large batch allows parallel training

slide-20
SLIDE 20

Large batch allows parallel training Large batch allows parallel training

slide-21
SLIDE 21

Large batch allows parallel training Large batch allows parallel training

slide-22
SLIDE 22

Large batch allows parallel training Large batch allows parallel training

slide-23
SLIDE 23

SGD noise is not Gaussian SGD noise is not Gaussian

A remark on SGD noise...

slide-24
SLIDE 24

Jastrzębski et. al. 2018 Goyal et. al. 2018 Shallue and Lee et. al. 2018 McCandlish et. al. 2018 Smith et. al. 2018

SGD noise is not Gaussian SGD noise is not Gaussian

slide-25
SLIDE 25

Jastrzębski et. al. 2018 Goyal et. al. 2018 Shallue and Lee et. al. 2018 McCandlish et. al. 2018 Smith et. al. 2018

But the noise is not Gaussian!

SGD noise is not Gaussian SGD noise is not Gaussian

slide-26
SLIDE 26

But the noise is not Gaussian! Simsekli, Sagun, Gurbuzbalaban 2019

SGD noise is not Gaussian SGD noise is not Gaussian

slide-27
SLIDE 27

Lessons from Observation 1 Lessons from Observation 1

Optimization of the training function is easy ... as long as there are enough parameters Effects of SGD is a little bit more subtle ... but exact reasons are somewhat unclear

slide-28
SLIDE 28

Observation 2 Observation 2

A look at the bottom of the loss A look at the bottom of the loss

slide-29
SLIDE 29

Different kinds of minima Different kinds of minima

Continuing with Keskar et al (2016): LB sharp, SB wide... Also see Jastrzębski et. al. (2018), Chaudhari et. al. (2016)... Older considerations Pardalos et. al. (1993) Sharpness depends on parametrization: Dinh et. al. (2017) → →

slide-30
SLIDE 30

Different kinds of minima Different kinds of minima

Continuing with Keskar et al (2016): LB sharp, SB wide... Also see Jastrzębski et. al. (2018), Chaudhari et. al. (2016)... Older considerations Pardalos et. al. (1993) Sharpness depends on parametrization: Dinh et. al. (2017) → →

slide-31
SLIDE 31

Searching for sharp basins Searching for sharp basins

Repeat LB/SB with a twist: first train with LB, then switch to SB Sagun, LeCun, Bottou 2016 & Sagun, Evci, Guney, Dauphin, Bottou 2017

slide-32
SLIDE 32

Searching for sharp basins Searching for sharp basins

(1) line away from LB Sagun, LeCun, Bottou 2016 & Sagun, Evci, Guney, Dauphin, Bottou 2017

slide-33
SLIDE 33

Searching for sharp basins Searching for sharp basins

(1) line away from LB (2) line away from SB Sagun, LeCun, Bottou 2016 & Sagun, Evci, Guney, Dauphin, Bottou 2017

slide-34
SLIDE 34

Searching for sharp basins Searching for sharp basins

(1) line away from LB (2) line away from SB (3) line in-between Sagun, LeCun, Bottou 2016 & Sagun, Evci, Guney, Dauphin, Bottou 2017

slide-35
SLIDE 35

Geometry of critical points Geometry of critical points

L

(θ +

tr

Δθ) ≈ L

(θ) +

tr

Δθ ∇L

(θ) +

T tr

Δθ ∇ L

(θ)Δθ

T 2 tr

Check out the Taylor expansion for local geometry: Local geometry at a critical point: All positive local min All negative local max Some negative saddle Moving along eigenvectors & sizes of eigenvalues → → →

slide-36
SLIDE 36

A look through the local curvature A look through the local curvature

Eigenvalues of the Hessian at the beginning and at the end Sagun, LeCun, Bottou 2016 & Sagun, Evci, Guney, Dauphin, Bottou 2017

slide-37
SLIDE 37

A look through the local curvature A look through the local curvature

Eigenvalues of the Hessian at the beginning and at the end Sagun, LeCun, Bottou 2016 & Sagun, Evci, Guney, Dauphin, Bottou 2017

slide-38
SLIDE 38

A look through the local curvature A look through the local curvature

Increasing the batch-size leads to larger outlier eigenvalues: Sagun, LeCun, Bottou 2016 & Sagun, Evci, Guney, Dauphin, Bottou 2017

slide-39
SLIDE 39

A look at the structure of the loss A look at the structure of the loss

Recall the loss per sample: is convex (MSE, NLL, hinge...) is non-linear (CNN, FC with ReLU...) ℓ(y, f(θ, x)) ℓ f We can see the Hessian of the loss as:

∇ ℓ(f) =

2

ℓ (f)∇f∇f +

′′ T

ℓ (f)∇ f

′ 2

a detailed study on this can be found in Papyan 2019

slide-40
SLIDE 40

More on the lack of barriers More on the lack of barriers

  • 1. Freeman and Bruna 2017: barriers of order
  • 2. Baity-Jesi & Sagun et. al. 2018: no barriers in SGD dynamics
  • 3. Xing et. al. 2018: no barrier crossing in SGD dynamics
  • 4. Garipov et. al. 2018: no barriers between solutions
  • 5. Draxler et. al. 2018: no barriers between solutions

1/N

slide-41
SLIDE 41

More on the lack of barriers More on the lack of barriers

  • 1. Freeman and Bruna 2017: barriers of order
  • 2. Baity-Jesi et. al. 2018: no barrier crossing in SGD dynamics
  • 3. Xing et. al. 2018: no barrier crossing in SGD dynamics
  • 4. Garipov et. al. 2018: no barriers between solutions
  • 5. Draxler et. al. 2018: no barriers between solutions

1/N

slide-42
SLIDE 42

More on the lack of barriers More on the lack of barriers

  • 1. Freeman and Bruna 2017: barriers of order
  • 2. Baity-Jesi et. al. 2018: no barrier crossing in SGD dynamics
  • 3. Xing et. al. 2018: no barrier crossing in SGD dynamics
  • 4. Garipov et. al. 2018: no barriers between solutions
  • 5. Draxler et. al. 2018: no barriers between solutions

1/N

slide-43
SLIDE 43

More on the lack of barriers More on the lack of barriers

  • 1. Freeman and Bruna 2017: barriers of order
  • 2. Baity-Jesi & Sagun et. al. 2018: no barriers in SGD dynamics
  • 3. Xing et. al. 2018: no barrier crossing in SGD dynamics
  • 4. Garipov et. al. 2018: no barriers between solutions
  • 5. Draxler et. al. 2018: no barriers between solutions

1/N

slide-44
SLIDE 44

Lessons from Observation 2 Lessons from Observation 2

A large and connected set of solutions ... possibly only for large N Visible effects of SGD is on a tiny subspace ... again, exact reasons are somewhat unclear

slide-45
SLIDE 45

A simple example A simple example

slide-46
SLIDE 46

Lessons from observations Lessons from observations

Observation 1: easy to optimize Observation 2: flat bottom

f(w) = w2 f(w

, w ) =

1 2

(w

w )

1 2 2

See Lopez-Paz, Sagun 2018 &

, ,

2018 Gur-Ari Roberts Dyer

slide-47
SLIDE 47

Defining over-parametrization Defining over-parametrization

Several works joint with: Mario Geiger, Stefano Spigler, Marco Baity-Jesi, Stephane d'Ascoli, Arthur Jacot, Franck Gabriel, Clement Hongler, Giulio Biroli, & Matthieu Wyart

slide-48
SLIDE 48
  • 1. For large

the dynamics don't get stuck When is the training landscape nice?

  • 2. Often

, yet it doesn't it overfit Relationship of the landscape with generalization? N → N >> P →

Puzzles with partial answers Puzzles with partial answers

number of parameters number of examples in the training set N : θ ∈ RN P : ∣D

train

slide-49
SLIDE 49

Switch to squared-hinge from cross-entropy precise stopping condition clear stability condition ℓ(y, f(θ, x)) =

max(0, (1 −

2 1

yf(θ, x)) )

2

Sharper vision through hinge loss Sharper vision through hinge loss

sum over unsatisfied constraints a local minimum is only possible if: (very loose) N/2 < P

∇ ℓ(f) =

2

∇f∇f +

T

(1 − yf)∇ f

2

slide-50
SLIDE 50

number of parameters number of examples in the training set critical number of parameters that fits N : θ ∈ RN P : ∣D

train

N :

D

train

Sharp transition to OP in NNs Sharp transition to OP in NNs

upper bound jamming line

N ∗

slide-51
SLIDE 51

number of parameters number of examples in the training set critical number of parameters that fits N : θ ∈ RN P : ∣D

train

N :

D

train

Sharp transition to OP in NNs Sharp transition to OP in NNs

upper bound jamming line

N ∗ N = 2P L

>

train

L

=

train

slide-52
SLIDE 52

Sharp transition to OP in NNs Sharp transition to OP in NNs

number of parameters number of examples in the training set critical number of parameters that fits N : θ ∈ RN P : ∣D

train

N :

D

train

slide-53
SLIDE 53

number of parameters number of examples in the training set critical number of parameters that fits N : θ ∈ RN P : ∣D

train

N :

D

train

Sharp transition to OP in NNs Sharp transition to OP in NNs

upper bound jamming line

N ∗

slide-54
SLIDE 54

Jamming is linked to Generalization Jamming is linked to Generalization

Test Error

N N/N ∗

Spigler, Geiger, d'Ascoli, Sagun, Biroli, Wyart 2018

slide-55
SLIDE 55

Recent independent work Recent independent work

Belkin et. al. December 31, 2018 The peak itself is also observed in Advani and Saxe 2017 See also Neal et al. 18, Neyshabur et al. 15 & 17 for related work

slide-56
SLIDE 56

Ensembling improves generalization Ensembling improves generalization

Test Error

N

Key: reducing fluctuations or increased regularization with N

N 1

slide-57
SLIDE 57

Ensembling improves generalization Ensembling improves generalization

Test Error extending to SGD on CNNs with CIFAR10 Number of filters in each CNN layer Sagun, Geiger, d'Ascoli, Spigler, Biroli, Wyart 2019 (unpublished)

slide-58
SLIDE 58

Concluding remarks Concluding remarks

Potential impact: Clear definition of OP can help guide design of models At finite we have a proposal for the best generalization New directions for theoretical understanding Belkin et. al. 18 March 2019 Hastie et. al. 19 March 2019 P → →

slide-59
SLIDE 59

Future work Future work

On the model-data-algorithm interactions: Can we disentangle the algorithm? Can we entangle the model-data interactions to unite model complexity measure data complexity measure the role of priors on performance!

slide-60
SLIDE 60

Thank You! Thank You!