Landscape Connectivity and Dropout Stability of SGD Solutions for - - PowerPoint PPT Presentation

landscape connectivity and dropout stability of sgd
SMART_READER_LITE
LIVE PREVIEW

Landscape Connectivity and Dropout Stability of SGD Solutions for - - PowerPoint PPT Presentation

Landscape Connectivity and Dropout Stability of SGD Solutions for Over-parameterized Neural Networks Alexander Shevchenko Marco Mondelli Neural Network Training From theoretical perspective training of neural networks is di ffi cult


slide-1
SLIDE 1

Landscape Connectivity and Dropout Stability

  • f SGD Solutions for Over-parameterized

Neural Networks

Marco Mondelli Alexander Shevchenko

slide-2
SLIDE 2

Neural Network Training

From theoretical perspective training of neural networks is difficult (NP-hardness, local/disconnected minima …), but in practice works remarkably well!

Two key ingredients of success:

Over-parameterization (Stochastic) gradient descent

slide-3
SLIDE 3

Training Landscape is indeed NICE

  • SGD minima connected via piecewise

linear path with constant loss [Garipov et al., 2018; Draxler et al., 2018]

  • Mode connectivity proved assuming

properties of well-trained networks (dropout/noise stability) [Kuditipudi et al., 2019]

slide-4
SLIDE 4

What do we show?

  • Theorem. (Informal) As neural network grows wider the

solutions obtained via SGD become increasingly more dropout stable and barriers between local optima disappear.

Mean-field view: Two layers [Mei et al., 2019] Multiple layers [Araujo et al., 2019] Quantitative bounds:

  • independent of input dimension for two-layer networks, scale linearly for multiple layers
  • change in loss scales with network width as
  • number of training samples is just required to scale faster than the

log(width)

1 width

slide-5
SLIDE 5
  • Local minima are globally optimal

for deep linear networks and networks with more neurons than training samples

  • Connected landscape if the

number of neurons grows large (two-layer networks, energy gap exponential in input dimension)

Related Work

Strong assumptions on the model and poor scaling of parameters ︎😟

slide-6
SLIDE 6

Warm-up: Two Layer Networks

Data: {(x1, y1), …, (xn, yn)} ∼i.i.d. ℙ (ℝd × ℝ) Goal: Minimize loss LN(θ) = 𝔽 {(y − 1

N ∑N i=1 aiσ (x; wi)) 2

}, θ = (w, a) Online SGD: θk+1 = θk + αN ∇θk((yk − 1

N ∑N i=1 ak i σ (xk; wk i )) 2

)

  • bounded,

sub-gaussian

  • bounded and differentiable,

bounded and Lipschitz

  • initialization of

with bounded support

y ∇wσ(x, w) σ ∇σ ai

Model:

̂ yN(x, θ) = 1 N

N

i=1

aiσ (x; wi)

slide-7
SLIDE 7

Recap: Dropout Stability

LM(θ) = 𝔽 (y − 1 M

M

i=1

aiσ (x; wi))

2

is

  • dropout stable if

θ εD |LN(θ) − LM(θ)| ≤ εD

slide-8
SLIDE 8

Recap: Dropout Stability and Connectivity

LM(θ) = 𝔽 (y − 1 M

M

i=1

aiσ (x; wi))

2

is

  • dropout stable if

θ εD |LN(θ) − LM(θ)| ≤ εD

and are

  • connected if there exists a continuous path connecting them

where the loss does not increase more than

θ θ′ εC εC

slide-9
SLIDE 9

Main Results: Dropout Stability

slide-10
SLIDE 10

Main Results: Dropout Stability

Change in loss scales as

log M M + α(D + log N)

slide-11
SLIDE 11

Main Results: Dropout Stability

  • Loss change vanishes as

and

  • does not need to scale with
  • r

α ≪ ( D + log N)

−1

M ≫ 1 M N D

slide-12
SLIDE 12

Main Results: Connectivity

slide-13
SLIDE 13

Main Results: Connectivity

  • Change in loss scales as

log N N + α(D + log N)

slide-14
SLIDE 14

Main Results: Connectivity

  • Change in loss scales as
  • Can connect SGD solutions obtained from different training data

(but same data distribution) and different initialization

log N N + α(D + log N)

slide-15
SLIDE 15

Proof Idea

Discrete dynamics of SGD Continuous dynamics of gradient flow

  • close to

i.i.d. particles that evolve with gradient flow

  • and

concentrate to the same limit

  • Dropout stability with

connectivity

θk N LN(θ) LM(θ) M = N/2 ⇒

slide-16
SLIDE 16

Multilayer Case: Setup

Data: {(x1, y1), …, (xn, yn)} ∼i.i.d. ℙ (ℝdx × ℝdy) Goal: Minimize loss LN(θ) = 𝔽 { y − ̂ yN(x, θ)

2

} Online SGD: θk+1 = θk + αN2∇θk

yk − ̂ yN (xk, θk)

2

Model: ̂

yN(x, θ) = 1 NWL+1σL (⋯( 1 NW2σ1 (W1x))⋯)

  • bounded
  • bounded and differentiable,

bounded and Lipschitz

  • initialization with bounded support
  • and

stay fixed (random features)

y σℓ ∇σℓ W1 WL+1

slide-17
SLIDE 17

Multilayer Case: Dropout Stability

Dropout stability: loss does not change much if we remove part of neurons from each layer (and suitably rescale remaining neurons).

slide-18
SLIDE 18

Multilayer Case: Dropout Stability

loss when we keep at most neurons per layer

LM(θ) := M

is

  • dropout stable if

θ εD |LN(θ) − LM(θ)| ≤ εD

slide-19
SLIDE 19

Multilayer Case: Dropout Stability and Connectivity

loss when we keep at most neurons per layer

LM(θ) := M

is

  • dropout stable if

θ εD |LN(θ) − LM(θ)| ≤ εD

and are

  • connected if there exists a continuous path connecting them

where the loss does not increase more than

θ θ′ εC εC

slide-20
SLIDE 20

Multilayer Case: Results

slide-21
SLIDE 21

Multilayer Case: Results

slide-22
SLIDE 22

Proof Challenges

Discrete dynamics of SGD Continuous dynamics of gradient flow

  • Ideal particles are no longer independent (weights in different layers are correlated)
  • Bound on norm of weights during the training
  • Bound maximum distance between SGD and ideal particles ([Araujo et al., 2019]

bounds the average distance)

slide-23
SLIDE 23

Numerical Results

  • CIFAR-10 dataset
  • Pretrained VGG-16 features
  • # layers = 3
  • Keep half of neurons
slide-24
SLIDE 24

Numerical Results

  • CIFAR-10 dataset
  • Pretrained VGG-16 features
  • # layers = 3
  • Keep half of neurons
slide-25
SLIDE 25

Numerical Results

  • CIFAR-10 dataset
  • Pretrained VGG-16 features
  • # layers = 3
  • Keep half of neurons
slide-26
SLIDE 26

Numerical Results

slide-27
SLIDE 27

Conclusion

slide-28
SLIDE 28

Thank You for Your Attention