Unraveling the mysteries of stochastic gradient descent on deep - - PowerPoint PPT Presentation

▶

Dec 26, 2022 244 likes •504 views

Unraveling the mysteries of stochastic gradient descent on deep neural networks Pratik Chaudhari UCLA VISION LAB 1 The question measures disagreement of predictions with ground truth x = argmin f ( x ) Cat Dog ... x weights aka

SLIDE 1

Unraveling the mysteries of stochastic gradient descent

n deep neural networks

Pratik Chaudhari

UCLA VISION LAB

SLIDE 2

The question

x ∗ = argmin

f (x)

measures disagreement of predictions with ground truth weights aka parameters

Many, many variants: AdaGrad, rmsprop, Adam, SAG, SVRG, Catalyst, APPA, Natasha, Katyusha…

xk +1 = xk − η +fb(xk )

Stochastic gradient descent

Why is SGD so special?

Cat Dog ...

SLIDE 3

−5

10 20 30 40

Eigenvalues

10 103 105

Frequency

−0.5 −0.4 −0.3 −0.2 −0.1 0.0

Eigenvalues

102 103 104

Frequency

Short negative tail

Empirical evidence: wide “minima”

SLIDE 4

Energy landscape of a binary perceptron

Wide minima are a large deviations phenomenon

Few wide minima, but generalize better [Baldassi et al., '15] Many sharp minima

A bit of statistical physics

SLIDE 5

x ∗ = argmin

f (x) = argmax

e−f (x)

Local Entropy [Chaudhari et al., ICLR '17]

≈ argmin

− log ⇣ Gγ ∗ e−f (x)⌘

Gaussian kernel

f varianceγ

Tilting the Gibbs measure

SLIDE 6

Wide-ResNet: CIFAR-10 All-CNN: CIFAR-10 (25% data)

State-of-the-art performance [Chaudhari et al., SysML '18]

Parle: parallelization of SGD

SLIDE 7

The question

Why is SGD so special?

SLIDE 8

A continuous-time view of SGD

Diffusion matrix: variance of mini-batch gradients
Temperature: ratio of learning rate and step-size

β −1 = η 2b

var ⇣ +fb(x) ⌘ = D(x) b = 1 b * , 1 N

k =1

+fk (x) +fk (x)> +f (x) +f (x)>+

SLIDE 9

A continuous-time view of SGD

Continuous-time limit of discrete-time updates

will assume x ∈ ⌦ ⊂ “d

Fokker-Planck (FP) equation gives the distribution on the

weight space induced by SGD

where x(t) ∼ ρ(t)

ρt = div ⇣ +f ρ |{z}

drift

+ β −1div ⇣ D ρ ⌘ | {z }

diffusion

⌘ dx = −+f (x) dt |{z}

,η

+ q 2β −1D(x) dW (t)

SLIDE 10

Wasserstein gradient flow

1 2 Z

⌦

+ρ(x)2 dx

Heat equation performs steepest descent on the

Dirichlet energy

ρt = div ⇣ I +ρ ⌘

−H (ρ) = Z

Ω

log ρ dρ

ρτ

k +1 ∈ argmin ρ

     −H (ρ) + ◊2

2(ρ, ρτ k )

2τ     

converges to trajectories

f the heat equation
It is also the steepest descent in the Wasserstein metric for

ρss

heat = argmin

−H (ρ)

Negative entropy is a Lyapunov functional for Brownian motion

SLIDE 11

Wasserstein gradient flow: with drift

If , the Fokker-Planck equation

ρt = div ⇣ +f ρ + β −1I +ρ ⌘

FP is the steepest descent on JKO in the Wasserstein metric

D = I

has the Jordan-Kinderleher-Otto (JKO) functional [Jordan et al., '97] as the Lyapunov functional.

ρss(x) = argmin

≈x∼ρ f f (x) g | {z }

energetic term

− β −1 H (ρ) | {z }

entropic term

SLIDE 12

What happens for non-isotropic noise?

ρt = div ⇣ +f ρ |{z}

drift

+ β −1div ⇣ D ρ ⌘ | {z }

diffusion

⌘

FP monotonically minimizes the free energy

ρss(x) = argmin

≈x∼ρ f (x) g − β −1H (ρ)

F (ρ) = β −1KL (ρ `` ρss)

Rewrite as

compare with |x - x*| for deterministic optimization.

SLIDE 13

SGD performs variational inference

Theorem [Chaudhari & Soatto, ICLR '18]

The functional is minimized monotonically by trajectories of the Fokker-Planck equation with as the steady-state distribution. Moreover,

ρss ρt = div ⇣ +f ρ + β −1div (D ρ) ⌘ F (ρ) = β −1KL (ρ `` ρss)

up to a constant.

Φ = −β −1 log ρss

SLIDE 14

Some implications

Learning rate should scale linearly with batch-size

β −1 = η 2b should not be small

Sampling with replacement regularizes better than without

β −1

w/o replacement = η

2b 1 − b N !

also generalizes better.

SLIDE 15

Information Bottleneck Principle

Minimize mutual information of the representation with the training data

[Tishby '99, Achille & Soatto '17]

Minimizing these functionals is hard, SGD does it naturally

IBβ(θ) = ≈x∼ρθ f f (x) g − β −1 KL ⇣ ρθ `` prior ⌘

SLIDE 16

Potential Phi vs. original loss f

The solution of the variational problem is
The two losses are equal if and only if noise is isotropic

D(x) = I ⇔ Φ(x) = f (x)

Key point

Most likely locations of SGD are not the critical points of the original loss

ρss(x) , 1 Z 0

eβ f (x)

ρss(x) = 1 Zβ e−β Φ(x)

SLIDE 17

Deep networks have highly non-isotropic noise

CIFAR-10

λ(D) = 0.27 ± 0.84 rank(D) = 0.34%

CIFAR-100

λ(D) = 0.98 ± 2.16 rank(D) = 0.47%

Evaluate neural architectures using the diffusion matrix

SLIDE 18

How different are cats and dogs, really?

SLIDE 19

is such that

Theorem [Chaudhari & Soatto, ICLR '18]

The most likely trajectories of SGD are where the "leftover" vector field

˙ x = j (x), div j (x) = 0.

j (x) = −+f (x) + D(x) +(x) − β −1divD(x)

SGD converges to limit cycles

SLIDE 20

Trajectories of SGD

FFT of xi

k +1 − xi k

Run SGD for epochs

105

SLIDE 21

An example

j (x) = 0

+(x) = 0 very large j (x) saddle-point j (x) is small force-field

SLIDE 22

Most likely locations are not the critical points of the original loss

Theorem [Chaudhari & Soatto, ICLR '18]

The Ito SDE with the same steady-state if is equivalent to an A-type SDE

dx = − ⇣ D + Q ⌘ + dt + q 2β −1D dW (t)

ρss ∝ e−βΦ(x)

+f = ⇣ D + Q ⌘ + − β −1 div ⇣ D + Q ⌘ . dx = −+f dt + q 2β −1D dW (t)

SLIDE 23

Knots in our understanding

ARCHITECTURE OPTIMIZATION GENERALIZATION

SLIDE 24

Punchline

Is SGD special?

SLIDE 25

Thank you, questions?

www.pratikac.info

Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks, Pratik Chaudhari and Stefano Soatto.

arXiv:1710.11029, ICLR '18