Reconciling Traditional Optimization Analyses and Modern Deep - - PowerPoint PPT Presentation

β–Ά
reconciling traditional optimization analyses and modern
SMART_READER_LITE
LIVE PREVIEW

Reconciling Traditional Optimization Analyses and Modern Deep - - PowerPoint PPT Presentation

Reconciling Traditional Optimization Analyses and Modern Deep Learning: the Intrinsic Learning Rate Kaifeng Lyu * Sanjeev Arora Zhiyuan Li * Tsinghua University Princeton University & IAS Princeton University NeurIPS Dec, 2020 *


slide-1
SLIDE 1

Reconciling Traditional Optimization Analyses and Modern Deep Learning: the Intrinsic Learning Rate

NeurIPS Dec, 2020

Sanjeev Arora

Princeton University & IAS

Zhiyuan Li*

Princeton University

Kaifeng Lyu*

Tsinghua University

*: these authors contribute equally

slide-2
SLIDE 2

Se Settings gs

  • Normalization, e.g. BN:

Scale invariance: 𝑀 π‘₯!; π‘ͺ𝒖 = 𝑀 𝑑π‘₯#; π‘ͺ𝒖

  • Stochastic Gradient Descent and Weight Decay:

Our Contribution:

  • 1. Several surprising incompatibilities between normalized nets and traditional analyses.
  • 2. A new theory suggesting that LR doesn’t play the role assumed in most discussions, via a new analysis of SDEs

arising from SGD in normalized nets. Our analysis show 𝝁𝜽, i.e. WD*LR, is a better measure for the speed of learning, and thus we call 𝝁𝜽 the intrinsic Learning Rate.

π‘ͺ𝒖: batch of training data at iteration t π‘₯": weights of neural net πœ‡: Weight Decay (WD) factor, aka β„“# regularization πœƒ": Learning Rate (LR) at iteration t

βˆ‡π‘€ π‘₯!; π‘ͺ𝒖 = π‘‘βˆ‡π‘€ 𝑑π‘₯#; π‘ͺ𝒖 ; βˆ‡$𝑀 π‘₯!; π‘ͺ𝒖 = 𝑑$βˆ‡$𝑀 𝑑π‘₯#; π‘ͺ𝒖 , βˆ€π’… > 𝟏

slide-3
SLIDE 3

Con Convention

  • n Wis

Wisdo doms ms in in De Deep Le Lear arning ning (t (that we suspect!)

Wisdom 2: To achieve the best generalization, LR must be large initially for quite a few epochs. Wisdom 1: As LR β†’ 𝟏,

  • ptimization dynamic

converges to a deterministic path (Gradient Flow) along which training loss strictly decreases. Wisdom 3: Modeling SGD via SDE with a fixed Gaussian

  • noise. Namely, as a diffusion

process that mixes to some Gibbs-like distribution.

Figure from blog post by Holden Lee and Andrej Risteski

Resnet with BN trained by full-batch GD + WD on subsampled CIFAR10 Resnet with BN trained by SGD with diff. LR schedules on CIFAR10

slide-4
SLIDE 4

Con Convention

  • n Wis

Wisdo doms ms in in De Deep Le Lear arning ning (t (that we suspect!)

Wisdom 2: To achieve the best generalization, LR must be large initially for quite a few epochs. Wisdom 1: As LR β†’ 0,

  • ptimization dynamic

converges to a deterministic path (Gradient Flow) along which training loss strictly decreases. Wisdom 3: Modeling SGD via SDE with a fixed Gaussian

  • noise. Namely, as a diffusion

process that mixes to some Gibbs-like distribution.

Figure from blog post by Holden Lee and Andrej Risteski

Resnet with BN trained by full-batch GD + WD on subsampled CIFAR10 Resnet with BN trained by SGD with diff. LR schedules on CIFAR10

slide-5
SLIDE 5

Con Convention

  • n Wis

Wisdo doms ms in in De Deep Le Lear arning ning (t (that we suspect!)

Wisdom 2: To achieve the best generalization, LR must be large initially for quite a few epochs. Wisdom 1: As LR β†’ 0,

  • ptimization dynamic

converges to a deterministic path (Gradient Flow) along which training loss strictly decreases. Wisdom 3: Modeling SGD via SDE with a fixed Gaussian

  • noise. Namely, as a diffusion

process that mixes to some Gibbs-like distribution.

Figure from blog post by Holden Lee and Andrej Risteski

Resnet with BN trained by full-batch GD + WD on subsampled CIFAR10 Resnet with BN trained by SGD with diff. LR schedules on CIFAR10

slide-6
SLIDE 6

Conventional Wisdom ch challenged

(Against Wisdom 2) Small LR can generalize equally well as large LR (Against Wisdom 3) Random walk/SDE view of SGD is way

  • ff. There is no evidence of

mixing as traditionally understood, at least within normal training times. Loss β†— as hessian β†— when π‘₯ β†’ 0

Stochastic Weight Averaging improves acc. ( Izmailov et al., 2018)

SWA rules out that SGD as a diffusion process is mixing to a unique global equilibrium.

(Against Wisdom 1): Full batch gradient descent β‰  gradient flow.

Resnet with BN trained by full-batch GD + WD on subsampled CIFAR10. Resnet with BN trained by SGD with diff. LR schedules on CIFAR10

Same setting, longer training! Same setting, longer training!

slide-7
SLIDE 7

Conventional Wisdom ch challenged

(Against Wisdom 2) Small LR can generalize equally well as large LR (Against Wisdom 1): Full batch gradient descent β‰  gradient flow. (Against Wisdom 3) Random walk/SDE view of SGD is way

  • ff. There is no evidence of

mixing as traditionally understood, at least within normal training times. Loss β†— as hessian β†— when π‘₯ β†’ 0

Stochastic Weight Averaging improves acc. ( Izmailov et al., 2018)

SWA rules out that SGD as a diffusion process is mixing to a unique global equilibrium.

Resnet with BN trained by full-batch GD + WD on subsampled CIFAR10. Resnet with BN trained by SGD with diff. LR schedules on CIFAR10

Same setting, longer training! Same setting, longer training!

slide-8
SLIDE 8

Conventional Wisdom ch challenged

(Against Wisdom 2) Small LR can generalize equally well as large LR (Against Wisdom 3) Random walk/SDE view of SGD is way

  • ff. There is no evidence of

mixing as traditionally understood, at least within normal training times. Loss β†— as hessian β†— when π‘₯ β†’ 0

Stochastic Weight Averaging improves acc. ( Izmailov et al., 2018)

SWA rules out that SGD as a diffusion process is mixing to a unique global equilibrium.

(Against Wisdom 1): Full batch gradient descent β‰  gradient flow.

slide-9
SLIDE 9

Conventional Wisdom ch challenged

(Against Wisdom 2) Small LR can generalize equally well as large LR (Against Wisdom 1): Full batch gradient descent β‰  gradient flow. (Against Wisdom 3) Random walk/SDE view of SGD is way

  • ff. There is no evidence of

mixing as traditionally understood, at least within normal training times.

So So wh what’s go going on

  • n?

Loss β†— as hessian β†— when π‘₯ β†’ 0

Stochastic Weight Averaging improves acc. ( Izmailov et al., 2018)

SWA rules out that SGD as a diffusion process is mixing to a unique global equilibrium.

slide-10
SLIDE 10

SD SDE-ba base sed d framework for mode deling ng SGD on n Normalized d Netw tworks

  • Standard SDE approximation:
  • New parametrization:

Main Theorem:

Effective Learning rate

Intrinsic Learning Rate

𝛿" = |𝑋

"|$

πœƒ#

Three implications of main theorem:

  • 1. The effective learning rate depends on norm of W, πœΉπ’–

"𝟏.πŸ” = 𝜽 𝑿𝒖 πŸ‘.

2. 𝝁𝒇 alone determines the evolution of the system

  • 3. norm convergence 𝑃(

) *#) time (steps)

dynamics of norm dynamics of direction

slide-11
SLIDE 11

A A conj njectur ture abo bout ut mixing ng ti time in n func uncti tion n spa space

  • SDE(SGD) mixes slowly in parameter space, but it mixes fast in function space in

experiments, i.e., shortly after norm convergence, 𝑒 = 𝑃( .

/+).

  • Fast Equilibrium Conjecture: Network from any non-pathological initialization would

converge to function space equilibrium in 𝑃( .

/+) steps, i.e.,

  • Cannot distinguish two networks (from different init) at equilibrium via (further training +) evaluation on

any input data, but it’s possible to distinguish them via looking at their parameters (e.g., via Stochastic Weight Averaging).

slide-12
SLIDE 12

Wh What hap appen ens in in real eal lif life e train ainin ing --

  • - an

an in inter erpretatio ion

Our theory explains …

  • Why test/train error sudden increases after every LR decay?
  • How many LR decays are necessary towards the best performance?
  • Only the last is necessary, if you don’t mind to train your network longer.
  • What’s benefit of early large LR then?
  • Faster convergence to the equilibrium (of small Intrinsic LR).

Figure from β€˜Wide Resnet’ paper (Zagoruyko & Komodakis, 2017)

Reason: Effective LR first divided by 10,but its stationary value only decrease by roughly 𝟐𝟏.

See more interpretations in our paper!