An Exponential LR Schedule for Deep Learning An Exponential LR - - PowerPoint PPT Presentation

โ–ถ
an exponential lr schedule for deep learning an
SMART_READER_LITE
LIVE PREVIEW

An Exponential LR Schedule for Deep Learning An Exponential LR - - PowerPoint PPT Presentation

An Exponential LR Schedule for Deep Learning An Exponential LR Schedule for Deep Learning (Strange Behavior of Normalization Layers) (Strange Behavior of Normalization Layers) Zhiyuan Li Zhiyuan Li Sanjeev Arora Sanjeev Arora Princeton


slide-1
SLIDE 1

An Exponential LR Schedule for Deep Learning

(Strange Behavior of Normalization Layers)

Sanjeev Arora

Princeton & IAS

Zhiyuan Li

Princeton

An Exponential LR Schedule for Deep Learning

(Strange Behavior of Normalization Layers)

Sanjeev Arora

Princeton & IAS

Zhiyuan Li

Princeton

ICLR April, 2019

slide-2
SLIDE 2

๐‘ฅ !"# โ† ๐‘ฅ ! โˆ’ ๐œƒ โ‹… โˆ‡๐‘€ ๐‘ฅ ! Traditional: Start with some LR; decay over time.

(extensive literature in optimization justifying this)

Confusingly, exotic LR schedules also reported to work: triangular[Smith,2015 ], cosine [Loshchilov&Hutter, 2016] etc.; no justification in theory.

Lea Learning g rate e in traditional optimi mization

slide-3
SLIDE 3

Result 1 (empirical): Possible to train todayโ€™s deep architectures, while growing LR exponentially (i.e., at each iteration multiply by (1 + ๐‘‘) for some ๐‘‘ > 0) Result 2 (theory): Mathematical proof that exponential LR schedule can yield (in function space*) every net produced by existing training schedules.

(* In all nets that use batch norm [Ioffe-Szegedyโ€™13] or any other normalization scheme.)

Raises questions for theory; highlights importance of trajectory analysis ( vs landscape)

Th This is wo work: Ex Exponential LR LR sch chedules

PreResNet32 on CIFAR10 with fixed LR converted to Exp LR with ๐œƒ! = 0.1ร—1.481!,

slide-4
SLIDE 4

Learning Rate

โ€œGeneral training algorithm; fixed LRโ€

Momentum โ„“! regularizer (aka Weight decay)

Thm: For nets w/ batch norm or layer norm, following is equivalent to above : weight decay 0, momentum ๐›ฟ, and LR schedule ๐œƒ! = ๐œƒ๐›ฝ$%!$# where ๐›ฝ (๐›ฝ < 1) is nonzero root of

x2 โˆ’ (1 + ฮณ โˆ’ ฮปฮท)x + ฮณ = 0,

(proof shows nets in the new trajectory are equivalent in function space to nets in original trajectory)

Ma Main Th Thm (original sch chedule has fixed LR)

Stochastic Loss at round t

slide-5
SLIDE 5

Original schedule(Step Decay): K phases. In phase ๐ฝ, iteration [๐‘ˆ&, ๐‘ˆ&"# โˆ’ 1], use LR ๐œƒ&

โˆ—.

Mo More e gen ener eral (ori riginal schedule has varyi ying LR)

Thm: Step Decay can be realized with tapered exponential LR schedule

Tapered Exponential LR schedule (TEXP):

Exponential growing LR in each phase + when entering a new phase:

  • Switching to a slower exponential growing rate;
  • Dividing the current LR by some constant.

(See details in the paper)

slide-6
SLIDE 6

Observation: Batch norm + fixed final layer รจ Loss is scale invariant (True for Feed-forward nets, Resnets, DenseNets, etc.; see our appendix) ๐‘€ ๐‘‘ โ‹… ๐œ„ = ๐‘€ ๐œ„ for every vector ๐œ„ of net parameters and ๐‘‘ > 0

Ke Key conce cept in proof: Sc Scale-in invar arian iant tr traini ning ng loss loss

๐œพ๐’– โˆ’๐›‚๐Œ(๐œพ๐’–) ๐’…๐œพ๐’– โˆ’๐›‚๐Œ(๐๐œพ๐’–) Lemma: Gradient for such losses satisfies Scale-invariant loss fn sufficient for state of the art deep learning![Hoffer et al,18]

slide-7
SLIDE 7

๐œพ๐’–

#, ๐œฝโ€ฒ โ†’ (๐’…๐œพ๐’–, ๐’…๐Ÿ‘๐œฝ)

๐œฝ ๐œพ๐’–

# = ๐’…๐œพ๐’–

๐›‚๐Œ(๐๐œพ๐’–) = ๐"๐Ÿ ๐›‚๐Œ(๐œพ๐’–)

๐œพ๐’– โˆ’๐›‚๐Œ(๐œพ๐’–) ๐œฝ# = c๐Ÿ‘๐œฝ โˆ’๐๐Ÿ‘๐œฝ๐›‚๐Œ(๐๐œพ๐’–) โˆ’๐œฝ๐›‚๐Œ(๐๐œพ๐’–) ๐œฝ ๐œพ๐’–%๐Ÿ

#

, ๐œฝโ€ฒ โ†’ (๐’…๐œพ๐’–%๐Ÿ, ๐’…๐Ÿ‘๐œฝ) ๐œพ๐’–%๐Ÿ

#

= ๐œพ๐’–

# โˆ’ ๐œฝโ€ฒ๐›‚๐Œ(๐œพ๐’– #)

๐œพ๐’–%๐Ÿ = ๐œพ๐’–%๐Ÿ โˆ’ ๐œฝ๐›‚๐Œ(๐œพ๐’–) ๐œฝ# = c๐Ÿ‘๐œฝ ๐œพ๐’–%๐Ÿ

#

๐œพ๐’–%๐Ÿ

Proof sketch ch when momentum=0

slide-8
SLIDE 8

๐œพ๐’–

0, ๐œฝโ€ฒ โ†’ (๐’…๐œพ๐’–, ๐’…๐Ÿ‘๐œฝ)

๐œพ๐’–"๐Ÿ , ๐œฝโ€ฒ โ†’ (๐’…๐œพ๐’–"๐Ÿ, ๐’…๐Ÿ‘๐œฝ) ๐œพ๐’–"๐Ÿ = ๐œพ๐’–

0 โˆ’ ๐œฝโ€ฒ๐›‚๐Œ(๐œพ๐’– 0)

๐œพ๐’–"๐Ÿ = ๐œพ๐’–"๐Ÿ โˆ’ ๐œฝ๐›‚๐Œ(๐œพ๐’–)

๐œพ๐’–

", ๐œฝโ€ฒ

๐œพ๐’–, ๐œฝ ๐œพ๐’–#๐Ÿ, ๐œฝ ๐œพ๐’–#๐Ÿ

"

, ๐œฝโ€ฒ

Wa Warm-up: up: Equivalence ce of

  • f mo

momen mentum-fr free ca case

State =

slide-9
SLIDE 9

Equivalent scaling: ๐šธ๐Ÿ

๐ โˆ˜ ๐šธ๐Ÿ‘ ๐’…๐Ÿ‘

๐œพ๐’–

", ๐œฝโ€ฒ

๐œพ๐’–, ๐œฝ ๐œพ๐’–#๐Ÿ, ๐œฝ ๐œพ๐’–#๐Ÿ

"

, ๐œฝโ€ฒ

๐‡๐„๐ฎ ๐‡๐„๐ฎ ๐šธ๐Ÿ

๐ โˆ˜ ๐šธ๐Ÿ‘ ๐’…๐Ÿ‘

Wa Warm-up: up: Equivalence ce of

  • f mo

momen mentum-fr free ca case

slide-10
SLIDE 10

(๐‡ = ๐Ÿ โˆ’ ๐๐œฝ) ๐œพ, ๐œฝ ๐œพ, ๐‡$๐Ÿ๐œฝ ๐šธ๐Ÿ‘

๐‡"๐Ÿ

๐œพ โˆ’ ๐‡$๐Ÿ๐œฝ๐›‚๐Œ๐ฎ(๐œพ), ๐‡$๐Ÿ๐œฝ ๐‡๐„๐ฎ ๐šธ๐Ÿ‘

๐‡ โˆ˜ ๐šธ๐Ÿ ๐‡

๐‡$๐Ÿ๐œพ โˆ’ ๐œฝ๐›‚๐Œ๐ฎ(๐œพ), ๐œฝ

Wa Warm-up: up: Equivalence ce of

  • f mo

momen mentum-fr free ca case

slide-11
SLIDE 11

(๐‡ = ๐Ÿ โˆ’ ๐๐œฝ) = ๐šธ๐Ÿ‘

๐‡"๐Ÿ

๐‡๐„๐ฎ ๐šธ๐Ÿ‘

๐‡ โˆ˜ ๐šธ๐Ÿ ๐‡

Theorem: GD + WD + constant LR= GD + Exp LR.

๐šธ๐Ÿ

๐‡"๐’– โˆ˜ ๐šธ๐Ÿ‘ ๐‡"๐Ÿ‘๐’– โˆ˜ ๐‘ฏ๐‘ฌ๐’–$๐Ÿ ๐‡

โˆ˜ โ‹ฏ โˆ˜ ๐‘ฏ๐‘ฌ๐Ÿ

๐‡ = ๐šธ๐Ÿ‘ ๐‡"๐Ÿ โˆ˜ ๐‡๐„๐ฎ$๐Ÿ โˆ˜ ๐šธ๐Ÿ‘ ๐‡"๐Ÿ‘ โˆ˜ โ‹ฏ โˆ˜ ๐‡๐„๐Ÿ โˆ˜ ๐šธ๐Ÿ‘ ๐‡"๐Ÿ‘๐‡๐„๐Ÿ โˆ˜ ๐šธ๐Ÿ‘ ๐‡"๐Ÿ

๐œพ๐Ÿ ๐œพ๐Ÿ = ๐œพ๐Ÿ

2

๐œพ๐Ÿ‘ ๐œพ๐Ÿ’ ๐œพ๐Ÿ

2

๐œพ๐Ÿ‘

2

๐œพ๐Ÿ’

2

Proof:

๐šธ๐Ÿ

๐‡"๐Ÿ

โˆ˜ ๐šธ๐Ÿ‘

๐‡"๐Ÿ‘

๐šธ๐Ÿ‘

๐‡"๐Ÿ

๐šธ๐Ÿ‘

๐‡"๐Ÿ

=

slide-12
SLIDE 12

Con Conclusion

  • ns
  • Scale-Invariance (provided by BN) makes the training procedure

incredibly robust to LR schedules, even to exponentially growing schedules.

  • Space of good LR schedules in current architectures is vast (hopeless

to search for best schedule??)

  • Current ways of thinking about training/optimization should be

rethought;

  • should focus on trajectory, not landscape.