An Exponential LR Schedule for Deep Learning An Exponential LR - PowerPoint PPT Presentation

An Exponential LR Schedule for Deep Learning An Exponential LR Schedule for Deep Learning (Strange Behavior of Normalization Layers) (Strange Behavior of Normalization Layers) Zhiyuan Li Zhiyuan Li Sanjeev Arora Sanjeev Arora Princeton Princeton Princeton & IAS Princeton & IAS ICLR April, 2019

Learning Lea g rate e in traditional optimi mization 𝑥 !"# ← 𝑥 ! − 𝜃 ⋅ ∇𝑀 𝑥 ! Traditional: Start with some LR; decay over time. (extensive literature in optimization justifying this) Confusingly, exotic LR schedules also reported to work: triangular[Smith,2015 ], cosine [Loshchilov&Hutter, 2016] etc.; no justification in theory.

Th This is wo work: Ex Exponential LR LR sch chedules PreResNet32 on CIFAR10 with fixed LR converted to Exp LR with 𝜃 ! = 0.1×1.481 ! , Result 1 (empirical): Possible to train today’s deep architectures, while growing LR exponentially (i.e., at each iteration multiply by (1 + 𝑑) for some 𝑑 > 0) Result 2 (theory): Mathematical proof that exponential LR schedule can yield (in function space*) every net produced by existing training schedules. Raises questions for theory; highlights importance of trajectory analysis ( vs landscape) (* In all nets that use batch norm [Ioffe-Szegedy’13] or any other normalization scheme.)

Ma Main Th Thm (original sch chedule has fixed LR) Learning Rate “General training algorithm; fixed LR” ℓ ! regularizer (aka Weight decay) Momentum Stochastic Loss at round t Thm: For nets w/ batch norm or layer norm, following is equivalent to above : weight decay 0, momentum 𝛿 , and LR schedule 𝜃 ! = 𝜃𝛽 $%!$# where 𝛽 ( 𝛽 < 1 ) is nonzero root of x 2 − (1 + γ − λη ) x + γ = 0 , (proof shows nets in the new trajectory are equivalent in function space to nets in original trajectory)

Mo More e gen ener eral (ori riginal schedule has varyi ying LR) ∗ . Original schedule(Step Decay) : K phases. In phase 𝐽 , iteration [ 𝑈 & , 𝑈 &"# − 1], use LR 𝜃 & Thm : Step Decay can be realized with tapered exponential LR schedule Tapered Exponential LR schedule (TEXP): Exponential growing LR in each phase + when entering a new phase: Switching to a slower exponential growing rate; • Dividing the current LR by some constant. • (See details in the paper)

Ke Key conce cept in proof: Sc Scale-in invar arian iant tr traini ning ng loss loss 𝑀 𝑑 ⋅ 𝜄 = 𝑀 𝜄 for every vector 𝜄 of net parameters and 𝑑 > 0 Observation: Batch norm + fixed final layer è Loss is scale invariant (True for Feed-forward nets, Resnets, DenseNets, etc.; see our appendix) Scale-invariant loss fn sufficient for state of the art deep learning![Hoffer et al,18] −𝛂𝐌(𝐝𝜾 𝒖 ) −𝛂𝐌(𝜾 𝒖 ) Lemma: Gradient for such losses satisfies 𝒅𝜾 𝒖 𝜾 𝒖

Proof sketch ch when momentum=0 𝛂𝐌(𝐝𝜾 𝒖 ) = 𝐝 "𝟐 𝛂𝐌(𝜾 𝒖 ) 𝜽 # = c 𝟑 𝜽 𝜽 # = c 𝟑 𝜽 −𝐝 𝟑 𝜽𝛂𝐌(𝐝𝜾 𝒖 ) # = 𝒅𝜾 𝒖 𝜾 𝒖 # 𝜾 𝒖%𝟐 # − 𝜽′𝛂𝐌(𝜾 𝒖 # # ) 𝜾 𝒖%𝟐 = 𝜾 𝒖 # , 𝜽′ → (𝒅𝜾 𝒖 , 𝒅 𝟑 𝜽) # 𝜾 𝒖 , 𝜽′ → (𝒅𝜾 𝒖%𝟐 , 𝒅 𝟑 𝜽) 𝜾 𝒖%𝟐 −𝛂𝐌(𝜾 𝒖 ) 𝜽 𝜽 −𝜽𝛂𝐌(𝐝𝜾 𝒖 ) 𝜾 𝒖%𝟐 𝜾 𝒖 𝜾 𝒖%𝟐 = 𝜾 𝒖%𝟐 − 𝜽𝛂𝐌(𝜾 𝒖 )

Warm-up: Wa up: Equivalence ce of of mo momen mentum-fr free ca case " " , 𝜽′ 𝜾 𝒖 𝜾 𝒖#𝟐 , 𝜽′ 0 − 𝜽′𝛂𝐌(𝜾 𝒖 0 0 ) 𝜾 𝒖"𝟐 = 𝜾 𝒖 0 , 𝜽′ → (𝒅𝜾 𝒖"𝟐 , 𝒅 𝟑 𝜽) 0 , 𝜽′ → (𝒅𝜾 𝒖 , 𝒅 𝟑 𝜽) 𝜾 𝒖"𝟐 𝜾 𝒖 𝜾 𝒖 , 𝜽 𝜾 𝒖#𝟐 , 𝜽 State = 𝜾 𝒖"𝟐 = 𝜾 𝒖"𝟐 − 𝜽𝛂𝐌(𝜾 𝒖 )

Warm-up: Wa up: Equivalence ce of of mo momen mentum-fr free ca case " " , 𝜽′ 𝜾 𝒖 𝜾 𝒖#𝟐 , 𝜽′ 𝐇𝐄 𝐮 Equivalent scaling: 𝐝 ∘ 𝚸 𝟑 𝐝 ∘ 𝚸 𝟑 𝒅 𝟑 𝒅 𝟑 𝚸 𝟐 𝚸 𝟐 𝜾 𝒖#𝟐 , 𝜽 𝜾 𝒖 , 𝜽 𝐇𝐄 𝐮

Warm-up: Wa up: Equivalence ce of of mo momen mentum-fr free ca case (𝝇 = 𝟐 − 𝝁𝜽) 𝝇 $𝟐 𝜾 − 𝜽𝛂𝐌 𝐮 (𝜾), 𝜽 𝜾 − 𝝇 $𝟐 𝜽𝛂𝐌 𝐮 (𝜾), 𝝇 $𝟐 𝜽 𝜾, 𝝇 $𝟐 𝜽 𝜾, 𝜽 𝝇 ∘ 𝚸 𝟐 𝝇 𝝇 "𝟐 𝚸 𝟑 𝐇𝐄 𝐮 𝚸 𝟑

(𝝇 = 𝟐 − 𝝁𝜽) = 𝝇 ∘ 𝚸 𝟐 𝝇 "𝟐 𝝇 𝐇𝐄 𝐮 𝚸 𝟑 𝚸 𝟑 Theorem: GD + WD + constant LR= GD + Exp LR. 𝝇 "𝒖 ∘ 𝚸 𝟑 𝝇 "𝟑𝒖 ∘ 𝑯𝑬 𝒖$𝟐 𝝇 "𝟐 ∘ 𝐇𝐄 𝐮$𝟐 ∘ 𝚸 𝟑 𝝇 "𝟑 ∘ ⋯ ∘ 𝐇𝐄 𝟐 ∘ 𝚸 𝟑 𝝇 = 𝚸 𝟑 𝝇 "𝟑 𝐇𝐄 𝟏 ∘ 𝚸 𝟑 𝝇 "𝟐 𝝇 𝚸 𝟐 ∘ ⋯ ∘ 𝑯𝑬 𝟏 = 2 𝜾 𝟑 2 Proof: 𝜾 𝟒 𝝇 "𝟐 𝚸 𝟑 2 𝝇 "𝟐 𝝇 "𝟑 𝜾 𝟐 𝚸 𝟐 ∘ 𝚸 𝟑 𝝇 "𝟐 𝚸 𝟑 2 𝜾 𝟏 = 𝜾 𝟏 𝜾 𝟐 𝜾 𝟑 𝜾 𝟒

Con Conclusion ons • Scale-Invariance (provided by BN) makes the training procedure incredibly robust to LR schedules, even to exponentially growing schedules. • Space of good LR schedules in current architectures is vast (hopeless to search for best schedule??) • Current ways of thinking about training/optimization should be rethought; • should focus on trajectory, not landscape.

An Exponential LR Schedule for Deep Learning An Exponential LR - PowerPoint PPT Presentation

An Exponential LR Schedule for Deep Learning An Exponential LR Schedule for Deep Learning (Strange Behavior of Normalization Layers) (Strange Behavior of Normalization Layers) Zhiyuan Li Zhiyuan Li Sanjeev Arora Sanjeev Arora Princeton

Exponential Families Leila Wehbe March 19, 2013 Leila Wehbe Exponential Families Exponential

Exponential Growth Exponential Growth Introduction Exponential Growth vs. Linear Growth

Applications of exponential functions Applications of exponential functions abound throughout the

Exponential Family Distributions CMSC 691 UMBC Exponential Family Form Exponential Family Form

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Exponential smoothing and non-negative data Muhammad Akram Rob J Hyndman J Keith Ord Business

Section5.2 Exponential Functions and Graphs Graphing Definition The exponential function with

GSoC 2016: Exponential Integrators Chiara Segala Mentor: Prof. Marco Caliari GSoC 2016:

Solving exponential and logarithmic equations We explore some results involving exponential

Beyond the exponential family Eric Pedersen, Gavin Simpson, David Miller August 6th, 2016 Away

Exponential distribution STAT 587 (Engineering) Iowa State University September 17, 2020

Exponential & Normal Distribution Lec.22 July 29, 2020 Exponential Distribution: Fundamental

Exponential and Logarithm Natural Logs and e Exponential Growth and Decay Functions Slide 3 /

Introduction to exponential functions An exponential function is a function of the form f ( x ) = b

Non-linear structure formation with massive neutrinos Yacine Ali-Hamoud and Simeon Bird (IAS)

SITools2 as VO service provider: an example with Herschel at

Effective Field Theory of Large-c CFTs Felix Haehl UBC Vancouver (> IAS) Based on

BEDROCK AQUIFER SYSTEMS ACROSS IOWA SOUTHWEST TO NORTHEAST SINKHOLES AND SHALLOW CARBONATE

An Object-based World Model and its Uses Mac Mason Bhaskara Marthi I just want a robot that

2013/14 Andrew Williams Chief Executive Kevin Thompson Finance Director Halma Final results

delivery for HIV treatment to strengthen contraceptive care October 14, 2020 WELCOME

19 th XBRL International Conference Reducing reporting burden with XBRL: a catalyst for better

An Exponential LR Schedule for Deep Learning An Exponential LR - PowerPoint PPT Presentation

An Exponential LR Schedule for Deep Learning An Exponential LR Schedule for Deep Learning (Strange Behavior of Normalization Layers) (Strange Behavior of Normalization Layers) Zhiyuan Li Zhiyuan Li Sanjeev Arora Sanjeev Arora Princeton

Exponential Families Leila Wehbe March 19, 2013 Leila Wehbe Exponential Families Exponential

Exponential Growth Exponential Growth Introduction Exponential Growth vs. Linear Growth

Applications of exponential functions Applications of exponential functions abound throughout the

Exponential Family Distributions CMSC 691 UMBC Exponential Family Form Exponential Family Form

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Exponential smoothing and non-negative data Muhammad Akram Rob J Hyndman J Keith Ord Business

Section5.2 Exponential Functions and Graphs Graphing Definition The exponential function with

GSoC 2016: Exponential Integrators Chiara Segala Mentor: Prof. Marco Caliari GSoC 2016:

Solving exponential and logarithmic equations We explore some results involving exponential

Beyond the exponential family Eric Pedersen, Gavin Simpson, David Miller August 6th, 2016 Away

Exponential distribution STAT 587 (Engineering) Iowa State University September 17, 2020

Exponential &amp; Normal Distribution Lec.22 July 29, 2020 Exponential Distribution: Fundamental

Exponential and Logarithm Natural Logs and e Exponential Growth and Decay Functions Slide 3 /

Introduction to exponential functions An exponential function is a function of the form f ( x ) = b

Non-linear structure formation with massive neutrinos Yacine Ali-Hamoud and Simeon Bird (IAS)

SITools2 as VO service provider: an example with Herschel at

Effective Field Theory of Large-c CFTs Felix Haehl UBC Vancouver (&gt; IAS) Based on

BEDROCK AQUIFER SYSTEMS ACROSS IOWA SOUTHWEST TO NORTHEAST SINKHOLES AND SHALLOW CARBONATE

An Object-based World Model and its Uses Mac Mason Bhaskara Marthi I just want a robot that

2013/14 Andrew Williams Chief Executive Kevin Thompson Finance Director Halma Final results

delivery for HIV treatment to strengthen contraceptive care October 14, 2020 WELCOME

19 th XBRL International Conference Reducing reporting burden with XBRL: a catalyst for better

Exponential & Normal Distribution Lec.22 July 29, 2020 Exponential Distribution: Fundamental

Effective Field Theory of Large-c CFTs Felix Haehl UBC Vancouver (> IAS) Based on