Reconciling Traditional Optimization Analyses and Modern Deep - PowerPoint PPT Presentation

Reconciling Traditional Optimization Analyses and Modern Deep Learning: the Intrinsic Learning Rate Kaifeng Lyu * Sanjeev Arora Zhiyuan Li * Tsinghua University Princeton University & IAS Princeton University NeurIPS Dec, 2020 * ： these authors contribute equally

Se Settings gs • Normalization, e.g. BN: Scale invariance: 𝑀 𝑥 ! ; 𝑪 𝒖 = 𝑀 𝑑𝑥 # ; 𝑪 𝒖 ∇𝑀 𝑥 ! ; 𝑪 𝒖 = 𝑑∇𝑀 𝑑𝑥 # ; 𝑪 𝒖 ; ∇ $ 𝑀 𝑥 ! ; 𝑪 𝒖 = 𝑑 $ ∇ $ 𝑀 𝑑𝑥 # ; 𝑪 𝒖 , ∀𝒅 > 𝟏 • Stochastic Gradient Descent and Weight Decay: 𝑪 𝒖 : batch of training data at iteration t 𝑥 " : weights of neural net 𝜃 " : Learning Rate (LR) at iteration t 𝜇: Weight Decay (WD) factor, aka ℓ # regularization Our Contribution: 1. Several surprising incompatibilities between normalized nets and traditional analyses. 2. A new theory suggesting that LR doesn’t play the role assumed in most discussions, via a new analysis of SDEs arising from SGD in normalized nets. Our analysis show 𝝁𝜽, i.e. WD*LR, is a better measure for the speed of learning, and thus we call 𝝁𝜽 the intrinsic Learning Rate.

(that we suspect! ) Con Convention on Wis Wisdo doms ms in in De Deep Le Lear arning ning (t Wisdom 1: As LR → 𝟏 , Wisdom 2: To achieve the Wisdom 3: Modeling SGD via optimization dynamic best generalization, LR must SDE with a fixed Gaussian converges to a deterministic be large initially for quite a noise. Namely, as a diffusion path (Gradient Flow) along few epochs. process that mixes to some which training loss strictly Gibbs-like distribution. decreases. Figure from blog post by Holden Lee and Andrej Risteski Resnet with BN trained by full-batch GD Resnet with BN trained by SGD with diff. + WD on subsampled CIFAR10 LR schedules on CIFAR10

(that we suspect! ) Con Convention on Wis Wisdo doms ms in in De Deep Le Lear arning ning (t Wisdom 1: As LR → 0 , Wisdom 2: To achieve the Wisdom 3: Modeling SGD via optimization dynamic best generalization, LR must SDE with a fixed Gaussian converges to a deterministic be large initially for quite a noise. Namely, as a diffusion path (Gradient Flow) along few epochs. process that mixes to some which training loss strictly Gibbs-like distribution. decreases. Figure from blog post by Holden Lee and Andrej Risteski Resnet with BN trained by full-batch GD Resnet with BN trained by SGD with diff. + WD on subsampled CIFAR10 LR schedules on CIFAR10

Conventional Wisdom ch challenged (Against Wisdom 1): Full batch (Against Wisdom 2) Small LR (Against Wisdom 3) Random gradient descent ≠ gradient can generalize equally well as walk/SDE view of SGD is way flow. large LR off. There is no evidence of mixing as traditionally Loss ↗ as hessian ↗ when 𝑥 → 0 understood, at least within normal training times. Stochastic Weight Averaging improves acc. Resnet with BN trained by SGD with diff. Resnet with BN trained by full-batch GD ( Izmailov et al., 2018) LR schedules on CIFAR10 + WD on subsampled CIFAR10. SWA rules out that SGD as a diffusion process Same setting, longer training! Same setting, longer training! is mixing to a unique global equilibrium.

Conventional Wisdom ch challenged (Against Wisdom 1): Full batch (Against Wisdom 2) Small LR (Against Wisdom 3) Random gradient descent ≠ gradient can generalize equally well as walk/SDE view of SGD is way flow. large LR off. There is no evidence of mixing as traditionally Loss ↗ as hessian ↗ when 𝑥 → 0 understood, at least within normal training times. Stochastic Weight Averaging improves acc. ( Izmailov et al., 2018) SWA rules out that SGD as a diffusion process is mixing to a unique global equilibrium.

Conventional Wisdom ch challenged (Against Wisdom 1): Full batch (Against Wisdom 2) Small LR (Against Wisdom 3) Random gradient descent ≠ gradient can generalize equally well as walk/SDE view of SGD is way flow. large LR off. There is no evidence of mixing as traditionally Loss ↗ as hessian ↗ when 𝑥 → 0 understood, at least within normal training times. Stochastic Weight Averaging improves acc. ( Izmailov et al., 2018) So So wh what’s go going on on? SWA rules out that SGD as a diffusion process is mixing to a unique global equilibrium.

SD SDE-ba base sed d framework for mode deling ng SGD on n Normalized d Netw tworks • Standard SDE approximation: • New parametrization: Intrinsic Learning Rate Main Theorem: dynamics of direction " | $ dynamics of norm 𝛿 " = |𝑋 Effective Learning rate 𝜃 # Three implications of main theorem: "𝟏.𝟔 = 𝜽 1. The effective learning rate depends on norm of W, 𝜹 𝒖 𝑿 𝒖 𝟑 . 2. 𝝁 𝒇 alone determines the evolution of the system ) 3. norm convergence 𝑃( * # ) time (steps)

A A conj njectur ture abo bout ut mixing ng ti time in n func uncti tion n spa space • SDE(SGD) mixes slowly in parameter space, but it mixes fast in function space in experiments, i.e., shortly after norm convergence, 𝑢 = 𝑃( . / + ) . • Fast Equilibrium Conjecture : Network from any non-pathological initialization would converge to function space equilibrium in 𝑃( . / + ) steps, i.e., • Cannot distinguish two networks (from different init) at equilibrium via (further training +) evaluation on any input data, but it’s possible to distinguish them via looking at their parameters (e.g., via Stochastic Weight Averaging).

Wh What hap appen ens in in real eal lif life e train ainin ing -- -- an an in inter erpretatio ion Our theory explains … • Why test/train error sudden increases after every LR decay? Figure from ‘Wide Resnet’ paper (Zagoruyko & Komodakis, 2017) Reason: Effective LR first divided by 10,but its stationary value only decrease by roughly 𝟐𝟏 . • How many LR decays are necessary towards the best performance? • Only the last is necessary, if you don’t mind to train your network longer. • What’s benefit of early large LR then? • Faster convergence to the equilibrium (of small Intrinsic LR) . See more interpretations in our paper!

Reconciling Traditional Optimization Analyses and Modern Deep - PowerPoint PPT Presentation

Reconciling Traditional Optimization Analyses and Modern Deep Learning: the Intrinsic Learning Rate Kaifeng Lyu * Sanjeev Arora Zhiyuan Li * Tsinghua University Princeton University & IAS Princeton University NeurIPS Dec, 2020 *

MODERN 1 MODERN 2 MODERN 3 MODERN 4 MODERN A peep at some distant orb has power to raise

Reconciling Human Development Reconciling Human Development and Climate Protection -

A Tale of Two Theories: A Tale of Two Theories: Reconciling Reconciling random matrix theory

getting active after SCI Traditional Email Interaction: Traditional Email Interaction:

Order Of Service Collection www.keeganfuneraldirectors.co.uk Modern Funerals - Traditional

Reconciling DRR and Climate Frameworks World Water Week 2018 Monday 27 August | 16.00-17.30 |

Industry vs the Planet A Christian Perspective on Reconciling Environmental

Modern Risk Modern Risk Modern Risk Management Modern Risk Management anagement Concepts:

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Nothing is Traditional about Nothing is Traditional about Environments in a Traditional

Modern OLTP Indexes (Part 2) 1 / 43 Modern OLTP Indexes (Part 2) Recap Recap 2 / 43 Modern OLTP

From Traditional Neural From Traditional NN . . . Networks to Deep Learning Need to Go Beyond .

Analyses, Hardware/Software Compilation, Code Optimization for Complex Dataflow HPC Applications

Coffin and Casket Collection www.keeganfuneraldirectors.co.uk Modern Funerals - Traditional

Floral Tributes www.keeganfuneraldirectors.co.uk Modern Funerals - Traditional Values Florists

Improved analyses and forecasts with AIRS Improved analyses and forecasts with AIRS retrievals

Overview of nucleon form factor measurements Focus on theoretical calculations of form factors

International Quality Infrastructure Dr Martin Milton Director, BIPM 29 th November 2017 Outline

Ascott Residence Trust 2Q 2019 Financial Results 30 July 2019 Important Notice The value of

Understanding Priors in Bayesian Neural Networks at the Unit Level Mariia Vladimirova, Jakob

Qualification of SSR1 cavities for PIP-II Soniya Samani Queen Mary University of London

Distributional Reinforcement Learning for Efficient Exploration Hengshuai Yao Huawei Hi-Silicon

Identification of Complexity Factors for Remote Towers Billy Josefsson Joern

Variational Autoencoders Recap: Story so far A classification MLP actually comprises two