reconciling traditional optimization analyses and modern
play

Reconciling Traditional Optimization Analyses and Modern Deep - PowerPoint PPT Presentation

Reconciling Traditional Optimization Analyses and Modern Deep Learning: the Intrinsic Learning Rate Kaifeng Lyu * Sanjeev Arora Zhiyuan Li * Tsinghua University Princeton University & IAS Princeton University NeurIPS Dec, 2020 *


  1. Reconciling Traditional Optimization Analyses and Modern Deep Learning: the Intrinsic Learning Rate Kaifeng Lyu * Sanjeev Arora Zhiyuan Li * Tsinghua University Princeton University & IAS Princeton University NeurIPS Dec, 2020 * : these authors contribute equally

  2. Se Settings gs β€’ Normalization, e.g. BN: Scale invariance: 𝑀 π‘₯ ! ; π‘ͺ 𝒖 = 𝑀 𝑑π‘₯ # ; π‘ͺ 𝒖 βˆ‡π‘€ π‘₯ ! ; π‘ͺ 𝒖 = π‘‘βˆ‡π‘€ 𝑑π‘₯ # ; π‘ͺ 𝒖 ; βˆ‡ $ 𝑀 π‘₯ ! ; π‘ͺ 𝒖 = 𝑑 $ βˆ‡ $ 𝑀 𝑑π‘₯ # ; π‘ͺ 𝒖 , βˆ€π’… > 𝟏 β€’ Stochastic Gradient Descent and Weight Decay: π‘ͺ 𝒖 : batch of training data at iteration t π‘₯ " : weights of neural net πœƒ " : Learning Rate (LR) at iteration t πœ‡: Weight Decay (WD) factor, aka β„“ # regularization Our Contribution: 1. Several surprising incompatibilities between normalized nets and traditional analyses. 2. A new theory suggesting that LR doesn’t play the role assumed in most discussions, via a new analysis of SDEs arising from SGD in normalized nets. Our analysis show 𝝁𝜽, i.e. WD*LR, is a better measure for the speed of learning, and thus we call 𝝁𝜽 the intrinsic Learning Rate.

  3. (that we suspect! ) Con Convention on Wis Wisdo doms ms in in De Deep Le Lear arning ning (t Wisdom 1: As LR β†’ 𝟏 , Wisdom 2: To achieve the Wisdom 3: Modeling SGD via optimization dynamic best generalization, LR must SDE with a fixed Gaussian converges to a deterministic be large initially for quite a noise. Namely, as a diffusion path (Gradient Flow) along few epochs. process that mixes to some which training loss strictly Gibbs-like distribution. decreases. Figure from blog post by Holden Lee and Andrej Risteski Resnet with BN trained by full-batch GD Resnet with BN trained by SGD with diff. + WD on subsampled CIFAR10 LR schedules on CIFAR10

  4. (that we suspect! ) Con Convention on Wis Wisdo doms ms in in De Deep Le Lear arning ning (t Wisdom 1: As LR β†’ 0 , Wisdom 2: To achieve the Wisdom 3: Modeling SGD via optimization dynamic best generalization, LR must SDE with a fixed Gaussian converges to a deterministic be large initially for quite a noise. Namely, as a diffusion path (Gradient Flow) along few epochs. process that mixes to some which training loss strictly Gibbs-like distribution. decreases. Figure from blog post by Holden Lee and Andrej Risteski Resnet with BN trained by full-batch GD Resnet with BN trained by SGD with diff. + WD on subsampled CIFAR10 LR schedules on CIFAR10

  5. (that we suspect! ) Con Convention on Wis Wisdo doms ms in in De Deep Le Lear arning ning (t Wisdom 1: As LR β†’ 0 , Wisdom 2: To achieve the Wisdom 3: Modeling SGD via optimization dynamic best generalization, LR must SDE with a fixed Gaussian converges to a deterministic be large initially for quite a noise. Namely, as a diffusion path (Gradient Flow) along few epochs. process that mixes to some which training loss strictly Gibbs-like distribution. decreases. Figure from blog post by Holden Lee and Andrej Risteski Resnet with BN trained by full-batch GD Resnet with BN trained by SGD with diff. + WD on subsampled CIFAR10 LR schedules on CIFAR10

  6. Conventional Wisdom ch challenged (Against Wisdom 1): Full batch (Against Wisdom 2) Small LR (Against Wisdom 3) Random gradient descent β‰  gradient can generalize equally well as walk/SDE view of SGD is way flow. large LR off. There is no evidence of mixing as traditionally Loss β†— as hessian β†— when π‘₯ β†’ 0 understood, at least within normal training times. Stochastic Weight Averaging improves acc. Resnet with BN trained by SGD with diff. Resnet with BN trained by full-batch GD ( Izmailov et al., 2018) LR schedules on CIFAR10 + WD on subsampled CIFAR10. SWA rules out that SGD as a diffusion process Same setting, longer training! Same setting, longer training! is mixing to a unique global equilibrium.

  7. Conventional Wisdom ch challenged (Against Wisdom 1): Full batch (Against Wisdom 2) Small LR (Against Wisdom 3) Random gradient descent β‰  gradient can generalize equally well as walk/SDE view of SGD is way flow. large LR off. There is no evidence of mixing as traditionally Loss β†— as hessian β†— when π‘₯ β†’ 0 understood, at least within normal training times. Stochastic Weight Averaging improves acc. Resnet with BN trained by SGD with diff. Resnet with BN trained by full-batch GD ( Izmailov et al., 2018) LR schedules on CIFAR10 + WD on subsampled CIFAR10. SWA rules out that SGD as a diffusion process Same setting, longer training! Same setting, longer training! is mixing to a unique global equilibrium.

  8. Conventional Wisdom ch challenged (Against Wisdom 1): Full batch (Against Wisdom 2) Small LR (Against Wisdom 3) Random gradient descent β‰  gradient can generalize equally well as walk/SDE view of SGD is way flow. large LR off. There is no evidence of mixing as traditionally Loss β†— as hessian β†— when π‘₯ β†’ 0 understood, at least within normal training times. Stochastic Weight Averaging improves acc. ( Izmailov et al., 2018) SWA rules out that SGD as a diffusion process is mixing to a unique global equilibrium.

  9. Conventional Wisdom ch challenged (Against Wisdom 1): Full batch (Against Wisdom 2) Small LR (Against Wisdom 3) Random gradient descent β‰  gradient can generalize equally well as walk/SDE view of SGD is way flow. large LR off. There is no evidence of mixing as traditionally Loss β†— as hessian β†— when π‘₯ β†’ 0 understood, at least within normal training times. Stochastic Weight Averaging improves acc. ( Izmailov et al., 2018) So So wh what’s go going on on? SWA rules out that SGD as a diffusion process is mixing to a unique global equilibrium.

  10. SD SDE-ba base sed d framework for mode deling ng SGD on n Normalized d Netw tworks β€’ Standard SDE approximation: β€’ New parametrization: Intrinsic Learning Rate Main Theorem: dynamics of direction " | $ dynamics of norm 𝛿 " = |𝑋 Effective Learning rate πœƒ # Three implications of main theorem: "𝟏.πŸ” = 𝜽 1. The effective learning rate depends on norm of W, 𝜹 𝒖 𝑿 𝒖 πŸ‘ . 2. 𝝁 𝒇 alone determines the evolution of the system ) 3. norm convergence 𝑃( * # ) time (steps)

  11. A A conj njectur ture abo bout ut mixing ng ti time in n func uncti tion n spa space β€’ SDE(SGD) mixes slowly in parameter space, but it mixes fast in function space in experiments, i.e., shortly after norm convergence, 𝑒 = 𝑃( . / + ) . β€’ Fast Equilibrium Conjecture : Network from any non-pathological initialization would converge to function space equilibrium in 𝑃( . / + ) steps, i.e., β€’ Cannot distinguish two networks (from different init) at equilibrium via (further training +) evaluation on any input data, but it’s possible to distinguish them via looking at their parameters (e.g., via Stochastic Weight Averaging).

  12. Wh What hap appen ens in in real eal lif life e train ainin ing -- -- an an in inter erpretatio ion Our theory explains … β€’ Why test/train error sudden increases after every LR decay? Figure from β€˜Wide Resnet’ paper (Zagoruyko & Komodakis, 2017) Reason: Effective LR first divided by 10,but its stationary value only decrease by roughly 𝟐𝟏 . β€’ How many LR decays are necessary towards the best performance? β€’ Only the last is necessary, if you don’t mind to train your network longer. β€’ What’s benefit of early large LR then? β€’ Faster convergence to the equilibrium (of small Intrinsic LR) . See more interpretations in our paper!

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend