norm matters efficient and accurate
play

Norm matters: efficient and accurate normalization schemes in deep - PowerPoint PPT Presentation

Norm matters: efficient and accurate normalization schemes in deep networks Elad Hoffer*, Ron Banner*, Itay Golan*, Daniel Soudry Spotlight , NeurIPS 2018 Norm Matters - Poster #27 1 *Equal contribution Batch normalization Shortcomings:


  1. Norm matters: efficient and accurate normalization schemes in deep networks Elad Hoffer*, Ron Banner*, Itay Golan*, Daniel Soudry Spotlight , NeurIPS 2018 Norm Matters - Poster #27 1 *Equal contribution

  2. Batch normalization Shortcomings: • Assumes independence between samples (problem when modeling time-series, RL, GANs, metric-learning etc.) • Why it works? Interaction with other regularization • Significant computational and memory impact, with data-bound operations – up to 25% of computation time in current models (Gitman, 17 ’ ) 2 ) , numerically unstable. • Requires high-precision operations ( σ 𝑗 𝑦 𝑗 Norm Matters - Poster #27 2

  3. Batch-norm Leads to norm invariance The key observation: 𝑥 • Given input 𝑦 , weight vector 𝑥 , its direction ෝ 𝑥 = 𝑥 • Batch-norm is norm invariant: 𝐶𝑂 𝑥 ෝ 𝑥𝑦 = 𝐶𝑂 ෝ 𝑥𝑦 • Weight norm only affects effective learning rate, e.g. in SGD: Norm Matters - Poster #27 3

  4. Weight decay before BN is redundant • Weight-decay equivalent to learning-rate scaling • Can be mimicked by With WD Without WD Without WD + LR correction Norm Matters - Poster #27 4

  5. Improving weight-norm This can help to make weight-norm work for large-scale models Weight normalization, for a channel 𝑗 : 𝑤 𝑗 𝑥 𝑗 = 𝑕 𝑗 𝑤 𝑗 Bounded Weight Normalization: 𝑥 𝑗 = 𝜍 𝑤 𝑗 𝑤 𝑗 𝜍 - constant determined from chosen initialization Resnet 50, ImageNet Norm Matters - Poster #27 5

  6. Replacing Batch-norm – switching norms 𝑦 𝑗 − 𝑦 • Batch-normalization – just scaled 𝑀 2 normalization: 𝑦 𝑗 = ෝ 1 𝑜 𝑦− 𝑦 • More numerically stable norms: 2 𝑦 1 = σ 𝑗 |𝑦 𝑗 | 𝑦 ∞ = max 𝑗 |𝑦 𝑗 | We use additional scaling constants so that the norm will behave similarly to 𝑀 2 , by assuming that neural input is Gaussian, e.g.: 1 𝜌 1 = 2∙ 𝑜𝐹 𝑦− 𝑦 𝑜𝐹 𝑦− 𝑦 2 1 Norm Matters - Poster #27 6

  7. 𝑀 1 Batch-norm (Imagenet, Resnet) Norm Matters - Poster #27 7

  8. Low precision batch-norm • 𝑀 1 batch-norm alleviates low-precision difficulties of batch-norm. • Can now train using Batch-Norm on ResNet50 without issues on FP16: Regular BN in FP16 fails L1 BN in FP16 works as well as L2 in FP32 Norm Matters - Poster #27 8

  9. With a few more tricks … • Can now train ResNet 18 ImageNet with bottleneck operations in Int8 : 8 bit Also at NeurIPS 2018 Full Precision “ Scalable Methods for 8-bit Training of Neural Networks ” *Ron Banner, *Itay Hubara, *Elad Hoffer, Daniel Soudry Norm Matters - Poster #27 9

  10. Thank you for your tim ime! Come vis isit us at poster #27 Norm Matters - Poster #27 10

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend