Norm matters: efficient and accurate normalization schemes in deep - - PowerPoint PPT Presentation

norm matters efficient and accurate
SMART_READER_LITE
LIVE PREVIEW

Norm matters: efficient and accurate normalization schemes in deep - - PowerPoint PPT Presentation

Norm matters: efficient and accurate normalization schemes in deep networks Elad Hoffer*, Ron Banner*, Itay Golan*, Daniel Soudry Spotlight , NeurIPS 2018 Norm Matters - Poster #27 1 *Equal contribution Batch normalization Shortcomings:


slide-1
SLIDE 1

Norm matters: efficient and accurate normalization schemes in deep networks

1

Elad Hoffer*, Ron Banner*, Itay Golan*, Daniel Soudry

*Equal contribution Spotlight , NeurIPS 2018

Norm Matters - Poster #27

slide-2
SLIDE 2

Batch normalization

Shortcomings:

  • Assumes independence between samples (problem when modeling

time-series, RL, GANs, metric-learning etc.)

  • Why it works? Interaction with other regularization
  • Significant computational and memory impact, with data-bound
  • perations –up to 25% of computation time in current models

(Gitman, 17’)

  • Requires high-precision operations ( σ𝑗 𝑦𝑗

2) , numerically unstable.

2

Norm Matters - Poster #27

slide-3
SLIDE 3

Batch-norm Leads to norm invariance

The key observation:

  • Given input 𝑦, weight vector 𝑥, its direction ෝ

𝑥 =

𝑥 𝑥

  • Batch-norm is norm invariant: 𝐶𝑂

𝑥 ෝ 𝑥𝑦 = 𝐶𝑂 ෝ 𝑥𝑦

  • Weight norm only affects effective learning rate, e.g. in SGD:

3

Norm Matters - Poster #27

slide-4
SLIDE 4

Weight decay before BN is redundant

  • Weight-decay equivalent

to learning-rate scaling

  • Can be mimicked by

4

Norm Matters - Poster #27

With WD Without WD Without WD + LR correction

slide-5
SLIDE 5

Improving weight-norm

This can help to make weight-norm work for large-scale models

5

Resnet 50, ImageNet

Weight normalization, for a channel 𝑗:

𝑥𝑗 = 𝑕𝑗 𝑤𝑗 𝑤𝑗

Bounded Weight Normalization:

𝑥𝑗 = 𝜍 𝑤𝑗 𝑤𝑗

𝜍 - constant determined from chosen initialization

Norm Matters - Poster #27

slide-6
SLIDE 6

Replacing Batch-norm – switching norms

  • Batch-normalization – just scaled 𝑀2 normalization:
  • More numerically stable norms:

𝑦 1 = σ𝑗|𝑦𝑗| 𝑦 ∞ = max𝑗 |𝑦𝑗| We use additional scaling constants so that the norm will behave similarly to 𝑀2, by assuming that neural input is Gaussian, e.g.:

6

1 𝑜𝐹 𝑦− 𝑦

2

1 𝑜𝐹 𝑦− 𝑦

1

=

𝜌 2∙

ෝ 𝑦𝑗 = 𝑦𝑗 − 𝑦

1 𝑜 𝑦− 𝑦 2

Norm Matters - Poster #27

slide-7
SLIDE 7

𝑀1 Batch-norm (Imagenet, Resnet)

7

Norm Matters - Poster #27

slide-8
SLIDE 8

Low precision batch-norm

  • 𝑀1 batch-norm alleviates low-precision difficulties of batch-norm.
  • Can now train using Batch-Norm on ResNet50 without issues on FP16:

8

Norm Matters - Poster #27

Regular BN in FP16 fails L1 BN in FP16 works as well as L2 in FP32

slide-9
SLIDE 9

With a few more tricks…

  • Can now train ResNet18 ImageNet with bottleneck operations in Int8:

9

“Scalable Methods for 8-bit Training of Neural Networks” *Ron Banner, *Itay Hubara, *Elad Hoffer, Daniel Soudry Also at NeurIPS 2018

Norm Matters - Poster #27

8 bit Full Precision

slide-10
SLIDE 10

Thank you for your tim ime! Come vis isit us at poster #27

10

Norm Matters - Poster #27