Understanding Normalization in Deep Learning Speaker: Wenqi Shao - - PowerPoint PPT Presentation

understanding normalization in deep learning
SMART_READER_LITE
LIVE PREVIEW

Understanding Normalization in Deep Learning Speaker: Wenqi Shao - - PowerPoint PPT Presentation

Understanding Normalization in Deep Learning Speaker: Wenqi Shao Email: Weqish@link.cuhk.edu.hk Outline Introduction Various Normalizers: IN, BN, LN, SN, SSN An Unified Representation: Meta Norm (MN) Back-propagation & Geometric


slide-1
SLIDE 1

Understanding Normalization in Deep Learning

Speaker: Wenqi Shao Email: Weqish@link.cuhk.edu.hk

slide-2
SLIDE 2

Outline

➢ Introduction ➢ Various Normalizers: IN, BN, LN, SN, SSN ➢ An Unified Representation: Meta Norm (MN) Back-propagation & Geometric Interpretation ➢ Why Batch Normalization? Optimization & Generalization ➢ Normalization in Various Computer Vision Tasks

slide-3
SLIDE 3

Introduction

⚫ Normalization is a well-known technique in deep learning. ⚫ The first normalization method----Batch Normalization (BN). BN achieves the same accuracy with 14 times fewer training steps ⚫ Normalization improves both optimization and generalization of a DNN. ⚫ Various normalizers in terms of tasks and network architecture — Batch Normalization (BN), Image classification [1] — Instance Normalization (IN), Image style transfer [2] — Layer normalization (LN), Recurrent Neural Network (RNN) [3] — Group normalization (GN), robust to batch size, image classification, object detection [4]

Normalization methods have been a foundation of various state-of-the-art computer vision tasks

slide-4
SLIDE 4

Convolution Normalization ReLu Convolution Normalization

h

h= ℎ𝑜𝑑𝑗𝑘

𝑂𝐷𝐼𝑋

Introduction

⚫ Object of normalization method — a 4-D tensor 𝒊 ∈ 𝑺𝑶×𝑫×𝑰×𝑿 N- minibatch size (the number of samples) C- number of channels H- height of a channel W- width of a channel ⚫ A very common building block — Conv+Norm+ReLU

⚫ They work by standardizing the activations within specific scope. Output

Input

⚫ Two statistics: mean 𝜈 and variance 𝜏2 ⚫ Two learnable parameters: scale parameter 𝛿 and shift parameter 𝛾

slide-5
SLIDE 5

𝝂𝑪𝑶 = 1 𝑂𝐼𝑋 ෍

𝑜,𝑗,𝑘=1 𝑂,𝐼,𝑋

ℎ𝑜𝑑𝑗𝑘 , 𝝉𝑪𝑶

𝟑

= 1 𝑂𝐼𝑋 ෍

𝑜,𝑗,𝑘=1 𝑂,𝐼,𝑋

ℎ𝑜𝑑𝑗𝑘 − 𝜈𝐶𝑂

2

Various Normalizers-IN, BN, LN and GN

Calculating mean 𝜈 and variance 𝜏2 in different scope produces different normalizers. Given a feature map in DNN 𝒊𝒐𝒅𝒋𝒌 ∈ 𝑺𝑶×𝑫×𝑰×𝑿, 𝝂𝑱𝑶 = 1 𝐼𝑋 ෍

𝑗,𝑘=1 𝐼,𝑋

ℎ𝑜𝑑𝑗𝑘 , 𝝉𝑱𝑶

𝟑 =

1 𝐼𝑋 ෍

𝑗,𝑘=1 𝐼,𝑋

ℎ𝑜𝑑𝑗𝑘 − 𝜈𝐽𝑂

2

IN BN 𝝂𝑴𝑶 = 1 𝐷𝐼𝑋 ෍

𝑑,𝑗,𝑘=1 𝐷,𝐼,𝑋

ℎ𝑜𝑑𝑗𝑘 , 𝝉𝑴𝑶

𝟑

= 1 𝐷𝐼𝑋 ෍

𝑑,𝑗,𝑘=1 𝐷,𝐼,𝑋

ℎ𝑜𝑑𝑗𝑘 − 𝜈𝑀𝑂

2

LN 𝝂𝑯𝑶

𝒉

= 1 𝐷𝑕𝐼𝑋 ෍

𝑑,𝑗,𝑘=1 𝐷𝑕,𝐼,𝑋

ℎ𝑜𝑑𝑗𝑘 , 𝝉𝑯𝑶

𝒉𝟑 =

1 𝐷𝑕𝐼𝑋 ෍

𝑑,𝑗,𝑘=1 𝐷𝑕,𝐼,𝑋

ℎ𝑜𝑑𝑗𝑘 − 𝜈𝐻𝑂

𝑕 2

GN GN divides the channels into groups and computes within each group the mean and variance for normalization.

slide-6
SLIDE 6

Various Normalizers-SN and SSN

The above-mentioned methods of normalization use the same normalizer in different normalization layer. Swithchable Normalization (SN) is able to learn different normalizer for each normalization layer [5]. 𝜈𝑇𝑂 = 𝑞1𝜈𝐽𝑂 + 𝑞2𝜈𝐶𝑂 + 𝑞3𝜈𝑀𝑂, 𝜏𝑇𝑂

2 = 𝑞1𝜏𝑇𝑂 2 + 𝑞2𝜏𝑇𝑂 2 + 𝑞3𝜏𝑇𝑂 2

Where 𝑞1, 𝑞2, 𝑞3 = softmax 𝑨1, 𝑨2, 𝑨3 and 𝑨1, 𝑨2, 𝑨3 are learnable parameters 𝑨1, 𝑨2, 𝑨3 learned by SGD in different layers could be different

slide-7
SLIDE 7

Various Normalizers-SN and SSN

However, SN suffers from overfitting and redundant computation. — overfitting, 𝑨1, 𝑨2, 𝑨3 are optimized without any constraint. — redundant computation, compute all statistics in IN, BN and LN in the inference stage Sparse Switchable Normalization (SSN) is able to learn only one normalizer for each normalization layer [6]. Statistics in SSN: 𝜈𝑇𝑂 = 𝑞1𝜈𝐽𝑂 + 𝑞2𝜈𝐶𝑂 + 𝑞3𝜈𝑀𝑂, 𝜏𝑇𝑂

2 = 𝑞1𝜏𝑇𝑂 2 + 𝑞2𝜏𝑇𝑂 2 + 𝑞3𝜏𝑇𝑂 2

Such that 𝑞1 + 𝑞2 + 𝑞3 = 1 𝑏𝑜𝑒 𝑞𝑗 ∈ 0,1 SSN is achieved by a novel transformation ‘SparsestMax’, which is used to substituted softmax in SN

slide-8
SLIDE 8

An Unified Representation: Meta Normalization [7]

  • Question. Is there an universal normalization that could include IN, BN, LN, etc. ?

To answer this question, let’s consider the relation between 𝜈𝐽𝑂 and 𝜈𝐶𝑂, 𝜈𝑀𝑂 𝜈11 ⋯ 𝜈1𝐷 ⋮ ⋱ ⋮ 𝜈𝑂1 ⋯ 𝜈𝑂𝐷 𝜈𝐽𝑂 ∈ 𝑆𝑂×𝐷 𝜈𝐶𝑂 ∈ 𝑆𝐷 𝜈𝑀𝑂 ∈ 𝑆𝑂 Taking sum

  • ver each column

Taking sum

  • ver each row

𝜈𝐶𝑂 = 1 𝑂 1 ⋯ 1 ⋮ ⋱ ⋮ 1 ⋯ 1 𝜈𝐽𝑂 duplicate N rows 𝜈𝑀𝑂 = 1 𝐷 𝜈𝐽𝑂 1 ⋯ 1 ⋮ ⋱ ⋮ 1 ⋯ 1 duplicate C columns

slide-9
SLIDE 9

An Unified Representation: Meta Normalization

  • MN. We can design an universal normalization by constructing binary matrix U and V as follows:

𝜈𝑁𝑂 = 1 𝑎𝑉 𝑉 𝜈𝐽𝑂 1 𝑎𝑊 𝑊 𝜏𝑁𝑂 = 1 𝑎𝑉 𝑉 𝜏𝐽𝑂 1 𝑎𝑊 𝑊 𝑎𝑉 and 𝑎𝑊 are normalizing factor. 𝑉 ∈ 𝑆𝑂×𝑂 and 𝑊 ∈ 𝑆𝐷×𝐷 are two binary matrix whose elements are either 0 or 1 Representation Capacity. In MN, V aggregates the statistics from the channels, while U aggregates those in a batch

  • f samples. Therefore, different V and U represent different normalization approaches.

◆ Let 𝑉 = 𝐽 and 𝑊 = 𝐽, then MN represents IN. ◆ Let 𝑉 =

1 𝑂 𝟐 and 𝑊 = 𝐽, then MN turns into BN.

◆ Let 𝑉 = 𝐽 and 𝑊 = 1

𝐷 𝟐, then MN represents LN.

◆ Let 𝑉 = 𝐽 and 𝑊 =

2 𝐷

𝟐 𝟏 𝟏 𝟐 , then MN represents GN with a group number of 2.

slide-10
SLIDE 10

Back-propagation of MN

  • MN. Let ෨

𝐺𝑜𝑑𝑗𝑘 be the neuron after normalization, and then it is transformed to ത 𝐺𝑜𝑑𝑗𝑘. Back-propagation. What we most care about is to back- propagate the gradient of output

𝝐𝑴 𝝐ഥ 𝑮𝒐𝒅𝒋𝒌 to the gradient of

input

𝝐𝑴 𝝐𝑮𝒐𝒅𝒋𝒌.

slide-11
SLIDE 11

Back-propagation of MN

Back-propagation.

𝝐𝑴 𝝐෩ 𝑮𝒐𝒅𝒋𝒌 ≜ ෩

𝒆𝒐𝒅𝒋𝒌

𝝐𝑴 𝝐𝑮𝒐𝒅𝒋𝒌 ≜ 𝒆𝒐𝒅𝒋𝒌

Geometric View of BN. Let 𝑉 = 1

𝑂 𝟐 and 𝑊 = 𝐽.

(*) Geometric View of LN. Let 𝑉 = 𝐽 and 𝑊 =

1 𝐷 𝟐.

Geometric View of N with group number G. Let 𝑉 = 𝐽 and 𝑊 =

𝑕 𝐷

𝟐 𝟐 𝟐 . G diagonal sub-matrixes 𝑒𝑜(𝑑

𝑕) =

1 𝜏𝑜(𝑑

𝑕) 𝑁𝑂

𝐽 − 11T + ෨ 𝐺𝑜 𝑑

𝑕 𝑗𝑘 ෨

𝐺

𝑜 𝑑 𝑕 𝑗𝑘

T 𝐷𝑕𝐼𝑋 ሚ 𝑒𝑜(𝑑

𝑕)

slide-12
SLIDE 12

Geometric Interpretation

Projection Matrix. Given a matrix A, we have projection matrix 𝑸 = 𝑩 𝑩𝑼𝑩

−𝟐𝑩𝑼.

The columns of A, we're given, form a basis for some subspace W, matrix (𝑱 − 𝑸) is the projection matrix for the

  • rthogonal complement of W.

Given a vector y, 𝑸𝒛 lies in subspace 𝑿 and (I − 𝑸)𝒛 is in the orthogonal complement of W. Take BN as an example. y Py (I-P)y Let 𝐵 = [𝟐, ෨ 𝐺

𝑑], then

𝐵𝑈𝐵 = 𝟐𝑼𝟐 ෨ 𝐺

𝑑 𝑈 ෨

𝐺

𝑑

= 𝑂𝐼𝑋 𝟐 𝟐 . Therefore, the projection matrix corresponding to A is exactly

slide-13
SLIDE 13

Why Batch Normalization?

BN has been an indispensable component in various networks architectures. The effectiveness of BN has been uncovered form two aspects: optimization and generalization. A more fundamental impact of BatchNorm on the training process: it makes the optimization landscape significantly smoother [8]. the variation (shaded region) in loss ℓ2 changes in the gradient as we move in the gradient direction maximum difference (ℓ2 nrom) in gradient over distance moved in that direction.

slide-14
SLIDE 14

Lipschitzness of the Loss

BN causes the landscape to be more well-behaved, inducing favorable properties in Lipschitz- continuity。 Let’s first consider the optimization landscape wrt. activation.

gradient magnitude, captures the Lipschitzness

  • f the loss

empically less than 1 grows quadratically in the dimension bounded away from zero

slide-15
SLIDE 15

Lipschitzness of the Loss

Let’s now turn to consider the optimization landscape wrt. weight.

slide-16
SLIDE 16

Regularization in BN

Batch normalization implicitly discourages single channel reliance, suggesting an alternative regularization mechanism by which batch normalization may encourage good generalization performance. BN makes channel equal such that they play homogeneous role in representing a prediction function. Networks trained with batch normalization are more robust to these ablations than those trained without batch normalization How to empirically verify this conclusion? [9] measure their robustness to cumulative ablation of channels

slide-17
SLIDE 17

Regularization in BN

We explore explicit regularization expression in BN by analyzing a building block in a deep network. BN also induces Gaussian priors for batch mean 𝝂𝑪 and batch standard deviation 𝝉𝑪. [10] These priors tell us that 𝝂𝑪 and 𝝉𝑪would produce Gaussian noise. Taking expectation over such noise may give us explicit regularization expression in BN. [11] ◆ regularization strength ζ is inversely proportional to the batch size M. ◆ 𝝂𝑪 and 𝝉𝑪 produce two different regularization strengths. ◆ 𝝂𝑪 penalizes the expectation of activation, implying that the neuron with larger output may exposure to larger regularization. expectation of activation ζ γ expectation of activation

slide-18
SLIDE 18

Normalization in Various Computer Vision Tasks

Image Classification Object Detection Semantic Segmentation

slide-19
SLIDE 19

Normalization in Various Computer Vision Tasks

Image Classification Object Detection Semantic Segmentation

slide-20
SLIDE 20

References

  • 1. Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift.
  • 2. Ulyanov D, Vedaldi A, Lempitsky V (2017) Instance normalization: the missing ingredient for fast stylization.
  • 3. Ba JL, Kiros JR, Hinton GE (2016) Layer normalization.
  • 4. Wu Y, He K (2018) Group normalization.
  • 5. Luo P, Ren J, Peng Z (2018) Differentiable learning-tonormalize via switchable normalization.
  • 6. Shao W, Meng T, Li J (2019) Learning sparse switchable normalization via SparsestMax.
  • 7. Luo P (2019) Differentiable Learning to Learn to Normalize. (To be appeared)
  • 8. Santurkar S, Tsipras D, Ilyas A, Madry A (2018) How does batch normalization help optimization?
  • 9. Ari S Morcos, David GT Barrett, Neil C Rabinowitz, and Matthew Botvinick (2019). On the importance of single

directions for generalization.

  • 10. Teye M, Azizpour H, Smith K (2018) Bayesian uncertainty estimation for batch normalized deep networks.
  • 11. Luo P, Wang X, Shao W, Peng Z (2018) Understanding regularization in batch normalization.
slide-21
SLIDE 21

Thanks!