Understanding Normalization in Deep Learning Speaker: Wenqi Shao - PowerPoint PPT Presentation

Understanding Normalization in Deep Learning Speaker: Wenqi Shao Email: Weqish@link.cuhk.edu.hk

Outline ➢ Introduction ➢ Various Normalizers: IN, BN, LN, SN, SSN ➢ An Unified Representation: Meta Norm (MN) Back-propagation & Geometric Interpretation ➢ Why Batch Normalization? Optimization & Generalization ➢ Normalization in Various Computer Vision Tasks

Introduction ⚫ Normalization is a well-known technique in deep learning. ⚫ The first normalization method----Batch Normalization (BN). BN achieves the same accuracy with 14 times fewer training steps ⚫ Normalization improves both optimization and generalization of a DNN. ⚫ Various normalizers in terms of tasks and network architecture — Batch Normalization (BN), Image classification [1] — Instance Normalization (IN), Image style transfer [2] — Layer normalization (LN), Recurrent Neural Network (RNN) [3] — Group normalization (GN), robust to batch size, image classification, object detection [4] Normalization methods have been a foundation of various state-of-the-art computer vision tasks

Introduction Input ⚫ Object of normalization method — a 4-D tensor 𝒊 ∈ 𝑺 𝑶×𝑫×𝑰×𝑿 Convolution 𝑂𝐷𝐼𝑋 N- minibatch size (the number of samples) h = ℎ 𝑜𝑑𝑗𝑘 h Normalization C- number of channels H- height of a channel ReLu W- width of a channel Convolution ⚫ A very common building block — Conv+Norm+ReLU Normalization ⚫ They work by standardizing the activations within specific scope. ⚫ Two statistics : mean 𝜈 and variance 𝜏 2 Output ⚫ Two learnable parameters: scale parameter 𝛿 and shift parameter 𝛾

Various Normalizers-IN, BN, LN and GN Calculating mean 𝜈 and variance 𝜏 2 in different scope produces different normalizers. Given a feature map in DNN 𝒊 𝒐𝒅𝒋𝒌 ∈ 𝑺 𝑶×𝑫×𝑰×𝑿 , 𝐼,𝑋 𝐼,𝑋 1 1 2 𝟑 = IN 𝝂 𝑱𝑶 = 𝐼𝑋 ෍ ℎ 𝑜𝑑𝑗𝑘 , 𝝉 𝑱𝑶 𝐼𝑋 ෍ ℎ 𝑜𝑑𝑗𝑘 − 𝜈 𝐽𝑂 𝑗,𝑘=1 𝑗,𝑘=1 𝑂,𝐼,𝑋 𝑂,𝐼,𝑋 1 1 2 𝟑 𝝂 𝑪𝑶 = 𝑂𝐼𝑋 ෍ ℎ 𝑜𝑑𝑗𝑘 , 𝝉 𝑪𝑶 = 𝑂𝐼𝑋 ෍ ℎ 𝑜𝑑𝑗𝑘 − 𝜈 𝐶𝑂 BN 𝑜,𝑗,𝑘=1 𝑜,𝑗,𝑘=1 𝐷,𝐼,𝑋 𝐷,𝐼,𝑋 1 1 2 𝟑 𝝂 𝑴𝑶 = 𝐷𝐼𝑋 ෍ ℎ 𝑜𝑑𝑗𝑘 , 𝝉 𝑴𝑶 = 𝐷𝐼𝑋 ෍ ℎ 𝑜𝑑𝑗𝑘 − 𝜈 𝑀𝑂 LN 𝑑,𝑗,𝑘=1 𝑑,𝑗,𝑘=1 𝐷 𝑕 ,𝐼,𝑋 𝐷 𝑕 ,𝐼,𝑋 GN divides the channels into groups and 1 1 𝒉𝟑 = 2 𝒉 𝑕 GN computes within each group the mean and 𝝂 𝑯𝑶 = 𝐷 𝑕 𝐼𝑋 ෍ ℎ 𝑜𝑑𝑗𝑘 , 𝝉 𝑯𝑶 𝐷 𝑕 𝐼𝑋 ෍ ℎ 𝑜𝑑𝑗𝑘 − 𝜈 𝐻𝑂 variance for normalization. 𝑑,𝑗,𝑘=1 𝑑,𝑗,𝑘=1

Various Normalizers-SN and SSN The above-mentioned methods of normalization use the same normalizer in different normalization layer. Swithchable Normalization (SN) is able to learn different normalizer for each normalization layer [5] . 2 = 𝑞 1 𝜏 𝑇𝑂 2 + 𝑞 2 𝜏 𝑇𝑂 2 + 𝑞 3 𝜏 𝑇𝑂 2 𝜈 𝑇𝑂 = 𝑞 1 𝜈 𝐽𝑂 + 𝑞 2 𝜈 𝐶𝑂 + 𝑞 3 𝜈 𝑀𝑂 , 𝜏 𝑇𝑂 Where 𝑞 1 , 𝑞 2 , 𝑞 3 = softmax 𝑨 1 , 𝑨 2 , 𝑨 3 and 𝑨 1 , 𝑨 2 , 𝑨 3 are learnable parameters 𝑨 1 , 𝑨 2 , 𝑨 3 learned by SGD in different layers could be different

Various Normalizers-SN and SSN However, SN suffers from overfitting and redundant computation. — overfitting , 𝑨 1 , 𝑨 2 , 𝑨 3 are optimized without any constraint. — redundant computation , compute all statistics in IN, BN and LN in the inference stage Sparse Switchable Normalization (SSN) is able to learn only one normalizer for each normalization layer [6] . Statistics in SSN: 2 = 𝑞 1 𝜏 𝑇𝑂 2 + 𝑞 2 𝜏 𝑇𝑂 2 + 𝑞 3 𝜏 𝑇𝑂 2 𝜈 𝑇𝑂 = 𝑞 1 𝜈 𝐽𝑂 + 𝑞 2 𝜈 𝐶𝑂 + 𝑞 3 𝜈 𝑀𝑂 , 𝜏 𝑇𝑂 Such that 𝑞 1 + 𝑞 2 + 𝑞 3 = 1 𝑏𝑜𝑒 𝑞 𝑗 ∈ 0,1 SSN is achieved by a novel transformation ‘SparsestMax’, which is used to substituted softmax in SN

An Unified Representation: Meta Normalization [7] Question. Is there an universal normalization that could include IN, BN, LN, etc. ? To answer this question, let’s consider the relation between 𝜈 𝐽𝑂 and 𝜈 𝐶𝑂 , 𝜈 𝑀𝑂 duplicate N rows Taking sum 1 ⋯ 1 𝜈 𝐶𝑂 = 1 𝜈 𝐶𝑂 ∈ 𝑆 𝐷 over each column 𝜈 𝐽𝑂 ⋮ ⋱ ⋮ 𝑂 1 ⋯ 1 𝜈 11 ⋯ 𝜈 1𝐷 ⋮ ⋱ ⋮ 𝜈 𝑂1 ⋯ 𝜈 𝑂𝐷 1 ⋯ 1 𝜈 𝑀𝑂 = 1 duplicate C columns 𝐷 𝜈 𝐽𝑂 ⋮ ⋱ ⋮ 𝜈 𝐽𝑂 ∈ 𝑆 𝑂×𝐷 𝜈 𝑀𝑂 ∈ 𝑆 𝑂 Taking sum 1 ⋯ 1 over each row

An Unified Representation: Meta Normalization MN. We can design an universal normalization by constructing binary matrix U and V as follows: 1 1 𝜈 𝑁𝑂 = 𝑉 𝜈 𝐽𝑂 𝑊 𝑎 𝑉 𝑎 𝑊 𝑎 𝑉 and 𝑎 𝑊 are normalizing factor. 𝑉 ∈ 𝑆 𝑂×𝑂 and 𝑊 ∈ 𝑆 𝐷×𝐷 are two binary matrix whose elements are either 0 or 1 1 1 𝜏 𝑁𝑂 = 𝑉 𝜏 𝐽𝑂 𝑊 𝑎 𝑉 𝑎 𝑊 Representation Capacity. In MN, V aggregates the statistics from the channels, while U aggregates those in a batch of samples. Therefore, different V and U represent different normalization approaches. ◆ Let 𝑉 = 𝐽 and 𝑊 = 𝐽 , then MN represents IN. 1 ◆ Let 𝑉 = 𝑂 𝟐 and 𝑊 = 𝐽 , then MN turns into BN. ◆ Let 𝑉 = 𝐽 and 𝑊 = 1 𝐷 𝟐 , then MN represents LN. 𝟐 𝟏 2 ◆ Let 𝑉 = 𝐽 and 𝑊 = 𝟐 , then MN represents GN with a group number of 2. 𝟏 𝐷

Back-propagation of MN MN. Let ෨ 𝐺 𝑜𝑑𝑗𝑘 be the neuron after normalization, and then it is transformed to ത 𝐺 𝑜𝑑𝑗𝑘 . Back-propagation . What we most care about is to back- 𝝐𝑴 propagate the gradient of output 𝑮 𝒐𝒅𝒋𝒌 to the gradient of 𝝐ഥ 𝝐𝑴 input 𝝐𝑮 𝒐𝒅𝒋𝒌 .

Back-propagation of MN 𝝐𝑴 𝝐𝑴 𝑮 𝒐𝒅𝒋𝒌 ≜ ෩ Back-propagation . 𝒆 𝒐𝒅𝒋𝒌 𝝐𝑮 𝒐𝒅𝒋𝒌 ≜ 𝒆 𝒐𝒅𝒋𝒌 𝝐෩ (*) Geometric View of BN. Let 𝑉 = 1 𝑂 𝟐 and 𝑊 = 𝐽 . 1 Geometric View of LN. Let 𝑉 = 𝐽 and 𝑊 = 𝐷 𝟐 . T 11 T + ෨ Geometric View of N with group number G. 𝑕 𝑗𝑘 ෨ 𝐺 𝑜 𝑑 𝐺 𝑜 𝑑 1 𝑕 𝑗𝑘 𝟐 ሚ 𝑕 𝑒 𝑜( 𝑑 𝑕) = 𝐽 − 𝑒 𝑜( 𝑑 Let 𝑉 = 𝐽 and 𝑊 = . 𝟐 𝑁𝑂 𝑕) 𝐷 𝑕 𝐼𝑋 𝜏 𝑜(𝑑 𝐷 𝑕) 𝟐 G diagonal sub-matrixes

Geometric Interpretation −𝟐 𝑩 𝑼 . Projection Matrix. Given a matrix A, we have projection matrix 𝑸 = 𝑩 𝑩 𝑼 𝑩 The columns of A, we're given, form a basis for some subspace W, matrix (𝑱 − 𝑸) is the projection matrix for the orthogonal complement of W. Given a vector y, 𝑸𝒛 lies in subspace 𝑿 and (I − 𝑸)𝒛 is in the orthogonal complement of W. y Take BN as an example. (I-P)y Py Let 𝐵 = [𝟐, ෨ 𝐺 𝑑 ], then 𝐵 𝑈 𝐵 = 𝟐 𝑼 𝟐 0 = 𝑂𝐼𝑋 𝟐 𝟐 . 𝑈 ෨ ෨ 0 𝐺 𝐺 𝑑 𝑑 Therefore, the projection matrix corresponding to A is exactly

Why Batch Normalization? BN has been an indispensable component in various networks architectures. The effectiveness of BN has been uncovered form two aspects: optimization and generalization. A more fundamental impact of BatchNorm on the training process: it makes the optimization landscape significantly smoother [8] . maximum difference ( ℓ 2 nrom) in the variation (shaded region) in loss ℓ 2 changes in the gradient gradient over distance moved in that as we move in the gradient direction. direction

Lipschitzness of the Loss BN causes the landscape to be more well-behaved , inducing favorable properties in Lipschitz- continuity 。 Let’s first consider the optimization landscape wrt. activation. gradient magnitude, empically grows quadratically bounded away from zero less than 1 captures the Lipschitzness in the dimension of the loss

Lipschitzness of the Loss Let’s now turn to consider the optimization landscape wrt. weight.

Regularization in BN Batch normalization implicitly discourages single channel reliance , suggesting an alternative regularization mechanism by which batch normalization may encourage good generalization performance. BN makes channel equal such that they play homogeneous role in representing a prediction function. How to empirically verify this conclusion? [9] measure their robustness to cumulative ablation of channels Networks trained with batch normalization are more robust to these ablations than those trained without batch normalization

Regularization in BN We explore explicit regularization expression in BN by analyzing a building block in a deep network. BN also induces Gaussian priors for batch mean 𝝂 𝑪 and batch standard deviation 𝝉 𝑪 . [10] These priors tell us that 𝝂 𝑪 and 𝝉 𝑪 would produce Gaussian noise. Taking expectation over such noise may give us explicit regularization expression in BN. [11] ◆ regularization strength ζ is inversely proportional to the batch size M . ◆ 𝝂 𝑪 and 𝝉 𝑪 produce two different regularization strengths. ◆ 𝝂 𝑪 penalizes the expectation of activation, implying that the neuron with larger output may exposure to larger regularization . expectation of activation ζ γ expectation of activation

Understanding Normalization in Deep Learning Speaker: Wenqi Shao - PowerPoint PPT Presentation

Understanding Normalization in Deep Learning Speaker: Wenqi Shao Email: Weqish@link.cuhk.edu.hk Outline Introduction Various Normalizers: IN, BN, LN, SN, SSN An Unified Representation: Meta Norm (MN) Back-propagation & Geometric

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

An Exponential LR Schedule for Deep Learning An Exponential LR Schedule for Deep Learning

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

E9 205 Machine Learning for Signal Processing Understanding Deep Networks 08-11-2019 Instructor

TAEP/ AWMA Joint Meeting TAEP/ AWMA Joint Meeting Normalization of the Abnorm Normalization of

Strong normalization for the parameter-free Strong polymorphic lambda calculus based on the

Normalization Lecture 9 Normalization 24 February 2015 1 Wentworth Institute of Technology

Linear Logic and Strong Normalization Beniamino Accattoli Carnegie Mellon University B.

Formalizing Strong Normalization Proofs Kazuhiko Sakaguchi College of Information Science,

Normal forms and normalization An example of normalization using normal forms We assume we have

Database Normalization Asst. Prof. Dr. Kanda Runapongsa Saikaew (krunapon@kku.ac.th)

Normalization Redundancy causes several anomalies : insert, delete and update

Normalization-Invariant Fuzzy Logic Need for Normalization Operations Explain Empirical Success

Normalization by evaluation for Thorsten Altenkirch Tarmo Uustalu University of Nottingham

Normalization by Evaluation for Martin-Lf Type Theory with One Universe Peter Dybjer,

Schema Refinement and Normalization Module 5, Lectures 3 and 4 Database Management Systems, R.

Data Modeling Session 12 INST 301 Introduction to Information Science Databases Database

Type Theory and Coq Herman Geuvers Lecture: Normalization for and 2 1 Properties of

UMBC A B M A L T F O U M B C I M Y O R T 1 (10/16/06) I E S R C E O V

Digital Logic Design: a rigorous approach c Chapter 11: Foundations of combinational circuits

The influence of p -regular class sizes on normal subgroups Mar a Jos e Felipe

rs s r rs r

ON COLEMAN AUTOMORPHISMS OF FINITE GROUPS AND THEIR MINIMAL NORMAL SUBGROUPS Arne Van Antwerpen