Normalization Techniques in Training of Deep Neural Networks Lei - - PowerPoint PPT Presentation

normalization techniques in training of deep neural
SMART_READER_LITE
LIVE PREVIEW

Normalization Techniques in Training of Deep Neural Networks Lei - - PowerPoint PPT Presentation

Normalization Techniques in Training of Deep Neural Networks Lei Huang ( ) State Key Laboratory of Software Development Environment, Beihang University Mail:huanglei@nlsde.buaa.edu.cn August 17 th , 2017 Outline Introduction to Deep


slide-1
SLIDE 1

Normalization Techniques in Training

  • f Deep Neural Networks

August 17th, 2017

Lei Huang (黄雷) State Key Laboratory of Software Development Environment, Beihang University Mail:huanglei@nlsde.buaa.edu.cn

slide-2
SLIDE 2

Outline

  • Introduction to Deep Neural Networks (DNNs)
  • Training DNNs: Optimization
  • Batch Normalization
  • Other Normalization Techniques
  • Centered Weight Normalization
slide-3
SLIDE 3

Machine learning

 dataset D={X, Y}  Input: X  Output: Y  Learning: Y=F(X) or P(Y|X)

Y=F(X) P(Y|X)

 Fitting and Generalization

 Types: view of models  Non-parametric model  Y=F(X; 𝑦1, 𝑦2…𝑦𝑜)  Parametric model  Y=F(X; 𝜄)

slide-4
SLIDE 4

Neural network

  • Neural network

– Y=F(X)=𝑔

𝑈(𝑔 𝑈−1(… 𝑔 1(𝑌)))

– 𝑔

𝑗 𝑦 = 𝑕(𝑋𝑦 + 𝑐)

  • Nonlinear activation

– sigmod – Relu

slide-5
SLIDE 5

Deep neural network

  • Why deep?

– Powerful representation capacity

slide-6
SLIDE 6

Key properties of Deep learning

  • End to End learning

– no distinction between feature extractor and classifier

  • “Deep” architectures:

– Hierarchy of simpler non-linear modules

slide-7
SLIDE 7

Applications and techniques of DNNs

  • Successful applications in a range of domains

– Speech – Computer Vision – Natural Language processing

  • Main techniques in using deep neural networks networks

– Design the architecture

  • Module selection and Module connection
  • Loss function

– Train the model based on optimization

  • Initialize the parameters
  • Search direction in parameters space
  • Learning rate schedule
  • Regularization techniques
slide-8
SLIDE 8

Outline

  • Introduction to Deep Neural Networks (DNNs)
  • Training DNNs: Optimization
  • Batch Normalization
  • Other Normalization Techniques
  • Centered Weight Normalization
slide-9
SLIDE 9

9

Training of Neural Networks

  • Multi-layer perceptron (example)

(1, 0, 0)𝑈 input 𝑦1 𝑦2 𝑦3 𝑦0 = 1 ℎ0

(1) = 1

  • utput

hidden layer

  • MSE Loss: L=(𝐳 −

𝒛)2

  • 1 Forward calculate y:

𝒃(𝟐) = 𝑿(𝟐) ∙ 𝒚 𝒊(𝟐) = 𝜏 𝒃(𝟐) 𝒃(𝟑) = 𝑿(𝟑) ∙ 𝒊 𝟐 𝒛 = 𝜏 𝒃(𝟑) d L

𝒆 𝑿(𝟑)= d L 𝒆𝒃(𝟑) 𝒊(𝟐)

d L

𝒆 𝑿(𝟐)= d L 𝒆𝒃(𝟐) 𝒚

d L 𝐳 =2(𝐳 − 𝒛) d L

𝒆𝒃(𝟑)=d L 𝐞𝐳 ∙ 𝜏 𝒃(𝟑) ∙ (1−𝜏 𝒃(𝟑) )

d L

𝒆 𝒊(𝟐)=d L 𝒆𝒃(𝟑) 𝑿(𝟑)

d L

𝒃(𝟐) = d L 𝒆 𝒊(𝟐) ∙ 𝜏 𝒃(𝟐) ∙ (1−𝜏 𝒃(𝟐) )

d L

𝒆𝒚 =d L 𝒆𝒃(𝟐) 𝑿(𝟐)

  • 2 Backward,Calculate d L

𝒆 𝒚

  • 3 calculate gradient d L

𝒆 𝑿 :

slide-10
SLIDE 10

Optimization in Deep Model

  • Challenge:

– Non-convex and local optimal points – Saddle point – Severe correlation between dimensions and highly non-isotropic parameter space (ill-shaped)

  • Goal:
  • Update Iteratively:
slide-11
SLIDE 11

First order optimization

  • First order stochastic gradient descent (SGD):

– The direction of the gradient – Gradient is averaged by the sampled examples – Disadvantage

  • Over-aggressive steps on ridges
  • Too small steps on plateaus
  • Slow convergence
  • non-robust performance.

Figure 2: zig-zag iteration path for SGD

slide-12
SLIDE 12

Advanced Optimization

  • Estimate curvature or scale

– Quadratic optimization

  • Newton or quasi-Newton

– Inverse of Hessian

  • Natural Gradient

– Inverse of FIM

– Estimate the scale

  • AdaGrad
  • Rmsprop
  • Adam
  • Normalize input/activation

– Intuition:the landscape of cost w.r.t parameters is controlled by Input/activation L=(f(x,𝜄),y) – Method: Stabilize the distribution of input/activation

  • Normalize the input explicitly
  • Normalize the input implicitly (constrain weights)

Iteration path of SGD (red) and NGD (green)

slide-13
SLIDE 13

Some intuitions of normalization for

  • ptimization
  • How Normalizing activation affect the
  • ptimization?

– 𝑧 = 𝑥1𝑦1 + 𝑥2𝑦2+b – L=(𝑧 − 𝑧)^2

𝑥1 𝑥2 0 < 𝑦1 < 2 0 < 𝑦2 < 0.5 𝑥1 𝑥2 0 < 𝑦1

′ = 𝑦1/2 < 1

0 < 𝑦2

′ = 𝑦2 ∗ 2 < 1

L(𝑥1, 𝑥2) L(𝑥1, 𝑥2)

slide-14
SLIDE 14

Outline

  • Introduction to Deep Neural Networks (DNNs)
  • Training DNNs: Optimization
  • Batch Normalization
  • Other Normalization Techniques
  • Centered Weight Normalization
slide-15
SLIDE 15

Batch Normalization--motivation

  • Solving Internal Covariate Shift
  • Whitening Input benefits optimization (1998,Lecun,

Efficient back-propagation)

– Centering – Decorrelate – stretch ℎ1 = 𝑥𝑦 ℎ2 = 𝑥ℎ1

decorrelate centering stretch

y=Wx, MSE loss

slide-16
SLIDE 16

Batch Normalization--method

  • Only standardize input: decorrelating is expensive

– Centering – Stretch

  • How to do it?

– 𝑦 =

𝑦−𝐹(𝑦) 𝑡𝑢𝑒(𝑦)

centering stretch

slide-17
SLIDE 17

Batch Normalization--training

  • Forward
slide-18
SLIDE 18

Batch Normalization--training

  • Backward
slide-19
SLIDE 19

Batch Normalization--Inference

  • Inference (in paper)
  • Inference (in practice)

– Running average

  • 𝐹 𝑦 = 𝛽 𝜈𝐶 + (1 − 𝛽)𝐹 𝑦
  • 𝑊𝑏𝑠 𝑦 = 𝛽 𝜏𝐶

2 + 1 − 𝛽 𝑤𝑏𝑠 𝑦

slide-20
SLIDE 20

Batch Normalization—how to use

  • Convolution layer
  • Wrapped as a module

– Before or after nonlinear?

  • For shallow module, after nonlinear (Layer <11)
  • For deep model, before nonlinear

– Advantage of before nonlinear

  • For Relu, half activated
  • For sigmod, avoiding saturated region.

– Advantage of after nonlinear

  • The intuition of whitening
slide-21
SLIDE 21

Batch Normalization—how to use

  • Example:

Residual block (CVPR 2015) Pre-activation Residual block (ECCV 2016)

slide-22
SLIDE 22

Batch Normalization—characteristics

  • For accelerating training:

– Weight scale invariant: Not sensitive for weight initialization – Adjustable learning rate – Large learning rate

  • Better conditioning, (1998 Lecun)
  • For generalization

– Stochastic, works like Dropout

slide-23
SLIDE 23

Batch Normalization

  • Routine in deep feed forward neural networks,

especially for CNNs.

  • Weakness

– Can not be used for online learning – Unstable for small mini batch size – Used in RNN with caution

slide-24
SLIDE 24

Batch Normalization– for RNN

  • The extra problems need be considered:

– Where BN should put? – Sequence data

  • 2016,ICASSP, Batch Normalized Recurrent Neural

Networks

– How to put BN module – Sequence data problem

  • Frame-wise normalization
  • Sequence-wise normalization
slide-25
SLIDE 25

Batch Normalization for RNN

  • 2017, ICLR, Recurrent Batch Normalization

– How to put BN module – Sequence data problem

  • T_max Frame-wise normalization
  • It depends…..
slide-26
SLIDE 26

Outline

  • Introduction to Deep Neural Networks (DNNs)
  • Training DNNs: Optimization
  • Batch Normalization
  • Other Normalization Techniques
  • Centered Weight Normalization
slide-27
SLIDE 27

Norm-propagation (2016, ICML)

  • Target BN’s drawback:

– Can not be used for online learning – Unstable for small mini batch size.

  • Data independent parametric estimate of

mean and variance

– Normalize input: 0-mean and unit variance – Assuming W is orthogonal – Derivate the nonlinear dynamic

  • Relu:
slide-28
SLIDE 28

Layer Normalization (2016, Arxiv)

  • Target BN’s drawback:

– Can not be used for online learning – Unstable for small mini batch size – RNN

  • Normalizing each example, over dimensions

BN LN

slide-29
SLIDE 29

Natural Neural Network (2015, NIPS)

  • Canonical model(MLP): ℎ𝑗 = 𝑔

𝑗(𝑋 𝑗ℎ𝑗−1 + 𝑐𝑗)

  • Natural neural network
  • Model parameters:

Ω = {,𝑊

1, 𝑒1 … , 𝑊 𝑀, 𝑒𝑀}

  • Whitening coefficients :

Φ = {,𝑉0, 𝑑0 … , 𝑉𝑀−1, 𝑑𝑀−1}

  • How about decorrelate the activations?
slide-30
SLIDE 30

Weight Normalization (2016, NIPS)

  • Target BN’s drawback:

– Can not be used for online learning – Unstable for small mini batch size – RNN

  • Express weight as new parameters
  • Decouple direction and length of vectors
slide-31
SLIDE 31

Reference

  • batch normalization accelerating deep network training by reducing internal

covariate shift, ICML 2015 (Batch Normalization)

  • Normalization Propagation A Parametric Technique for Removing Internal

Covariate Shift in Deep Networks, ICML, 2016

  • Weight Normalization A Simple Reparameterization to Accelerate Training of

Deep Neural Networks, NIPS, 2016

  • Layer Normalization, Arxiv:1607.06450, 2016
  • Recurrent Batch Normalization, ICLR,2017
  • Batch Normalized Recurrent Neural Networks, ICASSP, 2016
  • Natural Neural Networks, NIPS, 2015
  • Normalizing the normaliziers-comparing and extending network normalization

schemes, ICLR, 2017

  • Batch Renormalization, Arxiv:1702.03275, 2017
  • mean-normalized stochastic gradient for large-scale deep learning, ICASSP 2014
  • deep learning made easier by linear transformations in perceptrons, AISTATS

2012

31

slide-32
SLIDE 32

Outline

  • Introduction to Deep Neural Networks (DNNs)
  • Training DNNs: Optimization
  • Batch Normalization
  • Other Normalization Techniques
  • Centered Weight Normalization
slide-33
SLIDE 33

Centered Weight Normalization in Accelerating Training of Deep Neural Networks

Lei Huang, Xianglong Liu, Yang Liu, Bo Lang, Dacheng Tao International Conference on Computer Vision (ICCV) 2017

slide-34
SLIDE 34

Motivation

  • Stable distribution in hidden layer
  • Initialization method

– Random Init (1998, YanLecun)

  • Zero-mean, stable-var

– Xavier Init (2010, Xavier) – He Init (2015, He Kaiming)

  • 𝑋~𝑂 0,

2 𝑜 , 𝑜 = 𝑝𝑣𝑢 ∗ 𝐼 ∗ 𝑋

  • Keep desired characters during training
slide-35
SLIDE 35

Method

  • Solution by re-parameterization
  • Formulation: Constrained optimization problem:
slide-36
SLIDE 36

Method

  • Gradient Information:
  • Using proxy parameter v:
  • Adjustable scale:
slide-37
SLIDE 37

Method

  • Wrapped as module for practitioner:

– Forward

slide-38
SLIDE 38

Method

  • Wrapped as module for practitioner:

– Backward

slide-39
SLIDE 39

Discussion

  • Beneficial Characters for training

– Stabilize the distributions

– Better Conditioning of Hessian

  • Regularization in improving performance
slide-40
SLIDE 40

Experiments

  • Data set

– Yale-B – SVHN – Cifar10, Cifar100 – ImageNet

  • Reproducible experiments and Code:

https://github.com/huangleiBuaa/CenteredWN

slide-41
SLIDE 41

Experiments

  • Ablation study

– YaleB, MLP{128,64,48,48}

slide-42
SLIDE 42

Experiments

  • MLP

– SVHN, {128,128,128,128,128}

SGD optimization SGD+BN Adam optimization

slide-43
SLIDE 43

Experiments

  • MLP

– SVHN, {128,128,128,128,128}

SGD optimization SGD+BN Adam optimization

slide-44
SLIDE 44

Experiments

  • Cifar10 & Cifar100

– BN-Inception – Residual Network-56 layers.

Cifar-10 Cifar-100 Plain 6.14 ±0.04 25.52 ±0.15 WN 6.18 ±0.34 25.49 ±0.35 WCBN 6.01 ±0.16 24.45 ±0.54 Cifar-10 Cifar-100 Plain 7.34 ±0.52 29.38 ±0.14 WN 7.58 ±0.40 29.85 ±0.66 WCBN 6.85 ±0.25 29.23 ±0.14

slide-45
SLIDE 45

Experiments

  • ImageNet

– BN-Inception

slide-46
SLIDE 46

Conclusion and Feature work

  • Apply CWN module

– RNNs – Reinforcement learning scene

  • Conclusion:

– CWN shows the advantages in accelerating training and better generalization – CWN module as a optimal module to replace linear module

slide-47
SLIDE 47

Thanks !