Normalization Techniques in Training
- f Deep Neural Networks
Normalization Techniques in Training of Deep Neural Networks Lei - - PowerPoint PPT Presentation
Normalization Techniques in Training of Deep Neural Networks Lei Huang ( ) State Key Laboratory of Software Development Environment, Beihang University Mail:huanglei@nlsde.buaa.edu.cn August 17 th , 2017 Outline Introduction to Deep
Y=F(X) P(Y|X)
𝑈(𝑔 𝑈−1(… 𝑔 1(𝑌)))
𝑗 𝑦 = (𝑋𝑦 + 𝑐)
– Design the architecture
– Train the model based on optimization
9
(1, 0, 0)𝑈 input 𝑦1 𝑦2 𝑦3 𝑦0 = 1 ℎ0
(1) = 1
hidden layer
𝒛)2
𝒃(𝟐) = 𝑿(𝟐) ∙ 𝒚 𝒊(𝟐) = 𝜏 𝒃(𝟐) 𝒃(𝟑) = 𝑿(𝟑) ∙ 𝒊 𝟐 𝒛 = 𝜏 𝒃(𝟑) d L
𝒆 𝑿(𝟑)= d L 𝒆𝒃(𝟑) 𝒊(𝟐)
d L
𝒆 𝑿(𝟐)= d L 𝒆𝒃(𝟐) 𝒚
d L 𝐳 =2(𝐳 − 𝒛) d L
𝒆𝒃(𝟑)=d L 𝐞𝐳 ∙ 𝜏 𝒃(𝟑) ∙ (1−𝜏 𝒃(𝟑) )
d L
𝒆 𝒊(𝟐)=d L 𝒆𝒃(𝟑) 𝑿(𝟑)
d L
𝒃(𝟐) = d L 𝒆 𝒊(𝟐) ∙ 𝜏 𝒃(𝟐) ∙ (1−𝜏 𝒃(𝟐) )
d L
𝒆𝒚 =d L 𝒆𝒃(𝟐) 𝑿(𝟐)
𝒆 𝒚
:
𝒆 𝑿 :
– Non-convex and local optimal points – Saddle point – Severe correlation between dimensions and highly non-isotropic parameter space (ill-shaped)
Figure 2: zig-zag iteration path for SGD
– Quadratic optimization
– Inverse of Hessian
– Inverse of FIM
– Estimate the scale
– Intuition:the landscape of cost w.r.t parameters is controlled by Input/activation L=(f(x,𝜄),y) – Method: Stabilize the distribution of input/activation
Iteration path of SGD (red) and NGD (green)
𝑥1 𝑥2 0 < 𝑦1 < 2 0 < 𝑦2 < 0.5 𝑥1 𝑥2 0 < 𝑦1
′ = 𝑦1/2 < 1
0 < 𝑦2
′ = 𝑦2 ∗ 2 < 1
L(𝑥1, 𝑥2) L(𝑥1, 𝑥2)
decorrelate centering stretch
y=Wx, MSE loss
𝑦−𝐹(𝑦) 𝑡𝑢𝑒(𝑦)
centering stretch
2 + 1 − 𝛽 𝑤𝑏𝑠 𝑦
Residual block (CVPR 2015) Pre-activation Residual block (ECCV 2016)
– Can not be used for online learning – Unstable for small mini batch size.
BN LN
Ω = {,𝑊
1, 𝑒1 … , 𝑊 𝑀, 𝑒𝑀}
Φ = {,𝑉0, 𝑑0 … , 𝑉𝑀−1, 𝑑𝑀−1}
covariate shift, ICML 2015 (Batch Normalization)
Covariate Shift in Deep Networks, ICML, 2016
Deep Neural Networks, NIPS, 2016
schemes, ICLR, 2017
2012
31
2 𝑜 , 𝑜 = 𝑝𝑣𝑢 ∗ 𝐼 ∗ 𝑋
SGD optimization SGD+BN Adam optimization
SGD optimization SGD+BN Adam optimization
Cifar-10 Cifar-100 Plain 6.14 ±0.04 25.52 ±0.15 WN 6.18 ±0.34 25.49 ±0.35 WCBN 6.01 ±0.16 24.45 ±0.54 Cifar-10 Cifar-100 Plain 7.34 ±0.52 29.38 ±0.14 WN 7.58 ±0.40 29.85 ±0.66 WCBN 6.85 ±0.25 29.23 ±0.14