Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019 1
Lecture 7 - Lecture 7 - April 22, 2019 April 22, 2019 1 - - PowerPoint PPT Presentation
Fei-Fei Li & Justin Johnson & Serena Yeung Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - Lecture 7 - April 22, 2019 April 22, 2019 1 Administrative: Project Proposal Due tomorrow, 4/24 on GradeScope 1 person per
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019 1
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019 2
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019 3
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019 4
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019
5
x W
hinge loss
R
+
L
s (scores)
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019 6
W1
W2 3072 100 10
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019
Illustration of LeCun et al. 1998 from CS231n 2017 Lecture 1
7
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019 8
32 32 3
convolve (slide) over all spatial locations activation map 1 28 28
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019 9
32 32 3 Convolution Layer activation maps 6 28 28 For example, if we had 6 5x5 filters, we’ll get 6 separate activation maps:
We stack these up to get a “new image” of size 28x28x6!
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019 10
Landscape image is CC0 1.0 public domain Walking man image is CC0 1.0 public domain
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019 11
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019 12
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019 13
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019 14
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019 15
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019 16
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019 17
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019 18
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019 19
have nice interpretation as a saturating “firing rate” of a neuron
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019 20
have nice interpretation as a saturating “firing rate” of a neuron 3 problems: 1. Saturated neurons “kill” the gradients
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019 21
sigmoid gate
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019 22
have nice interpretation as a saturating “firing rate” of a neuron 3 problems: 1. Saturated neurons “kill” the gradients 2. Sigmoid outputs are not zero-centered
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019 23
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019 24
hypothetical
vector zig zag path
allowed gradient update directions allowed gradient update directions
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019 25
hypothetical
vector zig zag path
allowed gradient update directions allowed gradient update directions
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019 26
have nice interpretation as a saturating “firing rate” of a neuron 3 problems: 1. Saturated neurons “kill” the gradients 2. Sigmoid outputs are not zero-centered 3. exp() is a bit compute expensive
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019 27
[LeCun et al., 1991]
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019 28
sigmoid/tanh in practice (e.g. 6x)
[Krizhevsky et al., 2012]
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019 29
sigmoid/tanh in practice (e.g. 6x)
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019 30
sigmoid/tanh in practice (e.g. 6x)
hint: what is the gradient when x < 0?
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019 31
ReLU gate
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019 32
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019 33
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019 34
[Mass et al., 2013] [He et al., 2015]
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019 35
backprop into \alpha (parameter) [Mass et al., 2013] [He et al., 2015]
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019 36
compared with Leaky ReLU adds some robustness to noise
[Clevert et al., 2015]
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019 37
[Goodfellow et al., 2013]
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019 38
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019 39
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019 40
(Assume X [NxD] is data matrix, each example in a row)
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019 41
hypothetical
vector zig zag path
allowed gradient update directions allowed gradient update directions
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019 42
(Assume X [NxD] is data matrix, each example in a row)
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019 43
(data has diagonal covariance matrix) (covariance matrix is the identity matrix)
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 24, 2018
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 24, 2018 44
Before normalization: classification loss very sensitive to changes in weight matrix; hard to optimize After normalization: less sensitive to small changes in weights; easier to optimize
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019 45
Not common to do PCA or whitening
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019 46
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019 47
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019 48
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019 49
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019 50
Forward pass for a 6-layer net with hidden size 4096
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019 51
Forward pass for a 6-layer net with hidden size 4096
All activations tend to zero for deeper network layers Q: What do the gradients dL/dW look like?
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019 52
Forward pass for a 6-layer net with hidden size 4096
All activations tend to zero for deeper network layers Q: What do the gradients dL/dW look like? A: All zero, no learning =(
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019 53
Increase std of initial weights from 0.01 to 0.05
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019 54
Increase std of initial weights from 0.01 to 0.05
All activations saturate Q: What do the gradients look like?
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019 55
Increase std of initial weights from 0.01 to 0.05
All activations saturate Q: What do the gradients look like? A: Local gradients all zero, no learning =(
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019 56
“Xavier” initialization: std = 1/sqrt(Din)
Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019 57 “Just right”: Activations are nicely scaled for all layers!
“Xavier” initialization: std = 1/sqrt(Din)
Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019 58 “Just right”: Activations are nicely scaled for all layers! For conv layers, Din is
kernel_size2 * input_channels
“Xavier” initialization: std = 1/sqrt(Din)
Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019 59
“Xavier” initialization: std = 1/sqrt(Din)
Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010
y = Wx h = f(y) Var(yi) = Din * Var(xiwi) [Assume x, w are iid] = Din * (E[xi
2]E[wi 2] - E[xi]2 E[wi]2) [Assume x, w independant]
= Din * Var(xi) * Var(wi) [Assume x, w are zero-mean] If Var(wi) = 1/Din then Var(yi) = Var(xi) Derivation:
“Just right”: Activations are nicely scaled for all layers! For conv layers, Din is
kernel_size2 * input_channels
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019 60
Change from tanh to ReLU
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019 61
Xavier assumes zero centered activation function Activations collapse to zero again, no learning =(
Change from tanh to ReLU
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019 62
ReLU correction: std = sqrt(2 / Din)
He et al, “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”, ICCV 2015
“Just right”: Activations are nicely scaled for all layers!
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019 63
Understanding the difficulty of training deep feedforward neural networks by Glorot and Bengio, 2010 Exact solutions to the nonlinear dynamics of learning in deep linear neural networks by Saxe et al, 2013 Random walk initialization for training very deep feedforward networks by Sussillo and Abbott, 2014 Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification by He et al., 2015 Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., 2015 All you need is a good init, Mishkin and Matas, 2015 Fixup Initialization: Residual Learning Without Normalization, Zhang et al, 2019 The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks, Frankle and Carbin, 2019
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019 64
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019 65
“you want zero-mean unit-variance activations? just make them so.”
[Ioffe and Szegedy, 2015]
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 24, 2018
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 24, 2018 66
Per-channel mean, shape is D Per-channel var, shape is D Normalized x, Shape is N x D
[Ioffe and Szegedy, 2015]
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 24, 2018
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 24, 2018 67
Per-channel mean, shape is D Per-channel var, shape is D Normalized x, Shape is N x D
[Ioffe and Szegedy, 2015]
Problem: What if zero-mean, unit variance is too hard of a constraint?
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 24, 2018
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 24, 2018 68
Per-channel mean, shape is D Per-channel var, shape is D Normalized x, Shape is N x D
[Ioffe and Szegedy, 2015]
Output, Shape is N x D
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 24, 2018
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 24, 2018 69
Per-channel mean, shape is D Per-channel var, shape is D Normalized x, Shape is N x D
Output, Shape is N x D
Estimates depend on minibatch; can’t do this at test-time!
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 24, 2018
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 24, 2018 70
Per-channel mean, shape is D Per-channel var, shape is D Normalized x, Shape is N x D
Output, Shape is N x D
(Running) average of values seen during training (Running) average of values seen during training
During testing batchnorm becomes a linear operator! Can be fused with the previous fully-connected or conv layer
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 24, 2018
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 24, 2018 71
Per-channel mean, shape is D Per-channel var, shape is D Normalized x, Shape is N x D
Output, Shape is N x D
(Running) average of values seen during training (Running) average of values seen during training
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 24, 2018
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 24, 2018 72
Normalize Normalize Batch Normalization for fully-connected networks Batch Normalization for convolutional networks (Spatial Batchnorm, BatchNorm2D)
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019 73
[Ioffe and Szegedy, 2015]
FC BN tanh FC BN tanh ...
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019 74
[Ioffe and Szegedy, 2015]
FC BN tanh FC BN tanh ...
is a very common source of bugs!
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 24, 2018
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 24, 2018 75
Normalize Normalize Layer Normalization for fully-connected networks Same behavior at train and test! Can be used in recurrent networks Batch Normalization for fully-connected networks
Ba, Kiros, and Hinton, “Layer Normalization”, arXiv 2016
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 24, 2018
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 24, 2018 76
Normalize Normalize Instance Normalization for convolutional networks Same behavior at train / test! Batch Normalization for convolutional networks
Ulyanov et al, Improved Texture Networks: Maximizing Quality and Diversity in Feed-forward Stylization and Texture Synthesis, CVPR 2017
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 24, 2018
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 24, 2018 77
Wu and He, “Group Normalization”, ECCV 2018
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 24, 2018
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 24, 2018 78
Wu and He, “Group Normalization”, ECCV 2018
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019 79
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 7 - April 22, 2019 80