Justin Johnson October 7, 2019
Lecture 10: Training Neural Networks (Part 1)
Lecture 1 - 1
Lecture 10: Training Neural Networks (Part 1) Justin Johnson - - PowerPoint PPT Presentation
Lecture 10: Training Neural Networks (Part 1) Justin Johnson Lecture 1 - 1 October 7, 2019 Reminder: A3 Due Monday, October 14 (1 week from today!) Remember to run the validation script! Justin Johnson Lecture 10 - 2 October 7, 2019
Justin Johnson October 7, 2019
Lecture 1 - 1
Justin Johnson October 7, 2019
Lecture 10 - 2
Justin Johnson October 7, 2019
(standard 8.5” x 11” paper)
Lecture 10 - 3
Justin Johnson October 7, 2019
Lecture 10 - 4
CPU GPU TPU
Justin Johnson October 7, 2019
Lecture 10 - 5
Justin Johnson October 7, 2019
Lecture 10 - 6
Justin Johnson October 7, 2019
Lecture 10 - 7
Justin Johnson October 7, 2019
Lecture 10 - 8
Justin Johnson October 7, 2019
Lecture 10 - 9
Justin Johnson October 7, 2019
Lecture 10 - 10
nice interpretation as a saturating “firing rate” of a neuron
Justin Johnson October 7, 2019
Lecture 10 - 11
nice interpretation as a saturating “firing rate” of a neuron 3 problems:
1.
Saturated neurons “kill” the gradients
Justin Johnson October 7, 2019
Lecture 10 - 12
sigmoid gate
Justin Johnson October 7, 2019
Lecture 10 - 13
nice interpretation as a saturating “firing rate” of a neuron 3 problems:
1.
Saturated neurons “kill” the gradients
Justin Johnson October 7, 2019
Lecture 10 - 14
nice interpretation as a saturating “firing rate” of a neuron 3 problems:
1.
Saturated neurons “kill” the gradients
2.
Sigmoid outputs are not zero-centered
Justin Johnson October 7, 2019 Lecture 10 - 15
Justin Johnson October 7, 2019 Lecture 10 - 16
hypothetical
vector
allowed gradient update directions allowed gradient update directions
Justin Johnson October 7, 2019 Lecture 10 - 17
hypothetical
vector
allowed gradient update directions allowed gradient update directions
Justin Johnson October 7, 2019
Lecture 10 - 18
nice interpretation as a saturating “firing rate” of a neuron 3 problems:
1.
Saturated neurons “kill” the gradients
2.
Sigmoid outputs are not zero-centered
Justin Johnson October 7, 2019
Lecture 10 - 19
nice interpretation as a saturating “firing rate” of a neuron 3 problems:
1.
Saturated neurons “kill” the gradients
2.
Sigmoid outputs are not zero-centered
3.
exp() is a bit compute expensive
Justin Johnson October 7, 2019
Lecture 10 - 20
Justin Johnson October 7, 2019
Lecture 10 - 21
sigmoid/tanh in practice (e.g. 6x)
Justin Johnson October 7, 2019
Lecture 10 - 22
sigmoid/tanh in practice (e.g. 6x)
Justin Johnson October 7, 2019
Lecture 10 - 23
sigmoid/tanh in practice (e.g. 6x)
hint: what is the gradient when x < 0?
Justin Johnson October 7, 2019
Lecture 10 - 24
ReLU gate
Justin Johnson October 7, 2019 Lecture 10 - 25
Justin Johnson October 7, 2019 Lecture 10 - 26
Justin Johnson October 7, 2019
Lecture 10 - 27
Maas et al, “Rectifier Nonlinearities Improve Neural Network Acoustic Models”, ICML 2013
Justin Johnson October 7, 2019
Lecture 10 - 28
backprop into \alpha (parameter)
Maas et al, “Rectifier Nonlinearities Improve Neural Network Acoustic Models”, ICML 2013 He et al, “Delving Deep into Rectifiers: Surpassing Human- Level Performance on ImageNet Classification”, ICCV 2015
Justin Johnson October 7, 2019
Lecture 10 - 29
(Default alpha=1)
compared with Leaky ReLU adds some robustness to noise
Justin Johnson October 7, 2019
Lecture 10 - 30
α = 1.6732632423543772848170429916717 λ = 1.0507009873554804934193349852946
works better for deep networks
can train deep SELU networks without BatchNorm
Klambauer et al, Self-Normalizing Neural Networks, ICLR 2017
Justin Johnson October 7, 2019
α = 1.6732632423543772848170429916717 λ = 1.0507009873554804934193349852946
Lecture 10 - 31
works better for deep networks
can train deep SELU networks without BatchNorm
Klambauer et al, Self-Normalizing Neural Networks, ICLR 2017
Derivation takes 91 pages of math in appendix…
Justin Johnson October 7, 2019 Lecture 10 - 32
93.8 95.3 94.8 94.2 95.6 94.7 94.1 95.1 94.5 94.6 94.9 94.7 94.1 94.1 94.4 93 93.2 93.9 94.3 95.5 94.8 94.7 95.5 94.8
90 91 92 93 94 95 96
ResNet Wide ResNet DenseNet
ReLU Leaky ReLU Parametric ReLU Softplus ELU SELU GELU Swish
Ramachandran et al, “Searching for activation functions”, ICLR Workshop 2018
Justin Johnson October 7, 2019
Lecture 10 - 33
Justin Johnson October 7, 2019
Lecture 10 - 34
Justin Johnson October 7, 2019
Lecture 10 - 35
(Assume X [NxD] is data matrix, each example in a row)
Justin Johnson October 7, 2019
Lecture 10 - 36
hypothetical
vector
allowed gradient update directions allowed gradient update directions
Justin Johnson October 7, 2019
Lecture 10 - 37
(Assume X [NxD] is data matrix, each example in a row)
Justin Johnson October 7, 2019
Lecture 10 - 38
(data has diagonal covariance matrix) (covariance matrix is the identity matrix)
Justin Johnson October 7, 2019
Lecture 10 - 39
Before normalization: classification loss very sensitive to changes in weight matrix; hard to optimize After normalization: less sensitive to small changes in weights; easier to
Justin Johnson October 7, 2019
Lecture 10 - 40
Not common to do PCA or whitening
Justin Johnson October 7, 2019
Lecture 10 - 41
Justin Johnson October 7, 2019
Lecture 10 - 42
Q: What happens if we initialize all W=0, b=0?
Justin Johnson October 7, 2019
Lecture 10 - 43
Q: What happens if we initialize all W=0, b=0? A: All outputs are 0, all gradients are the same! No “symmetry breaking”
Justin Johnson October 7, 2019
Lecture 10 - 44
Justin Johnson October 7, 2019
Lecture 10 - 45
Justin Johnson October 7, 2019
Lecture 10 - 46
Forward pass for a 6-layer net with hidden size 4096
Justin Johnson October 7, 2019
Lecture 10 - 47
Forward pass for a 6-layer net with hidden size 4096
All activations tend to zero for deeper network layers Q: What do the gradients dL/dW look like?
Justin Johnson October 7, 2019
Lecture 10 - 48
Forward pass for a 6-layer net with hidden size 4096
All activations tend to zero for deeper network layers Q: What do the gradients dL/dW look like? A: All zero, no learning =(
Justin Johnson October 7, 2019
Lecture 10 - 49
Increase std of initial weights from 0.01 to 0.05
Justin Johnson October 7, 2019
Lecture 10 - 50
Increase std of initial weights from 0.01 to 0.05
All activations saturate Q: What do the gradients look like?
Justin Johnson October 7, 2019
Lecture 10 - 51
Increase std of initial weights from 0.01 to 0.05
All activations saturate Q: What do the gradients look like? A: Local gradients all zero, no learning =(
Justin Johnson October 7, 2019
Lecture 10 - 52
“Xavier” initialization: std = 1/sqrt(Din)
Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010
Justin Johnson October 7, 2019
Lecture 10 - 53
“Just right”: Activations are nicely scaled for all layers!
“Xavier” initialization: std = 1/sqrt(Din)
Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010
Justin Johnson October 7, 2019
Lecture 10 - 54
“Just right”: Activations are nicely scaled for all layers! For conv layers, Din is
kernel_size2 * input_channels
“Xavier” initialization: std = 1/sqrt(Din)
Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010
Justin Johnson October 7, 2019
Lecture 10 - 55
“Xavier” initialization: std = 1/sqrt(Din)
[Assume x, w are iid]
2]E[wi 2] - E[xi]2 E[wi]2) [Assume x, w independent]
& (") &*+
Justin Johnson October 7, 2019
Lecture 10 - 56
“Xavier” initialization: std = 1/sqrt(Din)
[Assume x, w are iid]
2]E[wi 2] - E[xi]2 E[wi]2) [Assume x, w independent]
& (") &*+
Justin Johnson October 7, 2019
Lecture 10 - 57
“Xavier” initialization: std = 1/sqrt(Din)
[Assume x, w are iid]
2]E[wi 2] - E[xi]2 E[wi]2) [Assume x, w independent]
& (") &*+
Justin Johnson October 7, 2019
Lecture 10 - 58
“Xavier” initialization: std = 1/sqrt(Din)
[Assume x, w are iid]
2]E[wi 2] - E[xi]2 E[wi]2) [Assume x, w independent]
& (") &*+
Justin Johnson October 7, 2019
Lecture 10 - 59
“Xavier” initialization: std = 1/sqrt(Din)
[Assume x, w are iid]
2]E[wi 2] - E[xi]2 E[wi]2) [Assume x, w independent]
& (") &*+
Justin Johnson October 7, 2019
Lecture 10 - 60
Change from tanh to ReLU
Justin Johnson October 7, 2019
Lecture 10 - 61
Xavier assumes zero centered activation function Activations collapse to zero again, no learning =(
Change from tanh to ReLU
Justin Johnson October 7, 2019
Lecture 10 - 62
”Just right” – activations nicely scaled for all layers
ReLU correction: std = sqrt(2 / Din)
He et al, “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”, ICCV 2015
Justin Johnson October 7, 2019
Lecture 10 - 63
relu Residual Block
conv conv
F(x) + x F(x) relu X
If we initialize with MSRA: then Var(F(x)) = Var(x) But then Var(F(x) + x) > Var(x) – variance grows with each block!
Justin Johnson October 7, 2019
Lecture 10 - 64
relu Residual Block
conv conv
F(x) + x F(x) relu X
If we initialize with MSRA: then Var(F(x)) = Var(x) But then Var(F(x) + x) > Var(x) – variance grows with each block! Solution: Initialize first conv with MSRA, initialize second conv to zero. Then Var(x + F(x)) = Var(x)
Zhang et al, “Fixup Initialization: Residual Learning Without Normalization”, ICLR 2019
Justin Johnson October 7, 2019
Lecture 10 - 65
Understanding the difficulty of training deep feedforward neural networks by Glorot and Bengio, 2010 Exact solutions to the nonlinear dynamics of learning in deep linear neural networks by Saxe et al, 2013 Random walk initialization for training very deep feedforward networks by Sussillo and Abbott, 2014 Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification by He et al., 2015 Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., 2015 All you need is a good init, Mishkin and Matas, 2015 Fixup Initialization: Residual Learning Without Normalization, Zhang et al, 2019 The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks, Frankle and Carbin, 2019
Justin Johnson October 7, 2019
Lecture 10 - 66
Justin Johnson October 7, 2019
Lecture 10 - 67
(Weight decay)
Justin Johnson October 7, 2019
Lecture 10 - 68
Srivastava et al, “Dropout: A simple way to prevent neural networks from overfitting”, JMLR 2014
Justin Johnson October 7, 2019
Lecture 10 - 69
Example forward pass with a 3-layer network using dropout
Justin Johnson October 7, 2019
Lecture 10 - 70
Forces the network to have a redundant representation; Prevents co-adaptation of features has an ear has a tail is furry has claws mischievous look X X X cat score cat score
Justin Johnson October 7, 2019
Lecture 10 - 71
Another interpretation: Dropout is training a large ensemble of models (that share parameters). Each binary mask is one model An FC layer with 4096 units has 24096 ~ 101233 possible masks! Only ~ 1082 atoms in the universe...
Justin Johnson October 7, 2019
Lecture 10 - 72
Output (label) Input (image) Random mask
Justin Johnson October 7, 2019
Lecture 10 - 73
Consider a single neuron.
a x y w1 w2
Justin Johnson October 7, 2019
Lecture 10 - 74
Consider a single neuron. At test time we have:
a x y w1 w2
Justin Johnson October 7, 2019
Lecture 10 - 75
Consider a single neuron. At test time we have: During training we have:
a x y w1 w2
Justin Johnson October 7, 2019
Lecture 10 - 76
Consider a single neuron. At test time we have: During training we have:
a x y w1 w2 At test time, drop nothing and multiply by dropout probability
Justin Johnson October 7, 2019
Lecture 10 - 77
Justin Johnson October 7, 2019
Lecture 10 - 78
Justin Johnson October 7, 2019
Lecture 10 - 79
Justin Johnson October 7, 2019
Lecture 10 - 80
20000 40000 60000 80000 100000 120000
conv1 conv2 conv3 conv4 conv5 fc6 fc7 fc8
AlexNet vs VGG-16 (Params, M)
AlexNet VGG-16
Recall AlexNet, VGG have most of their parameters in fully-connected layers; usually Dropout is applied there
Dropout here!
Justin Johnson October 7, 2019
Lecture 10 - 81
20000 40000 60000 80000 100000 120000
conv1 conv2 conv3 conv4 conv5 fc6 fc7 fc8
AlexNet vs VGG-16 (Params, M)
AlexNet VGG-16
Recall AlexNet, VGG have most of their parameters in fully-connected layers; usually Dropout is applied there
Dropout here!
Later architectures (GoogLeNet, ResNet, etc) use global average pooling instead of fully-connected layers: they don’t use dropout at all!
Justin Johnson October 7, 2019
Lecture 10 - 82
Justin Johnson October 7, 2019
Lecture 10 - 83
Justin Johnson October 7, 2019
Lecture 10 - 84
For ResNet and later,
Normalization are the only regularizers!
Justin Johnson October 7, 2019
Lecture 10 - 85
Load image and label
“cat” CNN Compute loss
This image by Nikita is licensed under CC-BY 2.0
Justin Johnson October 7, 2019
Lecture 10 - 86
Transform image Load image and label
“cat” CNN Compute loss
Justin Johnson October 7, 2019
Lecture 10 - 87
Justin Johnson October 7, 2019
Lecture 10 - 88
ResNet:
1.
Pick random L in range [256, 480]
2.
Resize training image, short side = L
3.
Sample random 224 x 224 patch
Justin Johnson October 7, 2019
Lecture 10 - 89
ResNet:
1.
Pick random L in range [256, 480]
2.
Resize training image, short side = L
3.
Sample random 224 x 224 patch
ResNet:
1.
Resize image at 5 scales: {224, 256, 384, 480, 640}
2.
For each size, use 10 224 x 224 crops: 4 corners + center, + flips
Justin Johnson October 7, 2019
Lecture 10 - 90
Simple: Randomize contrast and brightness
(Used in AlexNet, ResNet, etc)
Justin Johnson October 7, 2019
Lecture 10 - 91
Justin Johnson October 7, 2019
Lecture 10 - 92
Wan et al, “Regularization of Neural Networks using DropConnect”, ICML 2013
Dropout Batch Normalization Data Augmentation
Training: Add some randomness Testing: Marginalize over randomness
Justin Johnson October 7, 2019
Lecture 10 - 93
Wan et al, “Regularization of Neural Networks using DropConnect”, ICML 2013
Dropout Batch Normalization Data Augmentation DropConnect
Training: Drop random connections between neurons (set weight=0) Testing: Use all the connections
Justin Johnson October 7, 2019
Lecture 10 - 94
Dropout Batch Normalization Data Augmentation DropConnect Fractional Max Pooling
Training: Use randomized pooling regions Testing: Average predictions over different samples
Graham, “Fractional Max Pooling”, arXiv 2014
Justin Johnson October 7, 2019
Lecture 10 - 95
Dropout Batch Normalization Data Augmentation DropConnect Fractional Max Pooling Stochastic Depth
Training: Skip some residual blocks in ResNet Testing: Use the whole network
Huang et al, “Deep Networks with Stochastic Depth”, ECCV 2016
Justin Johnson October 7, 2019
Lecture 10 - 96
Dropout Batch Normalization Data Augmentation DropConnect Fractional Max Pooling Stochastic Depth Cutout
Training: Set random images regions to 0 Testing: Use the whole image
DeVries and Taylor, “Improved Regularization of Convolutional Neural Networks with Cutout”, arXiv 2017
Works very well for small datasets like CIFAR, less common for large datasets like ImageNet
Justin Johnson October 7, 2019
Lecture 10 - 97
Dropout Batch Normalization Data Augmentation DropConnect Fractional Max Pooling Stochastic Depth Cutout Mixup
Training: Train on random blends of images Testing: Use original images
Zhang et al, “mixup: Beyond Empirical Risk Minimization”, ICLR 2018
Randomly blend the pixels of pairs of training images, e.g. 40% cat, 60% dog
CNN
Target label: cat: 0.4 dog: 0.6
Sample blend probability from a beta distribution Beta(a, b) with a=b≈0 so blend weights are close to 0/1
Justin Johnson October 7, 2019
Lecture 10 - 98
Dropout Batch Normalization Data Augmentation DropConnect Fractional Max Pooling Stochastic Depth Cutout Mixup
Training: Train on random blends of images Testing: Use original images
Zhang et al, “mixup: Beyond Empirical Risk Minimization”, ICLR 2018
Randomly blend the pixels of pairs of training images, e.g. 40% cat, 60% dog
CNN
Target label: cat: 0.4 dog: 0.6
Justin Johnson October 7, 2019
Lecture 10 - 99
Dropout Batch Normalization Data Augmentation DropConnect Fractional Max Pooling Stochastic Depth Cutout Mixup
Training: Train on random blends of images Testing: Use original images
Zhang et al, “mixup: Beyond Empirical Risk Minimization”, ICLR 2018
connected layers
augmentation almost always a good idea
for small classification datasets
Justin Johnson October 7, 2019
Lecture 10 - 100
Justin Johnson October 7, 2019
Lecture 10 - 101