DATA ANALYTICS USING DEEP LEARNING
GT 8803 // FALL 2019 // JOY ARULRAJ
L E C T U R E # 1 2 : T R A I N I N G N E U R A L N E T W O R K S ( P T 1 )
DATA ANALYTICS USING DEEP LEARNING GT 8803 // FALL 2019 // JOY - - PowerPoint PPT Presentation
DATA ANALYTICS USING DEEP LEARNING GT 8803 // FALL 2019 // JOY ARULRAJ L E C T U R E # 1 2 : T R A I N I N G N E U R A L N E T W O R K S ( P T 1 ) administrivia Reminders Integration with Eva Code reviews Each team must send
L E C T U R E # 1 2 : T R A I N I N G N E U R A L N E T W O R K S ( P T 1 )
GT 8803 // Fall 2019
– Integration with Eva – Code reviews – Each team must send Pull Requests to Eva
2
GT 8803 // Fall 2018
3
GT 8803 // Fall 2019
– Activation Functions, Preprocessing, Weight
Initialization, Regularization, Gradient Checking
– Babysitting the Learning Process, Parameter
updates, Hyperparameter Optimization
– Model ensembles, Test-time augmentation
4
GT 8803 // Fall 2019
– Activation Functions – Data Preprocessing – Weight Initialization – Batch Normalization
5
GT 8803 // Fall 2018
6
GT 8803 // Fall 2018 7
GT 8803 // Fall 2018 8
GT 8803 // Fall 2018 9
nice interpretation as a saturating “firing rate” of a neuron
GT 8803 // Fall 2018 10
nice interpretation as a saturating “firing rate” of a neuron 3 problems:
1.
Saturated neurons “kill” the gradients
GT 8803 // Fall 2018 11
sigmoid gate
GT 8803 // Fall 2018 12
nice interpretation as a saturating “firing rate” of a neuron 3 problems:
1.
Saturated neurons “kill” the gradients
2.
Sigmoid outputs are not zero- centered
GT 8803 // Fall 2018 13
GT 8803 // Fall 2018 14
hypothetical
vector zig zag path allowed gradient update directions allowed gradient update directions
GT 8803 // Fall 2018 15
hypothetical
vector zig zag path allowed gradient update directions allowed gradient update directions
GT 8803 // Fall 2018 16
nice interpretation as a saturating “firing rate” of a neuron 3 problems:
1.
Saturated neurons “kill” the gradients
2.
Sigmoid outputs are not zero- centered
3.
exp() is a bit compute expensive
GT 8803 // Fall 2018 17
[LeCun et al., 1991]
GT 8803 // Fall 2018 18
sigmoid/tanh in practice (e.g. 6x)
[Krizhevsky et al., 2012]
GT 8803 // Fall 2018 19
sigmoid/tanh in practice (e.g. 6x)
[Krizhevsky et al., 2012]
GT 8803 // Fall 2018 20
sigmoid/tanh in practice (e.g. 6x)
[Krizhevsky et al., 2012]
GT 8803 // Fall 2018 21
sigmoid/tanh in practice (e.g. 6x)
hint: what is the gradient when x < 0?
GT 8803 // Fall 2018 22
ReLU gate
GT 8803 // Fall 2018 23
GT 8803 // Fall 2018 24
GT 8803 // Fall 2018 25
[Mass et al., 2013] [He et al., 2015]
GT 8803 // Fall 2018 26
backprop into \alpha (parameter) [Mass et al., 2013] [He et al., 2015]
GT 8803 // Fall 2018 27
compared with Leaky ReLU adds some robustness to noise
[Clevert et al., 2015]
GT 8803 // Fall 2018 28
[Goodfellow et al., 2013]
GT 8803 // Fall 2018 29
GT 8803 // Fall 2018
30
GT 8803 // Fall 2018 31
(Assume X [NxD] is data matrix, each example in a row)
GT 8803 // Fall 2018 32
hypothetical
vector zig zag path allowed gradient update directions allowed gradient update directions
GT 8803 // Fall 2018 33
(Assume X [NxD] is data matrix, each example in a row)
GT 8803 // Fall 2018 34
(data has diagonal covariance matrix) (covariance matrix is the identity matrix)
GT 8803 // Fall 2018 35
Before normalization: classification loss very sensitive to changes in weight matrix; hard to
After normalization: less sensitive to small changes in weights; easier to optimize
GT 8803 // Fall 2018 36
(mean image = [32,32,3] array)
(mean along each channel = 3 numbers)
(mean along each channel = 3 numbers)
Not common to do PCA or whitening
GT 8803 // Fall 2018
37
GT 8803 // Fall 2018 38
GT 8803 // Fall 2018 39
GT 8803 // Fall 2018 40
GT 8803 // Fall 2018 41
Forward pass for a 6-layer net with hidden size 4096
GT 8803 // Fall 2018 42
All activations tend to zero for deeper network layers Q: What do the gradients dL/dW look like?
Forward pass for a 6-layer net with hidden size 4096
GT 8803 // Fall 2018 43
All activations tend to zero for deeper network layers Q: What do the gradients dL/dW look like? A: All zero, no learning =(
Forward pass for a 6-layer net with hidden size 4096
GT 8803 // Fall 2018 44
Increase std of initial weights from 0.01 to 0.05
GT 8803 // Fall 2018 45
All activations saturate Q: What do the gradients look like?
Increase std of initial weights from 0.01 to 0.05
GT 8803 // Fall 2018 46
All activations saturate Q: What do the gradients look like? A: Local gradients all zero, no learning =(
Increase std of initial weights from 0.01 to 0.05
GT 8803 // Fall 2018 47
“Xavier” initialization: std = 1/sqrt(Din)
GT 8803 // Fall 2018 48
“Xavier” initialization: std = 1/sqrt(Din)
Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010
“Just right”: Activations are nicely scaled for all layers!
GT 8803 // Fall 2018 49
“Xavier” initialization: std = 1/sqrt(Din)
Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010
“Just right”: Activations are nicely scaled for all layers!
For conv layers, Din is
kernel_size2 * input_channels
GT 8803 // Fall 2018 50
“Xavier” initialization: std = 1/sqrt(Din)
“Just right”: Activations are nicely scaled for all layers!
For conv layers, Din is
kernel_size2 * input_channels y = Wx h = f(y) Var(yi) = Din * Var(xiwi) [Assume x, w are iid] = Din * (E[xi2]E[wi2] - E[xi]2 E[wi]2) [Assume x, w independant] = Din * Var(xi) * Var(wi) [Assume x, w are zero-mean] If Var(wi) = 1/Din then Var(yi) = Var(xi) Derivation:
GT 8803 // Fall 2018 51
Change from tanh to ReLU
GT 8803 // Fall 2018 52
Change from tanh to ReLU
Xavier assumes zero centered activation function Activations collapse to zero again, no learning =(
GT 8803 // Fall 2018 53
ReLU correction: std = sqrt(2 / Din) “Just right”: Activations are
nicely scaled for all layers!
He et al, “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”, ICCV 2015
GT 8803 // Fall 2018 54
Glorot and Bengio, 2010
networks by Saxe et al, 2013
Sussillo and Abbott, 2014
classification by He et al., 2015
Krähenbühl et al., 2015
Frankle and Carbin, 2019
GT 8803 // Fall 2018
55
GT 8803 // Fall 2018 56
“you want zero-mean unit-variance activations? just make them so.”
[Ioffe and Szegedy, 2015]
GT 8803 // Fall 2018 57
Per-channel mean, shape is D Per-channel var, shape is D Normalized x, Shape is N x D
[Ioffe and Szegedy, 2015]
GT 8803 // Fall 2018 58
Per-channel mean, shape is D Per-channel var, shape is D Normalized x, Shape is N x D
Problem: What if zero-mean, unit variance is too hard of a constraint?
[Ioffe and Szegedy, 2015]
GT 8803 // Fall 2018 59
Per-channel mean, shape is D Per-channel var, shape is D Normalized x, Shape is N x D
[Ioffe and Szegedy, 2015] Output, Shape is N x D
GT 8803 // Fall 2018 60
Per-channel mean, shape is D Per-channel var, shape is D Normalized x, Shape is N x D
Output, Shape is N x D
Estimates depend on minibatch; can’t do this at test-time!
GT 8803 // Fall 2018 61
Per-channel mean, shape is D Per-channel var, shape is D Normalized x, Shape is N x D
Output, Shape is N x D
During testing batchnorm becomes a linear operator! Can be fused with the previous fully-connected or conv layer (Running) average of values seen during training (Running) average of values seen during training
GT 8803 // Fall 2018 62
Normalize Normalize Batch Normalization for fully- connected networks Batch Normalization for convolutional networks (Spatial Batchnorm, BatchNorm2D)
GT 8803 // Fall 2018 63
FC BN tanh FC BN tanh ...
[Ioffe and Szegedy, 2015]
GT 8803 // Fall 2018 64
FC BN tanh FC BN tanh ... [Ioffe and Szegedy, 2015]
very common source of bugs!
GT 8803 // Fall 2018 65
Normalize Normalize Layer Normalization for fully-connected networks Same behavior at train and test! Can be used in recurrent networks Batch Normalization for fully-connected networks Ba, Kiros, and Hinton, “Layer Normalization”, arXiv 2016
GT 8803 // Fall 2018 66
Normalize Normalize Instance Normalization for convolutional networks Same behavior at train / test! Batch Normalization for convolutional networks
Ulyanov et al, Improved Texture Networks: Maximizing Quality and Diversity in Feed-forward Stylization and Texture Synthesis, CVPR 2017
GT 8803 // Fall 2018 67
Wu and He, “Group Normalization”, ECCV 2018
GT 8803 // Fall 2018 68
Wu and He, “Group Normalization”, ECCV 2018
69
GT 8803 // Fall 2019
– Parameter update schemes – Learning rate schedules – Gradient checking – Regularization (Dropout etc.) – Babysitting learning – Hyperparameter search – Evaluation (Ensembles etc.) – Transfer learning / fine-tuning
70