Basic ics of f DL
- Prof. Leal-Taixé and Prof. Niessner
1
Basic ics of f DL Prof. Leal-Taix and Prof. Niessner 1 What we - - PowerPoint PPT Presentation
Basic ics of f DL Prof. Leal-Taix and Prof. Niessner 1 What we assume you know Linear Algebra & Programming! Basics from the Introduction to Deep Learning lecture PyTorch (can use TensorFlow ) You have trained already
1
how to debug problems, observe training curves, prepare training/validation/test data.
2
3
4
On CIFAR-10 On ImageNet
Credit: Li/Karpathy/Johnson
– 2-layers: 𝑔 = 𝑋
2 max(0, 𝑋 1𝑦)
– 3-layers: 𝑔 = 𝑋
3 max(0, 𝑋 2 max(0, 𝑋 1𝑦))
– 4-layers: 𝑔 = 𝑋
4 tanh (W3, max(0, 𝑋 2 max(0, 𝑋 1𝑦)))
– 5-layers: 𝑔 = 𝑋
5𝜏(𝑋 4 tanh(W3, max(0, 𝑋 2 max(0, 𝑋 1𝑦))))
– … up to hundreds of layers
5
6
2-layer network: 𝑔 = 𝑋
2 max(0, 𝑋 1𝑦)
𝑦 ℎ 𝑋
1
128 × 128 = 16384 1000
𝑔 𝑋
2
10
1-layer network: 𝑔 = 𝑋𝑦 𝑦 𝑋
128 × 128 = 16384
𝑔
10
7
Credit: Li/Karpathy/Johnson
8
What is the shape of this function?
9
Loss (Softmax, Hinge) Prediction
10
Evaluate the ground truth score for the image
– Optimizes until the loss is zero
– Saturates whenever it has learned a class “well enough”
11
12
Forward
13
Saturated neurons kill the gradient flow
14
More on zero- mean data later
15
Zero- centered Still saturates Still saturates
LeCun 1991
16
Large and consistent gradients Does not saturate Fast convergence What happens if a ReLU outputs zero? Dead ReLU
17
Generalization
Linear regimes Does not die Does not saturate Increase of the number of parameters
18
19
𝑦0 𝑦1 𝑦2 ℎ0 ℎ1 ℎ2 ℎ3 𝑧0 𝑧1 𝑢0 𝑢1 𝑧𝑗 = 𝐵(𝑐1,𝑗 +
𝑘
ℎ𝑘𝑥1,𝑗,𝑘) ℎ𝑘 = 𝐵(𝑐0,𝑘 +
𝑙
𝑦𝑙𝑥0,𝑘,𝑙) 𝑀𝑗 = 𝑧𝑗 − 𝑢𝑗 2 𝛼𝑥,𝑐𝑔 𝑦,𝑢 (𝑥) = 𝜖𝑔 𝜖𝑥0,0,0 … … 𝜖𝑔 𝜖𝑥𝑚,𝑛,𝑜 … … 𝜖𝑔 𝜖𝑐𝑚,𝑛 Just simple: 𝐵 𝑦 = max(0, 𝑦)
𝜄𝑙+1 = 𝜄𝑙 − 𝛽𝛼𝜄𝑀(𝜄𝑙, 𝑦{1..𝑛}, 𝑧{1..𝑛}) 𝛼𝜄𝑀 =
1 𝑛 σ𝑗=1 𝑛 𝛼𝜄𝑀𝑗
Note the terminology: iteration vs epoch
20
𝑙 now refers to 𝑙-th iteration 𝑛 training samples in the current batch Gradient for the 𝑙-th batch
𝑤𝑙+1 = 𝛾 ⋅ 𝑤𝑙 + 𝛼𝜄𝑀(𝜄𝑙) 𝜄𝑙+1 = 𝜄𝑙 − 𝛽 ⋅ 𝑤𝑙+1 Exponentially-weighted average of gradient Important: velocity 𝑤𝑙 is vector-valued!
21
Gradient of current minibatch velocity accumulation rate (‘friction’, momentum) learning rate velocity model
𝜄𝑙+1 = 𝜄𝑙 − 𝛽 ⋅ 𝑤𝑙+1
22
Step will be largest when a sequence of gradients all point to the same direction
Hyperparameters are 𝛽, 𝛾 𝛾 is often set to 0.9
𝑡𝑙+1 = 𝛾 ⋅ 𝑡𝑙 + (1 − 𝛾)[𝛼𝜄𝑀 ∘ 𝛼𝜄𝑀] 𝜄𝑙+1 = 𝜄𝑙 − 𝛽 ⋅ 𝛼𝜄𝑀 𝑡𝑙+1 + 𝜗 Hyperparameters: 𝛽, 𝛾, 𝜗
23
Typically 10−8 Often 0.9
Element-wise multiplication
Needs tuning!
24
X-direction Small gradients Y-Direction Large gradients
𝑡𝑙+1 = 𝛾 ⋅ 𝑡𝑙 + (1 − 𝛾)[𝛼𝜄𝑀 ∘ 𝛼𝜄𝑀] 𝜄𝑙+1 = 𝜄𝑙 − 𝛽 ⋅ 𝛼𝜄𝑀 𝑡𝑙+1 + 𝜗 We’re dividing by square gradients:
(uncentered) variance of gradients
Can increase learning rate!
Combines Momentum and RMSProp 𝑛𝑙+1 = 𝛾1 ⋅ 𝑛𝑙 + 1 − 𝛾1 𝛼𝜄𝑀 𝜄𝑙 𝑤𝑙+1 = 𝛾2 ⋅ 𝑤𝑙 + (1 − 𝛾2)[𝛼𝜄𝑀 𝜄𝑙 ∘ 𝛼𝜄𝑀 𝜄𝑙 ] 𝜄𝑙+1 = 𝜄𝑙 − 𝛽 ⋅
𝑛𝑙+1 𝑤𝑙+1+𝜗
25
First momentum: mean of gradients Second momentum: variance of gradients
Combines Momentum and RMSProp
𝑛𝑙+1 = 𝛾1 ⋅ 𝑛𝑙 + 1 − 𝛾1 𝛼𝜄𝑀 𝜄𝑙 𝑤𝑙+1 = 𝛾2 ⋅ 𝑤𝑙 + (1 − 𝛾2)[𝛼𝜄𝑀 𝜄𝑙 ∘ 𝛼𝜄𝑀 𝜄𝑙 ]
𝜄𝑙+1 = 𝜄𝑙 − 𝛽 ⋅
ෝ 𝑛𝑙+1 ො 𝑤𝑙+1+𝜗
26
𝑛𝑙+1 and 𝑤𝑙+1 are initialized with zero
Typically, bias-corrected moment updates ෝ 𝑛𝑙+1 = 𝑛𝑙 1 − 𝛾1 ො 𝑤𝑙+1 = 𝑤𝑙 1 − 𝛾2
27
28
29
30
Figure extracted from Deep Learning by Adam Gibson, Josh Patterson, O‘Reily Media Inc., 2017
31
Source: http://srdas.github.io/DLBook/ImprovingModelGeneralization.html
32
Find your hyperparameters 20% train test validation 20% 60%
33
34
35
36
Krizhevsky 2012
Training time is also a hyperparameter
37
Overfitting
38
Training Set 1 Training Set 2 Training Set 3
39
Srivastava 2014
Forward
40
41
Credit: Li/Karpathy/Johnson
42
Input Edge detection −1 −1 −1 −1 8 −1 −1 −1 −1 Sharpen −1 −1 5 −1 −1 Box mean 1 9 1 1 1 1 1 1 1 1 1 Gaussian blur 1 16 1 2 1 2 4 2 1 2 1
43
32 32 3 3 5 5 32 × 32 × 3 image (pixels 𝑦) 5 × 5 × 3 filter (weights 𝑥) 1 28 28 activation map (also feature map)
Co Convolve slide over all spatial locations 𝑦𝑗 and compute all output 𝑨𝑗; w/o padding, there are 28 × 28 locations
44
32 32 3 32 × 32 × 3 image 5 28 28 activation maps
Co Convolve
Let’s apply **five** * filt lters, ea each ch wit ith dif iffe ferent wei eights! Convolution “Layer”
45
ConvNet is concatenation of Conv Layers and activations
32 32 3 28 28 5 24 24 8 Conv + ReLU Conv + ReLU Conv + ReLU 12 5 filters 5 × 5 × 3 8 filters 5 × 5 × 5 12 filters 5 × 5 × 8 Input Image 20
46
47
3 1 3 5 6 7 9 3 2 1 4 2 4 3 6 9 3 4
Single depth slice of input Max pool with 2 × 2 filters and stride 2 ‘Pooled’ output
48
49
60k parameters
50
[Krizhevsky et al. 2012]
51
[Simonyan and Zisserman 2014]
52
Conv=3x3,s=1,same Maxpool=2x2,s=2 Still very common: VGG-16
53
[He et al. 2015]
ier/2 init init by by He et t al. l.
54
[He et al. 2015]
performance starts to degrade
the optimizer cannot properly train the network
55
performance starts to degrade
56
57
58
[Szegedy et al. 2014]
Inception block
59
60
61
Outputs Inputs Hidden states The hidden state will have its own internal dynamics More expressive model!
62
Hidden state Same parameters for each time step = generalization! Output
63
64
excitation backprop
Lecture 7: GANs 1: Generative models, GANs.
for image, RNN for text) /
65
66
(needs to have 100% because it just memorizes input)
– It’s irrespective of input !!!
samples
– It’s now conditioned on input data
67
– 5, 10, 100, 1000… – At some point, we should see generalization
number of samples?
68
– Get pr precise tim timings!! !!! – If an iteration takes > 500ms, things get dicey…
– Speed up data loading: smaller resolutions, compression, train from SSD – e.g., network training is good idea – Speed up backprop:
pattern? How long till convergence?
69
train for two weeks and we see where we stand.” [because we desperately need those 2%...]
divide #layers you started with by 5.
70
– Evaluation needs to be consistent! – Numbers need to be comparable
– “I’ve added 5 more layers and double the training size, and now I also trained 5 days longer” – it’s better, but WHY?
71
ONLY THINKG ABOUT THIS ONCE YOU’R TRAINING LOSS GOES DOWN AND YOU CAN OVERFIT! Typically try this order:
72
PROCEED ONLY IF YOU GENERALIZE AND YOU ADDRESSED OVERFITTING ISSUES!
more data!
InceptionNet architectures perform often better (e.g., InceptionNet v4, XceptionNet, etc.)
higher weight)
73
very unlikely – unless you have a bug )
74
1) you didn't try to overfit a single batch first. 2) you forgot to toggle train/eval mode for the net. 3) you forgot to .zero_grad() (in pytorch) before .backward(). 4) you passed softmaxed outputs to a loss that expects raw logits. 5) you didn't use bias=False for your Linear/Conv2d layer when using BatchNorm, or conversely forget to include it for the output layer
75
Credit: A. Karpathy
– Start actively discussing -> reach out to us if you have questions!
76