Train inin ing Neural l Networks
I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 1
Train inin ing Neural l Networks I2DL: Prof. Niessner, Prof. - - PowerPoint PPT Presentation
Train inin ing Neural l Networks I2DL: Prof. Niessner, Prof. Leal-Taix 1 Lecture 5 Recap I2DL: Prof. Niessner, Prof. Leal-Taix 2 Gra radie ient Descent fo for r Neura ral Networks Loss function 2
I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 1
I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 2
π¦0 π¦1 π¦2 β0 β1 β2 β3 ΰ· π§0 ΰ· π§1 π§0 π§1 ΰ· π§π = π΅(π1,π + ΰ·
π
βππ₯1,π,π) βπ = π΅(π0,π + ΰ·
π
π¦ππ₯0,π,π) Loss function ππ = ΰ· π§π β π§π 2 Just simple: π΅ π¦ = max(0, π¦) πΌπΏ,ππ π,π (πΏ) = ππ ππ₯0,0,0 β¦ β¦ ππ ππ₯π,π,π β¦ β¦ ππ πππ,π
I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 3
πΎπ+1 = πΎπ β π½πΌπΎπ(πΎπ, π{1..π}, π{1..π}) πΌπΎπ =
1 π Οπ=1 π πΌπΎππ
:
π now refers to π-th iteration π training samples in the current minibatch Gradient for the π-th minibatch
I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 4
ππ+1 = πΎ β ππ + πΌπΎπ(πΎπ) πΎπ+1 = πΎπ β π½ β ππ+1 Exponentially-weighted average of gradient Important: velocity ππ is vector-valued!
Gradient of current minibatch velocity accumulation rate (βfrictionβ, momentum) learning rate velocity model
I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 5
X-direction Small gradients Y-Direction Large gradients
Source: A. Ng
ππ+1 = πΎ β ππ + (1 β πΎ)[πΌπΎπ β πΌπΎπ] πΎπ+1 = πΎπ β π½ β πΌπΎπ ππ+1 + π Weβre dividing by square gradients:
large
small (Uncentered) variance of gradients β second momentum Can increase learning rate!
I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 6
ππ+1 = πΎ1 β ππ + 1 β πΎ1 πΌπΎπ πΎπ ππ+1 = πΎ2 β ππ + (1 β πΎ2)[πΌπΎπ πΎπ β πΌπΎπ πΎπ
I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 7
β bias towards zero β Typically, bias-corrected moment updates
ΰ· ππ+1 = ππ+1 1 β πΎ1
π+1
ΰ· ππ+1 = ππ+1 1 β πΎ2
π+1
πΎπ+1 = πΎπ β π½ β
ΰ· ππ+1 ΰ· ππ+1+π
I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 8
Source: http://cs231n.github.io/neural-networks-3/
I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 9
Need high learning rate when far away Need low learning rate when close
I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 10
1 1+πππππ§_π ππ’πβππππβ β π½0
β E.g., π½0 = 0.1, πππππ§_π ππ’π = 1.0 β Epoch 0: 0.1 β Epoch 1: 0.05 β Epoch 2: 0.033 β Epoch 3: 0.025 ...
0.02 0.04 0.06 0.08 0.1 0.12 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46
Learning Rate over Epochs
I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 11
Many options:
β T is decay rate (often 0.5)
β t is decay rate (t < 1.0)
π’ ππππβ β π0
β t is decay rate
I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 12
Manually specify learning rate for entire training process
β Trial and error (the hard way) β Some experience (only generalizes to some degree)
Consider: #epochs, training set size, network size, etc.
I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 13
β {π¦π, π§π}
β Take network π and its parameters π₯, π β Use SGD (or variation) to find optimal parameters π₯, π
I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 14
β Letβs say 1 million labeled images β Letβs say our network has 500k parameters
xtremely ly exp xpensive to to compute
I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 15
β (So far no βrealβ learning) β I.e., train on known dataset β test with optimized parameters on unknown dataset
different data (i.e., test data)
I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 16
β Use for training your neural network
β Hyperparameter optimization β Check generalization progress
β Only for the very end β NEVER TO TOUCH DURIN ING DEVELOPMENT OR TR TRAINING
I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 17
β Train (60%), Val (20%), Test (20%) β Train (80%), Val (10%), Test (10%)
β Train error comes from average minibatch error β Typically take subset of validation every n iterations
I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 18
Find your hyperparameters
20% train test validation 20% 60%
I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 19
20% train test validation 20% 60%
Ground truth error β¦... 1% Training set error β¦.... 5% Val/test set error β¦.... 8%
Bias (underfitting) Variance (overfitting)
Example scenario
I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 20
Credits: A. Ng
Done
I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 21
Underfitted Appropriate Overfitted
Source: Deep Learning by Adam Gibson, Josh Patterson, OβReily Media Inc., 2017
I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 22
Source: https://srdas.github.io/DLBook/ImprovingModelGeneralization.html
I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 23
I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 24
t e s t
val
Source: https://machinelearningmastery.com/learning-curves-for-diagnosing-machine-learning-model-performance/
I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 25
t e s t
Source: https://machinelearningmastery.com/learning-curves-for-diagnosing-machine-learning-model-performance/
Val
I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 26
Underfitting (loss still decreasing) Validation Set is easier than Training set
Source: https://machinelearningmastery.com/learning-curves-for-diagnosing-machine-learning-model-performance/
I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 27
β Training and validation losses decrease even at the end
β Training loss decreases and validation loss increases
β Small gap between training and validation loss, and both go down at same rate (stable without fluctuations).
I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 28
β Training error not going down β Validation error not going down β Performance on validation better than on training set β Tests on train set different than during training
β Training set contains test data β Debug algorithm on test data
I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 29
Never touch during development or training
learning setup + optimization = hyperparameters
I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 30
β Manual search:
β Gri rid search (structured, for βrealβ applications)
select points
β Iterate over all possible configurations
β Rand ndom search:
Like grid search but one picks points at random in the predefined ranges
I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 31
0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1
Second Parameter First Parameter
Grid search
0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1
Second Parameter First Parameter
Random search
β Check if output correct β Overfit ο train accuracy should be 100% because input just memorized
β Check if input is handled correctly
β 5, 10, 100, 1000, β¦ β At some point, you should see generalization
I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 32
β¦
softmax with π· classes
the loss drop significantly (exponentially) within 100 iterations
1e-1, 1e-2, 1e-3, 1e-4
I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 33
Training time Loss
decay around what worked from
I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 34
0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1
Second Parameter First Parameter
Grid search
rate decay
I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 35
β Get prec recise timings! β If an iteration exceeds 500ms, things get dicey
β Dataloading: smaller resolution, compression, train from SSD β Backprop
β How long until you see some pattern? β How long till convergence?
I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 36
train for two weeks and we see where we stand.β
I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 37
network possible
β Rule of thumb divide #layers you started with by 5
β Ideally, minutes
β Evaluation needs to be consistent β Numbers need to be comparable
t a tim time
β βIβve added 5 more layers and double the training size, and now I also trained 5 days longer. Now itβs better, but why?β
I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 38
β Check later when we talk about dropoutβ¦
.backward()
expects raw logits
I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 39
I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 40
t e s t
I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 41
t e s t
I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 42
t e s t
I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 43
t e s t
I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 44
t e s t
I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 45
β More about training neural networks: output functions, loss functions, activation functions
I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 46
I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 47
β Chapter 6: Deep Feedforward Networks
β Chapter 5.5: Regularization in Network Nets
I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 48