Lecture 5 recap
1
- Prof. Leal-Taixé and Prof. Niessner
Lecture 5 recap 1 Prof. Leal-Taix and Prof. Niessner Neural - - PowerPoint PPT Presentation
Lecture 5 recap 1 Prof. Leal-Taix and Prof. Niessner Neural Network Width Depth 2 Prof. Leal-Taix and Prof. Niessner Gra radie ient Descent fo for r Neura ral Networks 0 0 0 0 1 0,0,0
1
Depth Width
2
3
𝑦0 𝑦1 𝑦2 ℎ0 ℎ1 ℎ2 ℎ3 𝑧0 𝑧1 𝑢0 𝑢1 𝑧𝑗 = 𝐵(𝑐1,𝑗 +
𝑘
ℎ𝑘𝑥1,𝑗,𝑘) ℎ𝑘 = 𝐵(𝑐0,𝑘 +
𝑙
𝑦𝑙𝑥0,𝑘,𝑙) 𝑀𝑗 = 𝑧𝑗 − 𝑢𝑗 2 𝛼𝑥,𝑐𝑔 𝑦,𝑢 (𝑥) = 𝜖𝑔 𝜖𝑥0,0,0 … … 𝜖𝑔 𝜖𝑥𝑚,𝑛,𝑜 … … 𝜖𝑔 𝜖𝑐𝑚,𝑛 Just simple: 𝐵 𝑦 = max(0, 𝑦)
𝜄𝑙+1 = 𝜄𝑙 − 𝛽𝛼𝜄𝑀(𝜄𝑙, 𝑦{1..𝑛}, 𝑧{1..𝑛}) 𝛼𝜄𝑀 =
1 𝑛 σ𝑗=1 𝑛 𝛼𝜄𝑀𝑗
Note the terminology: iteration vs epoch
4
𝑙 now refers to 𝑙-th iteration 𝑛 training samples in the current batch Gradient for the 𝑙-th batch
𝑤𝑙+1 = 𝛾 ⋅ 𝑤𝑙 + 𝛼𝜄𝑀(𝜄𝑙) 𝜄𝑙+1 = 𝜄𝑙 − 𝛽 ⋅ 𝑤𝑙+1 Exponentially-weighted average of gradient Important: velocity 𝑤𝑙 is vector-valued!
5
Gradient of current minibatch velocity accumulation rate (‘friction’, momentum) learning rate velocity model
𝜄𝑙+1 = 𝜄𝑙 − 𝛽 ⋅ 𝑤𝑙+1
6
Step will be largest when a sequence of gradients all point to the same direction
Hyperparameters are 𝛽, 𝛾 𝛾 is often set to 0.9
𝑡𝑙+1 = 𝛾 ⋅ 𝑡𝑙 + (1 − 𝛾)[𝛼𝜄𝑀 ∘ 𝛼𝜄𝑀] 𝜄𝑙+1 = 𝜄𝑙 − 𝛽 ⋅ 𝛼𝜄𝑀 𝑡𝑙+1 + 𝜗 Hyperparameters: 𝛽, 𝛾, 𝜗
7
Typically 10−8 Often 0.9
Element-wise multiplication
Needs tuning!
8
X-direction Small gradients Y-Direction Large gradients
𝑡𝑙+1 = 𝛾 ⋅ 𝑡𝑙 + (1 − 𝛾)[𝛼𝜄𝑀 ∘ 𝛼𝜄𝑀] 𝜄𝑙+1 = 𝜄𝑙 − 𝛽 ⋅ 𝛼𝜄𝑀 𝑡𝑙+1 + 𝜗 We’re dividing by square gradients:
(uncentered) variance of gradients
Can increase learning rate!
Combines Momentum and RMSProp 𝑛𝑙+1 = 𝛾1 ⋅ 𝑛𝑙 + 1 − 𝛾1 𝛼𝜄𝑀 𝜄𝑙 𝑤𝑙+1 = 𝛾2 ⋅ 𝑤𝑙 + (1 − 𝛾2)[𝛼𝜄𝑀 𝜄𝑙 ∘ 𝛼𝜄𝑀 𝜄𝑙 ] 𝜄𝑙+1 = 𝜄𝑙 − 𝛽 ⋅
𝑛𝑙+1 𝑤𝑙+1+𝜗
9
First momentum: mean of gradients Second momentum: variance of gradients
Combines Momentum and RMSProp
𝑛𝑙+1 = 𝛾1 ⋅ 𝑛𝑙 + 1 − 𝛾1 𝛼𝜄𝑀 𝜄𝑙 𝑤𝑙+1 = 𝛾2 ⋅ 𝑤𝑙 + (1 − 𝛾2)[𝛼𝜄𝑀 𝜄𝑙 ∘ 𝛼𝜄𝑀 𝜄𝑙 ]
𝜄𝑙+1 = 𝜄𝑙 − 𝛽 ⋅
ෝ 𝑛𝑙+1 ො 𝑤𝑙+1+𝜗
10
𝑛𝑙+1 and 𝑤𝑙+1 are initialized with zero
Typically, bias-corrected moment updates ෝ 𝑛𝑙+1 = 𝑛𝑙 1 − 𝛾1 ො 𝑤𝑙+1 = 𝑤𝑙 1 − 𝛾2
11
12
13
SECOND DERIVATIVE
14
series expansion
https://en.wikipedia.org/wiki/Taylor_series
First derivative Second derivative (curvature)
15
Update step SGD We got rid of the learning rate!
16
Update step Parameters
(millions) Number of elements in the Hessian Computational complexity of ‘inversion’ per iteration
17
the curvature to take a more direct route
Image from Wikipedia 18
Can you apply Newton’s method for linear regression? What do you get as a result?
19
20
𝑔 𝑦𝑙 −1𝛼𝑔(𝑦𝑙)
– ’true’ 2nd derivatives are often hard to obtain (e.g., numerics) – 𝐼
𝑔 ≈ 2𝐾𝐺 𝑈𝐾𝐺
𝑦𝑙+1 = 𝑦𝑙 − [2𝐾𝐺 𝑦𝑙 𝑈𝐾𝐺 𝑦𝑙 ]−1𝛼𝑔(𝑦𝑙)
2 𝐾𝐺 𝑦𝑙 𝑈𝐾𝐺 𝑦𝑙 𝑦𝑙 − 𝑦𝑙+1 = 𝛼𝑔(𝑦𝑙)
Solve for delta vector
21
– “damped” version of Gauss-Newton: – 𝐾𝐺 𝑦𝑙 𝑈𝐾𝐺 𝑦𝑙 + 𝜇 ⋅ 𝐽 ⋅ 𝑦𝑙 − 𝑦𝑙+1 = 𝛼𝑔(𝑦𝑙) – The damping factor 𝜇 is adjusted in each iteration ensuring: – 𝑔 𝑦𝑙 > 𝑔(𝑦𝑙+1)
Gradient Descent (large 𝜇)
Tikhonov regularization
22
𝐾𝐺 𝑦𝑙 𝑈𝐾𝐺 𝑦𝑙 + 𝜇 ⋅ 𝑒𝑗𝑏(𝐾𝐺 𝑦𝑙 𝑈𝐾𝐺 𝑦𝑙 ) ⋅ 𝑦𝑙 − 𝑦𝑙+1 = 𝛼𝑔(𝑦𝑙) – Instead of a plain Gradient Descent for large 𝜇, scale each component of the gradient according to the curvature.
gradient
23
ith momentum
ton, L-BFGS, GN, , LM only if you can do full batch updates (doesn’t work well for minibatches!!)
24
This practically never happens for DL Theoretically, it would be nice though due to fast convergence
– LU, QR, Cholesky, Jacobi, Gauss-Seidel, CG, PCG, etc.
– Newton, Gauss-Newton, LM, (L)BFGS <- second order – Gradient Descent, SGD <- first order
– Genetic algorithms, MCMC, Metropolis-Hastings, etc. – Constrained and convex solvers (Langrage, ADMM, Primal-Dual, etc.)
25
not a way to solve a linear system!
26
27
28
Need high learning rate when far away Need low learning rate when close
1 1+𝑒𝑓𝑑𝑏𝑧𝑠𝑏𝑢𝑓⋅𝑓𝑞𝑝𝑑ℎ ⋅ 𝛽0
– E.g., 𝛽0 = 0.1, 𝑒𝑓𝑑𝑏𝑧𝑠𝑏𝑢𝑓 = 1.0 – > Epoch 0: 0.1 – > Epoch 1: 0.05 – > Epoch 2: 0.033 – > Epoch 3: 0.025 ...
29
0.02 0.04 0.06 0.08 0.1 0.12 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46
Learning Rate over Epochs
Many options:
– T is decay rate (often 0.5)
– t is decay rate (t < 1.0)
𝑢 𝑓𝑞𝑝𝑑ℎ ⋅ 𝑏0
– t is decay rate
30
Manually specify learning rate for entire training process
– Trial and error (the hard way) – Some experience (only generalizes to some degree)
Consider: #epochs, training set size, network size, etc.
31
32
– {𝑦𝑗, 𝑧𝑗}
– Take network 𝑔 and its parameters 𝑥, 𝑐 – Use SGD (or variation) to find optimal parameters 𝑥, 𝑐
33
– (so far no ‘real’ learning) – I.e., train on known dataset -> test with optimized parameters on unknown dataset
different data (i.e., test data)
34
– Use for training your neural network
– Hyperparameter optimization – Check generalization progress
– Only for the very end – NEVER TO TOUCH DURIN ING DEVELOPMENT OR TR TRAINING
35
– Train (60%), Val (20%), Test (20%) – Train (80%), Val (10%), Test (10%)
– Train error comes from average mini-batch error – Typically take subset of validation every n iterations
36
37
(EMA smoothing)
38
39
Underfitted Appropriate Overfitted
Figure extracted from Deep Learning by Adam Gibson, Josh Patterson, O‘Reily Media Inc., 2017
40
Source: http://srdas.github.io/DLBook/ImprovingModelGeneralization.html
41
– Manual search: most common – Grid search (structured, for ‘real’ applications) Define ranges for all parameters spaces and select points (usually pseudo-uniformly distributed). Iterate over all possible configurations – Random search:
Like grid search but one picks points at random in the predefined ranges
42
learning_rates = [1e-2, 1e-3, 1e-4, 1e-5] regularization_strengths = [1e2, 1e3, 1e4, 1e5] num_iters = [500, 1000, 1500] best_val = 0 for
in learning_rates: for
in regularization_strengths: for iterations in in num_iters: model = train_model(learning_rate, reg., iterations) validation_accuracy = evaluate(model) if if validation_accuracy > best_val: best_val = validation_accuracy best_model = model
43
44
Figure extracted from cs231n
45
method of choice has low training times
performance on the remaining subset
and average results
46
Results for k=5
Hyperparmeter value
Figure extracted from cs231n
47
48
Find your hyperparameters 20% train test validation 20% 60%
49
20% train test validation 20% 60% Human level error …... 1% Training set error ….... 5% Val/test set error ….... 8% Bias (or underfitting) Variance (overfitting)
50
Credits: A. Ng
More on
– Discussion solution exercise and presentation exercise 2
– Training Neural Networks
51
52