Scali ling Optim imiz izatio ion
1 I2DL: Prof. Niessner, Prof. Leal-Taixé
Scali ling Optim imiz izatio ion I2DL: Prof. Niessner, Prof. - - PowerPoint PPT Presentation
Scali ling Optim imiz izatio ion I2DL: Prof. Niessner, Prof. Leal-Taix 1 Lecture 4 Recap I2DL: Prof. Niessner, Prof. Leal-Taix 2 Neural Network Source: http://cs231n.github.io/neural-networks-1/ I2DL: Prof. Niessner, Prof.
1 I2DL: Prof. Niessner, Prof. Leal-Taixé
2 I2DL: Prof. Niessner, Prof. Leal-Taixé
Source: http://cs231n.github.io/neural-networks-1/
3 I2DL: Prof. Niessner, Prof. Leal-Taixé
Depth Width
4
Input Layer Hidden Layer 1 Hidden Layer 2 Hidden Layer 3 Output Layer
I2DL: Prof. Niessner, Prof. Leal-Taixé
Input layer Output layer e.g., class label/ regression target 𝑦0 𝑦1 ∗ 𝑥1 ∗ 𝑥0 + Input Weights (unknowns!) L2 Loss Loss/ cost We want to compute gradients w.r.t. all weights 𝑿 max(0, 𝑦) ReLU Activation
(btw. I’m not arguing this is the right choice here)
𝑦∗𝑦 −𝑧0 𝑦0 𝑦1 ො 𝑧0 𝑧0
5 I2DL: Prof. Niessner, Prof. Leal-Taixé
𝑦0 𝑦1 ො 𝑧0 𝑧0 Input layer Output layer ො 𝑧1 ො 𝑧2 𝑧1 𝑧2 𝑦0 𝑦1 ∗ 𝑥0,0 + Loss/ cost + Loss/ cost ∗ 𝑥0,1 ∗ 𝑥1,0 ∗ 𝑥1,1 + Loss/ cost ∗ 𝑥2,0 ∗ 𝑥2,1 We want to compute gradients w.r.t. all weights 𝑿 𝑦∗𝑦 𝑦∗𝑦 𝑦∗𝑦 −𝑧0 −𝑧0 −𝑧0
6 I2DL: Prof. Niessner, Prof. Leal-Taixé
𝑦0 𝑦𝑙 ො 𝑧0 𝑧0 Input layer Output layer ො 𝑧1 𝑧1 … ො 𝑧𝑗 = 𝐵(𝑐𝑗 +
𝑙
𝑦𝑙𝑥𝑗,𝑙) 𝑀 =
𝑗
𝑀𝑗 𝑀𝑗 = ො 𝑧𝑗 − 𝑧𝑗 2 We want to compute gradients w.r.t. all weights 𝑿 AN AND all biases 𝑐 Activation function bias 𝜖𝑀 𝜖𝑥𝑗,𝑙 = 𝜖𝑀 𝜖 ො 𝑧𝑗 ⋅ 𝜖 ො 𝑧𝑗 𝜖𝑥𝑗,𝑙 ⟶ use chain rule to compute partials Goal: We want to compute gradients of the loss function 𝑀 w.r.t. all weights 𝑥 𝑀: sum over loss per sample, e.g. L2 loss ⟶ simply sum up squares:
7 I2DL: Prof. Niessner, Prof. Leal-Taixé
– (Directional) compute graph – Structure graph into layers – Compute partial derivatives w.r.t. weights (unknowns)
– Find weights based on gradients
Gradient step: 𝑿′ = 𝑿 − 𝛽𝛼𝑿𝑔 𝒚,𝒛 (𝑿) 𝛼𝑿𝑔 𝒚,𝒛 (𝑿) = 𝜖𝑔 𝜖𝑥0,0,0 … … 𝜖𝑔 𝜖𝑥𝑚,𝑛,𝑜 … … 𝜖𝑔 𝜖𝑐𝑚,𝑛
8 I2DL: Prof. Niessner, Prof. Leal-Taixé
9 I2DL: Prof. Niessner, Prof. Leal-Taixé
𝑦∗ = arg min 𝑔(𝑦)
10
Optimum
Initialization
I2DL: Prof. Niessner, Prof. Leal-Taixé
Follow the slope
DERIVATIVE 𝑒𝑔(𝑦) 𝑒𝑦 = lim
ℎ→0
𝑔 𝑦 + ℎ − 𝑔(𝑦) ℎ
11
𝑦∗ = arg min 𝑔(𝑦)
Initialization
Optimum
I2DL: Prof. Niessner, Prof. Leal-Taixé
Direction of greatest increase
ⅆ𝑔 𝑦 ⅆ𝑦 𝛼
𝑦𝑔 𝑦
Learning rate
𝑦′ = 𝑦 − 𝛽𝛼
𝑦𝑔 𝑦
𝛼
𝑦𝑔(𝑦)
𝑦
12 I2DL: Prof. Niessner, Prof. Leal-Taixé
Direction of greatest increase
ⅆ𝑔 𝑦 ⅆ𝑦 𝛼
𝑦𝑔 𝑦
𝛼
𝑦𝑔(𝑦)
𝑦
13
SMALL Learning rate
𝑦′ = 𝑦 − 𝛽𝛼
𝑦𝑔 𝑦
I2DL: Prof. Niessner, Prof. Leal-Taixé
Direction of greatest increase
ⅆ𝑔 𝑦 ⅆ𝑦 𝛼
𝑦𝑔 𝑦
𝛼
𝑦𝑔(𝑦)
𝑦
14
LARGE Learning rate
𝑦′ = 𝑦 − 𝛽𝛼
𝑦𝑔 𝑦
I2DL: Prof. Niessner, Prof. Leal-Taixé
Optimum
Not guaranteed to reach the global optimum
𝒚∗ = arg min 𝑔(𝒚)
Initialization
What is the gradient when we reach this point?
15 I2DL: Prof. Niessner, Prof. Leal-Taixé
If line/plane segment between any two points lies above or on the graph
Source: https://en.wikipedia.org/wiki/Convex_function#/media/File:ConvexFunction.svg
16 I2DL: Prof. Niessner, Prof. Leal-Taixé
– many (different) local minima – no (practical) way to say which is globally optimal
Source: Li, Qi. (2006). Challenging Registration of Geologic Image Data
17 I2DL: Prof. Niessner, Prof. Leal-Taixé
Source: https://builtin.com/data-science/gradient-descent
18 I2DL: Prof. Niessner, Prof. Leal-Taixé
Source: A. Geron
19 I2DL: Prof. Niessner, Prof. Leal-Taixé
Various ways to visualize…
Source: builtin.com/data-science/gradient-descent
20 I2DL: Prof. Niessner, Prof. Leal-Taixé
Source: http://blog.datumbox.com/wp-content/uploads/2013/10/gradient-descent.png
21 I2DL: Prof. Niessner, Prof. Leal-Taixé
𝑦0 𝑦1 𝑦2 ℎ0 ℎ1 ℎ2 ℎ3 ො 𝑧0 ො 𝑧1 𝑧0 𝑧1 ො 𝑧𝑗 = 𝐵(𝑐1,𝑗 +
𝑘
ℎ𝑘𝑥1,𝑗,𝑘) ℎ𝑘 = 𝐵(𝑐0,𝑘 +
𝑙
𝑦𝑙𝑥0,𝑘,𝑙) Loss function 𝑀𝑗 = ො 𝑧𝑗 − 𝑧𝑗 2 Just simple: 𝐵 𝑦 = max(0, 𝑦) 𝛼𝑿,𝒄𝑔 𝒚,𝒛 (𝑿) = 𝜖𝑔 𝜖𝑥0,0,0 … … 𝜖𝑔 𝜖𝑥𝑚,𝑛,𝑜 … … 𝜖𝑔 𝜖𝑐𝑚,𝑛
22 I2DL: Prof. Niessner, Prof. Leal-Taixé
{𝒚𝑗, 𝒛𝑗}
– 𝜾 = arg min 𝑀𝑗(𝒚𝑗, 𝒛𝑗)
– Initialize 𝜾1 with ‘random’ values (more to that later) – 𝜾𝑙+1 = 𝜾𝑙 − 𝛽𝛼𝜾𝑀𝑗(𝜾𝑙, 𝒚𝑗, 𝒛𝑗) – Iterate until convergence: 𝜾𝑙+1 − 𝜾𝑙 < 𝜗
23 I2DL: Prof. Niessner, Prof. Leal-Taixé
– 𝜾𝑙+1 = 𝜾𝑙 − 𝛽𝛼𝜾𝑀𝑗(𝜾𝑙, 𝒚𝑗, 𝒛𝑗) – 𝛼𝜾𝑀𝑗 𝜾𝑙, 𝒚𝑗, 𝒛𝒋 computed via backpropagation – Typically: ⅆim 𝛼𝜾𝑀𝑗 𝜾𝑙, 𝒚𝑗, 𝒛𝑗 = ⅆim 𝜾 ≫ 1 𝑛𝑗𝑚𝑚𝑗𝑝𝑜
24
Weights, biases at step k (current model) Weights, biases after update step Learning rate Gradient w.r.t. 𝜾 Training sample Loss Function
I2DL: Prof. Niessner, Prof. Leal-Taixé
samples {𝒚𝑗, 𝒛𝑗}
1 𝑜 σ𝑗=1 𝑜
𝑀𝑗(𝜾, 𝒚𝑗, 𝒛𝑗)
– 𝜾 = arg min 𝑀
25 I2DL: Prof. Niessner, Prof. Leal-Taixé
26
𝜾𝑙+1 = 𝜾𝑙 − 𝛽𝛼𝜾𝑀 𝜾𝑙, 𝒚 1..𝑜 , 𝒛 1..𝑜
𝛼𝜾𝑀 𝜾𝑙, 𝒚 1..𝑜 , 𝒛 1..𝑜 = 1
𝑜 σ𝑗=1 𝑜
𝛼𝜾𝑀𝑗 𝜾𝑙, 𝒚𝑗, 𝒛𝒋
Reminder: this comes from backprop.
𝑜
𝛼𝜾𝑀𝑗
omitting 1
𝑜 is not ‘wrong’, it just means rescaling the
learning rate
I2DL: Prof. Niessner, Prof. Leal-Taixé
Can compute optimal learning rate 𝛽 using Line Search (optimal for a given set) 1. Compute gradient: 𝛼𝜾𝑀 =
1 𝑜 σ𝑗=1 𝑜
𝛼𝜾𝑀𝑗
arg min
𝛽 𝑀(𝜾𝑙 − 𝛽 𝛼𝜾𝑀)
3. 𝜾𝑙+1 = 𝜾𝑙 − 𝛽𝛼𝜾𝑀
𝜾𝑙+1 Not that practical for DL since we need to solve huge system every step…
27 I2DL: Prof. Niessner, Prof. Leal-Taixé
– Let’s say 1 million labeled images – Let’s say our network has 500k parameters
→ Extremely expensive to compute
28 I2DL: Prof. Niessner, Prof. Leal-Taixé
the gradient for all of them which is 𝑃(𝑜)
minimization, we can express the total loss over the training data as the expectation of all the samples
29
1 𝑜
𝑗=1 𝑜
𝑀𝑗 𝜾, 𝒚𝒋, 𝒛𝒋 = 𝔽𝑗~ 1,…,𝑜 𝑀𝑗 𝜾, 𝒚𝒋, 𝒛𝒋
I2DL: Prof. Niessner, Prof. Leal-Taixé
subset of the data
30
𝔽𝑗~ 1,…,𝑜 𝑀𝑗 𝜾, 𝒚𝒋, 𝒛𝒋 ≈ 1 𝑇
𝑘∈𝑇
𝑀𝑘 𝜾, 𝒚𝒌, 𝒛𝒌 with S ⊆ 1, … , 𝑜
Minibatch choose subset of trainset 𝑛 ≪ 𝑜
𝐶𝑗 = { 𝒚𝟐, 𝒛𝟐 , 𝒚𝟑, 𝒛𝟑 , … , 𝒚𝒏, 𝒛𝒏 } {𝐶1, 𝐶2, … , 𝐶𝑜/𝑛}
I2DL: Prof. Niessner, Prof. Leal-Taixé
– Typically power of 2 → 8, 16, 32, 64, 128… – Smaller batch size means greater variance in the gradients
→ noisy updates
– Mostly limited by GPU memory (in backward pass) – E.g.,
31
(Epoch = complete pass through training set)
I2DL: Prof. Niessner, Prof. Leal-Taixé
𝜾𝑙+1 = 𝜾𝑙 − 𝛽𝛼𝜾𝑀(𝜾𝑙, 𝒚{1..𝑛}, 𝒛{1..𝑛}) 𝛼𝜾𝑀 =
1 𝑛 σ𝑗=1 𝑛 𝛼𝜾𝑀𝑗
Note the terminology: iteration vs epoch
𝑙 now refers to 𝑙-th iteration 𝑛 training samples in the current minibatch Gradient for the 𝑙-th minibatch
32 I2DL: Prof. Niessner, Prof. Leal-Taixé
33
Suppose we want to minimize the function 𝐺 𝜄 with the stochastic approximation
Robbins, H. and Monro, S. “A Stochastic Approximation Method" 1951.
where 𝛽1, 𝛽2 … 𝛽𝑜 is a sequence of positive step-sizes and 𝐼 𝜄𝑙, 𝑌 is the unbiased estimate of 𝛼F 𝜄𝑙 , i.e. 𝜄𝑙+1 = 𝜄𝑙 − 𝛽𝑙𝐼 𝜄𝑙, 𝑌 𝔽 𝐼 𝜄𝑙, 𝑌 = 𝛼F 𝜄𝑙
I2DL: Prof. Niessner, Prof. Leal-Taixé
converges to a local (glo lobal) minimum if the following conditions are met:
1) 𝛽𝑜 ≥ 0, ∀ 𝑜 ≥ 0 2) σ𝑜=1
∞
𝛽𝑜 = ∞ 3) σ𝑜=1
∞
𝛽𝑜
2 < ∞
4) 𝐺 𝜄 is is str strictly ly conv nvex
34
𝜄𝑙+1 = 𝜄𝑙 − 𝛽𝑙𝐼 𝜄𝑙, 𝑌
The proposed sequence by Robbins and Monro is 𝛽𝑜 ∝ 𝛽
𝑜 , 𝑔𝑝𝑠 𝑜 > 0
I2DL: Prof. Niessner, Prof. Leal-Taixé
→ i.e., cannot independently scale directions → need to have conservative min learning rate to avoid divergence → Slower than ‘necessary’
→ More next lecture
35 I2DL: Prof. Niessner, Prof. Leal-Taixé
We’re making many steps back and forth along this
track that this is averaging
Would love to go faster here… I.e., accumulated gradients over time
Source: A. Ng
36 I2DL: Prof. Niessner, Prof. Leal-Taixé
𝒘𝑙+1 = 𝛾 ⋅ 𝒘𝑙 − 𝛽 ⋅ 𝛼𝜾𝑀(𝜾𝑙) 𝜾𝑙+1 = 𝜾𝑙 + 𝒘𝑙+1 Exponentially-weighted average of gradient Important: velocity 𝒘𝑙 is vector-valued!
Gradient of current minibatch velocity accumulation rate (‘friction’, momentum) learning rate velocity weights of model
37
[Sutskever et al., ICML’13] On the importance of initialization and momentum in deep learning
I2DL: Prof. Niessner, Prof. Leal-Taixé
𝜾𝑙+1 = 𝜾𝑙 + 𝒘𝑙+1
Step will be largest when a sequence of gradients all point to the same direction
Source: I. Goodfellow
Hyperparameters are 𝛽, 𝛾 𝛾 is often set to 0.9
38 I2DL: Prof. Niessner, Prof. Leal-Taixé
𝜾𝑙+1 = 𝜾𝑙 + 𝒘𝑙+1
39 I2DL: Prof. Niessner, Prof. Leal-Taixé
෩ 𝜾𝑙+1 = 𝜾𝑙 + 𝛾 ⋅ 𝒘𝑙 𝒘𝑙+1 = 𝛾 ⋅ 𝒘𝑙 − 𝛽 ⋅ 𝛼𝜾𝑀(෩ 𝜾𝑙+1) 𝜾𝑙+1 = 𝜾𝑙 + 𝒘𝑙+1
Nesterov, Yurii E. "A method for solving the convex programming problem with convergence rate O (1/k^ 2)." Dokl. akad. nauk
40 I2DL: Prof. Niessner, Prof. Leal-Taixé
Source: G. Hinton
෩ 𝜾𝑙+1 = 𝜾𝑙 + 𝛾 ⋅ 𝒘𝑙 𝒘𝑙+1 = 𝛾 ⋅ 𝒘𝑙 − 𝛽 ⋅ 𝛼𝜾𝑀(෩ 𝜾𝑙+1) 𝜾𝑙+1 = 𝜾𝑙 + 𝒘𝑙+1
41 I2DL: Prof. Niessner, Prof. Leal-Taixé
exponentially-decaying average of squared gradients.
Small gradients Large gradients
Source: Andrew. Ng Hinton et al. "Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude." COURSERA: Neural networks for machine learning 4.2 (2012): 26-31.
42 I2DL: Prof. Niessner, Prof. Leal-Taixé
𝒕𝑙+1 = 𝛾 ⋅ 𝒕𝑙 + (1 − 𝛾)[𝛼𝜾𝑀 ∘ 𝛼𝜾𝑀] 𝜾𝑙+1 = 𝜾𝑙 − 𝛽 ⋅ 𝛼𝜾𝑀 𝒕𝑙+1 + 𝜗 Hyperparameters: 𝛽, 𝛾, 𝜗
43
Typically 10−8 Often 0.9
Element-wise multiplication
Needs tuning!
I2DL: Prof. Niessner, Prof. Leal-Taixé
X-direction Small gradients Y-Direction Large gradients
Source: A. Ng
𝒕𝑙+1 = 𝛾 ⋅ 𝒕𝑙 + (1 − 𝛾)[𝛼𝜾𝑀 ∘ 𝛼𝜾𝑀] 𝜾𝑙+1 = 𝜾𝑙 − 𝛽 ⋅ 𝛼𝜾𝑀 𝒕𝑙+1 + 𝜗 We’re dividing by square gradients:
large
small (Uncentered) variance of gradients → second momentum Can increase learning rate!
44 I2DL: Prof. Niessner, Prof. Leal-Taixé
directions
diverge
→ Speed up learning speed → Second moment
45 I2DL: Prof. Niessner, Prof. Leal-Taixé
Idea : Combine Momentum and RMSProp
𝒏𝑙+1 = 𝛾1 ⋅ 𝒏𝑙 + 1 − 𝛾1 𝛼𝜾𝑀 𝜾𝑙 𝒘𝑙+1 = 𝛾2 ⋅ 𝒘𝑙 + (1 − 𝛾2)[𝛼𝜾𝑀 𝜾𝑙 ∘ 𝛼𝜾𝑀 𝜾𝑙 ] 𝜾𝑙+1 = 𝜾𝑙 − 𝛽 ⋅
𝒏𝑙+1 𝒘𝑙+1+𝜗 First momentum: mean of gradients Second momentum: variance of gradients
[Kingma et al., ICLR’15] Adam: A method for stochastic optimization
46
No Note : Th This is s not
the e upd update rule rule of
I2DL: Prof. Niessner, Prof. Leal-Taixé
𝒏𝑙+1 = 𝛾1 ⋅ 𝒏𝑙 + 1 − 𝛾1 𝛼𝜾𝑀 𝜾𝑙 𝒘𝑙+1 = 𝛾2 ⋅ 𝒘𝑙 + (1 − 𝛾2)[𝛼𝜾𝑀 𝜾𝑙 ∘ 𝛼𝜾𝑀 𝜾𝑙
47
→ bias towards zero → Need bias-corrected moment updates
ෝ 𝒏𝑙+1 = 𝒏𝑙+1 1 − 𝛾1
𝑙+1
ෝ 𝒘𝑙+1 = 𝒘𝑙+1 1 − 𝛾2
𝑙+1
𝜾𝑙+1 = 𝜾𝑙 − 𝛽 ⋅
ෝ 𝒏𝑙+1 ෝ 𝒘𝑙+1+𝜗
Upd Update rule rule of
Adam
I2DL: Prof. Niessner, Prof. Leal-Taixé
gradients (combines first and second order momentum)
48
𝒏𝑙+1 = 𝛾1 ⋅ 𝒏𝑙 + 1 − 𝛾1 𝛼𝜾𝑀 𝜾𝑙 𝒘𝑙+1 = 𝛾2 ⋅ 𝒘𝑙 + 1 − 𝛾2 𝛼𝜾𝑀 𝜾𝑙 ∘ 𝛼𝜾𝑀 𝜾𝑙 ෝ 𝒏𝑙+1 =
𝒏𝑙+1 1−𝛾1
𝑙+1
ෝ 𝒘𝑙+1 =
𝒘𝑙+1 1−𝛾2
𝑙+1
𝜾𝑙+1 = 𝜾𝑙 − 𝛽 ⋅ ෝ 𝒏𝑙+1 ෝ 𝒘𝑙+1 + 𝜗
Typically 10−8 Often 0.9 Often 0.999 Defaults in PyTorch Needs tuning!
I2DL: Prof. Niessner, Prof. Leal-Taixé
Adam is mostly method
It’s actually fun to play around with SGD updates. It’s easy and you get pretty immediate feedback
50 I2DL: Prof. Niessner, Prof. Leal-Taixé
Source: http://ruder.io/optimizing-gradient-descent/
51 I2DL: Prof. Niessner, Prof. Leal-Taixé
Source: http://ruder.io/optimizing-gradient-descent/
52 I2DL: Prof. Niessner, Prof. Leal-Taixé
Source: https://github.com/Jaewan-Yun/optimizer-visualization
53 I2DL: Prof. Niessner, Prof. Leal-Taixé
SECOND DERIVATIVE
𝒈: ℝ → ℝ 𝑒𝑔 𝑦 𝑒𝑦 𝒈: ℝ𝑛 → ℝ 𝛼
𝒚𝑔 𝒚
𝒈: ℝ𝑛 → ℝ𝑜 𝐊 ∈ ℝ𝑜 × 𝑛 𝒈: ℝ𝑛 → ℝ 𝐈 ∈ ℝ𝑛 × 𝑛 ⅆ𝑔 𝒚 ⅆ𝑦1 , ⅆ𝑔 𝒚 ⅆ𝑦2
54 I2DL: Prof. Niessner, Prof. Leal-Taixé
series expansion
First derivative Second derivative (curvature) 𝑀 𝜾 ≈ 𝑀 𝜾0 + 𝜾 − 𝜾0 𝑈𝜶𝜾𝑀 𝜾0 + 1 2 𝜾 − 𝜾0 𝑈𝐈 𝜾 − 𝜾0
55
More info: https://en.wikipedia.org/wiki/Taylor_series
I2DL: Prof. Niessner, Prof. Leal-Taixé
SGD
We got rid of the learning rate!
𝜾𝑙+1 = 𝜾𝑙 − 𝛽𝛼𝜾𝑀 𝜾𝑙, 𝐲𝒋 , 𝐳𝒋
56
Update step
𝜾∗ = 𝜾0 − 𝐈−1𝛼𝜾𝑀 𝜾
I2DL: Prof. Niessner, Prof. Leal-Taixé
Update step Parameters of a network (millions) Number of elements in the Hessian Computational complexity of ‘inversion’ per iteration
𝜾∗ = 𝜾0 − 𝐈−1𝛼𝜾𝑀 𝜾
57 I2DL: Prof. Niessner, Prof. Leal-Taixé
the curvature to take a more direct route
58
Source: https://en.wikipedia.org/wiki/Newton%27s_method_in_optimization
I2DL: Prof. Niessner, Prof. Leal-Taixé
𝐾 𝜾 = 𝐳 − 𝐘𝛊 𝑈 𝐳 − 𝐘𝛊
59
Can you apply Newton’s method for linear regression? What do you get as a result?
I2DL: Prof. Niessner, Prof. Leal-Taixé
𝜾∗ = 𝜾0 − 𝐈−1𝛼𝜾𝑀 𝜾
60 I2DL: Prof. Niessner, Prof. Leal-Taixé
𝑔 𝑦𝑙 −1𝛼𝑔(𝑦𝑙)
– ‘true’ 2nd derivatives are often hard to obtain (e.g., numerics) – 𝐼
𝑔 ≈ 2𝐾𝐺 𝑈𝐾𝐺
𝑦𝑙+1 = 𝑦𝑙 − [2𝐾𝐺 𝑦𝑙 𝑈𝐾𝐺 𝑦𝑙 ]−1𝛼𝑔(𝑦𝑙)
unstable):
2 𝐾𝐺 𝑦𝑙 𝑈𝐾𝐺 𝑦𝑙 𝑦𝑙 − 𝑦𝑙+1 = 𝛼𝑔(𝑦𝑙)
Solve for delta vector
61 I2DL: Prof. Niessner, Prof. Leal-Taixé
– “damped” version of Gauss-Newton: – 𝐾𝐺 𝑦𝑙 𝑈𝐾𝐺 𝑦𝑙 + 𝜇 ⋅ 𝐽 ⋅ 𝑦𝑙 − 𝑦𝑙+1 = 𝛼𝑔(𝑦𝑙) – The damping factor 𝜇 is adjusted in each iteration ensuring: 𝑔 𝑦𝑙 > 𝑔(𝑦𝑙+1)
and Gradient Descent (large 𝜇)
Ti Tikhonov regulariz izatio tion
62 I2DL: Prof. Niessner, Prof. Leal-Taixé
𝐾𝐺 𝑦𝑙 𝑈𝐾𝐺 𝑦𝑙 + 𝜇 ⋅ 𝑒𝑗𝑏(𝐾𝐺 𝑦𝑙 𝑈𝐾𝐺 𝑦𝑙 ) ⋅ 𝑦𝑙 − 𝑦𝑙+1 = 𝛼𝑔(𝑦𝑙) – Instead of a plain Gradient Descent for large 𝜇, scale each component of the gradient according to the curvature.
gradient
63 I2DL: Prof. Niessner, Prof. Leal-Taixé
ith momentum
ton, L-BFGS, GN, , LM only if you can do full batch updates (doesn’t work well for minibatches!!)
This practically never happens for DL Theoretically, it would be nice though due to fast convergence
64 I2DL: Prof. Niessner, Prof. Leal-Taixé
inear r Syste tems (A (Ax x = b)
– LU, QR, Cholesky, Jacobi, Gauss-Seidel, CG, PCG, etc.
linear (g (gradient-based)
– Newton, Gauss-Newton, LM, (L)BFGS ← second
– Gradient Descent, SGD ← first order
thers
– Genetic algorithms, MCMC, Metropolis-Hastings, etc. – Constrained and convex solvers (Langrage, ADMM, Primal-Dual, etc.)
65 I2DL: Prof. Niessner, Prof. Leal-Taixé
not a way to solve a linear system!
66 I2DL: Prof. Niessner, Prof. Leal-Taixé
– Check exercises – Check office hours
– Training Neural networks
72 I2DL: Prof. Niessner, Prof. Leal-Taixé
73 I2DL: Prof. Niessner, Prof. Leal-Taixé
– Chapter 8: Optimization
(2006),
– Chapter 5.2: Network training (gradient descent) – Chapter 5.4: The Hessian Matrix (second order methods)
– https://pytorch.org/docs/stable/optim.html
74 I2DL: Prof. Niessner, Prof. Leal-Taixé