Scali ling Optim imiz izatio ion I2DL: Prof. Niessner, Prof. - - PowerPoint PPT Presentation

scali ling optim imiz izatio ion
SMART_READER_LITE
LIVE PREVIEW

Scali ling Optim imiz izatio ion I2DL: Prof. Niessner, Prof. - - PowerPoint PPT Presentation

Scali ling Optim imiz izatio ion I2DL: Prof. Niessner, Prof. Leal-Taix 1 Lecture 4 Recap I2DL: Prof. Niessner, Prof. Leal-Taix 2 Neural Network Source: http://cs231n.github.io/neural-networks-1/ I2DL: Prof. Niessner, Prof.


slide-1
SLIDE 1

Scali ling Optim imiz izatio ion

1 I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-2
SLIDE 2

Lecture 4 Recap

2 I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-3
SLIDE 3

Neural Network

Source: http://cs231n.github.io/neural-networks-1/

3 I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-4
SLIDE 4

Neural Network

Depth Width

4

Input Layer Hidden Layer 1 Hidden Layer 2 Hidden Layer 3 Output Layer

I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-5
SLIDE 5

Compute Gra raphs → Neura ral Network rks

Input layer Output layer e.g., class label/ regression target 𝑦0 𝑦1 ∗ 𝑥1 ∗ 𝑥0 + Input Weights (unknowns!) L2 Loss Loss/ cost We want to compute gradients w.r.t. all weights 𝑿 max(0, 𝑦) ReLU Activation

(btw. I’m not arguing this is the right choice here)

𝑦∗𝑦 −𝑧0 𝑦0 𝑦1 ො 𝑧0 𝑧0

5 I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-6
SLIDE 6

Compute Gra raphs → Neura ral Network rks

𝑦0 𝑦1 ො 𝑧0 𝑧0 Input layer Output layer ො 𝑧1 ො 𝑧2 𝑧1 𝑧2 𝑦0 𝑦1 ∗ 𝑥0,0 + Loss/ cost + Loss/ cost ∗ 𝑥0,1 ∗ 𝑥1,0 ∗ 𝑥1,1 + Loss/ cost ∗ 𝑥2,0 ∗ 𝑥2,1 We want to compute gradients w.r.t. all weights 𝑿 𝑦∗𝑦 𝑦∗𝑦 𝑦∗𝑦 −𝑧0 −𝑧0 −𝑧0

6 I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-7
SLIDE 7

Compute Gra raphs → Neura ral Network rks

𝑦0 𝑦𝑙 ො 𝑧0 𝑧0 Input layer Output layer ො 𝑧1 𝑧1 … ො 𝑧𝑗 = 𝐵(𝑐𝑗 + ෍

𝑙

𝑦𝑙𝑥𝑗,𝑙) 𝑀 = ෍

𝑗

𝑀𝑗 𝑀𝑗 = ො 𝑧𝑗 − 𝑧𝑗 2 We want to compute gradients w.r.t. all weights 𝑿 AN AND all biases 𝑐 Activation function bias 𝜖𝑀 𝜖𝑥𝑗,𝑙 = 𝜖𝑀 𝜖 ො 𝑧𝑗 ⋅ 𝜖 ො 𝑧𝑗 𝜖𝑥𝑗,𝑙 ⟶ use chain rule to compute partials Goal: We want to compute gradients of the loss function 𝑀 w.r.t. all weights 𝑥 𝑀: sum over loss per sample, e.g. L2 loss ⟶ simply sum up squares:

7 I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-8
SLIDE 8

Summary

  • We have

– (Directional) compute graph – Structure graph into layers – Compute partial derivatives w.r.t. weights (unknowns)

  • Next

– Find weights based on gradients

Gradient step: 𝑿′ = 𝑿 − 𝛽𝛼𝑿𝑔 𝒚,𝒛 (𝑿) 𝛼𝑿𝑔 𝒚,𝒛 (𝑿) = 𝜖𝑔 𝜖𝑥0,0,0 … … 𝜖𝑔 𝜖𝑥𝑚,𝑛,𝑜 … … 𝜖𝑔 𝜖𝑐𝑚,𝑛

8 I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-9
SLIDE 9

Optim imiz izatio ion

9 I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-10
SLIDE 10

Gra radie ient Descent

𝑦∗ = arg min 𝑔(𝑦)

10

Optimum

Initialization

I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-11
SLIDE 11

Gra radie ient Descent

Follow the slope

  • f the

DERIVATIVE 𝑒𝑔(𝑦) 𝑒𝑦 = lim

ℎ→0

𝑔 𝑦 + ℎ − 𝑔(𝑦) ℎ

11

𝑦∗ = arg min 𝑔(𝑦)

Initialization

Optimum

I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-12
SLIDE 12

Gra radie ient Descent

  • From derivative to gradient
  • Gradient steps in direction of negative gradient

Direction of greatest increase

  • f the function

ⅆ𝑔 𝑦 ⅆ𝑦 𝛼

𝑦𝑔 𝑦

Learning rate

𝑦′ = 𝑦 − 𝛽𝛼

𝑦𝑔 𝑦

𝛼

𝑦𝑔(𝑦)

𝑦

12 I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-13
SLIDE 13

Gra radie ient Descent

  • From derivative to gradient
  • Gradient steps in direction of negative gradient

Direction of greatest increase

  • f the function

ⅆ𝑔 𝑦 ⅆ𝑦 𝛼

𝑦𝑔 𝑦

𝛼

𝑦𝑔(𝑦)

𝑦

13

SMALL Learning rate

𝑦′ = 𝑦 − 𝛽𝛼

𝑦𝑔 𝑦

I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-14
SLIDE 14

Gra radie ient Descent

  • From derivative to gradient
  • Gradient steps in direction of negative gradient

Direction of greatest increase

  • f the function

ⅆ𝑔 𝑦 ⅆ𝑦 𝛼

𝑦𝑔 𝑦

𝛼

𝑦𝑔(𝑦)

𝑦

14

LARGE Learning rate

𝑦′ = 𝑦 − 𝛽𝛼

𝑦𝑔 𝑦

I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-15
SLIDE 15

Gra radie ient Descent

Optimum

Not guaranteed to reach the global optimum

𝒚∗ = arg min 𝑔(𝒚)

Initialization

What is the gradient when we reach this point?

15 I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-16
SLIDE 16

Convergence of f Gra radient Descent

  • Convex function: all local minima are global minima

If line/plane segment between any two points lies above or on the graph

Source: https://en.wikipedia.org/wiki/Convex_function#/media/File:ConvexFunction.svg

16 I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-17
SLIDE 17

Convergence of f Gra radient Descent

  • Neural networks are non-convex

– many (different) local minima – no (practical) way to say which is globally optimal

Source: Li, Qi. (2006). Challenging Registration of Geologic Image Data

17 I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-18
SLIDE 18

Convergence of f Gra radient Descent

Source: https://builtin.com/data-science/gradient-descent

18 I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-19
SLIDE 19

Convergence of f Gra radient Descent

Source: A. Geron

19 I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-20
SLIDE 20

Gra radie ient Descent: : Mult ltip iple Dim imensio ions

Various ways to visualize…

Source: builtin.com/data-science/gradient-descent

20 I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-21
SLIDE 21

Gra radie ient Descent: : Mult ltip iple Dim imensio ions

Source: http://blog.datumbox.com/wp-content/uploads/2013/10/gradient-descent.png

21 I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-22
SLIDE 22

Gra radie ient Descent fo for r Neura ral Networks

𝑦0 𝑦1 𝑦2 ℎ0 ℎ1 ℎ2 ℎ3 ො 𝑧0 ො 𝑧1 𝑧0 𝑧1 ො 𝑧𝑗 = 𝐵(𝑐1,𝑗 + ෍

𝑘

ℎ𝑘𝑥1,𝑗,𝑘) ℎ𝑘 = 𝐵(𝑐0,𝑘 + ෍

𝑙

𝑦𝑙𝑥0,𝑘,𝑙) Loss function 𝑀𝑗 = ො 𝑧𝑗 − 𝑧𝑗 2 Just simple: 𝐵 𝑦 = max(0, 𝑦) 𝛼𝑿,𝒄𝑔 𝒚,𝒛 (𝑿) = 𝜖𝑔 𝜖𝑥0,0,0 … … 𝜖𝑔 𝜖𝑥𝑚,𝑛,𝑜 … … 𝜖𝑔 𝜖𝑐𝑚,𝑛

22 I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-23
SLIDE 23

Gra radie ient Descent: : Sin ingle Tra rain inin ing Sample

  • Given a loss function 𝑀 and a single training sample

{𝒚𝑗, 𝒛𝑗}

  • Find best model parameters 𝜾 = 𝑿, 𝒄
  • Cost 𝑀𝑗 𝜾, 𝒚𝑗, 𝒛𝑗

– 𝜾 = arg min 𝑀𝑗(𝒚𝑗, 𝒛𝑗)

  • Gradient Descent:

– Initialize 𝜾1 with ‘random’ values (more to that later) – 𝜾𝑙+1 = 𝜾𝑙 − 𝛽𝛼𝜾𝑀𝑗(𝜾𝑙, 𝒚𝑗, 𝒛𝑗) – Iterate until convergence: 𝜾𝑙+1 − 𝜾𝑙 < 𝜗

23 I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-24
SLIDE 24

Gra radie ient Descent: : Sin ingle Tra rain inin ing Sample

– 𝜾𝑙+1 = 𝜾𝑙 − 𝛽𝛼𝜾𝑀𝑗(𝜾𝑙, 𝒚𝑗, 𝒛𝑗) – 𝛼𝜾𝑀𝑗 𝜾𝑙, 𝒚𝑗, 𝒛𝒋 computed via backpropagation – Typically: ⅆim 𝛼𝜾𝑀𝑗 𝜾𝑙, 𝒚𝑗, 𝒛𝑗 = ⅆim 𝜾 ≫ 1 𝑛𝑗𝑚𝑚𝑗𝑝𝑜

24

Weights, biases at step k (current model) Weights, biases after update step Learning rate Gradient w.r.t. 𝜾 Training sample Loss Function

I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-25
SLIDE 25

Gra radie ient Descent: : Mult ltip iple Tra rain inin ing Samples

  • Given a loss function 𝑀 and multiple (𝑜) training

samples {𝒚𝑗, 𝒛𝑗}

  • Find best model parameters 𝜾 = 𝑿, 𝒄
  • Cost 𝑀 =

1 𝑜 σ𝑗=1 𝑜

𝑀𝑗(𝜾, 𝒚𝑗, 𝒛𝑗)

– 𝜾 = arg min 𝑀

25 I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-26
SLIDE 26

Gra radie ient Descent: : Mult ltip iple Tra rain inin ing Samples

  • Update step for multiple samples

26

𝜾𝑙+1 = 𝜾𝑙 − 𝛽𝛼𝜾𝑀 𝜾𝑙, 𝒚 1..𝑜 , 𝒛 1..𝑜

  • Gradient is average / sum over residuals

𝛼𝜾𝑀 𝜾𝑙, 𝒚 1..𝑜 , 𝒛 1..𝑜 = 1

𝑜 σ𝑗=1 𝑜

𝛼𝜾𝑀𝑗 𝜾𝑙, 𝒚𝑗, 𝒛𝒋

Reminder: this comes from backprop.

  • Often people are lazy and just write: 𝛼𝑀 = σ𝑗=1

𝑜

𝛼𝜾𝑀𝑗

 omitting 1

𝑜 is not ‘wrong’, it just means rescaling the

learning rate

I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-27
SLIDE 27

Sid ide Note: : Optim imal Learnin ing Rate

Can compute optimal learning rate 𝛽 using Line Search (optimal for a given set) 1. Compute gradient: 𝛼𝜾𝑀 =

1 𝑜 σ𝑗=1 𝑜

𝛼𝜾𝑀𝑗

  • 2. Optimize for optimal step 𝛽:

arg min

𝛽 𝑀(𝜾𝑙 − 𝛽 𝛼𝜾𝑀)

3. 𝜾𝑙+1 = 𝜾𝑙 − 𝛽𝛼𝜾𝑀

𝜾𝑙+1 Not that practical for DL since we need to solve huge system every step…

27 I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-28
SLIDE 28

Gra radie ient Descent on Tra rain in Set

  • Given large train set with 𝑜 training samples {𝒚𝑗, 𝒛𝑗}

– Let’s say 1 million labeled images – Let’s say our network has 500k parameters

  • Gradient has 500k dimensions
  • 𝑜 = 1 𝑛𝑗𝑚𝑚𝑗𝑝𝑜

→ Extremely expensive to compute

28 I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-29
SLIDE 29

Stochastic Gra radient Descent (S (SGD)

  • If we have 𝑜 training samples we need to compute

the gradient for all of them which is 𝑃(𝑜)

  • If we consider the problem as empirical risk

minimization, we can express the total loss over the training data as the expectation of all the samples

29

1 𝑜 ෍

𝑗=1 𝑜

𝑀𝑗 𝜾, 𝒚𝒋, 𝒛𝒋 = 𝔽𝑗~ 1,…,𝑜 𝑀𝑗 𝜾, 𝒚𝒋, 𝒛𝒋

I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-30
SLIDE 30

Stochastic Gra radient Descent (S (SGD)

  • The expectation can be approximated with a small

subset of the data

30

𝔽𝑗~ 1,…,𝑜 𝑀𝑗 𝜾, 𝒚𝒋, 𝒛𝒋 ≈ 1 𝑇 ෍

𝑘∈𝑇

𝑀𝑘 𝜾, 𝒚𝒌, 𝒛𝒌 with S ⊆ 1, … , 𝑜

Minibatch choose subset of trainset 𝑛 ≪ 𝑜

𝐶𝑗 = { 𝒚𝟐, 𝒛𝟐 , 𝒚𝟑, 𝒛𝟑 , … , 𝒚𝒏, 𝒛𝒏 } {𝐶1, 𝐶2, … , 𝐶𝑜/𝑛}

I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-31
SLIDE 31

Stochastic Gra radient Descent (S (SGD)

  • Minibatch size is hyperparameter

– Typically power of 2 → 8, 16, 32, 64, 128… – Smaller batch size means greater variance in the gradients

→ noisy updates

– Mostly limited by GPU memory (in backward pass) – E.g.,

  • Train set has n = 220 (about 1 million) images
  • With batch size m = 64: 𝐶1 … 𝑜/𝑛 = 𝐶1 … 16,384 minibatches

31

(Epoch = complete pass through training set)

I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-32
SLIDE 32

Stochastic Gra radient Descent (S (SGD)

𝜾𝑙+1 = 𝜾𝑙 − 𝛽𝛼𝜾𝑀(𝜾𝑙, 𝒚{1..𝑛}, 𝒛{1..𝑛}) 𝛼𝜾𝑀 =

1 𝑛 σ𝑗=1 𝑛 𝛼𝜾𝑀𝑗

Note the terminology: iteration vs epoch

𝑙 now refers to 𝑙-th iteration 𝑛 training samples in the current minibatch Gradient for the 𝑙-th minibatch

32 I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-33
SLIDE 33

Convergence of f SGD

33

Suppose we want to minimize the function 𝐺 𝜄 with the stochastic approximation

Robbins, H. and Monro, S. “A Stochastic Approximation Method" 1951.

where 𝛽1, 𝛽2 … 𝛽𝑜 is a sequence of positive step-sizes and 𝐼 𝜄𝑙, 𝑌 is the unbiased estimate of 𝛼F 𝜄𝑙 , i.e. 𝜄𝑙+1 = 𝜄𝑙 − 𝛽𝑙𝐼 𝜄𝑙, 𝑌 𝔽 𝐼 𝜄𝑙, 𝑌 = 𝛼F 𝜄𝑙

I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-34
SLIDE 34

Convergence of f SGD

converges to a local (glo lobal) minimum if the following conditions are met:

1) 𝛽𝑜 ≥ 0, ∀ 𝑜 ≥ 0 2) σ𝑜=1

𝛽𝑜 = ∞ 3) σ𝑜=1

𝛽𝑜

2 < ∞

4) 𝐺 𝜄 is is str strictly ly conv nvex

34

𝜄𝑙+1 = 𝜄𝑙 − 𝛽𝑙𝐼 𝜄𝑙, 𝑌

The proposed sequence by Robbins and Monro is 𝛽𝑜 ∝ 𝛽

𝑜 , 𝑔𝑝𝑠 𝑜 > 0

I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-35
SLIDE 35

Pro roblems of f SGD

  • Gradient is scaled equally across all dimensions

→ i.e., cannot independently scale directions → need to have conservative min learning rate to avoid divergence → Slower than ‘necessary’

  • Finding good learning rate is an art by itself

→ More next lecture

35 I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-36
SLIDE 36

Gra radie ient Descent wit ith Momentum

We’re making many steps back and forth along this

  • dimension. Would love to

track that this is averaging

  • ut over time.

Would love to go faster here… I.e., accumulated gradients over time

Source: A. Ng

36 I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-37
SLIDE 37

Gra radie ient Descent wit ith Momentum

𝒘𝑙+1 = 𝛾 ⋅ 𝒘𝑙 − 𝛽 ⋅ 𝛼𝜾𝑀(𝜾𝑙) 𝜾𝑙+1 = 𝜾𝑙 + 𝒘𝑙+1 Exponentially-weighted average of gradient Important: velocity 𝒘𝑙 is vector-valued!

Gradient of current minibatch velocity accumulation rate (‘friction’, momentum) learning rate velocity weights of model

37

[Sutskever et al., ICML’13] On the importance of initialization and momentum in deep learning

I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-38
SLIDE 38

Gra radie ient Descent wit ith Momentum

𝜾𝑙+1 = 𝜾𝑙 + 𝒘𝑙+1

Step will be largest when a sequence of gradients all point to the same direction

Source: I. Goodfellow

Hyperparameters are 𝛽, 𝛾 𝛾 is often set to 0.9

38 I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-39
SLIDE 39

Gra radie ient Descent wit ith Momentum

  • Can it overcome local minima?

𝜾𝑙+1 = 𝜾𝑙 + 𝒘𝑙+1

39 I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-40
SLIDE 40

Nesterov Momentum

  • Look-ahead momentum

෩ 𝜾𝑙+1 = 𝜾𝑙 + 𝛾 ⋅ 𝒘𝑙 𝒘𝑙+1 = 𝛾 ⋅ 𝒘𝑙 − 𝛽 ⋅ 𝛼𝜾𝑀(෩ 𝜾𝑙+1) 𝜾𝑙+1 = 𝜾𝑙 + 𝒘𝑙+1

Nesterov, Yurii E. "A method for solving the convex programming problem with convergence rate O (1/k^ 2)." Dokl. akad. nauk

  • Sssr. Vol. 269. 1983.

40 I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-41
SLIDE 41

Nesterov Momentum

Source: G. Hinton

෩ 𝜾𝑙+1 = 𝜾𝑙 + 𝛾 ⋅ 𝒘𝑙 𝒘𝑙+1 = 𝛾 ⋅ 𝒘𝑙 − 𝛽 ⋅ 𝛼𝜾𝑀(෩ 𝜾𝑙+1) 𝜾𝑙+1 = 𝜾𝑙 + 𝒘𝑙+1

41 I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-42
SLIDE 42

Root Mean Squared Pro rop (RMSProp)

  • RMSProp divides the learning rate by an

exponentially-decaying average of squared gradients.

Small gradients Large gradients

Source: Andrew. Ng Hinton et al. "Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude." COURSERA: Neural networks for machine learning 4.2 (2012): 26-31.

42 I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-43
SLIDE 43

RMSProp

𝒕𝑙+1 = 𝛾 ⋅ 𝒕𝑙 + (1 − 𝛾)[𝛼𝜾𝑀 ∘ 𝛼𝜾𝑀] 𝜾𝑙+1 = 𝜾𝑙 − 𝛽 ⋅ 𝛼𝜾𝑀 𝒕𝑙+1 + 𝜗 Hyperparameters: 𝛽, 𝛾, 𝜗

43

Typically 10−8 Often 0.9

Element-wise multiplication

Needs tuning!

I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-44
SLIDE 44

RMSProp

X-direction Small gradients Y-Direction Large gradients

Source: A. Ng

𝒕𝑙+1 = 𝛾 ⋅ 𝒕𝑙 + (1 − 𝛾)[𝛼𝜾𝑀 ∘ 𝛼𝜾𝑀] 𝜾𝑙+1 = 𝜾𝑙 − 𝛽 ⋅ 𝛼𝜾𝑀 𝒕𝑙+1 + 𝜗 We’re dividing by square gradients:

  • Division in Y-Direction will be

large

  • Division in X-Direction will be

small (Uncentered) variance of gradients → second momentum Can increase learning rate!

44 I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-45
SLIDE 45

RMSProp

  • Dampening the oscillations for high-variance

directions

  • Can use faster learning rate because it is less likely to

diverge

→ Speed up learning speed → Second moment

45 I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-46
SLIDE 46

Adaptive Moment Estim imatio ion (A (Adam)

Idea : Combine Momentum and RMSProp

𝒏𝑙+1 = 𝛾1 ⋅ 𝒏𝑙 + 1 − 𝛾1 𝛼𝜾𝑀 𝜾𝑙 𝒘𝑙+1 = 𝛾2 ⋅ 𝒘𝑙 + (1 − 𝛾2)[𝛼𝜾𝑀 𝜾𝑙 ∘ 𝛼𝜾𝑀 𝜾𝑙 ] 𝜾𝑙+1 = 𝜾𝑙 − 𝛽 ⋅

𝒏𝑙+1 𝒘𝑙+1+𝜗 First momentum: mean of gradients Second momentum: variance of gradients

[Kingma et al., ICLR’15] Adam: A method for stochastic optimization

46

  • Q. What happens at 𝑙 = 0?
  • A. We need bias correction as 𝒏0 = 0 and 𝒘0 = 0

No Note : Th This is s not

  • t th

the e upd update rule rule of

  • f Adam

I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-47
SLIDE 47

Adam : : Bia ias Corr rrected

  • Combines Momentum and RMSProp

𝒏𝑙+1 = 𝛾1 ⋅ 𝒏𝑙 + 1 − 𝛾1 𝛼𝜾𝑀 𝜾𝑙 𝒘𝑙+1 = 𝛾2 ⋅ 𝒘𝑙 + (1 − 𝛾2)[𝛼𝜾𝑀 𝜾𝑙 ∘ 𝛼𝜾𝑀 𝜾𝑙

47

  • 𝒏𝑙 and 𝒘𝑙 are initialized with zero

→ bias towards zero → Need bias-corrected moment updates

ෝ 𝒏𝑙+1 = 𝒏𝑙+1 1 − 𝛾1

𝑙+1

ෝ 𝒘𝑙+1 = 𝒘𝑙+1 1 − 𝛾2

𝑙+1

𝜾𝑙+1 = 𝜾𝑙 − 𝛽 ⋅

ෝ 𝒏𝑙+1 ෝ 𝒘𝑙+1+𝜗

Upd Update rule rule of

  • f Ad

Adam

I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-48
SLIDE 48

Adam

  • Exponentially-decaying mean and variance of

gradients (combines first and second order momentum)

  • Hyperparameters: 𝛽, 𝛾1, 𝛾2, 𝜗

48

𝒏𝑙+1 = 𝛾1 ⋅ 𝒏𝑙 + 1 − 𝛾1 𝛼𝜾𝑀 𝜾𝑙 𝒘𝑙+1 = 𝛾2 ⋅ 𝒘𝑙 + 1 − 𝛾2 𝛼𝜾𝑀 𝜾𝑙 ∘ 𝛼𝜾𝑀 𝜾𝑙 ෝ 𝒏𝑙+1 =

𝒏𝑙+1 1−𝛾1

𝑙+1

ෝ 𝒘𝑙+1 =

𝒘𝑙+1 1−𝛾2

𝑙+1

𝜾𝑙+1 = 𝜾𝑙 − 𝛽 ⋅ ෝ 𝒏𝑙+1 ෝ 𝒘𝑙+1 + 𝜗

Typically 10−8 Often 0.9 Often 0.999 Defaults in PyTorch Needs tuning!

I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-49
SLIDE 49

There are a few others…

  • ‘Vanilla’ SGD
  • Momentum
  • RMSProp
  • Adagrad
  • Adadelta
  • AdaMax
  • Nada
  • AMSGrad

Adam is mostly method

  • f choice for neural networks!

It’s actually fun to play around with SGD updates. It’s easy and you get pretty immediate feedback 

50 I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-50
SLIDE 50

Convergence

Source: http://ruder.io/optimizing-gradient-descent/

51 I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-51
SLIDE 51

Convergence

Source: http://ruder.io/optimizing-gradient-descent/

52 I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-52
SLIDE 52

Convergence

Source: https://github.com/Jaewan-Yun/optimizer-visualization

53 I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-53
SLIDE 53

Jacobian and Hessian

  • Derivative
  • Gradient
  • Jacobian
  • Hessian

SECOND DERIVATIVE

𝒈: ℝ → ℝ 𝑒𝑔 𝑦 𝑒𝑦 𝒈: ℝ𝑛 → ℝ 𝛼

𝒚𝑔 𝒚

𝒈: ℝ𝑛 → ℝ𝑜 𝐊 ∈ ℝ𝑜 × 𝑛 𝒈: ℝ𝑛 → ℝ 𝐈 ∈ ℝ𝑛 × 𝑛 ⅆ𝑔 𝒚 ⅆ𝑦1 , ⅆ𝑔 𝒚 ⅆ𝑦2

54 I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-54
SLIDE 54

Newton’s Method

  • Approximate our function by a second-order Taylor

series expansion

First derivative Second derivative (curvature) 𝑀 𝜾 ≈ 𝑀 𝜾0 + 𝜾 − 𝜾0 𝑈𝜶𝜾𝑀 𝜾0 + 1 2 𝜾 − 𝜾0 𝑈𝐈 𝜾 − 𝜾0

55

More info: https://en.wikipedia.org/wiki/Taylor_series

I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-55
SLIDE 55

Newton’s Method

  • Differentiate and equate to zero

SGD

We got rid of the learning rate!

𝜾𝑙+1 = 𝜾𝑙 − 𝛽𝛼𝜾𝑀 𝜾𝑙, 𝐲𝒋 , 𝐳𝒋

56

Update step

𝜾∗ = 𝜾0 − 𝐈−1𝛼𝜾𝑀 𝜾

I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-56
SLIDE 56

Newton’s Method

  • Differentiate and equate to zero

Update step Parameters of a network (millions) Number of elements in the Hessian Computational complexity of ‘inversion’ per iteration

𝜾∗ = 𝜾0 − 𝐈−1𝛼𝜾𝑀 𝜾

57 I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-57
SLIDE 57

Newton’s Method

  • Gradient Descent (green)
  • Newton’s method exploits

the curvature to take a more direct route

58

Source: https://en.wikipedia.org/wiki/Newton%27s_method_in_optimization

I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-58
SLIDE 58

Newton’s Method

𝐾 𝜾 = 𝐳 − 𝐘𝛊 𝑈 𝐳 − 𝐘𝛊

59

Can you apply Newton’s method for linear regression? What do you get as a result?

I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-59
SLIDE 59

BFGS and L-BFGS

  • Broyden-Fletcher-Goldfarb-Shanno algorithm
  • Belongs to the family of quasi-Newton methods
  • Have an approximation of the inverse of the Hessian
  • BFGS
  • Limited memory: L-BFGS

𝜾∗ = 𝜾0 − 𝐈−1𝛼𝜾𝑀 𝜾

60 I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-60
SLIDE 60

Gauss-Newton

  • 𝑦𝑙+1 = 𝑦𝑙 − 𝐼

𝑔 𝑦𝑙 −1𝛼𝑔(𝑦𝑙)

– ‘true’ 2nd derivatives are often hard to obtain (e.g., numerics) – 𝐼

𝑔 ≈ 2𝐾𝐺 𝑈𝐾𝐺

  • Gauss-Newton (GN):

𝑦𝑙+1 = 𝑦𝑙 − [2𝐾𝐺 𝑦𝑙 𝑈𝐾𝐺 𝑦𝑙 ]−1𝛼𝑔(𝑦𝑙)

  • Solve linear system (again, inverting a matrix is

unstable):

2 𝐾𝐺 𝑦𝑙 𝑈𝐾𝐺 𝑦𝑙 𝑦𝑙 − 𝑦𝑙+1 = 𝛼𝑔(𝑦𝑙)

Solve for delta vector

61 I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-61
SLIDE 61

Levenberg

  • Levenberg

– “damped” version of Gauss-Newton: – 𝐾𝐺 𝑦𝑙 𝑈𝐾𝐺 𝑦𝑙 + 𝜇 ⋅ 𝐽 ⋅ 𝑦𝑙 − 𝑦𝑙+1 = 𝛼𝑔(𝑦𝑙) – The damping factor 𝜇 is adjusted in each iteration ensuring: 𝑔 𝑦𝑙 > 𝑔(𝑦𝑙+1)

  • if the equation is not fulfilled increase 𝜇
  • →Trust region
  • → “Interpolation” between Gauss-Newton (small 𝜇)

and Gradient Descent (large 𝜇)

Ti Tikhonov regulariz izatio tion

62 I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-62
SLIDE 62

Levenberg-Marquardt

  • Levenberg-Marquardt (LM)

𝐾𝐺 𝑦𝑙 𝑈𝐾𝐺 𝑦𝑙 + 𝜇 ⋅ 𝑒𝑗𝑏𝑕(𝐾𝐺 𝑦𝑙 𝑈𝐾𝐺 𝑦𝑙 ) ⋅ 𝑦𝑙 − 𝑦𝑙+1 = 𝛼𝑔(𝑦𝑙) – Instead of a plain Gradient Descent for large 𝜇, scale each component of the gradient according to the curvature.

  • Avoids slow convergence in components with a small

gradient

63 I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-63
SLIDE 63

Whic ich, What, and When?

  • Standard: Adam
  • Fallback option: SGD with

ith momentum

  • Newto

ton, L-BFGS, GN, , LM only if you can do full batch updates (doesn’t work well for minibatches!!)

This practically never happens for DL Theoretically, it would be nice though due to fast convergence

64 I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-64
SLIDE 64

General Optim imiz izatio ion

  • Lin

inear r Syste tems (A (Ax x = b)

– LU, QR, Cholesky, Jacobi, Gauss-Seidel, CG, PCG, etc.

  • Non-lin

linear (g (gradient-based)

– Newton, Gauss-Newton, LM, (L)BFGS ← second

  • rder

– Gradient Descent, SGD ← first order

  • Oth

thers

– Genetic algorithms, MCMC, Metropolis-Hastings, etc. – Constrained and convex solvers (Langrage, ADMM, Primal-Dual, etc.)

65 I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-65
SLIDE 65

Ple lease Remember!

  • Think about your problem and optimization at hand
  • SGD is specifically designed for minibatch
  • When you can, use 2nd order method → it’s just faster
  • GD or SGD is not

not a way to solve a linear system!

66 I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-66
SLIDE 66

Next Lecture

  • This week:

– Check exercises – Check office hours 

  • Next lecture

– Training Neural networks

72 I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-67
SLIDE 67

See you next week 

73 I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-68
SLIDE 68

Some Refe ferences to SGD Updates

  • Goodfellow et al. “Deep Learning” (2016),

– Chapter 8: Optimization

  • Bishop “Pattern Recognition and Machine Learning”

(2006),

– Chapter 5.2: Network training (gradient descent) – Chapter 5.4: The Hessian Matrix (second order methods)

  • https://ruder.io/optimizing-gradient-descent/index.html
  • PyTorch Documetation (with further readings)

– https://pytorch.org/docs/stable/optim.html

74 I2DL: Prof. Niessner, Prof. Leal-Taixé