Lecture 5 recap 1 Prof. Leal-Taix and Prof. Niessner Neural - - PowerPoint PPT Presentation

lecture 5 recap
SMART_READER_LITE
LIVE PREVIEW

Lecture 5 recap 1 Prof. Leal-Taix and Prof. Niessner Neural - - PowerPoint PPT Presentation

Lecture 5 recap 1 Prof. Leal-Taix and Prof. Niessner Neural Network Width Depth 2 Prof. Leal-Taix and Prof. Niessner Gra radie ient Descent fo for r Neura ral Networks 0 0 0 0 1 0,0,0


slide-1
SLIDE 1

Lecture 5 recap

1

  • Prof. Leal-Taixé and Prof. Niessner
slide-2
SLIDE 2

Neural Network

Depth Width

2

  • Prof. Leal-Taixé and Prof. Niessner
slide-3
SLIDE 3

Gra radie ient Descent fo for r Neura ral Networks

  • Prof. Leal-Taixé and Prof. Niessner

3

𝑦0 𝑦1 𝑦2 ℎ0 ℎ1 ℎ2 ℎ3 𝑧0 𝑧1 𝑢0 𝑢1 𝑧𝑗 = 𝐵(𝑐1,𝑗 + ෍

𝑘

ℎ𝑘𝑥1,𝑗,𝑘) ℎ𝑘 = 𝐵(𝑐0,𝑘 + ෍

𝑙

𝑦𝑙𝑥0,𝑘,𝑙) 𝑀𝑗 = 𝑧𝑗 − 𝑢𝑗 2 𝛼𝑥,𝑐𝑔 𝑦,𝑢 (𝑥) = 𝜖𝑔 𝜖𝑥0,0,0 … … 𝜖𝑔 𝜖𝑥𝑚,𝑛,𝑜 … … 𝜖𝑔 𝜖𝑐𝑚,𝑛 Just simple: 𝐵 𝑦 = max(0, 𝑦)

slide-4
SLIDE 4

Stochastic Gra radient Descent (S (SGD)

𝜄𝑙+1 = 𝜄𝑙 − 𝛽𝛼𝜄𝑀(𝜄𝑙, 𝑦{1..𝑛}, 𝑧{1..𝑛}) 𝛼𝜄𝑀 =

1 𝑛 σ𝑗=1 𝑛 𝛼𝜄𝑀𝑗

Note the terminology: iteration vs epoch

  • Prof. Leal-Taixé and Prof. Niessner

4

𝑙 now refers to 𝑙-th iteration 𝑛 training samples in the current batch Gradient for the 𝑙-th batch

slide-5
SLIDE 5

Gra radie ient Descent wit ith Momentum

𝑤𝑙+1 = 𝛾 ⋅ 𝑤𝑙 + 𝛼𝜄𝑀(𝜄𝑙) 𝜄𝑙+1 = 𝜄𝑙 − 𝛽 ⋅ 𝑤𝑙+1 Exponentially-weighted average of gradient Important: velocity 𝑤𝑙 is vector-valued!

  • Prof. Leal-Taixé and Prof. Niessner

5

Gradient of current minibatch velocity accumulation rate (‘friction’, momentum) learning rate velocity model

slide-6
SLIDE 6

Gra radie ient Descent wit ith Momentum

𝜄𝑙+1 = 𝜄𝑙 − 𝛽 ⋅ 𝑤𝑙+1

  • Prof. Leal-Taixé and Prof. Niessner

6

Step will be largest when a sequence of gradients all point to the same direction

  • Fig. credit: I. Goodfellow

Hyperparameters are 𝛽, 𝛾 𝛾 is often set to 0.9

slide-7
SLIDE 7

RMSProp

𝑡𝑙+1 = 𝛾 ⋅ 𝑡𝑙 + (1 − 𝛾)[𝛼𝜄𝑀 ∘ 𝛼𝜄𝑀] 𝜄𝑙+1 = 𝜄𝑙 − 𝛽 ⋅ 𝛼𝜄𝑀 𝑡𝑙+1 + 𝜗 Hyperparameters: 𝛽, 𝛾, 𝜗

  • Prof. Leal-Taixé and Prof. Niessner

7

Typically 10−8 Often 0.9

Element-wise multiplication

Needs tuning!

slide-8
SLIDE 8

RMSProp

  • Prof. Leal-Taixé and Prof. Niessner

8

X-direction Small gradients Y-Direction Large gradients

  • Fig. credit: A. Ng

𝑡𝑙+1 = 𝛾 ⋅ 𝑡𝑙 + (1 − 𝛾)[𝛼𝜄𝑀 ∘ 𝛼𝜄𝑀] 𝜄𝑙+1 = 𝜄𝑙 − 𝛽 ⋅ 𝛼𝜄𝑀 𝑡𝑙+1 + 𝜗 We’re dividing by square gradients:

  • Division in Y-Direction will be large
  • Division in X-Direction will be small

(uncentered) variance of gradients

  • > second momentum

Can increase learning rate!

slide-9
SLIDE 9

Adaptive Moment Estim imatio ion (A (Adam)

Combines Momentum and RMSProp 𝑛𝑙+1 = 𝛾1 ⋅ 𝑛𝑙 + 1 − 𝛾1 𝛼𝜄𝑀 𝜄𝑙 𝑤𝑙+1 = 𝛾2 ⋅ 𝑤𝑙 + (1 − 𝛾2)[𝛼𝜄𝑀 𝜄𝑙 ∘ 𝛼𝜄𝑀 𝜄𝑙 ] 𝜄𝑙+1 = 𝜄𝑙 − 𝛽 ⋅

𝑛𝑙+1 𝑤𝑙+1+𝜗

  • Prof. Leal-Taixé and Prof. Niessner

9

First momentum: mean of gradients Second momentum: variance of gradients

slide-10
SLIDE 10

Adam

Combines Momentum and RMSProp

𝑛𝑙+1 = 𝛾1 ⋅ 𝑛𝑙 + 1 − 𝛾1 𝛼𝜄𝑀 𝜄𝑙 𝑤𝑙+1 = 𝛾2 ⋅ 𝑤𝑙 + (1 − 𝛾2)[𝛼𝜄𝑀 𝜄𝑙 ∘ 𝛼𝜄𝑀 𝜄𝑙 ]

𝜄𝑙+1 = 𝜄𝑙 − 𝛽 ⋅

ෝ 𝑛𝑙+1 ො 𝑤𝑙+1+𝜗

  • Prof. Leal-Taixé and Prof. Niessner

10

𝑛𝑙+1 and 𝑤𝑙+1 are initialized with zero

  • > bias towards zero

Typically, bias-corrected moment updates ෝ 𝑛𝑙+1 = 𝑛𝑙 1 − 𝛾1 ො 𝑤𝑙+1 = 𝑤𝑙 1 − 𝛾2

slide-11
SLIDE 11

Convergence

11

  • Prof. Leal-Taixé and Prof. Niessner
slide-12
SLIDE 12

Convergence

12

  • Prof. Leal-Taixé and Prof. Niessner
slide-13
SLIDE 13

Im Importance of f Learning Rate

13

  • Prof. Leal-Taixé and Prof. Niessner
slide-14
SLIDE 14

Jacobian and Hessian

  • Derivative
  • Gradient
  • Jacobian
  • Hessian

SECOND DERIVATIVE

14

  • Prof. Leal-Taixé and Prof. Niessner
slide-15
SLIDE 15

Newton’s method

  • Approximate our function by a second-order Taylor

series expansion

https://en.wikipedia.org/wiki/Taylor_series

First derivative Second derivative (curvature)

15

  • Prof. Leal-Taixé and Prof. Niessner
slide-16
SLIDE 16

Newton’s method

  • Differentiate and equate to zero

Update step SGD We got rid of the learning rate!

16

  • Prof. Leal-Taixé and Prof. Niessner
slide-17
SLIDE 17

Newton’s method

  • Differentiate and equate to zero

Update step Parameters

  • f a network

(millions) Number of elements in the Hessian Computational complexity of ‘inversion’ per iteration

17

  • Prof. Leal-Taixé and Prof. Niessner
slide-18
SLIDE 18

Newton’s method

  • SGD (green)
  • Newton’s method exploits

the curvature to take a more direct route

Image from Wikipedia 18

  • Prof. Leal-Taixé and Prof. Niessner
slide-19
SLIDE 19

Newton’s method

Can you apply Newton’s method for linear regression? What do you get as a result?

19

  • Prof. Leal-Taixé and Prof. Niessner
slide-20
SLIDE 20

BFGS and L-BFGS

  • Broyden-Fletcher-Goldfarb-Shanno algorithm
  • Belongs to the family of quasi-Newton methods
  • Have an approximation of the inverse of the Hessian
  • BFGS
  • Limited memory: L-BFGS

20

  • Prof. Leal-Taixé and Prof. Niessner
slide-21
SLIDE 21

Gauss-Newton

  • 𝑦𝑙+1 = 𝑦𝑙 − 𝐼

𝑔 𝑦𝑙 −1𝛼𝑔(𝑦𝑙)

– ’true’ 2nd derivatives are often hard to obtain (e.g., numerics) – 𝐼

𝑔 ≈ 2𝐾𝐺 𝑈𝐾𝐺

  • Gauss-Newton (GN):

𝑦𝑙+1 = 𝑦𝑙 − [2𝐾𝐺 𝑦𝑙 𝑈𝐾𝐺 𝑦𝑙 ]−1𝛼𝑔(𝑦𝑙)

  • Solve linear system (again, inverting a matrix is unstable):

2 𝐾𝐺 𝑦𝑙 𝑈𝐾𝐺 𝑦𝑙 𝑦𝑙 − 𝑦𝑙+1 = 𝛼𝑔(𝑦𝑙)

Solve for delta vector

  • Prof. Leal-Taixé and Prof. Niessner

21

slide-22
SLIDE 22

Levenberg

  • Levenberg

– “damped” version of Gauss-Newton: – 𝐾𝐺 𝑦𝑙 𝑈𝐾𝐺 𝑦𝑙 + 𝜇 ⋅ 𝐽 ⋅ 𝑦𝑙 − 𝑦𝑙+1 = 𝛼𝑔(𝑦𝑙) – The damping factor 𝜇 is adjusted in each iteration ensuring: – 𝑔 𝑦𝑙 > 𝑔(𝑦𝑙+1)

  • if inequation is not fulfilled increase 𝜇
  • Trust region
  • “Interpolation” between Gauss-Newton (small 𝜇) and

Gradient Descent (large 𝜇)

Tikhonov regularization

  • Prof. Leal-Taixé and Prof. Niessner

22

slide-23
SLIDE 23

Levenberg-Marquardt

  • Levenberg-Marquardt (LM)

𝐾𝐺 𝑦𝑙 𝑈𝐾𝐺 𝑦𝑙 + 𝜇 ⋅ 𝑒𝑗𝑏𝑕(𝐾𝐺 𝑦𝑙 𝑈𝐾𝐺 𝑦𝑙 ) ⋅ 𝑦𝑙 − 𝑦𝑙+1 = 𝛼𝑔(𝑦𝑙) – Instead of a plain Gradient Descent for large 𝜇, scale each component of the gradient according to the curvature.

  • Avoids slow convergence in components with a small

gradient

  • Prof. Leal-Taixé and Prof. Niessner

23

slide-24
SLIDE 24

Whic ich, what and when?

  • Standard: Adam
  • Fallback option: SGD with

ith momentum

  • Newto

ton, L-BFGS, GN, , LM only if you can do full batch updates (doesn’t work well for minibatches!!)

24

  • Prof. Leal-Taixé and Prof. Niessner

This practically never happens for DL Theoretically, it would be nice though due to fast convergence

slide-25
SLIDE 25

General Optim imiz izatio ion

  • Linear Systems (Ax = b)

– LU, QR, Cholesky, Jacobi, Gauss-Seidel, CG, PCG, etc.

  • Non-linear (gradient-based)

– Newton, Gauss-Newton, LM, (L)BFGS <- second order – Gradient Descent, SGD <- first order

  • Others:

– Genetic algorithms, MCMC, Metropolis-Hastings, etc. – Constrained and convex solvers (Langrage, ADMM, Primal-Dual, etc.)

  • Prof. Leal-Taixé and Prof. Niessner

25

slide-26
SLIDE 26

Ple lease Remember!

  • Think about your problem and optimization at hand
  • SGD is specifically designed for minibatch
  • When you can, use 2nd order method -> it’s just faster
  • GD or SGD is not

not a way to solve a linear system!

  • Prof. Leal-Taixé and Prof. Niessner

26

slide-27
SLIDE 27

Im Importance of f Learning Rate

27

  • Prof. Leal-Taixé and Prof. Niessner
slide-28
SLIDE 28

Learnin ing Rate

  • Prof. Leal-Taixé and Prof. Niessner

28

Need high learning rate when far away Need low learning rate when close

slide-29
SLIDE 29

Learnin ing Rate Decay

  • 𝛽 =

1 1+𝑒𝑓𝑑𝑏𝑧𝑠𝑏𝑢𝑓⋅𝑓𝑞𝑝𝑑ℎ ⋅ 𝛽0

– E.g., 𝛽0 = 0.1, 𝑒𝑓𝑑𝑏𝑧𝑠𝑏𝑢𝑓 = 1.0 – > Epoch 0: 0.1 – > Epoch 1: 0.05 – > Epoch 2: 0.033 – > Epoch 3: 0.025 ...

  • Prof. Leal-Taixé and Prof. Niessner

29

0.02 0.04 0.06 0.08 0.1 0.12 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46

Learning Rate over Epochs

slide-30
SLIDE 30

Learnin ing Rate Decay

Many options:

  • Step decay 𝛽 = 𝛽 − 𝑢 ⋅ 𝛽 (only every n steps)

– T is decay rate (often 0.5)

  • Exponential decay 𝛽 = 𝑢𝑓𝑞𝑝𝑑ℎ ⋅ 𝛽0

– t is decay rate (t < 1.0)

  • 𝛽 =

𝑢 𝑓𝑞𝑝𝑑ℎ ⋅ 𝑏0

– t is decay rate

  • Etc.
  • Prof. Leal-Taixé and Prof. Niessner

30

slide-31
SLIDE 31

Tra rain ining Schedule le

Manually specify learning rate for entire training process

  • Manually set learning rate every n-epochs
  • How?

– Trial and error (the hard way) – Some experience (only generalizes to some degree)

Consider: #epochs, training set size, network size, etc.

  • Prof. Leal-Taixé and Prof. Niessner

31

slide-32
SLIDE 32

Learnin ing Rate: : Im Implic ications

  • What if too high?
  • What if too low?
  • Prof. Leal-Taixé and Prof. Niessner

32

slide-33
SLIDE 33

Tra rain ining

  • Given ground dataset with ground lables

– {𝑦𝑗, 𝑧𝑗}

  • For instance 𝑦𝑗-th training image, with label 𝑧𝑗
  • Often dim 𝑦 ≫ dim(𝑧) (e.g., for classification)
  • 𝑗 is often in the 100-thousands or millions

– Take network 𝑔 and its parameters 𝑥, 𝑐 – Use SGD (or variation) to find optimal parameters 𝑥, 𝑐

  • Gradients from backprop
  • Prof. Leal-Taixé and Prof. Niessner

33

slide-34
SLIDE 34

Learnin ing

  • Learning means generalization to unknown dataset

– (so far no ‘real’ learning) – I.e., train on known dataset -> test with optimized parameters on unknown dataset

  • Basically, we hope that based on the train set, the
  • ptimized parameters will give similar results on

different data (i.e., test data)

  • Prof. Leal-Taixé and Prof. Niessner

34

slide-35
SLIDE 35

Learnin ing

  • Training set (‘train’):

– Use for training your neural network

  • Validation set (‘val’):

– Hyperparameter optimization – Check generalization progress

  • Test set (‘test’):

– Only for the very end – NEVER TO TOUCH DURIN ING DEVELOPMENT OR TR TRAINING

  • Prof. Leal-Taixé and Prof. Niessner

35

slide-36
SLIDE 36

Learnin ing

  • Typical splits

– Train (60%), Val (20%), Test (20%) – Train (80%), Val (10%), Test (10%)

  • During training:

– Train error comes from average mini-batch error – Typically take subset of validation every n iterations

  • Prof. Leal-Taixé and Prof. Niessner

36

slide-37
SLIDE 37

Learnin ing

  • Training graph
  • Accuracy
  • Loss
  • Prof. Leal-Taixé and Prof. Niessner

37

(EMA smoothing)

slide-38
SLIDE 38

Learnin ing

  • Validation graph
  • Prof. Leal-Taixé and Prof. Niessner

38

slide-39
SLIDE 39

Over- and Underf rfitting

  • Prof. Leal-Taixé and Prof. Niessner

39

Underfitted Appropriate Overfitted

Figure extracted from Deep Learning by Adam Gibson, Josh Patterson, O‘Reily Media Inc., 2017

slide-40
SLIDE 40

Over- and Underf rfitting

  • Prof. Leal-Taixé and Prof. Niessner

40

Source: http://srdas.github.io/DLBook/ImprovingModelGeneralization.html

slide-41
SLIDE 41

Hyperparameters

  • Network architecture (e.g., num layers, #weights)
  • Number of iterations
  • Learning rate(s) (i.e., solver parameters, decay, etc.)
  • Regularization (more later next lecture)
  • Batch size
  • Overall: learning setup + optimization = hyerparameter
  • Prof. Leal-Taixé and Prof. Niessner

41

slide-42
SLIDE 42

Hyperparameter Tunin ing

  • Methods:

– Manual search: most common  – Grid search (structured, for ‘real’ applications) Define ranges for all parameters spaces and select points (usually pseudo-uniformly distributed). Iterate over all possible configurations – Random search:

Like grid search but one picks points at random in the predefined ranges

  • Prof. Leal-Taixé and Prof. Niessner

42

slide-43
SLIDE 43

Sim imple Gri rid Search Example le

learning_rates = [1e-2, 1e-3, 1e-4, 1e-5] regularization_strengths = [1e2, 1e3, 1e4, 1e5] num_iters = [500, 1000, 1500] best_val = 0 for

  • r learning_rate in

in learning_rates: for

  • r reg in

in regularization_strengths: for iterations in in num_iters: model = train_model(learning_rate, reg., iterations) validation_accuracy = evaluate(model) if if validation_accuracy > best_val: best_val = validation_accuracy best_model = model

  • Prof. Leal-Taixé and Prof. Niessner

43

slide-44
SLIDE 44

Cro ross Valid idation

  • Example: k=5
  • Prof. Leal-Taixé and Prof. Niessner

44

Figure extracted from cs231n

slide-45
SLIDE 45

Cro ross Valid idation

  • Prof. Leal-Taixé and Prof. Niessner

45

  • Used when data set is extremely small and/or our

method of choice has low training times

  • Partition data into k subsets, train on k-1 and evaluate

performance on the remaining subset

  • To reduce variability: perform on different partitions

and average results

slide-46
SLIDE 46

Cro ross Valid idation

  • Prof. Leal-Taixé and Prof. Niessner

46

Results for k=5

Hyperparmeter value

Figure extracted from cs231n

slide-47
SLIDE 47

Basic ic recip ipe for machin ine le learnin ing

47

slide-48
SLIDE 48

Basic re recip ipe fo for r machine le learnin ing

  • Split your data

48

Find your hyperparameters 20% train test validation 20% 60%

slide-49
SLIDE 49

Basic re recip ipe fo for r machine le learnin ing

  • Split your data

49

20% train test validation 20% 60% Human level error …... 1% Training set error ….... 5% Val/test set error ….... 8% Bias (or underfitting) Variance (overfitting)

slide-50
SLIDE 50

Basic re recip ipe fo for r machine le learnin ing

50

Credits: A. Ng

More on

slide-51
SLIDE 51

Next le lecture

  • Monday: Deadline Ex1!
  • Next Tuesday:

– Discussion solution exercise and presentation exercise 2

  • Next lecture on Dec 6th:

– Training Neural Networks

51

  • Prof. Leal-Taixé and Prof. Niessner
slide-52
SLIDE 52

See you next week!

  • Prof. Leal-Taixé and Prof. Niessner

52