Train inin ing Neural l Networks I2DL: Prof. Niessner, Prof. - - PowerPoint PPT Presentation

β–Ά
train inin ing neural l
SMART_READER_LITE
LIVE PREVIEW

Train inin ing Neural l Networks I2DL: Prof. Niessner, Prof. - - PowerPoint PPT Presentation

Train inin ing Neural l Networks I2DL: Prof. Niessner, Prof. Leal-Taix 1 Lecture 5 Recap I2DL: Prof. Niessner, Prof. Leal-Taix 2 Gra radie ient Descent fo for r Neura ral Networks Loss function 2


slide-1
SLIDE 1

Train inin ing Neural l Networks

I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 1

slide-2
SLIDE 2

Lecture 5 Recap

I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 2

slide-3
SLIDE 3

Gra radie ient Descent fo for r Neura ral Networks

𝑦0 𝑦1 𝑦2 β„Ž0 β„Ž1 β„Ž2 β„Ž3 ො 𝑧0 ො 𝑧1 𝑧0 𝑧1 ො 𝑧𝑗 = 𝐡(𝑐1,𝑗 + ෍

π‘˜

β„Žπ‘˜π‘₯1,𝑗,π‘˜) β„Žπ‘˜ = 𝐡(𝑐0,π‘˜ + ෍

𝑙

𝑦𝑙π‘₯0,π‘˜,𝑙) Loss function 𝑀𝑗 = ො 𝑧𝑗 βˆ’ 𝑧𝑗 2 Just simple: 𝐡 𝑦 = max(0, 𝑦) 𝛼𝑿,𝒄𝑔 π’š,𝒛 (𝑿) = πœ–π‘” πœ–π‘₯0,0,0 … … πœ–π‘” πœ–π‘₯π‘š,𝑛,π‘œ … … πœ–π‘” πœ–π‘π‘š,𝑛

I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 3

slide-4
SLIDE 4

Stochastic Gra radient Descent (S (SGD)

πœΎπ‘™+1 = πœΎπ‘™ βˆ’ π›½π›ΌπœΎπ‘€(πœΎπ‘™, π’š{1..𝑛}, 𝒛{1..𝑛}) π›ΌπœΎπ‘€ =

1 𝑛 σ𝑗=1 𝑛 π›ΌπœΎπ‘€π‘—

:

𝑙 now refers to 𝑙-th iteration 𝑛 training samples in the current minibatch Gradient for the 𝑙-th minibatch

I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 4

slide-5
SLIDE 5

Gra radie ient Descent wit ith Momentum

π’˜π‘™+1 = 𝛾 β‹… π’˜π‘™ + π›ΌπœΎπ‘€(πœΎπ‘™) πœΎπ‘™+1 = πœΎπ‘™ βˆ’ 𝛽 β‹… π’˜π‘™+1 Exponentially-weighted average of gradient Important: velocity π’˜π‘™ is vector-valued!

Gradient of current minibatch velocity accumulation rate (β€˜friction’, momentum) learning rate velocity model

I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 5

slide-6
SLIDE 6

RMSProp

X-direction Small gradients Y-Direction Large gradients

Source: A. Ng

𝒕𝑙+1 = 𝛾 β‹… 𝒕𝑙 + (1 βˆ’ 𝛾)[π›ΌπœΎπ‘€ ∘ π›ΌπœΎπ‘€] πœΎπ‘™+1 = πœΎπ‘™ βˆ’ 𝛽 β‹… π›ΌπœΎπ‘€ 𝒕𝑙+1 + πœ— We’re dividing by square gradients:

  • Division in Y-Direction will be

large

  • Division in X-Direction will be

small (Uncentered) variance of gradients β†’ second momentum Can increase learning rate!

I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 6

slide-7
SLIDE 7

Adam

  • Combines Momentum and RMSProp

𝒏𝑙+1 = 𝛾1 β‹… 𝒏𝑙 + 1 βˆ’ 𝛾1 π›ΌπœΎπ‘€ πœΎπ‘™ π’˜π‘™+1 = 𝛾2 β‹… π’˜π‘™ + (1 βˆ’ 𝛾2)[π›ΌπœΎπ‘€ πœΎπ‘™ ∘ π›ΌπœΎπ‘€ πœΎπ‘™

I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 7

  • 𝒏𝑙+1 and π’˜π‘™+1 are initialized with zero

β†’ bias towards zero β†’ Typically, bias-corrected moment updates

ෝ 𝒏𝑙+1 = 𝒏𝑙+1 1 βˆ’ 𝛾1

𝑙+1

ෝ π’˜π‘™+1 = π’˜π‘™+1 1 βˆ’ 𝛾2

𝑙+1

πœΎπ‘™+1 = πœΎπ‘™ βˆ’ 𝛽 β‹…

ෝ 𝒏𝑙+1 ෝ π’˜π‘™+1+πœ—

slide-8
SLIDE 8

Train inin ing Neural l Nets

I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 8

slide-9
SLIDE 9

Learnin ing Rate: : Im Implic ications

  • What if too high?
  • What if too low?

Source: http://cs231n.github.io/neural-networks-3/

I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 9

slide-10
SLIDE 10

Learnin ing Rate

Need high learning rate when far away Need low learning rate when close

I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 10

slide-11
SLIDE 11

Learnin ing Rate Decay

  • 𝛽 =

1 1+𝑒𝑓𝑑𝑏𝑧_π‘ π‘π‘’π‘“βˆ—π‘“π‘žπ‘π‘‘β„Ž β‹… 𝛽0

– E.g., 𝛽0 = 0.1, 𝑒𝑓𝑑𝑏𝑧_𝑠𝑏𝑒𝑓 = 1.0 β†’ Epoch 0: 0.1 β†’ Epoch 1: 0.05 β†’ Epoch 2: 0.033 β†’ Epoch 3: 0.025 ...

0.02 0.04 0.06 0.08 0.1 0.12 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46

Learning Rate over Epochs

I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 11

slide-12
SLIDE 12

Learnin ing Rate Decay

Many options:

  • Step decay 𝛽 = 𝛽 βˆ’ 𝑒 β‹… 𝛽 (only every n steps)

– T is decay rate (often 0.5)

  • Exponential decay 𝛽 = π‘’π‘“π‘žπ‘π‘‘β„Ž β‹… 𝛽0

– t is decay rate (t < 1.0)

  • 𝛽 =

𝑒 π‘“π‘žπ‘π‘‘β„Ž β‹… 𝑏0

– t is decay rate

  • Etc.

I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 12

slide-13
SLIDE 13

Tra rain ining Schedule le

Manually specify learning rate for entire training process

  • Manually set learning rate every n-epochs
  • How?

– Trial and error (the hard way) – Some experience (only generalizes to some degree)

Consider: #epochs, training set size, network size, etc.

I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 13

slide-14
SLIDE 14

Basic Recip ipe fo for r Tra rain ining

  • Given ground dataset with ground labels

– {𝑦𝑗, 𝑧𝑗}

  • 𝑦𝑗 is the π‘—π‘’β„Ž training image, with label 𝑧𝑗
  • Often dim 𝑦 ≫ dim(𝑧) (e.g., for classification)
  • 𝑗 is often in the 100-thousands or millions

– Take network 𝑔 and its parameters π‘₯, 𝑐 – Use SGD (or variation) to find optimal parameters π‘₯, 𝑐

  • Gradients from backpropagation

I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 14

slide-15
SLIDE 15

Gra radie ient Descent on Tra rain in Set

  • Given large train set with (π‘œ) training samples {π’šπ‘—, 𝒛𝑗}

– Let’s say 1 million labeled images – Let’s say our network has 500k parameters

  • Gradient has 500k dimensions
  • π‘œ = 1 π‘›π‘—π‘šπ‘šπ‘—π‘π‘œ
  • Extr

xtremely ly exp xpensive to to compute

I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 15

slide-16
SLIDE 16

Learnin ing

  • Learning means generalization to unknown dataset

– (So far no β€˜real’ learning) – I.e., train on known dataset β†’ test with optimized parameters on unknown dataset

  • Basically, we hope that based on the train set, the
  • ptimized parameters will give similar results on

different data (i.e., test data)

I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 16

slide-17
SLIDE 17

Learnin ing

  • Training set (β€˜train’):

– Use for training your neural network

  • Validation set (β€˜val’):

– Hyperparameter optimization – Check generalization progress

  • Test set (β€˜test’):

– Only for the very end – NEVER TO TOUCH DURIN ING DEVELOPMENT OR TR TRAINING

I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 17

slide-18
SLIDE 18

Learnin ing

  • Typical splits

– Train (60%), Val (20%), Test (20%) – Train (80%), Val (10%), Test (10%)

  • During training:

– Train error comes from average minibatch error – Typically take subset of validation every n iterations

I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 18

slide-19
SLIDE 19

Basic Recip ipe fo for r Machine Learning

  • Split your data

Find your hyperparameters

20% train test validation 20% 60%

I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 19

slide-20
SLIDE 20

Basic Recip ipe fo for r Machine Learning

  • Split your data

20% train test validation 20% 60%

Ground truth error …... 1% Training set error ….... 5% Val/test set error ….... 8%

Bias (underfitting) Variance (overfitting)

Example scenario

I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 20

slide-21
SLIDE 21

Basic Recip ipe fo for r Machine Learning

Credits: A. Ng

Done

I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 21

slide-22
SLIDE 22

Over- and Underf rfitting

Underfitted Appropriate Overfitted

Source: Deep Learning by Adam Gibson, Josh Patterson, Oβ€˜Reily Media Inc., 2017

I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 22

slide-23
SLIDE 23

Over- and Underf rfitting

Source: https://srdas.github.io/DLBook/ImprovingModelGeneralization.html

I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 23

slide-24
SLIDE 24

Learnin ing Curv rves

  • Training graphs
  • Accuracy
  • Loss

I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 24

slide-25
SLIDE 25

Learnin ing Curv rves

t e s t

val

Source: https://machinelearningmastery.com/learning-curves-for-diagnosing-machine-learning-model-performance/

I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 25

slide-26
SLIDE 26

Overf rfit ittin ing Curv rves

t e s t

Source: https://machinelearningmastery.com/learning-curves-for-diagnosing-machine-learning-model-performance/

Val

I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 26

slide-27
SLIDE 27

Other r Curv rves

Underfitting (loss still decreasing) Validation Set is easier than Training set

Source: https://machinelearningmastery.com/learning-curves-for-diagnosing-machine-learning-model-performance/

I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 27

slide-28
SLIDE 28

To Summariz ize

  • Underfitting

– Training and validation losses decrease even at the end

  • f training
  • Overfitting

– Training loss decreases and validation loss increases

  • Ideal Training

– Small gap between training and validation loss, and both go down at same rate (stable without fluctuations).

I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 28

slide-29
SLIDE 29

To Summariz ize

  • Bad Signs

– Training error not going down – Validation error not going down – Performance on validation better than on training set – Tests on train set different than during training

  • Bad Practice

– Training set contains test data – Debug algorithm on test data

I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 29

Never touch during development or training

slide-30
SLIDE 30

Hyperparameters

  • Network architecture (e.g., num layers, #weights)
  • Number of iterations
  • Learning rate(s) (i.e., solver parameters, decay, etc.)
  • Regularization (more later next lecture)
  • Batch size
  • …
  • Overall:

learning setup + optimization = hyperparameters

I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 30

slide-31
SLIDE 31

Hyperparameter Tunin ing

  • Methods:

– Manual search:

  • most common 

– Gri rid search (structured, for β€˜real’ applications)

  • Define ranges for all parameters spaces and

select points

  • Usually pseudo-uniformly distributed

β†’ Iterate over all possible configurations

– Rand ndom search:

Like grid search but one picks points at random in the predefined ranges

I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 31

0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1

Second Parameter First Parameter

Grid search

0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1

Second Parameter First Parameter

Random search

slide-32
SLIDE 32

How to Start

  • Start with single training sample

– Check if output correct – Overfit οƒ  train accuracy should be 100% because input just memorized

  • Increase to handful of samples (e.g., 4)

– Check if input is handled correctly

  • Move from overfitting to more samples

– 5, 10, 100, 1000, … – At some point, you should see generalization

I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 32

…

slide-33
SLIDE 33

Fin ind a Good Learnin ing Rate

  • Use all training data with small weight decay
  • Perform initial loss sanity check e.g., log(𝐷) for

softmax with 𝐷 classes

  • Find a learning rate that makes

the loss drop significantly (exponentially) within 100 iterations

  • Good learning rates to try:

1e-1, 1e-2, 1e-3, 1e-4

I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 33

Training time Loss

slide-34
SLIDE 34

Coarse Gri rid Search

  • Choose a few values of learning rate and weight

decay around what worked from

  • Train a few models for a few epochs.
  • Good weight decay to try: 1e-4, 1e-5, 0

I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 34

0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1

Second Parameter First Parameter

Grid search

slide-35
SLIDE 35

Refi fine Gri rid

  • Pick best models found with coarse grid.
  • Refine grid search around these models.
  • Train them for longer (10-20 epochs) without learning

rate decay

  • Study loss curves <- most important debugging tool!

I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 35

slide-36
SLIDE 36

Tim imin ings

  • How long does each iteration take?

– Get prec recise timings! – If an iteration exceeds 500ms, things get dicey

  • Look for bottlenecks

– Dataloading: smaller resolution, compression, train from SSD – Backprop

  • Estimate total time

– How long until you see some pattern? – How long till convergence?

I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 36

slide-37
SLIDE 37

Network Arc rchit itecture re

  • Frequent mistake: β€œLet’s use this super big network,

train for two weeks and we see where we stand.”

I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 37

  • Instead: start with simplest

network possible

– Rule of thumb divide #layers you started with by 5

  • Get debug cycles down

– Ideally, minutes

slide-38
SLIDE 38

Debugging

  • Use train/validation/test curves

– Evaluation needs to be consistent – Numbers need to be comparable

  • Only make one change at

t a tim time

– β€œI’ve added 5 more layers and double the training size, and now I also trained 5 days longer. Now it’s better, but why?”

I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 38

slide-39
SLIDE 39

Common Mis istakes in in Pra ractice

  • Did not overfit to single batch first
  • Forgot to toggle train/eval mode for network

– Check later when we talk about dropout…

  • Forgot to call .zero_grad() (in PyTorch) before calling

.backward()

  • Passed softmaxed outputs to a loss function that

expects raw logits

I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 39

slide-40
SLIDE 40

Tensorboard: : Vis isuali lizatio ion in in Practic ice

I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 40

slide-41
SLIDE 41

Tensorb rboard rd: : Compare Tra rain in/Val Curv rves

t e s t

I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 41

slide-42
SLIDE 42

Tensorb rboard rd: : Compare Dif iffere rent Runs

t e s t

I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 42

slide-43
SLIDE 43

Tensorb rboard rd: : Vis isualiz ize Model Pre redic ictio ions

t e s t

I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 43

slide-44
SLIDE 44

Tensorb rboard rd: : Vis isualiz ize Model Pre redic ictio ions

t e s t

I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 44

slide-45
SLIDE 45

Tensorb rboard rd: : Compare Hyperpara rameters rs

t e s t

I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 45

slide-46
SLIDE 46

Next Lecture

  • Next lecture

– More about training neural networks: output functions, loss functions, activation functions

  • Check the exercises 

I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 46

slide-47
SLIDE 47

See you next week 

I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 47

slide-48
SLIDE 48

Refe ferences

  • Goodfellow et al. β€œDeep Learning” (2016),

– Chapter 6: Deep Feedforward Networks

  • Bishop β€œPattern Recognition and Machine Learning” (2006),

– Chapter 5.5: Regularization in Network Nets

  • http://cs231n.github.io/neural-networks-1/
  • http://cs231n.github.io/neural-networks-2/
  • http://cs231n.github.io/neural-networks-3/

I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 48