Basic ics of f DL Prof. Leal-Taix and Prof. Niessner 1 What we - - PowerPoint PPT Presentation

basic ics of f dl
SMART_READER_LITE
LIVE PREVIEW

Basic ics of f DL Prof. Leal-Taix and Prof. Niessner 1 What we - - PowerPoint PPT Presentation

Basic ics of f DL Prof. Leal-Taix and Prof. Niessner 1 What we assume you know Linear Algebra & Programming! Basics from the Introduction to Deep Learning lecture PyTorch (can use TensorFlow ) You have trained already


slide-1
SLIDE 1

Basic ics of f DL

  • Prof. Leal-Taixé and Prof. Niessner

1

slide-2
SLIDE 2

What we assume you know

  • Linear Algebra & Programming!
  • Basics from the Introduction to Deep Learning lecture
  • PyTorch (can use TensorFlow…)
  • You have trained already several models and know

how to debug problems, observe training curves, prepare training/validation/test data.

  • Prof. Leal-Taixé and Prof. Niessner

2

slide-3
SLIDE 3

What is is a N Neural l Network?

  • Prof. Leal-Taixé and Prof. Niessner

3

slide-4
SLIDE 4

Neural Network

  • Linear score function 𝑔 = 𝑋𝑦
  • Prof. Leal-Taixé and Prof. Niessner

4

On CIFAR-10 On ImageNet

Credit: Li/Karpathy/Johnson

slide-5
SLIDE 5

Neural Network

  • Linear score function 𝑔 = 𝑋𝑦
  • Neural network is a nesting of ‘functions’

– 2-layers: 𝑔 = 𝑋

2 max(0, 𝑋 1𝑦)

– 3-layers: 𝑔 = 𝑋

3 max(0, 𝑋 2 max(0, 𝑋 1𝑦))

– 4-layers: 𝑔 = 𝑋

4 tanh (W3, max(0, 𝑋 2 max(0, 𝑋 1𝑦)))

– 5-layers: 𝑔 = 𝑋

5𝜏(𝑋 4 tanh(W3, max(0, 𝑋 2 max(0, 𝑋 1𝑦))))

– … up to hundreds of layers

  • Prof. Leal-Taixé and Prof. Niessner

5

slide-6
SLIDE 6

Neural Network

  • Prof. Leal-Taixé and Prof. Niessner

6

2-layer network: 𝑔 = 𝑋

2 max(0, 𝑋 1𝑦)

𝑦 ℎ 𝑋

1

128 × 128 = 16384 1000

𝑔 𝑋

2

10

1-layer network: 𝑔 = 𝑋𝑦 𝑦 𝑋

128 × 128 = 16384

𝑔

10

slide-7
SLIDE 7

Neural Network

  • Prof. Leal-Taixé and Prof. Niessner

7

Credit: Li/Karpathy/Johnson

slide-8
SLIDE 8

Loss functio ions

  • Prof. Leal-Taixé and Prof. Niessner

8

slide-9
SLIDE 9

Neural networks

What is the shape of this function?

9

  • Prof. Leal-Taixé and Prof. Niessner

Loss (Softmax, Hinge) Prediction

slide-10
SLIDE 10

Loss fu functio ions

  • Softmax loss function
  • Hinge Loss (derived from the Multiclass SVM loss)

10

  • Prof. Leal-Taixé and Prof. Niessner

Evaluate the ground truth score for the image

slide-11
SLIDE 11

Loss fu functio ions

  • Softmax loss function

– Optimizes until the loss is zero

  • Hinge Loss (derived from the Multiclass SVM loss)

– Saturates whenever it has learned a class “well enough”

11

  • Prof. Leal-Taixé and Prof. Niessner
slide-12
SLIDE 12

Activ ivatio ion functio ions

  • Prof. Leal-Taixé and Prof. Niessner

12

slide-13
SLIDE 13

Sig igmoid id

Forward

13

  • Prof. Leal-Taixé and Prof. Niessner

Saturated neurons kill the gradient flow

slide-14
SLIDE 14

Pro roblem of f positiv ive output

14

  • Prof. Leal-Taixé and Prof. Niessner

More on zero- mean data later

slide-15
SLIDE 15

tanh

15

  • Prof. Leal-Taixé and Prof. Niessner

Zero- centered Still saturates Still saturates

LeCun 1991

slide-16
SLIDE 16

Rectif ifie ied Lin inear Unit its (ReLU)

16

  • Prof. Leal-Taixé and Prof. Niessner

Large and consistent gradients Does not saturate Fast convergence What happens if a ReLU outputs zero? Dead ReLU

slide-17
SLIDE 17

Maxout unit its

17

  • Prof. Leal-Taixé and Prof. Niessner

Generalization

  • f ReLUs

Linear regimes Does not die Does not saturate Increase of the number of parameters

slide-18
SLIDE 18

Optim imiz izatio ion

  • Prof. Leal-Taixé and Prof. Niessner

18

slide-19
SLIDE 19

Gra radie ient Descent fo for r Neura ral Networks

  • Prof. Leal-Taixé and Prof. Niessner

19

𝑦0 𝑦1 𝑦2 ℎ0 ℎ1 ℎ2 ℎ3 𝑧0 𝑧1 𝑢0 𝑢1 𝑧𝑗 = 𝐵(𝑐1,𝑗 + ෍

𝑘

ℎ𝑘𝑥1,𝑗,𝑘) ℎ𝑘 = 𝐵(𝑐0,𝑘 + ෍

𝑙

𝑦𝑙𝑥0,𝑘,𝑙) 𝑀𝑗 = 𝑧𝑗 − 𝑢𝑗 2 𝛼𝑥,𝑐𝑔 𝑦,𝑢 (𝑥) = 𝜖𝑔 𝜖𝑥0,0,0 … … 𝜖𝑔 𝜖𝑥𝑚,𝑛,𝑜 … … 𝜖𝑔 𝜖𝑐𝑚,𝑛 Just simple: 𝐵 𝑦 = max(0, 𝑦)

slide-20
SLIDE 20

Stochastic Gra radient Descent (S (SGD)

𝜄𝑙+1 = 𝜄𝑙 − 𝛽𝛼𝜄𝑀(𝜄𝑙, 𝑦{1..𝑛}, 𝑧{1..𝑛}) 𝛼𝜄𝑀 =

1 𝑛 σ𝑗=1 𝑛 𝛼𝜄𝑀𝑗

Note the terminology: iteration vs epoch

  • Prof. Leal-Taixé and Prof. Niessner

20

𝑙 now refers to 𝑙-th iteration 𝑛 training samples in the current batch Gradient for the 𝑙-th batch

slide-21
SLIDE 21

Gra radie ient Descent wit ith Momentum

𝑤𝑙+1 = 𝛾 ⋅ 𝑤𝑙 + 𝛼𝜄𝑀(𝜄𝑙) 𝜄𝑙+1 = 𝜄𝑙 − 𝛽 ⋅ 𝑤𝑙+1 Exponentially-weighted average of gradient Important: velocity 𝑤𝑙 is vector-valued!

  • Prof. Leal-Taixé and Prof. Niessner

21

Gradient of current minibatch velocity accumulation rate (‘friction’, momentum) learning rate velocity model

slide-22
SLIDE 22

Gra radie ient Descent wit ith Momentum

𝜄𝑙+1 = 𝜄𝑙 − 𝛽 ⋅ 𝑤𝑙+1

  • Prof. Leal-Taixé and Prof. Niessner

22

Step will be largest when a sequence of gradients all point to the same direction

  • Fig. credit: I. Goodfellow

Hyperparameters are 𝛽, 𝛾 𝛾 is often set to 0.9

slide-23
SLIDE 23

RMSProp

𝑡𝑙+1 = 𝛾 ⋅ 𝑡𝑙 + (1 − 𝛾)[𝛼𝜄𝑀 ∘ 𝛼𝜄𝑀] 𝜄𝑙+1 = 𝜄𝑙 − 𝛽 ⋅ 𝛼𝜄𝑀 𝑡𝑙+1 + 𝜗 Hyperparameters: 𝛽, 𝛾, 𝜗

  • Prof. Leal-Taixé and Prof. Niessner

23

Typically 10−8 Often 0.9

Element-wise multiplication

Needs tuning!

slide-24
SLIDE 24

RMSProp

  • Prof. Leal-Taixé and Prof. Niessner

24

X-direction Small gradients Y-Direction Large gradients

  • Fig. credit: A. Ng

𝑡𝑙+1 = 𝛾 ⋅ 𝑡𝑙 + (1 − 𝛾)[𝛼𝜄𝑀 ∘ 𝛼𝜄𝑀] 𝜄𝑙+1 = 𝜄𝑙 − 𝛽 ⋅ 𝛼𝜄𝑀 𝑡𝑙+1 + 𝜗 We’re dividing by square gradients:

  • Division in Y-Direction will be large
  • Division in X-Direction will be small

(uncentered) variance of gradients

  • > second momentum

Can increase learning rate!

slide-25
SLIDE 25

Adaptiv ive Moment Estim imatio ion (A (Adam)

Combines Momentum and RMSProp 𝑛𝑙+1 = 𝛾1 ⋅ 𝑛𝑙 + 1 − 𝛾1 𝛼𝜄𝑀 𝜄𝑙 𝑤𝑙+1 = 𝛾2 ⋅ 𝑤𝑙 + (1 − 𝛾2)[𝛼𝜄𝑀 𝜄𝑙 ∘ 𝛼𝜄𝑀 𝜄𝑙 ] 𝜄𝑙+1 = 𝜄𝑙 − 𝛽 ⋅

𝑛𝑙+1 𝑤𝑙+1+𝜗

  • Prof. Leal-Taixé and Prof. Niessner

25

First momentum: mean of gradients Second momentum: variance of gradients

slide-26
SLIDE 26

Adam

Combines Momentum and RMSProp

𝑛𝑙+1 = 𝛾1 ⋅ 𝑛𝑙 + 1 − 𝛾1 𝛼𝜄𝑀 𝜄𝑙 𝑤𝑙+1 = 𝛾2 ⋅ 𝑤𝑙 + (1 − 𝛾2)[𝛼𝜄𝑀 𝜄𝑙 ∘ 𝛼𝜄𝑀 𝜄𝑙 ]

𝜄𝑙+1 = 𝜄𝑙 − 𝛽 ⋅

ෝ 𝑛𝑙+1 ො 𝑤𝑙+1+𝜗

  • Prof. Leal-Taixé and Prof. Niessner

26

𝑛𝑙+1 and 𝑤𝑙+1 are initialized with zero

  • > bias towards zero

Typically, bias-corrected moment updates ෝ 𝑛𝑙+1 = 𝑛𝑙 1 − 𝛾1 ො 𝑤𝑙+1 = 𝑤𝑙 1 − 𝛾2

slide-27
SLIDE 27

Convergence

27

slide-28
SLIDE 28

Train inin ing NNs

  • Prof. Leal-Taixé and Prof. Niessner

28

slide-29
SLIDE 29

Im Importance of f Learning Rate

  • Prof. Leal-Taixé and Prof. Niessner

29

slide-30
SLIDE 30

Over- and Underf rfitting

  • Prof. Leal-Taixé and Prof. Niessner

30

Figure extracted from Deep Learning by Adam Gibson, Josh Patterson, O‘Reily Media Inc., 2017

slide-31
SLIDE 31

Over- and Underf rfitting

  • Prof. Leal-Taixé and Prof. Niessner

31

Source: http://srdas.github.io/DLBook/ImprovingModelGeneralization.html

slide-32
SLIDE 32

Basic re recip ipe fo for r machine le learnin ing

  • Split your data
  • Prof. Leal-Taixé and Prof. Niessner

32

Find your hyperparameters 20% train test validation 20% 60%

slide-33
SLIDE 33

Basic re recip ipe fo for r machine le learnin ing

  • Prof. Leal-Taixé and Prof. Niessner

33

slide-34
SLIDE 34

Regula lariz izatio ion

  • Prof. Leal-Taixé and Prof. Niessner

34

slide-35
SLIDE 35

Regularization

  • Any strategy that aims to

35

  • Prof. Leal-Taixé and Prof. Niessner

Lower vali lidatio ion error In Increasing train inin ing error

slide-36
SLIDE 36

Data augmentatio ion

36

  • Prof. Leal-Taixé and Prof. Niessner

Krizhevsky 2012

slide-37
SLIDE 37

Earl rly stopping

Training time is also a hyperparameter

37

  • Prof. Leal-Taixé and Prof. Niessner

Overfitting

slide-38
SLIDE 38

Bagging and ensemble methods

  • Bagging: uses k different datasets

38

  • Prof. Leal-Taixé and Prof. Niessner

Training Set 1 Training Set 2 Training Set 3

slide-39
SLIDE 39

Dro ropout

  • Disable a random set of neurons (typically 50%)

39

  • Prof. Leal-Taixé and Prof. Niessner

Srivastava 2014

Forward

slide-40
SLIDE 40

How to deal l wit ith im images?

  • Prof. Leal-Taixé and Prof. Niessner

40

slide-41
SLIDE 41

Usin ing CNNs in in Computer Vis ision

  • Prof. Leal-Taixé and Prof. Niessner

41

Credit: Li/Karpathy/Johnson

slide-42
SLIDE 42

Im Image fi filt lters

  • Each kernel gives us a different image filter
  • Prof. Leal-Taixé and Prof. Niessner

42

Input Edge detection −1 −1 −1 −1 8 −1 −1 −1 −1 Sharpen −1 −1 5 −1 −1 Box mean 1 9 1 1 1 1 1 1 1 1 1 Gaussian blur 1 16 1 2 1 2 4 2 1 2 1

slide-43
SLIDE 43

Convolutio ions on RGB Im Images

  • Prof. Leal-Taixé and Prof. Niessner

43

32 32 3 3 5 5 32 × 32 × 3 image (pixels 𝑦) 5 × 5 × 3 filter (weights 𝑥) 1 28 28 activation map (also feature map)

Co Convolve slide over all spatial locations 𝑦𝑗 and compute all output 𝑨𝑗; w/o padding, there are 28 × 28 locations

slide-44
SLIDE 44

Convolutio ion Layer

  • Prof. Leal-Taixé and Prof. Niessner

44

32 32 3 32 × 32 × 3 image 5 28 28 activation maps

Co Convolve

Let’s apply **five** * filt lters, ea each ch wit ith dif iffe ferent wei eights! Convolution “Layer”

slide-45
SLIDE 45
  • Prof. Leal-Taixé and Prof. Niessner

45

CNN Pro rototype

ConvNet is concatenation of Conv Layers and activations

32 32 3 28 28 5 24 24 8 Conv + ReLU Conv + ReLU Conv + ReLU 12 5 filters 5 × 5 × 3 8 filters 5 × 5 × 5 12 filters 5 × 5 × 8 Input Image 20

slide-46
SLIDE 46

CNN le learned fi filt lters

  • Prof. Leal-Taixé and Prof. Niessner

46

slide-47
SLIDE 47
  • Prof. Leal-Taixé and Prof. Niessner

47

Poolin ing Layer: Max Poolin ing

3 1 3 5 6 7 9 3 2 1 4 2 4 3 6 9 3 4

Single depth slice of input Max pool with 2 × 2 filters and stride 2 ‘Pooled’ output

slide-48
SLIDE 48

Cla lassic ic CNN archit itectures

  • Prof. Leal-Taixé and Prof. Niessner

48

slide-49
SLIDE 49

LeNet

  • Digit recognition: 10 classes
  • Conv -> Pool -> Conv -> Pool -> Conv -> FC
  • As we go deeper: Width, height Number of filters
  • Prof. Leal-Taixé and Prof. Niessner

49

60k parameters

slide-50
SLIDE 50

Ale lexNet

  • Softmax for 1000 classes
  • Prof. Leal-Taixé and Prof. Niessner

50

[Krizhevsky et al. 2012]

slide-51
SLIDE 51

VGGNet

  • Striving for simplicity
  • CONV = 3x3 filters with stride 1, same convolutions
  • MAXPOOL = 2x2 filters with stride 2
  • Prof. Leal-Taixé and Prof. Niessner

51

[Simonyan and Zisserman 2014]

slide-52
SLIDE 52

VGGNet

  • Prof. Leal-Taixé and Prof. Niessner

52

Conv=3x3,s=1,same Maxpool=2x2,s=2 Still very common: VGG-16

slide-53
SLIDE 53

ResNet

  • Prof. Leal-Taixé and Prof. Niessner

53

[He et al. 2015]

slide-54
SLIDE 54

ResNet

  • Xavie

ier/2 init init by by He et t al. l.

  • Xavier/2 initialization
  • SGD + Momentum (0.9)
  • Learning rate 0.1, divided by 10 when plateau
  • Mini-batch size 256
  • Weight decay of 1e-5
  • No dropout
  • Prof. Leal-Taixé and Prof. Niessner

54

[He et al. 2015]

slide-55
SLIDE 55

ResNet

  • If we make the network deeper, at some point

performance starts to degrade

  • Too many parameters,

the optimizer cannot properly train the network

  • Prof. Leal-Taixé and Prof. Niessner

55

slide-56
SLIDE 56

ResNet

  • If we make the network deeper, at some point

performance starts to degrade

  • Prof. Leal-Taixé and Prof. Niessner

56

slide-57
SLIDE 57

In Inceptio ion la layer

  • Prof. Leal-Taixé and Prof. Niessner

57

slide-58
SLIDE 58

GoogLeNet: : usin ing the in inceptio ion la layer

  • Prof. Leal-Taixé and Prof. Niessner

58

[Szegedy et al. 2014]

Inception block

slide-59
SLIDE 59

CNN Arc rchit itectures

  • Prof. Leal-Taixé and Prof. Niessner

59

slide-60
SLIDE 60

Recurrent Neural l Networks

  • Prof. Leal-Taixé and Prof. Niessner

60

slide-61
SLIDE 61

Basic stru ructure of f a RNN

  • Multi-layer RNN
  • Prof. Leal-Taixé and Prof. Niessner

61

Outputs Inputs Hidden states The hidden state will have its own internal dynamics More expressive model!

slide-62
SLIDE 62

Basic stru ructure of f a RNN

  • We want to have notion of “time” or “sequence”
  • Prof. Leal-Taixé and Prof. Niessner

62

Hidden state Same parameters for each time step = generalization! Output

slide-63
SLIDE 63

Long-Short Term Memory Unit its

  • LSTM
  • Prof. Leal-Taixé and Prof. Niessner

63

slide-64
SLIDE 64

ADL4CV Content

  • Prof. Leal-Taixé and Prof. Niessner

64

slide-65
SLIDE 65

Rough Outli line

  • Lecture 1: introduction
  • Lecture 2: advanced architectures (e.g. siamese, capsules, attention)
  • Lecture 3: advanced architectures con’t
  • Lecture 4: Visualization, t-sne, grad-cam (active heatmaps), deep dream,

excitation backprop

  • Lecture 5: Bayesian Deep Learning
  • Lecture 6: Autoencoders, VAE

Lecture 7: GANs 1: Generative models, GANs.

  • Lecture 8: GANs 2: Generative models, GANs
  • Lecture 9: CNN++ / Audio<->Visual - autoregressive, pixelcnn
  • Lecture 10: RNN -> NLP <-> Visual Q&A (focus on the cross domain: CNN

for image, RNN for text) /

  • Lecture 11: Multi-dimensional CNN, 3D DL, video DL: pooling vs fully-conv,
  • perations… Self-supervised / unsupervised learning
  • Lecture 12: Domain Adaptation / Transfer Learning
  • Prof. Leal-Taixé and Prof. Niessner

65

slide-66
SLIDE 66

How to train in your neural l network?

  • Prof. Leal-Taixé and Prof. Niessner

66

slide-67
SLIDE 67

Is Is data lo loadin ing corr rrect?

  • Data output (target): overfit to single training sample

(needs to have 100% because it just memorizes input)

– It’s irrespective of input !!!

  • Data input: overfit to a handful (e.g., 4) training

samples

– It’s now conditioned on input data

  • Save and re-load data from PyTorch / TensorFlow
  • Prof. Leal-Taixé and Prof. Niessner

67

slide-68
SLIDE 68

Network debuggin ing

  • Move from overfitting to a hand-full of samples

– 5, 10, 100, 1000… – At some point, we should see generalization

  • Apply common sense: can we overfit to the current

number of samples?

  • Always be aware of network parameter count!
  • Prof. Leal-Taixé and Prof. Niessner

68

slide-69
SLIDE 69

Check tim imin ings

  • How long does each iteration take?

– Get pr precise tim timings!! !!! – If an iteration takes > 500ms, things get dicey…

  • Where is the bottleneck: data loading vs backprop?

– Speed up data loading: smaller resolutions, compression, train from SSD – e.g., network training is good idea – Speed up backprop:

  • Estimate total timings: how long until you see some

pattern? How long till convergence?

  • Prof. Leal-Taixé and Prof. Niessner

69

slide-70
SLIDE 70

Network Arc rchit itecture re

  • 100% mistake so far: “let’s use super big network and

train for two weeks and we see where we stand.” [because we desperately need those 2%...]

  • Start with simplest network possible: rule of thumb

divide #layers you started with by 5.

  • Get debug cycles down – ideally, minutes!!!
  • Prof. Leal-Taixé and Prof. Niessner

70

slide-71
SLIDE 71

Debugging

  • Need train/val/test curves

– Evaluation needs to be consistent! – Numbers need to be comparable

  • Only make one change at a time

– “I’ve added 5 more layers and double the training size, and now I also trained 5 days longer” – it’s better, but WHY?

  • Prof. Leal-Taixé and Prof. Niessner

71

slide-72
SLIDE 72

Overf rfit ittin ing

ONLY THINKG ABOUT THIS ONCE YOU’R TRAINING LOSS GOES DOWN AND YOU CAN OVERFIT! Typically try this order:

  • Network too big – makes things also faster 
  • More regularization; e.g., weight decay
  • Not enough data - makes things slower!
  • Dropout - makes things slower!
  • Guideline: make training harder -> generalize better
  • Prof. Leal-Taixé and Prof. Niessner

72

slide-73
SLIDE 73

Pushing the li limit its!

PROCEED ONLY IF YOU GENERALIZE AND YOU ADDRESSED OVERFITTING ISSUES!

  • Bigger network -> more capacity, more power - needs also

more data!

  • Better architecture -> ResNet is typically standard, but

InceptionNet architectures perform often better (e.g., InceptionNet v4, XceptionNet, etc.)

  • Schedules for learning rate decay
  • Class-based re-weighting (e.g., give under-represented classes

higher weight)

  • Hyperparameter tuning: e.g., grid search; apply common sense!
  • Prof. Leal-Taixé and Prof. Niessner

73

slide-74
SLIDE 74

Bad signs…

  • Train error doesn’t go down…
  • Validation error doesn’t go down… (ahhh we don’t learn)
  • Validation performs better than train… (trust me, this scenario is

very unlikely – unless you have a bug )

  • Test on train set is different error than train… (forgot dropout?)
  • Often people mess up the last batch in an epoch…
  • You are training set contains test data…
  • You debug your algorithm on test data…
  • Prof. Leal-Taixé and Prof. Niessner

74

slide-75
SLIDE 75

“Most common” neural net mis istakes

1) you didn't try to overfit a single batch first. 2) you forgot to toggle train/eval mode for the net. 3) you forgot to .zero_grad() (in pytorch) before .backward(). 4) you passed softmaxed outputs to a loss that expects raw logits. 5) you didn't use bias=False for your Linear/Conv2d layer when using BatchNorm, or conversely forget to include it for the output layer

  • Prof. Leal-Taixé and Prof. Niessner

75

Credit: A. Karpathy

slide-76
SLIDE 76

Next le lecture

  • Next Monday: advanced architectures
  • Keep projects in mind!

– Start actively discussing -> reach out to us if you have questions!

  • Prof. Leal-Taixé and Prof. Niessner

76