Lectu ture 7 Recap Prof. Leal-Taix and Prof. Niessner 1 Bey - - PowerPoint PPT Presentation

lectu ture 7 recap
SMART_READER_LITE
LIVE PREVIEW

Lectu ture 7 Recap Prof. Leal-Taix and Prof. Niessner 1 Bey - - PowerPoint PPT Presentation

Lectu ture 7 Recap Prof. Leal-Taix and Prof. Niessner 1 Bey Beyon ond l linea ear 1-layer network: f = Wx x W f 10 128128 Prof. Leal-Taix and Prof. Niessner 2 Ne Neural Ne Netw twork Width Depth Prof. Leal-Taix and


slide-1
SLIDE 1

Lectu ture 7 Recap

1

  • Prof. Leal-Taixé and Prof. Niessner
slide-2
SLIDE 2

Bey Beyon

  • nd l

linea ear

2

1-layer network: x W 128×128 f 10

f = Wx

  • Prof. Leal-Taixé and Prof. Niessner
slide-3
SLIDE 3

Ne Neural Ne Netw twork

Depth

3

Width

  • Prof. Leal-Taixé and Prof. Niessner
slide-4
SLIDE 4

Opti Optimizati tion

  • Prof. Leal-Taixé and Prof. Niessner

4

slide-5
SLIDE 5

Loss functi tions

5

  • Prof. Leal-Taixé and Prof. Niessner
slide-6
SLIDE 6

Ne Neural netw tworks

6

Loss (Softmax, Hinge) Prediction

  • Prof. Leal-Taixé and Prof. Niessner

What is the shape of this function?

slide-7
SLIDE 7

Si Sigmoid for bina nary predictions ns

7

x0 x1 x2 X

Can be interpreted as a probability 1

p(yi = 1|xi, θ)

θ0 θ1 θ2 σ(x) = 1 1 + e−x

  • Prof. Leal-Taixé and Prof. Niessner
slide-8
SLIDE 8

Lo Logitic re regre ressi ssion

  • Binary classification

x0 x1 x2 X

Πi

θ0 θ1 θ2

  • Prof. Leal-Taixé and Prof. Niessner

8

slide-9
SLIDE 9

Lo Logistic regression

  • Loss function
  • Cost function
  • Prof. Leal-Taixé and Prof. Niessner

9

Minimization

= σ(xiθ)

L(Πi, yi) = yi log Πi + (1 − yi) log(1 − Πi) C(θ) = − 1 n

n

X

i=1

yi log Πi + (1 − yi) log(1 − Πi)

slide-10
SLIDE 10

So Softmax re regre ressi ssion

  • Cost function for the binary case
  • Extension to multiple classes
  • Prof. Leal-Taixé and Prof. Niessner

10

C(θ) = − 1 n

n

X

i=1

yi log Πi + (1 − yi) log(1 − Πi) C(θ) = − 1 n

n

X

i=1 M

X

c=1

yi,c log pi,c c

Binary indicator whether is the label for image i Probability given by our sigmoid function

slide-11
SLIDE 11

So Softmax fo formu rmulation

  • What if we have multiple classes?

11

Softmax

  • Prof. Leal-Taixé and Prof. Niessner

Π2 Π1

Π3

x0 x1 x2

slide-12
SLIDE 12

So Softmax fo formu rmulation

  • Three neurons in the output layer for three classes

12

Π2 Π1

  • Prof. Leal-Taixé and Prof. Niessner

Π3

x0 x1 x2

slide-13
SLIDE 13

So Softmax fo formu rmulation

  • What if we have multiple classes?
  • You can no longer assign to. as in the binary

case, because all outputs need to sum to 1

13

  • Prof. Leal-Taixé and Prof. Niessner

C(θ) = − 1 n

n

X

i=1 M

X

c=1

yi,c log pi,c pi,c Πi X

c

Πi,c

slide-14
SLIDE 14

So Softmax fo formu rmulation

  • Softmax takes M inputs (Scores) and outputs M

probabilities (M is the number of classes)

  • Prof. Leal-Taixé and Prof. Niessner

14

Π2 Π1

Π3 p(dog|Xi) p(cat|Xi) p(bird|Xi) = escat P

c esc

Score for class cat given by all the layers of the network

Normalization

slide-15
SLIDE 15

Lo Loss func nctions ns

  • Softmax loss function
  • Hinge Loss (derived from the Multiclass SVM loss)

15

Li = − log ✓ esyi P

k esk

  • Prof. Leal-Taixé and Prof. Niessner

Evaluate the ground truth score for the image

Li = X

k6=yi

max(0, sk − syi + 1)

Comes from Maximum Likelihood Estimate

slide-16
SLIDE 16

Lo Loss func nctions ns

  • Softmax loss function

– Optimizes until the loss is zero

  • Hinge Loss (derived from the Multiclass SVM loss)

– Saturates whenever it has learned a class “well enough”

16

  • Prof. Leal-Taixé and Prof. Niessner
slide-17
SLIDE 17

Acti tivati tion functi tions

17

  • Prof. Leal-Taixé and Prof. Niessner
slide-18
SLIDE 18

Si Sigmoid

Forward

18

σ(x) = 1 1 + e−x ∂L ∂σ ∂σ ∂x ∂L ∂x = ∂σ ∂x ∂L ∂σ x = 6

Saturated neurons kill the gradient flow

  • Prof. Leal-Taixé and Prof. Niessner
slide-19
SLIDE 19

Pr Probl blem of po positi tive ve outpu tput

19

w1 w2

More on zero- mean data later

  • Prof. Leal-Taixé and Prof. Niessner
slide-20
SLIDE 20

ta tanh

20

Zero- centered Still saturates Still saturates

LeCun 1991

  • Prof. Leal-Taixé and Prof. Niessner
slide-21
SLIDE 21

Rec Rectif ified L ied Lin inear ear U Unit its ( (ReL ReLU)

21

Large and consistent gradients Does not saturate Fast convergence What happens if a ReLU outputs zero? Dead ReLU

  • Prof. Leal-Taixé and Prof. Niessner
slide-22
SLIDE 22

Ma Maxou

  • ut un

units ts

22

Generalization

  • f ReLUs

Linear regimes Does not die Does not saturate Increase of the number of parameters

  • Prof. Leal-Taixé and Prof. Niessner
slide-23
SLIDE 23

Da Data ta pre-pr proce cessing

  • Prof. Leal-Taixé and Prof. Niessner

23

For images subtract the mean image (AlexNet) or per- channel mean (VGG-Net)

slide-24
SLIDE 24

Weight t initi tializati tion

24

  • Prof. Leal-Taixé and Prof. Niessner
slide-25
SLIDE 25

Ho How do I st start rt?

Forward

25

w w w w

  • Prof. Leal-Taixé and Prof. Niessner
slide-26
SLIDE 26

In Init itial ializ izat atio ion is is extremely im importan ant

Optimum

26

Initialization Not guaranteed to reach the

  • ptimum
  • Prof. Leal-Taixé and Prof. Niessner
slide-27
SLIDE 27

Ho How do I st start rt?

Forward

27

w w w w w = 0 f X

i

wixi + b !

What happens to the gradients? No symmetry breaking

  • Prof. Leal-Taixé and Prof. Niessner
slide-28
SLIDE 28

Al All we weights to to ze zero ro

  • Elaborate: the hidden units are all going to compute

the same function, gradients are going to be the same

28

  • Prof. Leal-Taixé and Prof. Niessner
slide-29
SLIDE 29

Sm Small rand ndom nu numbers

  • Gaussian with zero mean and standard deviation 0.01
  • Let us see what happens:

– Network with 10 layers with 500 neurons each – Tanh as activation functions – Input unit Gaussian data

29

  • Prof. Leal-Taixé and Prof. Niessner
slide-30
SLIDE 30

Sm Small rand ndom nu numbers

Forward

30

Input Last layer Activations become zero

  • Prof. Leal-Taixé and Prof. Niessner
slide-31
SLIDE 31

Sm Small rand ndom nu numbers

Forward

31

f X

i

wixi + b !

small

  • Prof. Leal-Taixé and Prof. Niessner
slide-32
SLIDE 32

Sm Small rand ndom nu numbers

Backward

32

f X

i

wixi + b !

  • Prof. Leal-Taixé and Prof. Niessner
  • 1. Activation

function gradient is ok

  • 2. Compute the

gradients wrt the weights

slide-33
SLIDE 33

Sm Small rand ndom nu numbers

33

f X

i

wixi + b !

  • Prof. Leal-Taixé and Prof. Niessner
  • 1. Activation

function gradient is ok

  • 2. Compute the

gradients wrt the weights Gradients vanish

slide-34
SLIDE 34

Bi Big r random

  • m n

number ers

  • Gaussian with zero mean and standard deviation 1
  • Let us see what happens:

– Network with 10 layers with 500 neurons each – Tanh as activation functions – Input unit Gaussian data

34

  • Prof. Leal-Taixé and Prof. Niessner
slide-35
SLIDE 35

Bi Big r random

  • m n

number ers

35

Everything is saturated

  • Prof. Leal-Taixé and Prof. Niessner
slide-36
SLIDE 36

Ho How to so solv lve this? s?

  • Working on the initialization
  • Working on the output generated by each layer
  • Prof. Leal-Taixé and Prof. Niessner

36

slide-37
SLIDE 37

Xa Xavier r initiali liza zation

  • Gaussian with zero mean, but what standard

deviation?

37

Var(s) = Var(

n

X

i

wixi) =

n

X

i

Var(wixi)

n

X

Glorot 2010

  • Prof. Leal-Taixé and Prof. Niessner
slide-38
SLIDE 38

Xa Xavier r initiali liza zation

  • Gaussian with zero mean, but what standard

deviation?

38

Var(s) = Var(

n

X

i

wixi) =

n

X

i

Var(wixi) =

n

X

i

[E(wi)]2Var(xi) + E[(xi)]2Var(wi) + Var(xi)Var(wi)

n

Independent Zero mean

  • Prof. Leal-Taixé and Prof. Niessner
slide-39
SLIDE 39

Xa Xavier r initiali liza zation

  • Gaussian with zero mean, but what standard

deviation?

39

Var(s) = Var(

n

X

i

wixi) =

n

X

i

Var(wixi) =

n

X

i

[E(wi)]2Var(xi) + E[(xi)]2Var(wi) + Var(xi)Var(wi) =

n

X

i

Var(xi)Var(wi) = (nVar(w)) Var(x)

Identically distributed

  • Prof. Leal-Taixé and Prof. Niessner
slide-40
SLIDE 40

Xa Xavier r initiali liza zation

  • Gaussian with zero mean, but what standard

deviation?

40

Var(s) = Var(

n

X

i

wixi) =

n

X

i

Var(wixi) =

n

X

i

[E(wi)]2Var(xi) + E[(xi)]2Var(wi) + Var(xi)Var(wi) =

n

X

i

Var(xi)Var(wi) = (nVar(w)) Var(x)

Variance gets multiplied by the number of inputs

  • Prof. Leal-Taixé and Prof. Niessner
slide-41
SLIDE 41

Xa Xavier r initiali liza zation

  • How to ensure the variance of the output is the same

as the input?

41

(nVar(w)) Var(x)

1

V ar(w) = 1 n

  • Prof. Leal-Taixé and Prof. Niessner
slide-42
SLIDE 42

Xa Xavier r initiali liza zation

42

Mitigates the effect of activations going to zero

  • Prof. Leal-Taixé and Prof. Niessner
slide-43
SLIDE 43

Xa Xavier r initiali liza zation with ReL ReLU

43

  • Prof. Leal-Taixé and Prof. Niessner
slide-44
SLIDE 44

ReL ReLU ki kills ha half of the he data

44

V ar(w) = 2 n

He 2015

  • Prof. Leal-Taixé and Prof. Niessner
slide-45
SLIDE 45

ReL ReLU ki kills ha half of the he data

45

V ar(w) = 2 n

He 2015

It makes a huge difference!

  • Prof. Leal-Taixé and Prof. Niessner
slide-46
SLIDE 46

Ti Tips and nd tricks ks

  • Use ReLU and Xavier/2 initialization

46

  • Prof. Leal-Taixé and Prof. Niessner
slide-47
SLIDE 47

Batc tch norma malizati tion

47

  • Prof. Leal-Taixé and Prof. Niessner
slide-48
SLIDE 48

Ou Our go goal

  • All we want is that our activations do not die out
slide-49
SLIDE 49

Ba Batch n nor

  • rmalization
  • n
  • Wish: unit Gaussian activations (in our example)
  • Solution: let’s do it

49

ˆ x(k) = x(k) − E[x(k)] p Var[x(k)]

D = #features N = mini-batch size Ioffe and Szegedy 2015

  • Prof. Leal-Taixé and Prof. Niessner

Mean of your mini-batch examples over feature k

slide-50
SLIDE 50

Ba Batch n nor

  • rmalization
  • n
  • In each dimension of the features, you have a unit

gaussian (in our example)

50

ˆ x(k) = x(k) − E[x(k)] p Var[x(k)]

D = #features N = mini-batch size Mean of your mini-batch examples over feature k Ioffe and Szegedy 2015

  • Prof. Leal-Taixé and Prof. Niessner
slide-51
SLIDE 51

Ba Batch n nor

  • rmalization
  • n
  • In each dimension of the features, you have a unit

gaussian (in our example)

  • For NN in general à BN normalizes the mean and

variance of the inputs to your activation functions

51

Ioffe and Szegedy 2015

  • Prof. Leal-Taixé and Prof. Niessner
slide-52
SLIDE 52

BN BN l layer er

  • A layer to be applied after Fully

Connected (or Convolutional) layers and be before non-linear activation functions

  • Is it a good idea to have all unit

Gaussians before tanh? This normalization might not be the best for the network!

52

Ioffe and Szegedy 2015

  • Prof. Leal-Taixé and Prof. Niessner
slide-53
SLIDE 53

Ba Batch n nor

  • rmalization
  • n
  • 1. Normalize
  • 2. Allow the network to change the range

53

Ioffe and Szegedy 2015

ˆ x(k) = x(k) − E[x(k)] p Var[x(k)] y(k) = γ(k)ˆ x(k) + β(k) These parameters will be

  • ptimized during backprop
  • Prof. Leal-Taixé and Prof. Niessner

Differentiable function so we can backprop through it….

slide-54
SLIDE 54

Ba Batch n nor

  • rmalization
  • n
  • 1. Normalize
  • 2. Allow the network to change

the range

54

Ioffe and Szegedy 2015

ˆ x(k) = x(k) − E[x(k)] p Var[x(k)] y(k) = γ(k)ˆ x(k) + β(k)

backprop

γ(k) = q Var[x(k)] β(k) = E[x(k)]

The network can learn to undo the normalization

  • Prof. Leal-Taixé and Prof. Niessner
slide-55
SLIDE 55

Ba Batch n nor

  • rmalization
  • n
  • Is

Is it it ok to to tr treat t dim imensio ions separate tely? Shown empirically that even if features are not decorrelated, convergence is still faster with this method

  • You can set all biases of the layers before BN to zero,

because they will be cancelled out by BN anyway

55

Ioffe and Szegedy 2015

  • Prof. Leal-Taixé and Prof. Niessner
slide-56
SLIDE 56

BN BN: t : train v vs t tes est t time

  • Train time: mean and variance is taken over the mini-

batch

  • Test-time: what happens if we can just process one

image at a time?

– No chance to compute a meaningful mean and variance

  • Prof. Leal-Taixé and Prof. Niessner

56

ˆ x(k) = x(k) − E[x(k)] p Var[x(k)]

slide-57
SLIDE 57

BN BN: t : train v vs t tes est t time

  • Prof. Leal-Taixé and Prof. Niessner

57

Tr Train inin ing Test sting

  • Compute mean and

variance from mini- batch 1

  • Compute mean and

variance from mini- batch 2

  • Compute mean and

variance from mini- batch 3

  • Compute mean and

variance by running an exponentially weighted averaged across training mini-batches

µtest σ2

test

slide-58
SLIDE 58

BN BN: w : what d do y

  • you
  • u g

get et?

  • Very deep nets are much easier to train à more

stable gradients

  • A much larger range of hyperparameters works

similarly when using BN

  • Prof. Leal-Taixé and Prof. Niessner

58

slide-59
SLIDE 59

Regularizati tion

60

  • Prof. Leal-Taixé and Prof. Niessner
slide-60
SLIDE 60

Reg Regular ariz izat ation ion

  • Any strategy that aims to

61

Low Lower va validation error Inc Increas easing ng tr training error

  • Prof. Leal-Taixé and Prof. Niessner
slide-61
SLIDE 61

Ove Overfitti tting ng and nd un underfitti tting

Credits: Deep Learning. Goodfellow et al.

  • Prof. Leal-Taixé and Prof. Niessner

62

slide-62
SLIDE 62

Ove Overfitti tting ng and nd un underfitti tting

Credits: Deep Learning. Goodfellow et al.

Training error too big Generalization gap is too big

  • Prof. Leal-Taixé and Prof. Niessner

63

slide-63
SLIDE 63

We Weight d decay

  • L2 regularization
  • Penalizes large weights
  • Improves generalization

64

Learning rate Gradient

−λθT

k θk

θ θ/2 θ/2

  • Prof. Leal-Taixé and Prof. Niessner
slide-64
SLIDE 64

Da Data ta augmenta ntati tion

  • A classifier has to be invariant to a wide variety of

transformations

65

  • Prof. Leal-Taixé and Prof. Niessner
slide-65
SLIDE 65

Pose Appearance Illumination

  • Prof. Leal-Taixé and Prof. Niessner

66

slide-66
SLIDE 66

Da Data ta augmenta ntati tion

  • A classifier has to be invariant to a wide variety of

transformations

  • Helping the classifier: generate fake data simulating

plausible transformations

67

  • Prof. Leal-Taixé and Prof. Niessner
slide-67
SLIDE 67

Da Data ta augmenta ntati tion

68

Krizhevsky 2012

  • Prof. Leal-Taixé and Prof. Niessner
slide-68
SLIDE 68

Da Data ta augmenta ntati tion: n: rand ndom cr crops

  • Random brightness and contrast changes

69

Krizhevsky 2012

  • Prof. Leal-Taixé and Prof. Niessner
slide-69
SLIDE 69

Da Data ta augmenta ntati tion: n: rand ndom cr crops

  • Training: random crops

– Pick a random L in [256,480] – Resize training image, short side L – Randomly sample crops of 224x224

  • Testing: fixed set of crops

– Resize image at N scales – 10 fixed crops of 224x224: 4 corners + center + flips

70

Krizhevsky 2012

  • Prof. Leal-Taixé and Prof. Niessner
slide-70
SLIDE 70

Da Data ta augmenta ntati tion

  • When comparing two networks make sure to use the

same data augmentation!

  • Consider data augmentation a part of your network

design

71

  • Prof. Leal-Taixé and Prof. Niessner
slide-71
SLIDE 71

Ea Early s stop

  • pping

Training time is also a hyperparameter

72

Overfitting

  • Prof. Leal-Taixé and Prof. Niessner
slide-72
SLIDE 72

Ea Early s stop

  • pping
  • Easy form of regularization

73

θ0 θ∗

Overfitting

✏ θ1 ✏ θ2 τ θs

  • Prof. Leal-Taixé and Prof. Niessner
slide-73
SLIDE 73

Ba Bagging a and en ensem emble m e met ethod

  • ds
  • Train three models and average their results
  • Change a different algorithm for optimization or

change the objective function

  • If errors are uncorrelated, the expected combined

error will decrease linearly with the ensemble size

74

  • Prof. Leal-Taixé and Prof. Niessner
slide-74
SLIDE 74

Ba Bagging a and en ensem emble m e met ethod

  • ds
  • Bagging: uses k different datasets

75

Training Set 1 Training Set 2 Training Set 3

  • Prof. Leal-Taixé and Prof. Niessner
slide-75
SLIDE 75

Dr Dropout

  • pout

76

  • Prof. Leal-Taixé and Prof. Niessner
slide-76
SLIDE 76

Dr Dropout

  • Disable a random set of neurons (typically 50%)

77

Srivastava 2014

Forward

  • Prof. Leal-Taixé and Prof. Niessner
slide-77
SLIDE 77

Dr Dropout: t: intu ntuiti tion

  • Using half the network = half capacity

78

Furry Has two eyes Has a tail Has paws Has two ears Redundant representations

  • Prof. Leal-Taixé and Prof. Niessner
slide-78
SLIDE 78

Dr Dropout: t: intu ntuiti tion

  • Using half the network = half capacity

– Redundant representations – Base your scores on more features

  • Consider it as model ensemble

79

  • Prof. Leal-Taixé and Prof. Niessner
slide-79
SLIDE 79

Dr Dropout: t: intu ntuiti tion

  • Two models in one

80

Model 1 Model 2

  • Prof. Leal-Taixé and Prof. Niessner
slide-80
SLIDE 80

Dr Dropout: t: intu ntuiti tion

  • Using half the network = half capacity

– Redundant representations – Base your scores on more features

  • Consider it as two models in one

– Training a large ensemble of models, each on different set of data (mini-batch) and with SHARED parameters

81

Reducing co-adaptation between neurons

  • Prof. Leal-Taixé and Prof. Niessner
slide-81
SLIDE 81

Dr Dropout: t: te test t ti time

  • All neurons are “turned on” – no dropout

82

Conditions at train and test time are not the same

  • Prof. Leal-Taixé and Prof. Niessner
slide-82
SLIDE 82

Dr Dropout: t: te test t ti time

  • Test:
  • Train:

83

x y z θ1 θ2 z = θ1x + θ2y E[z] = 1 4(θ10 + θ20 +θ1x + θ20 +θ10 + θ2y +θ1x + θ2y) = 1 2(θ1x + θ2y)

Dropout probability p=0.5 Weight scaling inference rule

  • Prof. Leal-Taixé and Prof. Niessner
slide-83
SLIDE 83

Dr Dropout: t: ve verdict ct

  • Efficient bagging method with parameter sharing
  • Use it!
  • Dropout reduces the effective capacity of a model à

larger models, more training time

84

  • Prof. Leal-Taixé and Prof. Niessner
slide-84
SLIDE 84

Re Recap ap

85

  • Prof. Leal-Taixé and Prof. Niessner
slide-85
SLIDE 85

Wh What d do w we k know so so fa far? r?

Depth Width

  • Prof. Leal-Taixé and Prof. Niessner

86

slide-86
SLIDE 86

Wh What d do w we k know so so fa far? r?

x0 x1 x2 X θ0 θ1 θ2

Concept of a ‘Neuron’

  • Prof. Leal-Taixé and Prof. Niessner

87

slide-87
SLIDE 87

Wh What d do w we k know so so fa far? r?

Activation Functions (non-linearities)

Sigmoid: ! " =

$ ($&'())

tanh: tanh " ReLU: max 0, " Leaky ReLU: max 0.1", "

  • Prof. Leal-Taixé and Prof. Niessner

88

slide-88
SLIDE 88

Wh What d do w we k know so so fa far? r?

!" #" !$ #$ %

*−1 + 1 # )* ∗ ∗ + 2.00 −1.00 −2.00 −3.00 −2.00 6.00 +1 4.00 −3.00 −1.00 1.00 0.37 1.37 0.73 1.00 −0.53 −0.53 −0.20 0.20 0.20 0.20 0.20 0.20 −0.20 −0.39 −0.39 −0.59

Backpropagation

  • Prof. Leal-Taixé and Prof. Niessner

89

slide-89
SLIDE 89

Wh What d do w we k know so so fa far? r?

SGD Variations (Momentum, etc.)

  • Prof. Leal-Taixé and Prof. Niessner

90

slide-90
SLIDE 90

Wh What d do w we k know so so fa far? r?

Dropout Batch-Norm Weight Regularization Data Augmentation

ˆ x(k) = x(k) − E[x(k)] p Var[x(k)]

Weight Initialization (e.g., Xavier/2)

e.g., !"-reg: #" $ = ∑'()

*

+'

"

  • Prof. Leal-Taixé and Prof. Niessner

91

slide-91
SLIDE 91

Wh Why n not o

  • nly mo

more re L Layers? rs?

  • We can not make networks arbitrarily complex

– Why not just go deeper and get better? – No structure!! – It’s just brute force! – Optimization becomes hard – Performance plateaus / drops!

  • Prof. Leal-Taixé and Prof. Niessner

92

slide-92
SLIDE 92

Ad Admini nistrative Thi hing ngs

  • Next Monday (11.06): Starting with CNN
  • 18.06: Research lecture: DL projects at TUM
  • 25.06: Lecture CNN 2
  • 02.07: Guest lecture
  • 09.07: Lecture on Recurrent Neural Networks
  • Thursday: Solution 2nd exercise, presentation 3rd

93

  • Prof. Leal-Taixé and Prof. Niessner