Lectu ture 7 Recap Prof. Leal-Taix and Prof. Niessner 1 Bey - PowerPoint PPT Presentation

Lectu ture 7 Recap Prof. Leal-Taixé and Prof. Niessner 1

Bey Beyon ond l linea ear 1-layer network: f = Wx x W f 10 128×128 Prof. Leal-Taixé and Prof. Niessner 2

Ne Neural Ne Netw twork Width Depth Prof. Leal-Taixé and Prof. Niessner 3

Opti Optimizati tion Prof. Leal-Taixé and Prof. Niessner 4

Loss functi tions Prof. Leal-Taixé and Prof. Niessner 5

Ne Neural netw tworks What is the shape of this function? Loss (Softmax, Hinge) Prediction Prof. Leal-Taixé and Prof. Niessner 6

Si Sigmoid for bina nary predictions ns 1 σ ( x ) = 1 + e − x x 0 θ 0 1 Can be θ 1 interpreted as X x 1 a probability θ 2 0 x 2 p ( y i = 1 | x i , θ ) Prof. Leal-Taixé and Prof. Niessner 7

Lo Logitic re regre ressi ssion • Binary classification x 0 θ 0 θ 1 X Π i x 1 θ 2 x 2 Prof. Leal-Taixé and Prof. Niessner 8

Lo Logistic regression • Loss function L ( Π i , y i ) = y i log Π i + (1 − y i ) log(1 − Π i ) • Cost function n C ( θ ) = − 1 X y i log Π i + (1 − y i ) log(1 − Π i ) n i =1 Minimization = σ ( x i θ ) Prof. Leal-Taixé and Prof. Niessner 9

Softmax re So regre ressi ssion • Cost function for the binary case n C ( θ ) = − 1 X y i log Π i + (1 − y i ) log(1 − Π i ) n i =1 Probability given by our • Extension to multiple classes sigmoid function n M C ( θ ) = − 1 X X y i,c log p i,c n i =1 c =1 Binary indicator whether is c the label for image i Prof. Leal-Taixé and Prof. Niessner 10

So Softmax fo formu rmulation • What if we have multiple classes? x 0 Π 1 Softmax Π 2 x 1 Π 3 x 2 Prof. Leal-Taixé and Prof. Niessner 11

So Softmax fo formu rmulation • Three neurons in the output layer for three classes x 0 Π 1 Π 2 x 1 Π 3 x 2 Prof. Leal-Taixé and Prof. Niessner 12

So Softmax fo formu rmulation • What if we have multiple classes? n M C ( θ ) = − 1 X X y i,c log p i,c n i =1 c =1 • You can no longer assign to. as in the binary Π i p i,c case, because all outputs need to sum to 1 X Π i,c c Prof. Leal-Taixé and Prof. Niessner 13

So Softmax fo formu rmulation Score for class cat given by all the layers of e s cat p ( cat | X i ) Π 1 the network = P c e s c p ( dog | X i ) Π 2 Normalization p ( bird | X i ) Π 3 • Softmax takes M inputs (Scores) and outputs M probabilities (M is the number of classes) Prof. Leal-Taixé and Prof. Niessner 14

Lo Loss func nctions ns Evaluate the ground • Softmax loss function truth score for the e s yi ✓ ◆ image L i = − log P k e s k Comes from Maximum Likelihood Estimate • Hinge Loss (derived from the Multiclass SVM loss) X L i = max(0 , s k − s y i + 1) k 6 = y i Prof. Leal-Taixé and Prof. Niessner 15

Lo Loss func nctions ns • Softmax loss function – Optimizes until the loss is zero • Hinge Loss (derived from the Multiclass SVM loss) – Saturates whenever it has learned a class “well enough” Prof. Leal-Taixé and Prof. Niessner 16

Acti tivati tion functi tions Prof. Leal-Taixé and Prof. Niessner 17

Si Sigmoid 1 σ ( x ) = 1 + e − x Forward x = 6 Saturated neurons kill the gradient flow ∂ L ∂ x = ∂σ ∂ L ∂σ ∂ L ∂ x ∂σ ∂ x ∂σ Prof. Leal-Taixé and Prof. Niessner 18

Pr Probl blem of po positi tive ve outpu tput w 2 w 1 More on zero- mean data later Prof. Leal-Taixé and Prof. Niessner 19

ta tanh Still saturates Zero- centered Still saturates LeCun 1991 Prof. Leal-Taixé and Prof. Niessner 20

Rec Rectif ified L ied Lin inear ear U Unit its ( (ReL ReLU) Dead ReLU Large and What happens if a consistent ReLU outputs zero? gradients Fast convergence Does not saturate Prof. Leal-Taixé and Prof. Niessner 21

Ma Maxou out un units ts Linear Generalization Does not Does not regimes of ReLUs die saturate Increase of the number of parameters Prof. Leal-Taixé and Prof. Niessner 22

Da Data ta pre-pr proce cessing For images subtract the mean image (AlexNet) or per- channel mean (VGG-Net) Prof. Leal-Taixé and Prof. Niessner 23

Weight t initi tializati tion Prof. Leal-Taixé and Prof. Niessner 24

Ho How do I st start rt? Forward w w w w Prof. Leal-Taixé and Prof. Niessner 25

In Init itial ializ izat atio ion is is extremely im importan ant Initialization Not guaranteed Optimum to reach the optimum Prof. Leal-Taixé and Prof. Niessner 26

Ho How do I st start rt? X ! Forward f w i x i + b i w w w = 0 What happens to w the w gradients? No symmetry breaking Prof. Leal-Taixé and Prof. Niessner 27

Al All we weights to to ze zero ro • Elaborate: the hidden units are all going to compute the same function, gradients are going to be the same Prof. Leal-Taixé and Prof. Niessner 28

Sm Small rand ndom nu numbers • Gaussian with zero mean and standard deviation 0.01 • Let us see what happens: – Network with 10 layers with 500 neurons each – Tanh as activation functions – Input unit Gaussian data Prof. Leal-Taixé and Prof. Niessner 29

Sm Small rand ndom nu numbers Activations become zero Last Input layer Forward Prof. Leal-Taixé and Prof. Niessner 30

Sm Small rand ndom nu numbers small X ! f w i x i + b i Forward Prof. Leal-Taixé and Prof. Niessner 31

Sm Small rand ndom nu numbers 2. Compute the 1. Activation X ! gradients wrt function f w i x i + b the weights gradient is ok i Backward Prof. Leal-Taixé and Prof. Niessner 32

Sm Small rand ndom nu numbers 2. Compute the 1. Activation X ! gradients wrt function f w i x i + b the weights gradient is ok i Gradients vanish Prof. Leal-Taixé and Prof. Niessner 33

Bi Big r random om n number ers • Gaussian with zero mean and standard deviation 1 • Let us see what happens: – Network with 10 layers with 500 neurons each – Tanh as activation functions – Input unit Gaussian data Prof. Leal-Taixé and Prof. Niessner 34

Bi Big r random om n number ers Everything is saturated Prof. Leal-Taixé and Prof. Niessner 35

Ho How to so solv lve this? s? • Working on the initialization • Working on the output generated by each layer Prof. Leal-Taixé and Prof. Niessner 36

Xavier Xa r initiali liza zation • Gaussian with zero mean, but what standard deviation? n n X X Var( s ) = Var( w i x i ) = Var( w i x i ) i i n X Glorot 2010 Prof. Leal-Taixé and Prof. Niessner 37

Xa Xavier r initiali liza zation • Gaussian with zero mean, but what standard deviation? n n X X Var( s ) = Var( w i x i ) = Var( w i x i ) Independent i i n X [ E ( w i )] 2 Var( x i ) + E [( x i )] 2 Var( w i ) + Var( x i )Var( w i ) = i n Zero mean Prof. Leal-Taixé and Prof. Niessner 38

Xa Xavier r initiali liza zation • Gaussian with zero mean, but what standard deviation? n n X X Var( s ) = Var( w i x i ) = Var( w i x i ) i i n X [ E ( w i )] 2 Var( x i ) + E [( x i )] 2 Var( w i ) + Var( x i )Var( w i ) = i n X = Var( x i )Var( w i ) = ( n Var( w )) Var( x ) Identically distributed i Prof. Leal-Taixé and Prof. Niessner 39

Xa Xavier r initiali liza zation • Gaussian with zero mean, but what standard deviation? n n X X Var( s ) = Var( w i x i ) = Var( w i x i ) i i n X [ E ( w i )] 2 Var( x i ) + E [( x i )] 2 Var( w i ) + Var( x i )Var( w i ) = i n X = Var( x i )Var( w i ) = ( n Var( w )) Var( x ) i Variance gets multiplied by the number of inputs Prof. Leal-Taixé and Prof. Niessner 40

Xa Xavier r initiali liza zation • How to ensure the variance of the output is the same as the input? ( n Var( w )) Var( x ) 1 V ar ( w ) = 1 n Prof. Leal-Taixé and Prof. Niessner 41

Xa Xavier r initiali liza zation Mitigates the effect of activations going to zero Prof. Leal-Taixé and Prof. Niessner 42

Xa Xavier r initiali liza zation with ReL ReLU Prof. Leal-Taixé and Prof. Niessner 43

ReL ReLU ki kills ha half of the he data V ar ( w ) = 2 n He 2015 Prof. Leal-Taixé and Prof. Niessner 44

ReL ReLU ki kills ha half of the he data V ar ( w ) = 2 It makes a huge difference! n He 2015 Prof. Leal-Taixé and Prof. Niessner 45

Ti Tips and nd tricks ks • Use ReLU and Xavier/2 initialization Prof. Leal-Taixé and Prof. Niessner 46

Batc tch norma malizati tion Prof. Leal-Taixé and Prof. Niessner 47

Ou Our go goal • All we want is that our activations do not die out

Ba Batch n nor ormalization on • Wish: unit Gaussian activations (in our example) • Solution: let’s do it Mean of your mini-batch examples over feature k N = mini-batch size x ( k ) = x ( k ) − E[ x ( k ) ] ˆ p Var[ x ( k ) ] D = #features Ioffe and Szegedy 2015 Prof. Leal-Taixé and Prof. Niessner 49

Ba Batch n nor ormalization on • In each dimension of the features, you have a unit gaussian (in our example) Mean of your mini-batch examples over feature k N = mini-batch size x ( k ) = x ( k ) − E[ x ( k ) ] ˆ p Var[ x ( k ) ] D = #features Ioffe and Szegedy 2015 Prof. Leal-Taixé and Prof. Niessner 50

Lectu ture 7 Recap Prof. Leal-Taix and Prof. Niessner 1 Bey - PowerPoint PPT Presentation

Lectu ture 7 Recap Prof. Leal-Taix and Prof. Niessner 1 Bey Beyon ond l linea ear 1-layer network: f = Wx x W f 10 128128 Prof. Leal-Taix and Prof. Niessner 2 Ne Neural Ne Netw twork Width Depth Prof. Leal-Taix and

Lectu ture 6 6 reca recap Prof. Leal-Taix and Prof. Niessner 1 Ne Neural Ne Netw twork

Lectu ture 8 Recap Prof. Leal-Taix and Prof. Niessner 1 Wh What d do w we k know so so

Compsci 101 101 Introductio ion Live L Lectu ture re Susan Rodger August 18, 2020

UNC-CH CH Scho hool ol of of So Social ial Wor ork k Clinic linical al Lectu ture re

Compsci 101 101 Introductio ion Live L Lectu ture re Susan Rodger sum(lst) sum of the

HSC SC H HEAD ST START L T LECTU TURE Standard rd M Mathematics th October 2 Satu turday

Mar Maricultu ture + + In Innova0 a0on = = Opportu tunity ty Mar Maricultu ture + + In

In Innova& a&on + + Mar Maricultu ture = = $ Mar Maricultu ture: Recent t

IN TROPICAL NORTH QUEENSLAND Adven ventu ture re and nd Natu ture Ba Based sed Tou

L e c ture 2 L e c ture 2 Population E Population E c ology c ology Po pula tio n Gro

Alan Rice Al an Rice Seni nior or Lectu Lecturer - rer - St Geor t Georges Uni ges

Royal School of Administration Chapter 11: Fiscal Policy in the Short Run Lectu tured by: y:

CSE 311: Foundations of Computing I Spring 2015 Lectu cture 1: Propositional Logic about the

Re Repr presentational Di Dimen ensions Com omputer Science c cpsc sc322, Lecture 2 2 (Te

St Stoc ochastic c Loc ocal al Se Sear arch Com omputer Science c cpsc sc322, Lecture 1

Semiotics: Recap Examples References Jrg Cassens Data and Process Visualization SoSe 2017

Le Lecture 7 7 R Recap ap I2DL: Prof. Niessner, Prof. Leal-Taix 1 Na Nave L Losse

tss

DEEP LEARNING FFR135, Artificial Neural Networks Olof Mogren Chalmers University of Technology

Understanding Convolutional Neural Networks David Stutz July 24th, 2014 David Stutz | July

Neural Networks. Petr Pok Czech Technical University in Prague Faculty of Electrical

Statistical challenges and opportunities for reliable CNS interfaces Liam Paninski Department of

Convolutional Neural Networks 08, 10 & 17 Nov, 2016 J. Ezequiel Soto S. Image Processing

Distinguished lecture talk by our new AU honorary doctor Wendy E. Mackay on Creating Human-

Lectu ture 7 Recap Prof. Leal-Taix and Prof. Niessner 1 Bey - PowerPoint PPT Presentation

Lectu ture 7 Recap Prof. Leal-Taix and Prof. Niessner 1 Bey Beyon ond l linea ear 1-layer network: f = Wx x W f 10 128128 Prof. Leal-Taix and Prof. Niessner 2 Ne Neural Ne Netw twork Width Depth Prof. Leal-Taix and

Lectu ture 6 6 reca recap Prof. Leal-Taix and Prof. Niessner 1 Ne Neural Ne Netw twork

Lectu ture 8 Recap Prof. Leal-Taix and Prof. Niessner 1 Wh What d do w we k know so so

Compsci 101 101 Introductio ion Live L Lectu ture re Susan Rodger August 18, 2020

UNC-CH CH Scho hool ol of of So Social ial Wor ork k Clinic linical al Lectu ture re

Compsci 101 101 Introductio ion Live L Lectu ture re Susan Rodger sum(lst) sum of the

HSC SC H HEAD ST START L T LECTU TURE Standard rd M Mathematics th October 2 Satu turday

Mar Maricultu ture + + In Innova0 a0on = = Opportu tunity ty Mar Maricultu ture + + In

In Innova&amp; a&amp;on + + Mar Maricultu ture = = $ Mar Maricultu ture: Recent t

IN TROPICAL NORTH QUEENSLAND Adven ventu ture re and nd Natu ture Ba Based sed Tou

L e c ture 2 L e c ture 2 Population E Population E c ology c ology Po pula tio n Gro

Alan Rice Al an Rice Seni nior or Lectu Lecturer - rer - St Geor t Georges Uni ges

Royal School of Administration Chapter 11: Fiscal Policy in the Short Run Lectu tured by: y:

CSE 311: Foundations of Computing I Spring 2015 Lectu cture 1: Propositional Logic about the

Re Repr presentational Di Dimen ensions Com omputer Science c cpsc sc322, Lecture 2 2 (Te

St Stoc ochastic c Loc ocal al Se Sear arch Com omputer Science c cpsc sc322, Lecture 1

Semiotics: Recap Examples References Jrg Cassens Data and Process Visualization SoSe 2017

Le Lecture 7 7 R Recap ap I2DL: Prof. Niessner, Prof. Leal-Taix 1 Na Nave L Losse

tss

DEEP LEARNING FFR135, Artificial Neural Networks Olof Mogren Chalmers University of Technology

Understanding Convolutional Neural Networks David Stutz July 24th, 2014 David Stutz | July

Neural Networks. Petr Pok Czech Technical University in Prague Faculty of Electrical

Statistical challenges and opportunities for reliable CNS interfaces Liam Paninski Department of

Convolutional Neural Networks 08, 10 &amp; 17 Nov, 2016 J. Ezequiel Soto S. Image Processing

Distinguished lecture talk by our new AU honorary doctor Wendy E. Mackay on Creating Human-

In Innova& a&on + + Mar Maricultu ture = = $ Mar Maricultu ture: Recent t

Convolutional Neural Networks 08, 10 & 17 Nov, 2016 J. Ezequiel Soto S. Image Processing