Lectu ture 6 6 reca recap Prof. Leal-Taix and Prof. Niessner 1 - - PowerPoint PPT Presentation

lectu ture 6 6 reca recap
SMART_READER_LITE
LIVE PREVIEW

Lectu ture 6 6 reca recap Prof. Leal-Taix and Prof. Niessner 1 - - PowerPoint PPT Presentation

Lectu ture 6 6 reca recap Prof. Leal-Taix and Prof. Niessner 1 Ne Neural Ne Netw twork Width Depth Prof. Leal-Taix and Prof. Niessner 2 Gr Gradi dient De Descent fo for Neural Netwo works ks " =: ! " ' " (


slide-1
SLIDE 1

Lectu ture 6 6 reca recap

  • Prof. Leal-Taixé and Prof. Niessner

1

slide-2
SLIDE 2

Ne Neural Ne Netw twork

Depth Width

  • Prof. Leal-Taixé and Prof. Niessner

2

slide-3
SLIDE 3

Gr Gradi dient De Descent fo for Neural Netwo works ks

!" !# !$ ℎ" ℎ# ℎ$ ℎ& '" '# (" (# ') = +(-#,) + 0

1

ℎ12#,),1) ℎ1 = +(-",1 + 0

4

!42",1,4) 5) = ') − () $ 78,9: ;,< (2) = =: =2","," … … =: =2?,@,A … … =: =-?,@ Just simple: + ! = max(0, !)

  • Prof. Leal-Taixé and Prof. Niessner

3

slide-4
SLIDE 4

St Stocha hastic Gradient nt De Descent nt (SG SGD) D)

!"#$ = !" − '()*(!", -{$..0}, 2{$..0}) ()* =

$ 0 ∑56$ 0 ()*5

+ all variations of SGD: momentum, RMSProp, Adam, …

7 now refers to 7-th iteration 8 training samples in the current batch Gradient for the 7-th batch

  • Prof. Leal-Taixé and Prof. Niessner

4

slide-5
SLIDE 5

Im Importan ance of Lear arnin ing Rat ate

  • Prof. Leal-Taixé and Prof. Niessner

5

slide-6
SLIDE 6

Ove Over- an and d Unde derfit ittin ing

Underfitted Appropriate Overfitted

Figure extracted from Deep Learning by Adam Gibson, Josh Patterson, O‘Reily Media Inc., 2017

  • Prof. Leal-Taixé and Prof. Niessner

6

slide-7
SLIDE 7

Ove Over- an and d Unde derfit ittin ing

Source: http://srdas.github.io/DLBook/ImprovingModelGeneralization.html

  • Prof. Leal-Taixé and Prof. Niessner

7

slide-8
SLIDE 8

Ba Basic r rec ecipe f e for

  • r m

machine l e lea earning

  • Split your data

Find your hyperparameters 20% train test validation 20% 60%

  • Prof. Leal-Taixé and Prof. Niessner

8

slide-9
SLIDE 9

Ba Basic r rec ecipe f e for

  • r m

machine l e lea earning

  • Prof. Leal-Taixé and Prof. Niessner

9

slide-10
SLIDE 10

Ba Basically…

  • Prof. Leal-Taixé and Prof. Niessner

10

Deep learning memes

slide-11
SLIDE 11

Fu Fun things… s…

  • Prof. Leal-Taixé and Prof. Niessner

11

Deep learning memes

slide-12
SLIDE 12

Fu Fun things… s…

  • Prof. Leal-Taixé and Prof. Niessner

12

Deep learning memes

slide-13
SLIDE 13

Fu Fun things… s…

  • Prof. Leal-Taixé and Prof. Niessner

13

Deep learning memes

slide-14
SLIDE 14

Going Deep into to Neural Netw tworks

  • Prof. Leal-Taixé and Prof. Niessner

14

slide-15
SLIDE 15

Si Simpl mple St Star arting ng Point nts for Debuggi bugging ng

  • Start simple!

– First, overfit to a single training sample – Second, overfit to several training samples

  • Always try simple architecture first

– It will verify that you are learning something

  • Estimate timings (how long for each epoch?)
  • Prof. Leal-Taixé and Prof. Niessner

15

slide-16
SLIDE 16

Ne Neural Ne Netw twork

  • Problems of going deeper…

– Vanishing gradients (multiplication of chain rule)

  • The impact of small decisions (architecture, activation

functions...)

  • Is my network training correctly?
  • Prof. Leal-Taixé and Prof. Niessner

16

slide-17
SLIDE 17

Ne Neural Ne Netw tworks

  • Prof. Leal-Taixé and Prof. Niessner

17

3) Input t of da data ta 2) 2) Functions in Neu eurons 1) 1) Ou Outp tput t functio ctions

slide-18
SLIDE 18

Outp tput t Functi tions

  • Prof. Leal-Taixé and Prof. Niessner

18

slide-19
SLIDE 19

Ne Neural Ne Netw tworks

What is the shape of this function? Loss (Softmax, Hinge) Prediction

  • Prof. Leal-Taixé and Prof. Niessner

19

slide-20
SLIDE 20

Si Sigmoid for Bi Binary P Pred ediction

  • ns

x0 x1 x2 X

Can be interpreted as a probability 1

p(yi = 1|xi, θ)

θ0 θ1 θ2 σ(x) = 1 1 + e−x

  • Prof. Leal-Taixé and Prof. Niessner

20

slide-21
SLIDE 21

So Softmax fo formu rmulation

  • What if we have multiple classes?

x0 x1 x2 X

Πi

θ0 θ1 θ2

  • Prof. Leal-Taixé and Prof. Niessner

21

slide-22
SLIDE 22

So Softmax fo formu rmulation

  • What if we have multiple classes?

x0 x1 x2 X

Softmax

X

  • Prof. Leal-Taixé and Prof. Niessner

22

slide-23
SLIDE 23

So Softmax fo formu rmulation

  • What if we have multiple classes?

Π2 = exiθ2 exiθ1 + exiθ2

x0 x1 x2 X

Softmax

X

Π1 = exiθ1 exiθ1 + exiθ2

  • Prof. Leal-Taixé and Prof. Niessner

23

slide-24
SLIDE 24

So Softmax fo formu rmulation

  • Softmax
  • Softmax loss (Maximum Likelihood Estimate)

p(yi|x, θ) = exθi

n

P

k=1

exθk

Li = − log ✓ esyi P

k esk

exp normalize

  • Prof. Leal-Taixé and Prof. Niessner

24

slide-25
SLIDE 25

Loss Functi tions

  • Prof. Leal-Taixé and Prof. Niessner

25

slide-26
SLIDE 26

Na Naïve ve Losses

  • Sum of squared differences (SSD)
  • Prune to outliers
  • Compute-efficient (optimization)
  • Optimum is the mean

L2 Loss: !" = ∑%&'

(

)% − + ,%

"

L1 Loss: !' = ∑%&'

(

|)% − +(,%)|

12 24 42 23 34 32 5 2 12 31 12 31 31 64 5 13 15 20 40 25 34 32 5 2 12 31 12 31 31 64 5 13

,% )% !' ,, ) = 3 + 4 + 2 + 2 + 0 + ⋯ + 0 = 15 !" ,, ) = 9 + 16 + 4 + 4 + 0 + ⋯ + 0 = 66

  • Sum of absolute differences
  • Robust
  • Costly to compute
  • Optimum is the median
  • Prof. Leal-Taixé and Prof. Niessner

26

slide-27
SLIDE 27

Cros Cross-En Entrop

  • py (

(So Softmax)

Softmax !" = − log( )*+,

∑, )*.)

Suppose: 3 training examples and 3 classes 1.3 4. 4.9 2.0 3. 3.2 5.1

  • 1.7

2.2 2.5

  • 3.

3.1 cat chair “car” Loss scores Score function 0 = 1(2", 4) e.g., 1(2", 4) = 4 ⋅ 26, 27, … , 29 :

Given a function with weights 4, Training pairs [2"; ="] (input and labels)

  • Prof. Leal-Taixé and Prof. Niessner

27

slide-28
SLIDE 28

Cros Cross-En Entrop

  • py (

(So Softmax)

Softmax !" = − log( )*+,

∑, )*.)

Suppose: 3 training examples and 3 classes 1.3 4. 4.9 2.0 3. 3.2 5.1

  • 1.7

2.2 2.5

  • 3.

3.1 cat chair “car” Loss scores Score function 0 = 1(2", 4) e.g., 1(2", 4) = 4 ⋅ 26, 27, … , 29 :

Given a function with weights 4, Training pairs [2"; ="] (input and labels)

3.2 5.1

  • 1.7
  • Prof. Leal-Taixé and Prof. Niessner

28

slide-29
SLIDE 29

Cros Cross-En Entrop

  • py (

(So Softmax)

Softmax !" = − log( )*+,

∑, )*.)

Suppose: 3 training examples and 3 classes 1.3 4. 4.9 2.0 3. 3.2 5.1

  • 1.7

2.2 2.5

  • 3.

3.1 cat chair “car” Loss scores Score function 0 = 1(2", 4) e.g., 1(2", 4) = 4 ⋅ 26, 27, … , 29 :

Given a function with weights 4, Training pairs [2"; ="] (input and labels)

3.2 5.1

  • 1.7

24.5 164.0 0.18

exp

  • Prof. Leal-Taixé and Prof. Niessner

29

slide-30
SLIDE 30

Cros Cross-En Entrop

  • py (

(So Softmax)

Softmax !" = − log( )*+,

∑, )*.)

Suppose: 3 training examples and 3 classes 1.3 4. 4.9 2.0 3. 3.2 5.1

  • 1.7

2.2 2.5

  • 3.

3.1 cat chair “car” Loss scores Score function 0 = 1(2", 4) e.g., 1(2", 4) = 4 ⋅ 26, 27, … , 29 :

Given a function with weights 4, Training pairs [2"; ="] (input and labels)

3.2 5.1

  • 1.7

24.5 164.0 0.18 0.13 0.87 0.00

exp normalize

  • Prof. Leal-Taixé and Prof. Niessner

30

slide-31
SLIDE 31

Cros Cross-En Entrop

  • py (

(So Softmax)

Softmax !" = − log( )*+,

∑, )*.)

Suppose: 3 training examples and 3 classes 1.3 4. 4.9 2.0 3. 3.2 5.1

  • 1.7

2.2 2.5

  • 3.

3.1 cat chair “car” Loss 2.0 .04 0. 0.14 6. 6.94 scores Score function 0 = 1(2", 4) e.g., 1(2", 4) = 4 ⋅ 26, 27, … , 29 :

Given a function with weights 4, Training pairs [2"; ="] (input and labels)

3.2 5.1

  • 1.7

24.5 164.0 0.18 0.13 0.87 0.00

exp

2.04 0.14 6.94

normalize

  • log(x)
  • Prof. Leal-Taixé and Prof. Niessner

31

slide-32
SLIDE 32

Cros Cross-En Entrop

  • py (

(So Softmax)

Softmax !" = − log( )*+,

∑, )*.)

Suppose: 3 training examples and 3 classes 1.3 4. 4.9 2.0 3. 3.2 5.1

  • 1.7

2.2 2.5

  • 3.

3.1 cat chair “car” Loss 2.0 .04 0. 0.07 079 6. 6.156 scores Score function 0 = 1(2", 4) e.g., 1(2", 4) = 4 ⋅ 26, 27, … , 29 :

Given a function with weights 4, Training pairs [2"; ="] (input and labels)

3.2 5.1

  • 1.7

24.5 164.0 0.18 0.13 0.87 0.00

exp

2.0 .04 0.14 6.94

normalize

  • log(x)

! = 1 @ A

"B7 9

!" = = !7 + !D + !E 3 = = 2.04 + 0.079 + 6.156 3 = = O. PQ

  • Prof. Leal-Taixé and Prof. Niessner

32

slide-33
SLIDE 33

Hi Hinge Loss ss (SVM Loss) ss)

Multiclass SVM loss !" = ∑%&'( max(0, /

% − /'( + 1)

  • Prof. Leal-Taixé and Prof. Niessner

33

slide-34
SLIDE 34

Hi Hinge Loss ss (SVM Loss) ss)

Multiclass SVM loss !" = ∑%&'( max(0, /

% − /'( + 1)

Score function / = 4(5", 6) e.g., 4(5", 6) = 6 ⋅ 58, 59, … , 5; <

Given a function with weights 6, Training pairs [5"; ?"] (input and labels)

  • Prof. Leal-Taixé and Prof. Niessner

34

slide-35
SLIDE 35

Hi Hinge Loss ss (SVM Loss) ss)

Multiclass SVM loss !" = ∑%&'( max(0, /

% − /'( + 1)

Score function / = 4(5", 6) e.g., 4(5", 6) = 6 ⋅ 58, 59, … , 5; < Suppose: 3 training examples and 3 classes

Given a function with weights 6, Training pairs [5"; ?"] (input and labels)

  • Prof. Leal-Taixé and Prof. Niessner

35

slide-36
SLIDE 36

Hi Hinge Loss ss (SVM Loss) ss)

Multiclass SVM loss !" = ∑%&'( max(0, /

% − /'( + 1)

Score function / = 4(5", 6) e.g., 4(5", 6) = 6 ⋅ 58, 59, … , 5; < Suppose: 3 training examples and 3 classes

Given a function with weights 6, Training pairs [5"; ?"] (input and labels)

1.3 4. 4.9 2.0 3. 3.2 5.1

  • 1.7

2.2 2.5

  • 3.

3.1 cat chair “car” scores

  • Prof. Leal-Taixé and Prof. Niessner

36

slide-37
SLIDE 37

Hi Hinge Loss ss (SVM Loss) ss)

Multiclass SVM loss !" = ∑%&'( max(0, /

% − /'( + 1)

Score function / = 4(5", 6) e.g., 4(5", 6) = 6 ⋅ 58, 59, … , 5; < Suppose: 3 training examples and 3 classes

Given a function with weights 6, Training pairs [5"; ?"] (input and labels)

1.3 4. 4.9 2.0 3. 3.2 5.1

  • 1.7

2.2 2.5

  • 3.

3.1 cat chair “car” Loss !9 = max(0, 5.1 − 3.2 + 1) + max(0, −1.7 − 3.2 + 1) = = max 0, 2.9 + max(0, −3.9) = = 2.9 + 0 = = G. H 2.9 .9 scores

  • Prof. Leal-Taixé and Prof. Niessner

37

slide-38
SLIDE 38

Hi Hinge Loss ss (SVM Loss) ss)

Multiclass SVM loss !" = ∑%&'( max(0, /

% − /'( + 1)

Score function / = 4(5", 6) e.g., 4(5", 6) = 6 ⋅ 58, 59, … , 5; < Suppose: 3 training examples and 3 classes

Given a function with weights 6, Training pairs [5"; ?"] (input and labels)

1.3 4. 4.9 2.0 3. 3.2 5.1

  • 1.7

2.2 2.5

  • 3.

3.1 cat chair “car” Loss !A = max(0, 1.3 − 4.9 + 1) + max(0, 2.0 − 4.9 + 1) = = max 0, −2.6 + max 0, −1.9 = = 0 + 0 = = H 2.9 scores

  • Prof. Leal-Taixé and Prof. Niessner

38

slide-39
SLIDE 39

Hi Hinge Loss ss (SVM Loss) ss)

Multiclass SVM loss !" = ∑%&'( max(0, /

% − /'( + 1)

Score function / = 4(5", 6) e.g., 4(5", 6) = 6 ⋅ 58, 59, … , 5; < Suppose: 3 training examples and 3 classes

Given a function with weights 6, Training pairs [5"; ?"] (input and labels)

1.3 4. 4.9 2.0 3. 3.2 5.1

  • 1.7

2.2 2.5

  • 3.

3.1 cat chair “car” Loss 2.9 12 12.9 !A = max(0, 2.2 − (−3.1) + 1) + max(0, 2.5 − (−3.1) + 1) = = max 0, 6.3 + max 0, 6.6 = = 6.3 + 6.6 = = GH. I scores

  • Prof. Leal-Taixé and Prof. Niessner

39

slide-40
SLIDE 40

Hi Hinge Loss ss (SVM Loss) ss)

Multiclass SVM loss !" = ∑%&'( max(0, /

% − /'( + 1)

Score function / = 4(5", 6) e.g., 4(5", 6) = 6 ⋅ 58, 59, … , 5; < Suppose: 3 training examples and 3 classes

Given a function with weights 6, Training pairs [5"; ?"] (input and labels)

1.3 4. 4.9 2.0 3. 3.2 5.1

  • 1.7

2.2 2.5

  • 3.

3.1 cat chair “car” Loss 2.9 12.9 ! = 1 A B

"C9 ;

!" = = !9 + !D + !E 3 = = 2.9 + 0 + 10.9 3 = = J. K scores Full Loss (over all pairs):

  • Prof. Leal-Taixé and Prof. Niessner

40

slide-41
SLIDE 41

Hi Hinge Loss ss vs s So Softmax

Softmax: !" = − log( )*+,

∑, )*.)

Hinge loss: !" = ∑012, max(0, 8

0 − 82, + 1)

  • Prof. Leal-Taixé and Prof. Niessner

41

slide-42
SLIDE 42

Hi Hinge Loss ss vs s So Softmax

Softmax: !" = − log( )*+,

∑, )*.)

Given the following scores: 0 = [5, −3, 2] 0 = [5, 10, 10] 0 = [5, −20, −20] 9" = 0 Hinge loss: !" = ∑:;<, max(0, 0

: − 0<, + 1)

  • Prof. Leal-Taixé and Prof. Niessner

42

slide-43
SLIDE 43

Hi Hinge Loss ss vs s So Softmax

Softmax: !" = − log( )*+,

∑, )*.)

Given the following scores: 0 = [5, −3, 2] 0 = [5, 10, 10] 0 = [5, −20, −20] 9" = 0 Hinge loss:

max(0, −3 − 5 + 1) + max 0, 2 − 5 + 1 = 0 max(0, 10 − 5 + 1) + max 0, 10 − 5 + 1 = 12 max(0, −20 − 5 + 1) + max 0, −20 − 5 + 1 = 0

Hinge loss: !" = ∑>?@, max(0, 0

> − 0@, + 1)

  • Prof. Leal-Taixé and Prof. Niessner

43

slide-44
SLIDE 44

Hi Hinge Loss ss vs s So Softmax

Softmax: !" = − log( )*+,

∑, )*.)

Given the following scores: 0 = [5, −3, 2] 0 = [5, 10, 10] 0 = [5, −20, −20] 9" = 0 Hinge loss: Softmax loss:

max(0, −3 − 5 + 1) + max 0, 2 − 5 + 1 = 0 max(0, 10 − 5 + 1) + max 0, 10 − 5 + 1 = 12 max(0, −20 − 5 + 1) + max 0, −20 − 5 + 1 = 0

Google…

  • ln(e^(5)/(e^(5)+e^(-3)+e^(2)))=0.

0.05 05 Google…

  • ln(e^(5)/(e^(5)+e^(10)+e^(10)))=5.

5.70 Google…

  • ln(e^(5)/(e^(5)+e^(-20)+e^(-20)))=2.e

.e-11 11

Hinge loss: !" = ∑>?@, max(0, 0

> − 0@, + 1)

  • Prof. Leal-Taixé and Prof. Niessner

44

slide-45
SLIDE 45

Hi Hinge Loss ss vs s So Softmax

Softmax: !" = − log( )*+,

∑, )*.)

Given the following scores: 0 = [5, −3, 2] 0 = [5, 10, 10] 0 = [5, −20, −20] 9" = 0 Hinge loss: Softmax loss:

max(0, −3 − 5 + 1) + max 0, 2 − 5 + 1 = 0 max(0, 10 − 5 + 1) + max 0, 10 − 5 + 1 = 12 max(0, −20 − 5 + 1) + max 0, −20 − 5 + 1 = 0

Google…

  • ln(e^(5)/(e^(5)+e^(-3)+e^(2)))=0.

0.05 05 Google…

  • ln(e^(5)/(e^(5)+e^(10)+e^(10)))=5.

5.70 Google…

  • ln(e^(5)/(e^(5)+e^(-20)+e^(-20)))=2.e

.e-11 11

Softmax *always* wants to improve! Hinge Loss saturates Hinge loss: !" = ∑>?@, max(0, 0

> − 0@, + 1)

  • Prof. Leal-Taixé and Prof. Niessner

45

slide-46
SLIDE 46

Lo Loss in n Compute Graph

! "# $("#, !) SVM (# = ∑+,-. max(0, 3

+ − 3-. + 1)

7# (

score function regularization loss data losses

Softmax (# = − log(

;<=. ∑. ;<>)

Full Loss ( =

? @ ∑#A? @

(# + BC(!) e.g., (C-reg: BC ! = ∑#A?

@

D#

C

labeled data input data Given a function with weights !, Training pairs ["#; 7#] (input and labels)

Score function 3 = $("#, !) e.g., $("#, !) = ! ⋅ "I, "?, … , "@ K

  • Prof. Leal-Taixé and Prof. Niessner

46

slide-47
SLIDE 47

Com Compute G e Gra raphs

SVM !" = ∑%&'( max(0, /

% − /'( + 1)

Softmax !" = − log( 789(

∑( 78:)

Full Loss ! =

; < ∑"=; <

!" + >?(@) e.g., !?-reg: >? @ = ∑"=;

<

A"

?

Score function / = B(C", @) e.g., B(C", @) = @ ⋅ CE, C;, … , C< G

  • Prof. Leal-Taixé and Prof. Niessner

47

slide-48
SLIDE 48

Com Compute G e Gra raphs

SVM !" = ∑%&'( max(0, /

% − /'( + 1)

Softmax !" = − log( 789(

∑( 78:)

Full Loss ! =

; < ∑"=; <

!" + >?(@) e.g., !?-reg: >? @ = ∑"=;

<

A"

?

Score function / = B(C", @) e.g., B(C", @) = @ ⋅ CE, C;, … , C< G Want to find optimal @. I.e., weights are unknowns of

  • ptimization problem

Compute gradient w.r.t. @. Gradient HI! is computed via backpropagation

  • Prof. Leal-Taixé and Prof. Niessner

48

slide-49
SLIDE 49

We Weight R Regulari rization & S & SVM L M Loss ss

Multiclass SVM loss !" = ∑%&'( max(0, / 0"; 2 % − / 0"; 2 '( + 1) !7-reg: 87 2 = ∑"9:

;

∑%&'( <"

7

!:-reg: 8: 2 = ∑"9:

;

∑%&'( |<"| Full loss ! = :

; ∑"9: ;

∑%&'( max 0, / 0"; 2 % − / 0"; 2 '( + 1 + >8(2)

  • Prof. Leal-Taixé and Prof. Niessner

49

slide-50
SLIDE 50

We Weight R Regulari rization & S & SVM L M Loss ss

Multiclass SVM loss !" = ∑%&'( max(0, / 0"; 2 % − / 0"; 2 '( + 1) !7-reg: 87 2 = ∑"9:

;

∑%&'( <"

7

!:-reg: 8: 2 = ∑"9:

;

∑%&'( |<"| Full loss ! = :

; ∑"9: ;

∑%&'( max 0, / 0"; 2 % − / 0"; 2 '( + 1 + >8(2) <7 = [0.25, 0.25, 0.25, 0.25] 0 = 1,1,1,1 <: = [1, 0, 0, 0] Example: <:

D0 = <7 D0 = 1

87 <: = 1 87 <7 = 0.257 + 0.257 + 0.257 + 0.257 = 0.25

  • Prof. Leal-Taixé and Prof. Niessner

50

slide-51
SLIDE 51

Acti tivati tion Functi tions

  • Prof. Leal-Taixé and Prof. Niessner

51

slide-52
SLIDE 52

Ne Neurons

  • Prof. Leal-Taixé and Prof. Niessner

52

slide-53
SLIDE 53

Ne Neurons

  • Prof. Leal-Taixé and Prof. Niessner

53

slide-54
SLIDE 54

Ne Neural Ne Netw tworks

What is the shape of this function? Loss (Softmax, Hinge) Prediction

  • Prof. Leal-Taixé and Prof. Niessner

54

slide-55
SLIDE 55

Ac Activation n Fu Functions s or

  • r Hi

Hidden Un Units

x0 x1 x2 w2 w1 w0 X

  • Prof. Leal-Taixé and Prof. Niessner

55

slide-56
SLIDE 56

Si Sigmoid

σ(x) = 1 1 + e−x x0 x1 x2 w2 w1 w0 X yi ∈ {0, 1}

Can be interpreted as a probability

  • Prof. Leal-Taixé and Prof. Niessner

56

slide-57
SLIDE 57

Si Sigmoid

σ(x) = 1 1 + e−x ∂L ∂σ ∂σ ∂x ∂L ∂x = ∂σ ∂x ∂L ∂σ

Forward

  • Prof. Leal-Taixé and Prof. Niessner

57

slide-58
SLIDE 58

Si Sigmoid

σ(x) = 1 1 + e−x ∂L ∂σ ∂σ ∂x ∂L ∂x = ∂σ ∂x ∂L ∂σ

Forward

x = 6

Saturated neurons kill the gradient flow

  • Prof. Leal-Taixé and Prof. Niessner

58

slide-59
SLIDE 59

Si Sigmoid

σ(x) = 1 1 + e−x ∂L ∂σ ∂σ ∂x ∂L ∂x = ∂σ ∂x ∂L ∂σ

Forward Active region for gradient descent

  • Prof. Leal-Taixé and Prof. Niessner

59

slide-60
SLIDE 60

Si Sigmoid

σ(x) = 1 1 + e−x

Output is always positive

  • Prof. Leal-Taixé and Prof. Niessner

60

slide-61
SLIDE 61

Pr Probl blem of Po Positi tive ve Ou Outpu tput

x0 x1 x2 w2 w1 w0 X

f X

i

wixi + b !

We want to compute the gradient wrt the weights

  • Prof. Leal-Taixé and Prof. Niessner

61

slide-62
SLIDE 62

Pr Probl blem of Po Positi tive ve Ou Outpu tput

x0 x1 x2 w2 w1 w0 X

f X

i

wixi + b !

We want to compute the gradient wrt the weights

f z ∂z ∂w = xi > 0

  • Prof. Leal-Taixé and Prof. Niessner

62

slide-63
SLIDE 63

Pr Probl blem of Po Positi tive ve Ou Outpu tput

x0 x1 x2 w2 w1 w0 X

f X

i

wixi + b !

It is going to be either positive or negative for all weights

f z ∂z ∂w = xi > 0 ∂f ∂z

  • Prof. Leal-Taixé and Prof. Niessner

63

slide-64
SLIDE 64

Pr Probl blem of Po Positi tive ve Output utput

w1 w2

More on zero- mean data later

  • Prof. Leal-Taixé and Prof. Niessner

64

slide-65
SLIDE 65

ta tanh

Zero- centered Still saturates Still saturates

LeCun 1991

  • Prof. Leal-Taixé and Prof. Niessner

65

slide-66
SLIDE 66

Rec Rectif ified L ied Lin inear ear U Unit its ( (ReL ReLU)

Large and consistent gradients Does not saturate Fast convergence

Krizhevsky 2012

σ(x) = max(0, x)

  • Prof. Leal-Taixé and Prof. Niessner

66

slide-67
SLIDE 67

Rec Rectif ified L ied Lin inear ear U Unit its ( (ReL ReLU)

Large and consistent gradients Does not saturate Fast convergence What happens if a ReLU outputs zero? Dead ReLU

  • Prof. Leal-Taixé and Prof. Niessner

67

slide-68
SLIDE 68

Rec Rectif ified L ied Lin inear ear U Unit its ( (ReL ReLU)

  • Initializing ReLU neurons with slightly positive biases

(0.1) makes it likely that they stay active for most inputs

f X

i

wixi + b !

  • Prof. Leal-Taixé and Prof. Niessner

68

slide-69
SLIDE 69

Le Leaky ky ReL ReLU

Does not die

σ(x) = max(0.01x, x)

Mass 2013

  • Prof. Leal-Taixé and Prof. Niessner

69

slide-70
SLIDE 70

Pa Parametr tric c ReL ReLU

Does not die

σ(x) = max(αx, x)

One more parameter to backprop into

He 2015

  • Prof. Leal-Taixé and Prof. Niessner

70

slide-71
SLIDE 71

Ma Maxou

  • ut un

units ts

Goodfellow 2013

x0 x1 x2 X X

w01 w02 w11 w12 w21 w22 max

  • Prof. Leal-Taixé and Prof. Niessner

71

!"#$%& = max(,-

.# + 0-, ,2 .# + 02)

slide-72
SLIDE 72

Ma Maxou

  • ut un

units ts

Piecewise linear approximation of a convex function with N pieces

Goodfellow 2013

  • Prof. Leal-Taixé and Prof. Niessner

72

slide-73
SLIDE 73

Ma Maxou

  • ut un

units ts

Generalization

  • f ReLUs

Linear regimes Does not die Does not saturate Increase of the number of parameters

  • Prof. Leal-Taixé and Prof. Niessner

73

slide-74
SLIDE 74

Qu Quick ck Gui Guide

  • Sigmoid is not really used
  • ReLU is the standard choice
  • Second choice are the variants of ReLu or Maxout
  • Recurrent nets will require tanh or similar
  • Prof. Leal-Taixé and Prof. Niessner

74

slide-75
SLIDE 75

Ne Neural Ne Netw tworks

Data pre- processing? Loss (Softmax, Hinge) Prediction

  • Prof. Leal-Taixé and Prof. Niessner

75

slide-76
SLIDE 76

Da Data ta pre-pr proce cessing

For images subtract the mean image (AlexNet) or per- channel mean (VGG-Net)

  • Prof. Leal-Taixé and Prof. Niessner

76

slide-77
SLIDE 77

Outl tlook

  • Prof. Leal-Taixé and Prof. Niessner

77

slide-78
SLIDE 78

Ou Outl tlook

  • Prof. Leal-Taixé and Prof. Niessner

78

Regularization in the optimization:

  • Dropout, weight decay, etc..

Regularization in the architecture:

  • Convolutions! (CNNs)
  • Init. of optimization
  • How to set weights at beginning

Handling limited training data

  • Data augmentation
slide-79
SLIDE 79

Ne Next xt lectu cture

  • This week:

– No tutorial due to Holiday!

  • Next lecture June 4th

– More about training neural networks; regularization, BN, dropout, etc. – Followed by CNNs

  • Prof. Leal-Taixé and Prof. Niessner

79