Lectu ture 6 6 reca recap
- Prof. Leal-Taixé and Prof. Niessner
1
Lectu ture 6 6 reca recap Prof. Leal-Taix and Prof. Niessner 1 - - PowerPoint PPT Presentation
Lectu ture 6 6 reca recap Prof. Leal-Taix and Prof. Niessner 1 Ne Neural Ne Netw twork Width Depth Prof. Leal-Taix and Prof. Niessner 2 Gr Gradi dient De Descent fo for Neural Netwo works ks " =: ! " ' " (
1
Depth Width
2
!" !# !$ ℎ" ℎ# ℎ$ ℎ& '" '# (" (# ') = +(-#,) + 0
1
ℎ12#,),1) ℎ1 = +(-",1 + 0
4
!42",1,4) 5) = ') − () $ 78,9: ;,< (2) = =: =2","," … … =: =2?,@,A … … =: =-?,@ Just simple: + ! = max(0, !)
3
!"#$ = !" − '()*(!", -{$..0}, 2{$..0}) ()* =
$ 0 ∑56$ 0 ()*5
+ all variations of SGD: momentum, RMSProp, Adam, …
7 now refers to 7-th iteration 8 training samples in the current batch Gradient for the 7-th batch
4
5
Underfitted Appropriate Overfitted
Figure extracted from Deep Learning by Adam Gibson, Josh Patterson, O‘Reily Media Inc., 2017
6
Source: http://srdas.github.io/DLBook/ImprovingModelGeneralization.html
7
Find your hyperparameters 20% train test validation 20% 60%
8
9
10
Deep learning memes
11
Deep learning memes
12
Deep learning memes
13
Deep learning memes
14
– First, overfit to a single training sample – Second, overfit to several training samples
– It will verify that you are learning something
15
– Vanishing gradients (multiplication of chain rule)
functions...)
16
17
3) Input t of da data ta 2) 2) Functions in Neu eurons 1) 1) Ou Outp tput t functio ctions
18
What is the shape of this function? Loss (Softmax, Hinge) Prediction
19
x0 x1 x2 X
Can be interpreted as a probability 1
p(yi = 1|xi, θ)
θ0 θ1 θ2 σ(x) = 1 1 + e−x
20
x0 x1 x2 X
Πi
θ0 θ1 θ2
21
x0 x1 x2 X
Softmax
X
22
Π2 = exiθ2 exiθ1 + exiθ2
x0 x1 x2 X
Softmax
X
Π1 = exiθ1 exiθ1 + exiθ2
23
p(yi|x, θ) = exθi
n
P
k=1
exθk
Li = − log ✓ esyi P
k esk
◆
exp normalize
24
25
L2 Loss: !" = ∑%&'
(
)% − + ,%
"
L1 Loss: !' = ∑%&'
(
|)% − +(,%)|
12 24 42 23 34 32 5 2 12 31 12 31 31 64 5 13 15 20 40 25 34 32 5 2 12 31 12 31 31 64 5 13
,% )% !' ,, ) = 3 + 4 + 2 + 2 + 0 + ⋯ + 0 = 15 !" ,, ) = 9 + 16 + 4 + 4 + 0 + ⋯ + 0 = 66
26
Softmax !" = − log( )*+,
∑, )*.)
Suppose: 3 training examples and 3 classes 1.3 4. 4.9 2.0 3. 3.2 5.1
2.2 2.5
3.1 cat chair “car” Loss scores Score function 0 = 1(2", 4) e.g., 1(2", 4) = 4 ⋅ 26, 27, … , 29 :
Given a function with weights 4, Training pairs [2"; ="] (input and labels)
27
Softmax !" = − log( )*+,
∑, )*.)
Suppose: 3 training examples and 3 classes 1.3 4. 4.9 2.0 3. 3.2 5.1
2.2 2.5
3.1 cat chair “car” Loss scores Score function 0 = 1(2", 4) e.g., 1(2", 4) = 4 ⋅ 26, 27, … , 29 :
Given a function with weights 4, Training pairs [2"; ="] (input and labels)
3.2 5.1
28
Softmax !" = − log( )*+,
∑, )*.)
Suppose: 3 training examples and 3 classes 1.3 4. 4.9 2.0 3. 3.2 5.1
2.2 2.5
3.1 cat chair “car” Loss scores Score function 0 = 1(2", 4) e.g., 1(2", 4) = 4 ⋅ 26, 27, … , 29 :
Given a function with weights 4, Training pairs [2"; ="] (input and labels)
3.2 5.1
24.5 164.0 0.18
exp
29
Softmax !" = − log( )*+,
∑, )*.)
Suppose: 3 training examples and 3 classes 1.3 4. 4.9 2.0 3. 3.2 5.1
2.2 2.5
3.1 cat chair “car” Loss scores Score function 0 = 1(2", 4) e.g., 1(2", 4) = 4 ⋅ 26, 27, … , 29 :
Given a function with weights 4, Training pairs [2"; ="] (input and labels)
3.2 5.1
24.5 164.0 0.18 0.13 0.87 0.00
exp normalize
30
Softmax !" = − log( )*+,
∑, )*.)
Suppose: 3 training examples and 3 classes 1.3 4. 4.9 2.0 3. 3.2 5.1
2.2 2.5
3.1 cat chair “car” Loss 2.0 .04 0. 0.14 6. 6.94 scores Score function 0 = 1(2", 4) e.g., 1(2", 4) = 4 ⋅ 26, 27, … , 29 :
Given a function with weights 4, Training pairs [2"; ="] (input and labels)
3.2 5.1
24.5 164.0 0.18 0.13 0.87 0.00
exp
2.04 0.14 6.94
normalize
31
Softmax !" = − log( )*+,
∑, )*.)
Suppose: 3 training examples and 3 classes 1.3 4. 4.9 2.0 3. 3.2 5.1
2.2 2.5
3.1 cat chair “car” Loss 2.0 .04 0. 0.07 079 6. 6.156 scores Score function 0 = 1(2", 4) e.g., 1(2", 4) = 4 ⋅ 26, 27, … , 29 :
Given a function with weights 4, Training pairs [2"; ="] (input and labels)
3.2 5.1
24.5 164.0 0.18 0.13 0.87 0.00
exp
2.0 .04 0.14 6.94
normalize
! = 1 @ A
"B7 9
!" = = !7 + !D + !E 3 = = 2.04 + 0.079 + 6.156 3 = = O. PQ
32
Multiclass SVM loss !" = ∑%&'( max(0, /
% − /'( + 1)
33
Multiclass SVM loss !" = ∑%&'( max(0, /
% − /'( + 1)
Score function / = 4(5", 6) e.g., 4(5", 6) = 6 ⋅ 58, 59, … , 5; <
Given a function with weights 6, Training pairs [5"; ?"] (input and labels)
34
Multiclass SVM loss !" = ∑%&'( max(0, /
% − /'( + 1)
Score function / = 4(5", 6) e.g., 4(5", 6) = 6 ⋅ 58, 59, … , 5; < Suppose: 3 training examples and 3 classes
Given a function with weights 6, Training pairs [5"; ?"] (input and labels)
35
Multiclass SVM loss !" = ∑%&'( max(0, /
% − /'( + 1)
Score function / = 4(5", 6) e.g., 4(5", 6) = 6 ⋅ 58, 59, … , 5; < Suppose: 3 training examples and 3 classes
Given a function with weights 6, Training pairs [5"; ?"] (input and labels)
1.3 4. 4.9 2.0 3. 3.2 5.1
2.2 2.5
3.1 cat chair “car” scores
36
Multiclass SVM loss !" = ∑%&'( max(0, /
% − /'( + 1)
Score function / = 4(5", 6) e.g., 4(5", 6) = 6 ⋅ 58, 59, … , 5; < Suppose: 3 training examples and 3 classes
Given a function with weights 6, Training pairs [5"; ?"] (input and labels)
1.3 4. 4.9 2.0 3. 3.2 5.1
2.2 2.5
3.1 cat chair “car” Loss !9 = max(0, 5.1 − 3.2 + 1) + max(0, −1.7 − 3.2 + 1) = = max 0, 2.9 + max(0, −3.9) = = 2.9 + 0 = = G. H 2.9 .9 scores
37
Multiclass SVM loss !" = ∑%&'( max(0, /
% − /'( + 1)
Score function / = 4(5", 6) e.g., 4(5", 6) = 6 ⋅ 58, 59, … , 5; < Suppose: 3 training examples and 3 classes
Given a function with weights 6, Training pairs [5"; ?"] (input and labels)
1.3 4. 4.9 2.0 3. 3.2 5.1
2.2 2.5
3.1 cat chair “car” Loss !A = max(0, 1.3 − 4.9 + 1) + max(0, 2.0 − 4.9 + 1) = = max 0, −2.6 + max 0, −1.9 = = 0 + 0 = = H 2.9 scores
38
Multiclass SVM loss !" = ∑%&'( max(0, /
% − /'( + 1)
Score function / = 4(5", 6) e.g., 4(5", 6) = 6 ⋅ 58, 59, … , 5; < Suppose: 3 training examples and 3 classes
Given a function with weights 6, Training pairs [5"; ?"] (input and labels)
1.3 4. 4.9 2.0 3. 3.2 5.1
2.2 2.5
3.1 cat chair “car” Loss 2.9 12 12.9 !A = max(0, 2.2 − (−3.1) + 1) + max(0, 2.5 − (−3.1) + 1) = = max 0, 6.3 + max 0, 6.6 = = 6.3 + 6.6 = = GH. I scores
39
Multiclass SVM loss !" = ∑%&'( max(0, /
% − /'( + 1)
Score function / = 4(5", 6) e.g., 4(5", 6) = 6 ⋅ 58, 59, … , 5; < Suppose: 3 training examples and 3 classes
Given a function with weights 6, Training pairs [5"; ?"] (input and labels)
1.3 4. 4.9 2.0 3. 3.2 5.1
2.2 2.5
3.1 cat chair “car” Loss 2.9 12.9 ! = 1 A B
"C9 ;
!" = = !9 + !D + !E 3 = = 2.9 + 0 + 10.9 3 = = J. K scores Full Loss (over all pairs):
40
Softmax: !" = − log( )*+,
∑, )*.)
Hinge loss: !" = ∑012, max(0, 8
0 − 82, + 1)
41
Softmax: !" = − log( )*+,
∑, )*.)
Given the following scores: 0 = [5, −3, 2] 0 = [5, 10, 10] 0 = [5, −20, −20] 9" = 0 Hinge loss: !" = ∑:;<, max(0, 0
: − 0<, + 1)
42
Softmax: !" = − log( )*+,
∑, )*.)
Given the following scores: 0 = [5, −3, 2] 0 = [5, 10, 10] 0 = [5, −20, −20] 9" = 0 Hinge loss:
max(0, −3 − 5 + 1) + max 0, 2 − 5 + 1 = 0 max(0, 10 − 5 + 1) + max 0, 10 − 5 + 1 = 12 max(0, −20 − 5 + 1) + max 0, −20 − 5 + 1 = 0
Hinge loss: !" = ∑>?@, max(0, 0
> − 0@, + 1)
43
Softmax: !" = − log( )*+,
∑, )*.)
Given the following scores: 0 = [5, −3, 2] 0 = [5, 10, 10] 0 = [5, −20, −20] 9" = 0 Hinge loss: Softmax loss:
max(0, −3 − 5 + 1) + max 0, 2 − 5 + 1 = 0 max(0, 10 − 5 + 1) + max 0, 10 − 5 + 1 = 12 max(0, −20 − 5 + 1) + max 0, −20 − 5 + 1 = 0
Google…
0.05 05 Google…
5.70 Google…
.e-11 11
Hinge loss: !" = ∑>?@, max(0, 0
> − 0@, + 1)
44
Softmax: !" = − log( )*+,
∑, )*.)
Given the following scores: 0 = [5, −3, 2] 0 = [5, 10, 10] 0 = [5, −20, −20] 9" = 0 Hinge loss: Softmax loss:
max(0, −3 − 5 + 1) + max 0, 2 − 5 + 1 = 0 max(0, 10 − 5 + 1) + max 0, 10 − 5 + 1 = 12 max(0, −20 − 5 + 1) + max 0, −20 − 5 + 1 = 0
Google…
0.05 05 Google…
5.70 Google…
.e-11 11
Softmax *always* wants to improve! Hinge Loss saturates Hinge loss: !" = ∑>?@, max(0, 0
> − 0@, + 1)
45
! "# $("#, !) SVM (# = ∑+,-. max(0, 3
+ − 3-. + 1)
7# (
score function regularization loss data losses
Softmax (# = − log(
;<=. ∑. ;<>)
Full Loss ( =
? @ ∑#A? @
(# + BC(!) e.g., (C-reg: BC ! = ∑#A?
@
D#
C
labeled data input data Given a function with weights !, Training pairs ["#; 7#] (input and labels)
Score function 3 = $("#, !) e.g., $("#, !) = ! ⋅ "I, "?, … , "@ K
46
SVM !" = ∑%&'( max(0, /
% − /'( + 1)
Softmax !" = − log( 789(
∑( 78:)
Full Loss ! =
; < ∑"=; <
!" + >?(@) e.g., !?-reg: >? @ = ∑"=;
<
A"
?
Score function / = B(C", @) e.g., B(C", @) = @ ⋅ CE, C;, … , C< G
47
SVM !" = ∑%&'( max(0, /
% − /'( + 1)
Softmax !" = − log( 789(
∑( 78:)
Full Loss ! =
; < ∑"=; <
!" + >?(@) e.g., !?-reg: >? @ = ∑"=;
<
A"
?
Score function / = B(C", @) e.g., B(C", @) = @ ⋅ CE, C;, … , C< G Want to find optimal @. I.e., weights are unknowns of
Compute gradient w.r.t. @. Gradient HI! is computed via backpropagation
48
Multiclass SVM loss !" = ∑%&'( max(0, / 0"; 2 % − / 0"; 2 '( + 1) !7-reg: 87 2 = ∑"9:
;
∑%&'( <"
7
!:-reg: 8: 2 = ∑"9:
;
∑%&'( |<"| Full loss ! = :
; ∑"9: ;
∑%&'( max 0, / 0"; 2 % − / 0"; 2 '( + 1 + >8(2)
49
Multiclass SVM loss !" = ∑%&'( max(0, / 0"; 2 % − / 0"; 2 '( + 1) !7-reg: 87 2 = ∑"9:
;
∑%&'( <"
7
!:-reg: 8: 2 = ∑"9:
;
∑%&'( |<"| Full loss ! = :
; ∑"9: ;
∑%&'( max 0, / 0"; 2 % − / 0"; 2 '( + 1 + >8(2) <7 = [0.25, 0.25, 0.25, 0.25] 0 = 1,1,1,1 <: = [1, 0, 0, 0] Example: <:
D0 = <7 D0 = 1
87 <: = 1 87 <7 = 0.257 + 0.257 + 0.257 + 0.257 = 0.25
50
51
52
53
What is the shape of this function? Loss (Softmax, Hinge) Prediction
54
x0 x1 x2 w2 w1 w0 X
55
σ(x) = 1 1 + e−x x0 x1 x2 w2 w1 w0 X yi ∈ {0, 1}
Can be interpreted as a probability
56
σ(x) = 1 1 + e−x ∂L ∂σ ∂σ ∂x ∂L ∂x = ∂σ ∂x ∂L ∂σ
Forward
57
σ(x) = 1 1 + e−x ∂L ∂σ ∂σ ∂x ∂L ∂x = ∂σ ∂x ∂L ∂σ
Forward
x = 6
Saturated neurons kill the gradient flow
58
σ(x) = 1 1 + e−x ∂L ∂σ ∂σ ∂x ∂L ∂x = ∂σ ∂x ∂L ∂σ
Forward Active region for gradient descent
59
σ(x) = 1 1 + e−x
Output is always positive
60
x0 x1 x2 w2 w1 w0 X
f X
i
wixi + b !
We want to compute the gradient wrt the weights
61
x0 x1 x2 w2 w1 w0 X
f X
i
wixi + b !
We want to compute the gradient wrt the weights
f z ∂z ∂w = xi > 0
62
x0 x1 x2 w2 w1 w0 X
f X
i
wixi + b !
It is going to be either positive or negative for all weights
f z ∂z ∂w = xi > 0 ∂f ∂z
63
w1 w2
More on zero- mean data later
64
Zero- centered Still saturates Still saturates
LeCun 1991
65
Large and consistent gradients Does not saturate Fast convergence
Krizhevsky 2012
σ(x) = max(0, x)
66
Large and consistent gradients Does not saturate Fast convergence What happens if a ReLU outputs zero? Dead ReLU
67
(0.1) makes it likely that they stay active for most inputs
f X
i
wixi + b !
68
Does not die
σ(x) = max(0.01x, x)
Mass 2013
69
Does not die
σ(x) = max(αx, x)
One more parameter to backprop into
He 2015
70
Goodfellow 2013
x0 x1 x2 X X
w01 w02 w11 w12 w21 w22 max
71
!"#$%& = max(,-
.# + 0-, ,2 .# + 02)
Piecewise linear approximation of a convex function with N pieces
Goodfellow 2013
72
Generalization
Linear regimes Does not die Does not saturate Increase of the number of parameters
73
74
Data pre- processing? Loss (Softmax, Hinge) Prediction
75
For images subtract the mean image (AlexNet) or per- channel mean (VGG-Net)
76
77
78
Regularization in the optimization:
Regularization in the architecture:
Handling limited training data
– No tutorial due to Holiday!
– More about training neural networks; regularization, BN, dropout, etc. – Followed by CNNs
79