Optimization and Backpropagation
1 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©
Backpropagation I2DL: Prof. Niessner, Prof. Leal-Taix 1 Lecture 3 - - PowerPoint PPT Presentation
Optimization and Backpropagation I2DL: Prof. Niessner, Prof. Leal-Taix 1 Lecture 3 Recap I2DL: Prof. Niessner, Prof. Leal-Taix 2 Neural Network Linear score function = On CIFAR-10 Credit: Li/Karpathy/Johnson On
1 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©
2 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©
3 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©
On CIFAR-10 On ImageNet
Credit: Li/Karpathy/Johnson
β 2-layers: π = πΏπ max(π, πΏππ) β 3-layers: π = πΏπ max(π, πΏπ max(π, πΏππ)) β 4-layers: π = πΏπ tanh (πΏπ, max(π, πΏπ max(π, πΏππ))) β 5-layers: π = πΏππ(πΏπ tanh(πΏπ, max(π, πΏπ max(π, πΏππ)))) β β¦ up to hundreds of layers
4 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©
5 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©
Credit: Li/Karpathy/Johnson
Input layer Hidden layer Output layer
Depth Width
6
Input Layer Hidden Layer 1 Hidden Layer 2 Hidden Layer 3 Output Layer
I2DL: Prof. Niessner, Prof. Leal-TaixΓ©
7 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©
Sigmoid: π π¦ =
1 (1+πβπ¦)
tanh: tanh π¦ ReLU: max 0, π¦ Leaky ReLU: max 0.1π¦, π¦ Maxout max π₯1
ππ¦ + π1, π₯2 ππ¦ + π2
ELU f x = α π¦ if π¦ > 0 Ξ± eπ¦ β 1 if π¦ β€ 0 Parametric ReLU: max π½π¦, π¦
equivalently, the network's performance)
8 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©
β L1 loss π π, ΰ· π; πΎ = 1
π Οπ π | π§π β ΰ·
π§π| 1 β MSE loss π π, ΰ· π; πΎ = 1
π Οπ π | π§π β ΰ·
π§π| 2
2
β Cross Entropy loss E π, ΰ· π; πΎ = β Οπ=1
π
Οπ=1
π
(π§ππ β log ΰ· π§ππ)
β It has compute nodes β It has edges that connect nodes β It is directional β It is organized in βlayersβ
9 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©
10 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©
gradients
11 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©
Backpropagation
πΌπΎπ πΎ
Rumelhart 1986
12 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©
Initialization π¦ = 1, π§ = β3, π¨ = 4 mult sum π = β8 1 β3 4 π = β2 1 β3 4
13 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©
with π¦ = 1, π§ = β3, π¨ = 4
mult sum π = β8 1 β3 4 π = β2 1 β3 4
π π¦, π§, π¨ = π¦ + π§ β π¨ π = π¦ + π§
ππ ππ¦ = 1, ππ ππ§ = 1
π = π β π¨
ππ ππ = π¨, ππ ππ¨ = π
What is
ππ ππ¦ , ππ ππ§ , ππ ππ¨ ?
14 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©
with π¦ = 1, π§ = β3, π¨ = 4
mult sum π = β8 1 β3 4 π = β2 1 β3 4
π π¦, π§, π¨ = π¦ + π§ β π¨ π = π¦ + π§
ππ ππ¦ = 1, ππ ππ§ = 1
π = π β π¨
ππ ππ = π¨, ππ ππ¨ = π
What is
ππ ππ¦ , ππ ππ§ , ππ ππ¨ ?
ππ ππ 1
15 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©
with π¦ = 1, π§ = β3, π¨ = 4
mult sum π = β8 1 β3 4 π = β2 1 β3 4
π π¦, π§, π¨ = π¦ + π§ β π¨ π = π¦ + π§
ππ ππ¦ = 1, ππ ππ§ = 1
π = π β π¨
ππ ππ = π¨, ππ ππ¨ = π
What is
ππ ππ¦ , ππ ππ§ , ππ ππ¨ ?
1 ππ ππ¨ β2 β2
16 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©
with π¦ = 1, π§ = β3, π¨ = 4
mult sum π = β8 1 β3 4 π = β2 1 β3 4
π π¦, π§, π¨ = π¦ + π§ β π¨ π = π¦ + π§
ππ ππ¦ = 1, ππ ππ§ = 1
π = π β π¨
ππ ππ = π¨, ππ ππ¨ = π
What is
ππ ππ¦ , ππ ππ§ , ππ ππ¨ ?
β2 ππ ππ 4 1 β2
17 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©
with π¦ = 1, π§ = β3, π¨ = 4
mult sum π = β8 1 β3 4 π = β2 1 β3 4
π π¦, π§, π¨ = π¦ + π§ β π¨ π = π¦ + π§
ππ ππ¦ = 1, ππ ππ§ = 1
π = π β π¨
ππ ππ = π¨, ππ ππ¨ = π
What is
ππ ππ¦ , ππ ππ§ , ππ ππ¨ ?
1 4 ππ ππ§ 4 ππ ππ§ = ππ ππ β ππ ππ§ Chain Rule:
β ππ ππ§ = 4 β 1 = 4
4 β2
18 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©
with π¦ = 1, π§ = β3, π¨ = 4
mult sum π = β8 1 β3 4 π = β2 1 β3 4
π π¦, π§, π¨ = π¦ + π§ β π¨ π = π¦ + π§
ππ ππ¦ = 1, ππ ππ§ = 1
π = π β π¨
ππ ππ = π¨, ππ ππ¨ = π
What is
ππ ππ¦ , ππ ππ§ , ππ ππ¨ ?
1 β2 4 4 β2 ππ ππ¦ = ππ ππ β ππ ππ¦ Chain Rule:
β ππ ππ¦ = 4 β 1 = 4
ππ ππ¦ 4 4 4
β π which layer β π which neuron in layer β π which weight in neuron
π§π computed output (π output dim; πππ£π’)
19 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©
20 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©
π¦0 π¦1 ΰ· π§0 π§0 Input layer Output layer e.g., class label/ regression target π¦0 π¦1 β π₯1 β π₯0 βπ§0 + π¦βπ¦ Input Weights (unknowns!) L2 Loss function Loss/ cost
21 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©
Input layer Output layer e.g., class label/ regression target π¦0 π¦1 β π₯1 β π₯0 + Input Weights (unknowns!) L2 Loss function Loss/ cost We want to compute gradients w.r.t. all weights πΏ max(0, π¦) ReLU Activation
(btw. Iβm not arguing this is the right choice here)
π¦βπ¦ βπ§0
π¦0 π¦1 ΰ· π§0 π§0
22 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©
π¦0 π¦1 ΰ· π§0 π§0 Input layer Output layer ΰ· π§1 ΰ· π§2 π§1 π§2 π¦0 π¦1 β π₯0,0 + Loss/ cost + Loss/ cost β π₯0,1 β π₯1,0 β π₯1,1 + Loss/ cost β π₯2,0 β π₯2,1 We want to compute gradients w.r.t. all weights πΏ π¦βπ¦ π¦βπ¦ π¦βπ¦ βπ§0 βπ§0 βπ§0
23 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©
π¦0 π¦π ΰ· π§0 π§0 Input layer Output layer ΰ· π§1 π§1 β¦ ΰ· π§π = π΅(ππ + ΰ·
π
π¦ππ₯π,π) π = ΰ·
π
ππ ππ = ΰ· π§π β π§π 2 We want to compute gradients w.r.t. all weights πΏ AND all biases π Activation function bias πππ ππ₯π,π = πππ π ΰ· π§π β π ΰ· π§π ππ₯π,π βΆ use chain rule to compute partials Goal: We want to compute gradients of the loss function π w.r.t. all weights πΏ π: sum over loss per sample, e.g. L2 loss βΆ simply sum up squares:
computational graph, e.g. π π, π =
1 1+πβ π+π₯0π¦0+π₯1π¦1
24 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©
π₯1 π¦1 β π₯0 π¦0 β π + +
β β1
+1
1 β
Sigmoid function π π¦ = 1 1 + πβπ¦
exp(β)
1 1+πβ π+π₯0π¦0+π₯1π¦1
25 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©
π₯1 π¦1 β π₯0 π¦0 β π + +
β β1
+1
1 β
exp(β)
2 β1 β3 β2 β3 β2 6 4 1 β1 0.37 1.37 0.73
1 1+πβ π+π₯0π¦0+π₯1π¦1
26 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©
π₯1 π¦1 β π₯0 π¦0 β π + +
β β1
+1
1 β
exp(β)
2 β1 β3 β2 β3 β2 6 4 1 β1 0.37 1.37 0.73 1 β0.53 π π¦ = 1
π¦
β
ππ ππ¦ = β 1 π¦2
ππ½ π¦ = π½ + π¦ β ππ
ππ¦ = 1
π π¦ = ππ¦ β ππ
ππ¦ = ππ¦
ππ½ π¦ = π½π¦ β
ππ ππ¦ = π½
1 β β
1 1.372 = β0.53
1 1+πβ π+π₯0π¦0+π₯1π¦1
27 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©
π₯1 π¦1 β π₯0 π¦0 β π + +
β β1
+1
1 β
exp(β)
2 β1 β3 β2 β3 β2 6 4 1 β1 0.37 1.37 0.73 1 β0.53 β0.53 π π¦ = 1
π¦
β
ππ ππ¦ = β 1 π¦2
ππ½ π¦ = π½ + π¦ β ππ
ππ¦ = 1
π π¦ = ππ¦ β ππ
ππ¦ = ππ¦
ππ½ π¦ = π½π¦ β
ππ ππ¦ = π½
β0.53 β 1 = β0.53
1 1+πβ π+π₯0π¦0+π₯1π¦1
28 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©
π₯1 π¦1 β π₯0 π¦0 β π + +
β β1
+1
1 β
exp(β)
2 β1 β3 β2 β3 β2 6 4 1 β1 0.37 1.37 0.73 1 β0.53 β0.53 β0.2 π π¦ = 1
π¦
β
ππ ππ¦ = β 1 π¦2
ππ½ π¦ = π½ + π¦ β ππ
ππ¦ = 1
π π¦ = ππ¦ β ππ
ππ¦ = ππ¦
ππ½ π¦ = π½π¦ β
ππ ππ¦ = π½
β0.2 β0.53 β eβ1 = β0.2
1 1+πβ π+π₯0π¦0+π₯1π¦1
29 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©
π₯1 π¦1 β π₯0 π¦0 β π + +
β β1
+1
1 β
exp(β)
2 β1 β3 β2 β3 β2 6 4 1 β1 0.37 1.37 0.73 1 β0.53 β0.53 β0.2 0.2 π π¦ = 1
π¦
β
ππ ππ¦ = β 1 π¦2
ππ½ π¦ = π½ + π¦ β ππ
ππ¦ = 1
π π¦ = ππ¦ β ππ
ππ¦ = ππ¦
ππ½ π¦ = π½π¦ β
ππ ππ¦ = π½
β0.2 β β1 = 0.2
1 1+πβ π+π₯0π¦0+π₯1π¦1
30 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©
π₯1 π¦1 β π₯0 π¦0 β π + +
β β1
+1
1 β
exp(β)
2 β1 β3 β2 β3 β2 6 4 1 β1 0.37 1.37 0.73 1 β0.53 β0.53 β0.2 0.2 0.2 0.2 0.4 β0.6 β0.2 β0.4 0.2 0.2 π π¦ = 1
π¦
β
ππ ππ¦ = β 1 π¦2
ππ½ π¦ = π½ + π¦ β ππ
ππ¦ = 1
π π¦ = ππ¦ β ππ
ππ¦ = ππ¦
ππ½ π¦ = π½π¦ β
ππ ππ¦ = π½
31 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©
Optimum Initialization
πβ = arg min π(π)
32 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©
33 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©
Direction of greatest increase
β π π¦ β π¦ πΌ
π¦π π¦
πΌ
π¦π(π¦)
π¦ Learning rate π¦β² = π¦ β π½πΌ
π¦π π¦
34
Input Layer Hidden Layer 1 Hidden Layer 2 Hidden Layer 3 Output Layer π Neurons π Layers
I2DL: Prof. Niessner, Prof. Leal-TaixΓ©
πΌπΏπ π,π (πΏ) = ππ ππ₯0,0,0 β¦ β¦ ππ ππ₯π,π,π
For a given training pair {π, π}, we want to update all weights, i.e., we need to compute the derivatives w.r.t. to all weights:
π Layers π Neurons πΏβ² = πΏ β π½πΌπΏπ π,π (πΏ)
Gradient step:
35
Input Layer Hidden Layer 1 Hidden Layer 2 Hidden Layer 3 Output Layer
I2DL: Prof. Niessner, Prof. Leal-TaixΓ©
[Szegedy et al.,CVPRβ15] Going Deeper with Convolutions
36 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©
network
37 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©
NEURONS FORWARD AND BACKWARD PASS
38 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©
Activations Activation function π¨ = π(π¦, π§)
π¦ π§
ππ ππ¦ = ππ ππ¨ ππ¨ ππ¦ ππ ππ¨ ππ¨ ππ¦ ππ¨ ππ§ π
39 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©
π¦0 π¦1 π¦2 β0 β1 β2 β3 ΰ· π§0 ΰ· π§1 π§0 π§1 ΰ· π§π = π΅(π1,π + ΰ·
π
βππ₯1,π,π) βπ = π΅(π0,π + ΰ·
π
π¦ππ₯0,π,π) Loss function ππ = ΰ· π§π β π§π 2 Just simple: π΅ π¦ = max(0, π¦)
40 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©
π¦0 π¦1 π¦2 β0 β1 β2 β3 ΰ· π§0 ΰ· π§1 π§0 π§1
ΰ· π§π = π΅(π1,π + ΰ·
π
βππ₯1,π,π) βπ = π΅(π0,π + ΰ·
π
π¦ππ₯0,π,π) ππ = ΰ· π§π β π§π 2
Backpropagation
πππ ππ₯1,π,π = πππ π ΰ· π§π β π ΰ· π§π ππ₯1,π,π πππ ππ₯0,π,π = πππ π ΰ· π§π β π ΰ· π§π πβπ β πβπ ππ₯0,π,π Just go through layer by layer
πππ π ΰ· π§π = 2(ΰ· π§i β yi)
π ΰ· π§π ππ₯1,π,π = βπ
if > 0, else 0
β¦
41 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©
π¦0 π¦1 π¦2 β0 β1 β2 β3 ΰ· π§0 ΰ· π§1 π§0 π§1
ΰ· π§π = π΅(π1,π + ΰ·
π
βππ₯1,π,π) βπ = π΅(π0,π + ΰ·
π
π¦ππ₯0,π,π) ππ = ΰ· π§π β π§π 2
How many unknown weights?
#neurons β #input channels + #biases Note that some activations have also weights
42 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©
π¦0 π¦1 π¦2 β0 β1 β2 β3 ΰ· π§0 ΰ· π§1 π§0 π§1
π = β ΰ·
π=1 πππ£π’
(π§π log ΰ· π§π + 1 β π§π log(1 β ΰ· π§π)) ΰ· π§π = 1 1 + πβπ‘π π‘π = ΰ·
j
βππ₯
ππ
scores Binary Cross Entropy loss Gradients of weights of last layer: πππ ππ₯
ππ
= πππ π ΰ· π§π β π ΰ· π§π ππ‘π β ππ‘π ππ₯
ππ
πππ π ΰ· π§π = βπ§π ΰ· π§π + 1 β π§π 1 β ΰ· π§π = ΰ· π§π β π§π ΰ· π§π(1 β ΰ· π§π) , π ΰ· π§π ππ‘π = ΰ· π§π 1 β ΰ· π§π , ππ‘π ππ₯
ππ
= βπ πππ ππ‘π = ΰ· π§π β π§π βΉ πππ ππ₯
ππ
= (ΰ· π§π β π§π)βπ,
43 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©
Gradients of weights of first layer:
ππ ππ‘
π 1 = ΰ· π=1 πππ£π’ ππ
ππ‘π ππ‘π πβπ πβπ ππ‘
π 1 = ΰ· π=1 πππ£π’
(ΰ· π§π β π§π)π₯
ππ(βπ 1 β βπ )
ππ πβπ = ΰ·
π=1 πππ£π’ ππ
π ΰ· π§π π ΰ· π§π ππ‘π ππ‘
π
πβπ = ΰ·
π=1 πππ£π’ ππ
π ΰ· π§π ΰ· π§π 1 β ΰ· π§π π₯
ππ = ΰ· π=1 πππ£π’
ΰ· π§π β π§π π₯
ππ
ππ ππ₯ππ
1 = ΰ· π=1 πππ£π’ ππ
ππ‘
π 1
ππ‘
π 1
ππ₯ππ
1 = ΰ· π=1 πππ£π’
ΰ· π§π β π§π π₯
ππ βπ 1 β βπ
π¦π
with ReLU activation
ΰ·
π=1 π
π₯2max 0, π₯1π¦π β π§π
2 2
I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 44
π¦ π₯1 β π₯2 β π§ π π π¨
max(0,β)
ΰ· π§
45 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©
Initialize π¦ = 1, π§ = 0, π₯1 = 1
3, π₯2 = 2
π π, ΰ· π; πΎ =
1 π Οπ π | ΰ·
π§π β π§π| 2
2
In our case π, π = 1: π = ΰ· π§ β π§ 2 β ππ
π ΰ· π§ = 2(ΰ·
π§ β π§) ΰ· π§ = π₯2 β π β π ΰ·
π§ ππ₯2 = π
Backpropagation ππ ππ₯2 = ππ π ΰ· π§ β π ΰ· π§ ππ₯2 π¦ π₯1 β π₯2 β π§ π 1
1 3 1 3
2
2 3 4 9
1 π π¨
max(0,β)
ΰ· π§
1 3
Initialize π¦ = 1, π§ = 0, π₯1 = 1
3, π₯2 = 2
π π, ΰ· π; πΎ =
1 π Οπ π | ΰ·
π§π β π§π| 2
2
In our case π, π = 1: π = ΰ· π§ β π§ 2 β ππ
π ΰ· π§ = 2(ΰ·
π§ β π§) ΰ· π§ = π₯2 β π β π ΰ·
π§ ππ₯2 = π
46 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©
Backpropagation ππ ππ₯2 = ππ π ΰ· π§ β π ΰ· π§ ππ₯2 π¦ π₯1 β π₯2 β π§ π 1
1 3 1 3
2
2 3 4 9
1
4 3
π π¨ 2 β 2
3
max(0,β)
ΰ· π§
1 3
Initialize π¦ = 1, π§ = 0, π₯1 = 1
3, π₯2 = 2
π π, ΰ· π; πΎ =
1 π Οπ π | ΰ·
π§π β π§π| 2
2
In our case π, π = 1: π = ΰ· π§ β π§ 2 β ππ
π ΰ· π§ = 2(ΰ·
π§ β π§) ΰ· π§ = π₯2 β π β π ΰ·
π§ ππ₯2 = π
47 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©
Backpropagation ππ ππ₯2 = ππ π ΰ· π§ β π ΰ· π§ ππ₯2 π¦ π₯1 β π₯2 β π§ π 1
1 3 1 3
2
2 3 4 9
1
4 3
π π¨
4 9 4 9
2 β 2
3 β 1 3
max(0,β)
ΰ· π§
1 3
48 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©
π¦ π₯1 β π₯2 β π§ π 1
1 3 1 3 2 3 4 9
1 Backpropagation ππ ππ₯1 = ππ π ΰ· π§ β π ΰ· π§ ππ β ππ ππ¨ β ππ¨ ππ₯1 π π¨
max(0,β)
4 9 4 9
2 ΰ· π§
1 3
Initialize π¦ = 1, π§ = 0, π₯1 = 1
3, π₯2 = 2
In our case π, π = 1: π = ΰ· π§ β π§ 2 β ππ
π ΰ· π§ = 2(ΰ·
π§ β π§) ΰ· π§ = π₯2 β π β π ΰ·
π§ ππ = π₯2
π = max 0, π¨ β ππ
ππ¨ = α1 if π¦ > 0
0 else π¨ = π¦ β π₯1 β
ππ¨ ππ₯1 = π¦
Initialize π¦ = 1, π§ = 0, π₯1 = 1
3, π₯2 = 2
In our case π, π = 1: π = ΰ· π§ β π§ 2 β ππ
π ΰ· π§ = 2(ΰ·
π§ β π§) ΰ· π§ = π₯2 β π β π ΰ·
π§ ππ = π₯2
π = max 0, π¨ β ππ
ππ¨ = α1 if π¦ > 0
0 else π¨ = π¦ β π₯1 β
ππ¨ ππ₯1 = π¦
49 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©
π¦ π₯1 β π₯2 β π§ π 1
1 3 1 3 2 3 4 9
1 Backpropagation ππ ππ₯1 = ππ π ΰ· π§ β π ΰ· π§ ππ β ππ ππ¨ β ππ¨ ππ₯1
4 3
π π¨ 2 β 2
3
max(0,β)
4 9 4 9
2
4 3
ΰ· π§
1 3
Initialize π¦ = 1, π§ = 0, π₯1 = 1
3, π₯2 = 2
In our case π, π = 1: π = ΰ· π§ β π§ 2 β ππ
π ΰ· π§ = 2(ΰ·
π§ β π§) ΰ· π§ = π₯2 β π β π ΰ·
π§ ππ = π₯2
π = max 0, π¨ β ππ
ππ¨ = α1 if π¦ > 0
0 else π¨ = π¦ β π₯1 β
ππ¨ ππ₯1 = π¦
50 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©
π¦ π₯1 β π₯2 β π§ π 1
1 3 1 3 2 3 4 9
1 Backpropagation ππ ππ₯1 = ππ π ΰ· π§ β π ΰ· π§ ππ β ππ ππ¨ β ππ¨ ππ₯1
4 3 8 3
π π¨ 2 β 2
3 β 2
max(0,β)
4 9 4 9
2 ΰ· π§
1 3
Initialize π¦ = 1, π§ = 0, π₯1 = 1
3, π₯2 = 2
In our case π, π = 1: π = ΰ· π§ β π§ 2 β ππ
π ΰ· π§ = 2(ΰ·
π§ β π§) ΰ· π§ = π₯2 β π β π ΰ·
π§ ππ = π₯2
π = max 0, π¨ β ππ
ππ¨ = α1 if π¦ > 0
0 else π¨ = π¦ β π₯1 β
ππ¨ ππ₯1 = π¦
51 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©
π¦ π₯1 β π₯2 β π§ π 1
1 3 1 3 2 3 4 9
1 Backpropagation ππ ππ₯1 = ππ π ΰ· π§ β π ΰ· π§ ππ β ππ ππ¨ β ππ¨ ππ₯1
4 3 8 3 8 3
π π¨ 2 β 2
3 β 2 β 1
max(0,β)
4 9 4 9
2 ΰ· π§
1 3
Initialize π¦ = 1, π§ = 0, π₯1 = 1
3, π₯2 = 2
In our case π, π = 1: π = ΰ· π§ β π§ 2 β ππ
π ΰ· π§ = 2(ΰ·
π§ β π§) ΰ· π§ = π₯2 β π β π ΰ·
π§ ππ = π₯2
π = max 0, π¨ β ππ
ππ¨ = α1 if π¦ > 0
0 else π¨ = π¦ β π₯1 β
ππ¨ ππ₯1 = π¦
52 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©
π¦ π₯1 β π₯2 β π§ π 1
1 3 1 3 1 3 2 3 4 9
1 Backpropagation ππ ππ₯1 = ππ π ΰ· π§ β π ΰ· π§ ππ β ππ ππ¨ β ππ¨ ππ₯1
4 3 8 3 8 3 8 3 8 3
π π¨ 2 β 2
3 β 2 β 1
β 1
max(0,β)
4 9 4 9
2 ΰ· π§
53 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©
to weights π₯1and π₯2
π¦ π₯1 β π₯2 β π§ π 1
1 3 1 3 1 3 2 3 4 9
1
4 3 8 3 8 3 8 3 8 3
π π¨
max(0,β)
4 9 4 9
2 πβ² = π β π½ β πΌ
ππ = π₯1
π₯2 β π½ β πΌ
π₯1π
πΌ
π₯2π
=
1 3
2 β π½ β
8 3 4 9
π π¦, π = ΰ·
π=1 π
π₯2max 0, π₯1π¦π β π§π
2 2
But: how to choose a good learning rate π½ ?
ΰ· π§
lecturesβ¦
54 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©
55 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©
56 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©
Find your hyperparameters train test validation 20% 60% 20%
Other splits are also possible (e.g., 80%/10%/10%)
57 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©
Underfitted Appropriate Overfitted
Source: Deep Learning by Adam Gibson, Josh Patterson, OβReily Media Inc., 2017
58 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©
How can we prevent our model from overfitting? Regularization
Credits: Deep Learning. Goodfellow et al.
Training error too high Generalization gap is too big
π, πΎ) = Οπ=1
π
ΰ· π§π β π§π 2 + ππ(πΎ)
β L2 regularization β L1 regularization β Max norm regularization β Dropout β Early stopping β ...
59 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©
Add regularization term to loss function More details later
60 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©
Ignores 2 features Takes information from all features
π, πΎ = Οπ=1
π
π¦ππ
ππ β π§π 2 + ππ πΎ
π1 0 + 0.752 + 0 = 0.5625 π2 0.252 + 0.52 + 0.252 = 0.375 π¦ = 1, 2, 1 , π1 = 0, 0.75, 0 , π2 = [0.25, 0.5, 0.25]
61
Minimization
π πΎ = ΰ·
π=1 π
ππ
2
I2DL: Prof. Niessner, Prof. Leal-TaixΓ©
π, πΎ = Οπ=1
π
π¦ππ
ππ β π§π 2 + ππ πΎ
π1 0 + 0.75 + 0 = 0.75 π2 0.25 + 0.5 + 0.25 = 1 π¦ = 1, 2, 1 , π1 = 0, 0.75, 0 , π2 = [0.25, 0.5, 0.25]
62
Minimization
π πΎ = ΰ·
π=1 π
ππ
I2DL: Prof. Niessner, Prof. Leal-TaixΓ©
π1 = 0, 0.75, 0 π2 = [0.25, 0.5, 0.25]
63
Ignores 2 features Takes information from all features
I2DL: Prof. Niessner, Prof. Leal-TaixΓ©
π1 = 0, 0.75, 0 π2 = [0.25, 0.5, 0.25]
64
L1 regularization enforces sparsity Takes information from all features
I2DL: Prof. Niessner, Prof. Leal-TaixΓ©
π1 = 0, 0.75, 0 π2 = [0.25, 0.5, 0.25]
65
L1 regularization enforces sparsity L2 regularization enforces that the weights have similar values
I2DL: Prof. Niessner, Prof. Leal-TaixΓ©
66
Furry Has two eyes Has a tail Has paws Has two ears L1 regularization will focus all the attention to a few key features
I2DL: Prof. Niessner, Prof. Leal-TaixΓ©
67
Furry Has two eyes Has a tail Has paws Has two ears L2 regularization will take all information into account to make decisions
I2DL: Prof. Niessner, Prof. Leal-TaixΓ©
68
Combining nodes: Network output + L2-loss + regularization π¦ π₯1 β π₯2 β π§
max(0,β)
+ π
π2 loss
π(π₯1, π₯2) β π ΰ·
π=1 π
π₯2max 0, π₯1π¦π β π§π
2 2 + ππ π₯1, π₯2
I2DL: Prof. Niessner, Prof. Leal-TaixΓ©
69
Combining nodes: Network output + L2-loss + regularization π¦ π₯1 β π₯2 β π§
max(0,β)
+ π
π2 loss
π(π₯1, π₯2) β π ΰ·
π=1 π
π₯2max 0, π₯1π¦π β π§π
2 2 + π
π₯1 π₯2
2 2
I2DL: Prof. Niessner, Prof. Leal-TaixΓ©
70
Combining nodes: Network output + L2-loss + regularization π¦ π₯1 β π₯2 β π§
max(0,β)
+ π
π2 loss
π(π₯1, π₯2) β π ΰ·
π=1 π
π₯2max 0, π₯1π¦π β π§π
2 2 + π(π₯1 2 + π₯2 2)
I2DL: Prof. Niessner, Prof. Leal-TaixΓ©
71
Credits: University of Washington
What is the goal of regularization? What happens to the training error?
Decision Boundary Regularization π = 0.001 π = 000001 π = 0 π = 1 π = 10
I2DL: Prof. Niessner, Prof. Leal-TaixΓ©
72
I2DL: Prof. Niessner, Prof. Leal-TaixΓ©
β Check exercises β Check office hours βΊ
β Optimization of Neural Networks β In particular, introduction to SGD (our main method!)
73 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©
74 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©
β Chapter 6.5 (6.5.1 - 6.5.3) in http://www.deeplearningbook.org/contents/mlp.html β Chapter 5.3 in Bishop, Pattern Recognition and Machine Learning β http://cs231n.github.io/optimization-2/
β Chapter 7.1 (esp. 7.1.1 & 7.1.2) http://www.deeplearningbook.org/contents/regularization.html β Chapter 5.5 in Bishop, Pattern Recognition and Machine Learning
75 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©