Multi-Layer Networks & Back-Propagation
- M. Soleymani
Deep Learning Sharif University of Technology Spring 2019 Most slides have been adapted from Bhiksha Raj, 11-785, CMU 2019 and some from Fei Fei Li et. al, cs231n, Stanford 2017
Multi-Layer Networks & Back-Propagation M. Soleymani Deep - - PowerPoint PPT Presentation
Multi-Layer Networks & Back-Propagation M. Soleymani Deep Learning Sharif University of Technology Spring 2019 Most slides have been adapted from Bhiksha Raj, 11-785, CMU 2019 and some from Fei Fei Li et. al, cs231n, Stanford 2017 These
Deep Learning Sharif University of Technology Spring 2019 Most slides have been adapted from Bhiksha Raj, 11-785, CMU 2019 and some from Fei Fei Li et. al, cs231n, Stanford 2017
2
N.Net V
signal Transcription N.Net Image Textcaption N.Net Game State Next move
3
N.Net Something
Something weird
– How do we represent theinput? – How do we represent theoutput?
4
– How do we represent theinput? – How do we represent theoutput?
N.Net Something
Something weird
5
. . .
.
1 2 3 1 2 3 d d i i i
– General setting, inputs are realvalued – A bias 𝑐 representing a threshold to trigger the perceptron – Activation functions are notnecessarily threshold functions
6
1 2 3 1 2 3 d-1 d - 1 d d +1 i i i
component that is always set to1
d
𝒚($), 𝒛($) , 𝒚()), 𝒛()) , … , (𝒚(+), 𝒛(+))
– We consider a neural network as a parametric function 𝑔(𝒚; 𝑿)
7
8
Input units Output units Hidden units
– We will refer to the inputs as the input units
– No neurons here – the “input units” are just the inputs
– We refer to the outputs as the output units – Intermediate units are “hidden”units
9
– No loops: Neuron outputs do not feed back to their inputs directly or indirectly – Loopy networks are a future topic
– How many layers/neurons, which neuron connects to which andhow, etc.
representing the needed function
10
– The weights associated with the blue arrows in the picture
such that the network computes the desired function
𝒚($), 𝒛($) , 𝒚()), 𝒛()) , … , (𝒚(+), 𝒛(+))
– We consider a neural network as a parametric function 𝑔(𝒚; 𝑿)
𝑔(𝒚; 𝑿) when the desired output is 𝒛 1 𝑂 1 𝑚𝑝𝑡𝑡 𝑔 𝒚(5); 𝑿 , 𝒛(5)
+ 56$
11
network and the desired output for the training instances
– And a total error, which is the average divergence over all training instances
12
hard combinatorial-optimization problem
– Because we cannot compute the influence of small changes to the parameters
estimate network parameters
– This makes the output of the network differentiable w.r.t every parameter in the network – The logistic activation neuron actually computes the a posteriori probability of the output given the input
13
14
everywhere, except at 0 where itis non-differentiable
– Youcan vary the weights a lot without changing the error – There is no indication of which direction to change the weights to reduce error
15
1 2 3 1 2 3 N-1 N-1 N N N +1
the inputspace
– Small changes in weight can result in non-negligible changes in output – This enables us to estimate the parameters using gradient descent techniques..
16
Given a training of input-output pairs 𝒚($), 𝒛($) , 𝒚()), 𝒛()) , … , (𝒚(+), 𝒛(+))
5 , 𝑦) 5 , … , 𝑦9 5
is the nth input vector
$ 5 , 𝑧) 5 , … , 𝑧; 5
is the nth desired output
5 , 𝑝) 5 , … , 𝑝; 5 is the nth vector of actual outputs of the network
𝑧$ 𝑧; 𝑦$ 𝑦9
17
Input L ayer Output Layer Hidden Layers
– (or may even be just a scalar, if input layer is of size 1) – E.g. vector of pixelvalues – E.g. vector of speech features – E.g. real-valued vector representing text
– Other real valued vectors
18
Input L ayer Output Layer Hidden Layers
– Scalar Output : single output neuron
– Vector Output : as many output neurons as the dimension of the desired output
19
Square error y1y2y3y4 Div
𝐹𝑠𝑠𝑝𝑠 𝒛, 𝒑 = 1 2 𝒛 − 𝒑 ) = 1 2 1(𝑧; − 𝑝;))
– Squared Euclidean distance between true and desired output – Note: this is differentiable
𝑒𝐹 𝒛, 𝒑 𝑒𝑝; = −(𝑧; − 𝑝;) 𝛼H𝐹 𝒛, 𝒑 = [𝑝$ − 𝑧$, 𝑝) − 𝑧), … , 𝑝> − 𝑧>]
20
representation of the desired output
– 1 =YES it’s acat – 0 = NO it’s not a cat.
21
Training data
Input: vector of pixel values Output: sigmoid
– learn all weights such that the network does the desired job
22
– Overlapping classes – Rosenblatt’s perceptron wouldn’t work in the first place
23
83
x1 x2
– Blue dots (on the floor) on the “red” side – Red dots (suspended at Y=1)on the “blue” side – No line will cleanly separate the two colors
24
y
– All (red) dots at Y=1 represent instances of classY=1 – All (blue) dots at Y=0 are from classY=0 – The data are notlinearly separable
25
small window around that point
– This is an approximation of the probability of Y=1 at that point
x y
26
x1 x2
i i i
– It actually computes the probability that the input belongs to class 1
Decision: y >0.5?
When X is a 2-D variable
27
𝜏(𝑨 )
the desired output
– Viewed as the probability 𝑄 𝑍 = 1 𝒚 of class value 1
different probabilities
28
1 2 3 1 2 3 N-1 N-1 N N N +1
i i i
29
K L Div
probability distribution [𝑝, 1 − 𝑝] and the ideal output probability [𝑧, 1 − 𝑧] is popular
𝑀 𝑧, 𝑝 = −𝑧𝑚𝑝𝑝 − 1 − 𝑧 log (1 − 𝑝)
𝑒𝑀 𝑧, 𝑝 𝑒𝑝 = − 1 𝑝 𝑗𝑔 𝑧 = 1 1 1 − 𝑝 𝑗𝑔 𝑧 = 0
𝑧 𝑝 𝑝 = 𝜏(𝑨)
31
} Regression problem
– SSE
} Classification problem
– Cross-entropy
𝐹 = 1 𝐹5
+ 56$
𝐹5 = 1 2 𝑝 5 − 𝑧 5
)
𝐹5 = 1 2 𝒑 5 − 𝒛 5
)
= 1 𝑝;
5 − 𝑧; 5 ) > ;6$
One dimensional output Multi-dimensional output
𝑧$ 𝑧> 𝑦$ 𝑦9 𝑚𝑝𝑡𝑡5 = −𝑧 5 log 𝑝 5 − (1 − 𝑧 5 ) log(1 − 𝑝 5 )
Output layer uses sigmoid activation function
hat, or a flower
Cat : [1 0 0 0 0 ]T dog : [0 1 0 0 0 ]T camel : [0 0 1 0 0 ]T hat : [0 0 0 1 0 ]T flower : [0 0 0 0 1 ]T
zeros and a single 1 at the position of the class
32
33
Input L ayer Output Layer Hidden Layers
N binary outputs
– An N-dimensional binary vector
right place)
– N probability values that sum to 1.
34
Parameters are weights and bias
𝑝U
35
Input L ayer Output Layer Hidden Layers
𝑧$, 𝑧), … , 𝑧V = 𝑔(𝑦$, 𝑦), … , 𝑦;; 𝑿)
– Function 𝑔(. ) operates on set of inputs to produce set of outputs – Modifying a single parameter in 𝑿 will affect all outputs
36
Input L ayer Output Layer Hidden Layers
s
t m a x
classifier nets
𝑨U = 1 𝑥
YU (V)𝑏Y (5[$)
𝑝U = exp (𝑨U) ∑ exp (𝑨
Y)
37
y1y2y3y4 K L Div() E
𝑀 𝒛, 𝒑 = − 1 𝑧U𝑚𝑝𝑝U = −𝑚𝑝𝑝a
𝑒𝑀(𝒛, 𝒑) 𝑒𝑝U = b− 1 𝑝a 𝑔𝑝𝑠 𝑢ℎ𝑓 𝑑 𝑢ℎ 𝑑𝑝𝑛𝑞𝑝𝑜𝑓𝑜𝑢 0 𝑔𝑝𝑠 𝑠𝑓𝑛𝑏𝑗𝑜𝑗𝑜 𝑑𝑝𝑛𝑞𝑝𝑜𝑓𝑜𝑢 𝛼
𝒑𝑀(𝒛, 𝒑) = [0 0 … −1
𝑝a … 0 0 ]
The slopeis negative w.r.t. 𝑝a Indicates increasing 𝑝a will reduce divergence
38
𝑀 𝒛, 𝒑 = − 1 𝑧U𝑚𝑝𝑝U = −𝑚𝑝𝑝a
𝑒𝑀(𝒛, 𝒑) 𝑒𝑝U = b− 1 𝑝a 𝑔𝑝𝑠 𝑢ℎ𝑓 𝑑 𝑢ℎ 𝑑𝑝𝑛𝑞𝑝𝑜𝑓𝑜𝑢 0 𝑔𝑝𝑠 𝑠𝑓𝑛𝑏𝑗𝑜𝑗𝑜 𝑑𝑝𝑛𝑞𝑝𝑜𝑓𝑜𝑢 𝛼
𝒑𝑀(𝒛, 𝒑) = [0 0 … −1
𝑝a … 0 0 ]
Note: when 𝒛 = 𝒑 the derivative is not 0
The slopeis negative w.r.t. 𝑝a Indicates increasing 𝑝a will reduce divergence
y1y2y3y4 K L Div() E
40
} Regression problem
– SSE
} Classification problem
– Cross-entropy
𝑚𝑝𝑡𝑡5 = −log 𝑝i(j) 𝑚𝑝𝑡𝑡5 = −𝑧 5 log 𝑝 5 − (1 − 𝑧 5 ) log(1 − 𝑝 5 )
Output layer uses sigmoid activation function
Output is found by a softmax layer 𝑝U =
klm ∑ kln
𝑝 = 1 1 + 𝑓s 𝐹 = 1 𝐹5
+ 56$
𝐹5 = 1 2 𝑝 5 − 𝑧 5
)
𝐹5 = 1 2 𝒑 5 − 𝒛 5
)
= 1 𝑝;
5 − 𝑧; 5 ) > ;6$
One dimensional output Multi-dimensional output
𝑧$ 𝑧> 𝑦$ 𝑦9
𝒚($), 𝒛($) , 𝒚()), 𝒛()) , … , (𝒚(+), 𝒛(+))
when the desired output is 𝒛 𝐹(𝑿) = 1 𝑚𝑝𝑡𝑡 𝒑(5), 𝒛(5)
+ 56$
= 1 𝑂 1 𝑚𝑝𝑡𝑡 𝑔 𝒚(5); 𝑿 , 𝒛(5)
+ 56$
; , 𝑐 Y [;]
41
how can we train such nets?
– We need an efficient way of adapting all the weights, not just the last layer. – Learning the weights going into hidden units is equivalent to learning features. – This is difficult because nobody is telling us directly what the hidden units should do.
42
43
Source: http://3b1b.co
44
to that weight or bias.
𝛼𝐹
𝛼𝐹 Source: http://3b1b.co
45
– Training algorithm that is used to adjust weights in multi-layer networks (based on the training data) – The back-propagation algorithm is based on gradient descent – Use chain rule and dynamic programming to efficiently compute gradients
46
Total training error:
[;]
– Using the extended notation : the bias is also weight
– For every layer 𝑙 for all 𝑗, 𝑘 update:
[;] = 𝑥U,Y [;] − 𝜃 9w 9xm,n
[y]
Assuming the bias is also represented as a weight
𝐹 = 1 𝑚𝑝𝑡𝑡 𝒑(5), 𝒛(5)
+ 56$
47
Total derivative: Total training error:
𝐹 = 1 𝑚𝑝𝑡𝑡 𝒑(5), 𝒛(5)
+ 56$
𝑒𝐹 𝑒𝑥U,Y
[;] = 1 𝑚𝑝𝑡𝑡 𝒑(5), 𝒛(5)
𝑒𝑥U,Y
[;] + 56$
[;]
– For all 𝑗 , 𝑘 , 𝑙, initialize
9w 9xm,n
[y] = 0
– For all 𝑜 = 1: 𝑂
9 V{|| { j ,i j 9xm,n
[y]
9xm,n
[y] +=
9 V{|| { j ,i j 9xm,n
[y]
– For every layer 𝑙 for all 𝑗, 𝑘:
𝑥U,Y
[;] = 𝑥U,Y [;] − } T 9w 9xm,n
[y]
48
49
derivative of divergences of individual training inputs
Total derivative: Total training error:
𝐹 = 1 𝑚𝑝𝑡𝑡 𝒑(5), 𝒛(5)
+ 56$
𝑒𝐹 𝑒𝑥U,Y
[;] = 1 𝑒 𝑚𝑝𝑡𝑡 𝒑(5), 𝒛(5)
𝑒𝑥U,Y
[;] + 56$
50
with derivative
dy dx
the following must hold for sufficiently small For any differentiable function
1 2 M
with partial derivatives
∂y ∂y ∂y ∂x1 ∂x2 ∂xM
the following must hold for sufficiently small
1 2 M
𝑒𝑧 𝑒𝑦 ≈ Δ𝑧 Δ𝑦
51
Check –we can confirm that : For any nested function
52
Check:
1 1 2 2 M M 1 1 2 2 M M
53
1 2
1 2 M
M
54
1 2 M
1 1 M M
perturbations in each o cause small each of which individually additively perturbs
55
56
57
58
59
60
61
62
63
𝑨 = 𝑔(𝑦, 𝑧)
64
65
𝑒 𝑚𝑝𝑡𝑡 𝒑, 𝒛 𝑒𝑥U,Y
[;]
66
– Actual network would have many more neurons and inputs
activation
𝑔(.) 𝑔(.) 𝑔(.) 𝑔(.) 𝑔(.)
𝑝
67
– Actual network would have many more neurons and inputs
and input
(1) 2,1 (1) 3,1 (2) 2,1 (2) 3,1 (3) 1,1 (3) 2,1 (3) 3,1 (1) 3,2 (2) 3,2 (1) 2,2 (1) 1,1 (1) 1,2 (2) 2,2 (2) 1,1 (2) 1,2
𝑝
68
𝑔(. ) 𝑔(. ) 𝑔(. ) 𝑔(. ) 𝑔(. ) 𝒃[V[$] 𝒃[V] 𝒜[V]
𝑏U
[V[$]
𝑨
Y [†]
𝑏Y
[†]
𝑔
𝑏U
[†] = 𝑔 𝑨U [†]
𝑨Y
[†] = 1 𝑥UY [†]𝑏U [†[$] € U6‚
For squared error loss: 𝑚𝑝𝑡𝑡 = 1 2 1 𝑝
Y − 𝑧Y ) Y
𝑝
Y = 𝑏Y † 𝑥UY
[†]
69
𝜖𝑚𝑝𝑡𝑡 𝜖𝑏Y
†
Output j
𝜖𝑚𝑝𝑡𝑡 𝜖𝑏Y
† = (𝑧Y − 𝑏Y † )
𝜖𝑚𝑝𝑡𝑡 𝜖𝑥UY
[†] = 𝜖𝑚𝑝𝑡𝑡
𝜖𝑏Y
†
𝜖𝑏Y
†
𝜖𝑥UY
[†]
𝜖𝑏[†] 𝜖𝑥UY
[†] = 𝑔‰ 𝑨 Y [†]
𝜖𝑨
Y [†]
𝜖𝑥UY
[†] = 𝑔‰ 𝑨 Y [†] 𝑏U [†[$]
𝜖𝑚𝑝𝑡𝑡 𝜖𝑥UY
[†] = 𝜖𝑚𝑝𝑡𝑡
𝜖𝑏Y
† 𝑔‰ 𝑨 Y [†] 𝑏U [†[$]
𝜖𝑚𝑝𝑡𝑡 𝜖𝑥UY
[†]
70
2
[*]
𝜖 𝑚𝑝𝑡𝑡 𝜖𝑥UY
[V] = 𝜖 𝑚𝑝𝑡𝑡
𝜖𝑏Y
V
𝜖𝑏Y
V
𝜖𝑥UY
[V]
𝜖𝑏[V] 𝜖𝑥UY
[V] =
𝜖𝑏Y
[V]
𝜖𝑨
Y [V] ×
𝜖𝑨
Y [V]
𝜖𝑥UY
[V] = 𝑔‰ 𝑨Y
[V] 𝑏U [V[$]
𝜖 𝑚𝑝𝑡𝑡 𝜖𝑏U
[V[$] = 1 𝜖 𝑚𝑝𝑡𝑡
𝜖𝑏Y
[V] ×
𝜖𝑏Y
[V]
𝜖𝑨Y
[V] ×
𝜖𝑨Y
[V]
𝜖𝑏U
[V[$] 9[‹] Y6$
= 1 𝜖 𝑚𝑝𝑡𝑡 𝜖𝑏Y
[V] ×𝑔‰ 𝑨Y [V] ×𝑥UY [V] 9[‹] Y6$
71
𝑏U
[V[$]
𝑨
Y [V]
𝑏Y
[V]
𝑔 𝑥UY
[V]
𝑏U
[V[$]
𝑏Y
[V]
𝑥UY
[V] 𝑨
Y [V]
𝜖 𝑚𝑝𝑡𝑡 𝜖𝑏Y
V
𝑏U
[V] = 𝑔 𝑨U [V]
𝑨Y
[V] = 1 𝑥UY [V]𝑏U [V[$] € U6‚
𝜖 𝑚𝑝𝑡𝑡 𝜖𝑥UY
[V]
𝜖 𝑚𝑝𝑡𝑡 𝜖𝑏9[‹]
[V]
𝜖 𝑚𝑝𝑡𝑡 𝜖𝑏U
[V[$]
𝜖 𝑚𝑝𝑡𝑡 𝜖𝑏$
[V]
72
𝜖 𝑚𝑝𝑡𝑡 𝜖𝑥UY
[V] = 𝜖 𝑚𝑝𝑡𝑡
𝜖𝑏Y
[V] ×
𝜖𝑏Y
[V]
𝜖𝑥UY
[V]
= 𝜀
Y [V]×𝑏U [V[$]×𝑔‰ 𝑨Y [V] } 𝜀
Y [V] = • V{||
[‹] is the sensitivity of the output to 𝑏Y
[V]
} Sensitivity vectors can be obtained by running a backward process in the
network architecture (hence the name backpropagation.)
𝑏U
[V[$]
𝑨Y
[V]
𝑏Y
[V]
𝑔 𝑏U
[V] = 𝑔 𝑨U [V]
𝑨Y
[V] = 1 𝑥UY [V]𝑏U [V[$] € U6‚
𝑥UY
[V]
We will compute 𝜺[V[$] from 𝜺[V]:
𝜀U
[V[$] = 1 𝜀 Y [V]×𝑔‰ 𝑨 Y [V] ×𝑥UY [V] 9[‹] Y6$
73
𝜀
Y [†] = 𝜖 𝑚𝑝𝑡𝑡
𝜖𝑏Y
[†]
74
} 𝜀
Y [V] =
[‹] is the sensitivity of the output to 𝑏Y
[V]
} Sensitivity vectors can be obtained by running a backward process in
the network architecture (hence the name backpropagation.)
𝜀U
[V[$] = 1 𝜀 Y [V]×𝑔‰ 𝑨 Y [V] ×𝑥UY [V] 9[‹] Y6$
𝑏U
[V[$]
𝑨
Y [V]
𝑏Y
[V]
𝑔 𝑏U
[V] = 𝑔 𝑨U [V]
𝑨
Y [V] = 1𝑥UY [V]𝑏U [V[$] € Y6‚
𝑥UY
[V]
75
1. Feed forward the training example to the network and compute the outputs of all units in forward step (z and a) and the loss 2. For each unit find its 𝜀 in the backward step 3. Update each network weight 𝑥UY
[V] as 𝑥UY [V] ← 𝑥UY [V] − 𝜃
[‹] where
[‹] = 𝜀
Y [V]×𝑏U [V[$]×
𝑔‰ 𝑨
Y [V]
76
77
78
79
80
81
82
83
84
85
86
87
88
[local gradient] x [upstream gradient] x0: [2] x [0.2] = 0.4 w0: [-1] x [0.2] = -0.2
89
90
91
92
93
94
95
terms of vector operations
– Simpler arithmetic – Fast matrix libraries make operations much faster
– This is what is actually used in any real system
97
98
𝑧$ 𝑧) ⋮ 𝑧‘ = 𝑔 𝑦$ 𝑦) ⋮ 𝑦9 𝜖𝒛 𝜖𝒚 = 𝜖𝑧$ 𝜖𝑦$ 𝜖𝑧$ 𝜖𝑦) ⋯ 𝜖𝑧$ 𝜖𝑦9 𝜖𝑧) 𝜖𝑦$ 𝜖𝑧) 𝜖𝑦) ⋯ 𝜖𝑧) 𝜖𝑦9 ⋯ ⋯ ⋱ ⋯ 𝜖𝑧‘ 𝜖𝑦$ 𝜖𝑧‘ 𝜖𝑦) ⋯ 𝜖𝑧‘ 𝜖𝑦9 Using vector notation 𝐳 = 𝑔 𝐲 Check: ∆𝐳 = 𝜖𝒛 𝜖𝒚 ∆𝐲
99
𝜖𝑧 𝜖𝒚 = 𝜖𝑧 𝜖𝑦$ … 𝜖𝑧 𝜖𝑦5 𝜖𝒛 𝜖𝒚 = 𝜖𝑧$ 𝜖𝑦$ … 𝜖𝑧$ 𝜖𝑦5 ⋮ ⋱ ⋮ 𝜖𝑧‘ 𝜖𝑦$ … 𝜖𝑧‘ 𝜖𝑦5 𝜖𝑧 𝜖𝑩 = 𝜖𝑧 𝜖𝐵$$ … 𝜖𝑧 𝜖𝐵$5 ⋮ ⋱ ⋮ 𝜖𝑧 𝜖𝐵‘$ … 𝜖𝑧 𝜖𝐵‘5 𝜖𝑧 𝜖𝐵UY = 𝜖𝑧 𝜖𝒜 𝜖𝒜 𝜖𝐵UY
100
101
– Number of outputs is identical to the number of inputs
– Diagonal entries are individual derivatives of outputs w.r.t inputs – Not showing the superscript “[k]” in equations for brevity
102
𝜖𝒃 𝜖𝒜 = 𝑒𝑏$ 𝑒𝑨$ ⋯ 𝑒𝑏) 𝑒𝑨) ⋯ ⋯ ⋯ ⋱ ⋯ ⋯ 𝑒𝑏‘ 𝑒𝑨‘ z a
– Jacobian is a diagonal matrix – Diagonal entries are individual derivatives of outputs w.r.t inputs
103
𝜖𝒃 𝜖𝒜 = 𝑔′ 𝑨$ ⋯ 𝑔′ 𝑨) ⋯ ⋯ ⋯ ⋱ ⋯ ⋯ 𝑔′ 𝑨‘ z a
𝑏U = 𝑔 𝑨U
– Jacobian is a diagonal matrix – Diagonal entries are individual derivatives of outputs w.r.t inputs
104
𝜖𝒃 𝜖𝒜 = 𝜏 𝑨$ 1 − 𝜏 𝑨$ ⋯ 𝜏 𝑨) 1 − 𝜏 𝑨) ⋯ ⋯ ⋯ ⋱ ⋯ ⋯ 𝜏 𝑨‘ 1 − 𝜏 𝑨‘ z a
𝑏U = 𝜏 𝑨U
– Entries are partial derivatives of individual outputs w.r.t individual inputs
105
𝜖𝒃 𝜖𝒜 = 𝜖𝑏$ 𝜖𝑨$ 𝜖𝑏$ 𝜖𝑨) ⋯ 𝜖𝑏$ 𝜖𝑨‘ 𝜖𝑏) 𝜖𝑨$ 𝜖𝑏) 𝜖𝑨) ⋯ 𝜖𝑏) 𝜖𝑨‘ ⋯ ⋯ ⋱ ⋯ 𝜖𝑏5 𝜖𝑨$ 𝜖𝑏5 𝜖𝑨) ⋯ 𝜖𝑏5 𝜖𝑨‘ z a
106
𝐴 V = 𝐗 V 𝐛[V[$] + 𝐜[V]
107
𝐳 = 𝒈 𝒉 𝐲 𝐳 = 𝒈 𝐴 𝐴 = 𝒉 𝐲 𝜖𝒛 𝜖𝒚 = 𝜖𝒛 𝜖𝒜 𝜖𝒜 𝜖𝒚
Check
∆𝐴 = 𝜖𝒜
𝜖𝒚 ∆𝐲
∆𝐳 = 𝜖𝒛
𝜖𝒜 ∆𝐴 ∆𝐳 = 𝜖𝒛 𝜖𝒜 𝜖𝒜 𝜖𝒚 ∆𝐲 = 𝜖𝒛 𝜖𝒚 ∆𝐲
Note the order: The derivative of the outer function comes first
108
intermediate step has shape of denominator
109
110
111
112
𝒓 𝑔
𝒓 = 𝑿𝒚 𝑔 𝒓 = 𝒓 ) = 𝒓T𝒓
113
𝒓 𝑔
𝜖𝑔 𝜖𝑔
𝒓 = 𝑿𝒚 𝑔 𝒓 = 𝒓 ) = 𝒓T𝒓
114
𝜖𝑔 𝜖𝒓 = 2𝒓
𝜖𝑔 𝜖𝑔
𝒓 𝑔
𝒓 = 𝑿𝒚 𝑔 𝒓 = 𝒓 ) = 𝒓T𝒓
𝜖𝑔 𝜖𝒓 = 2𝒓
115
𝜖𝑔 𝜖𝑋 = 𝜖𝑔 𝜖𝒓 𝒚T = 2𝒓𝒚T
𝜖𝑔 𝜖𝒓
𝒓 = 𝑿𝒚
116
117
𝑃𝑣𝑢𝑞𝑣𝑢 = 𝑏[†] = 𝑔 𝑨[†] = 𝑔 𝑋[†]𝑏[†[$] = 𝑔 𝑋[†]𝑔(𝑋[†[$]𝑏[†[)] = 𝑔 𝑋[†]𝑔 𝑋[†[$] … 𝑔 𝑋[)]𝑔 𝑋[$]𝑦
For convenience, we use the same activation functions for all layers. However, output layer neurons most commonly do not need activation function (they show class scores or real-valued targets.)
𝑋[$] 𝑦 × 𝑔 𝑋[)] × 𝑔 𝑋[†] × 𝑔 𝑨[$] 𝑏[$] 𝑨[)] 𝑏[†[$] 𝑨[†] 𝑏[†] 𝑏[†] = 𝑝𝑣𝑢𝑞𝑣𝑢
118
119
𝑋[$] 𝑦 × 𝑔 𝑋[)] × 𝑔 𝑋[†] × 𝑔 𝑨[$] 𝑏[$] 𝑨[)] 𝑏[†[$] 𝑨[†] 𝑏[†] 𝑏[†] = 𝑝𝑣𝑢𝑞𝑣𝑢 𝑨[)]
The Jacobian will be a diagonal matrix for scalar activations
120
hand for all parameters
graph to compute the gradients of all inputs/parameters/intermediates
forward() / backward() API
– forward: compute result of an operation and save any intermediates needed for gradient computation in memory – backward: apply the chain rule to compute the gradient of the loss function with respect to the inputs
121
gradient of the error function w.r.t. weights and biases.
procedure from these derivatives:
– Convergence or optimization issues: How do we use the error derivatives? – Generalization issues: How can we improve its decisions on unseen data?
122
– http://cs231n.stanford.edu/handouts/derivatives.pdf – http://cs231n.stanford.edu/handouts/linear-backprop.pdf – http://cs231n.github.io/optimization-2/
123