Neural Networks Learning the network: Part 3
11-785, Fall 2020 Lecture 5
1
Neural Networks Learning the network: Part 3 11-785, Fall 2020 - - PowerPoint PPT Presentation
Neural Networks Learning the network: Part 3 11-785, Fall 2020 Lecture 5 1 Recap : Training the network Given a training set of input-output pairs Minimize the following function w.r.t This is problem of function minimization
11-785, Fall 2020 Lecture 5
1
w.r.t
– An instance of optimization
2
3
What are these input-output pairs? What is f() and what are its parameters W? What is the divergence div()?
differentiable activations
4
Input units Output units Hidden units
– For real valued prediction: a vector of reals – For classification: A one-hot vector representation of the label
– For real valued prediction: a vector of reals – For classification: A probability distribution over labels
5
divergence is popular
– The derivative:
6
L2 Div() d1d2d3 d4 Div
, d is 0/1, the Kullback Leibler (KL) divergence between the probability distribution and the ideal output probability is popular
– Minimum when 𝑒 = 𝑍
𝑒𝐸𝑗𝑤(𝑍, 𝑒) 𝑒𝑍 = − 1 𝑍 𝑗𝑔 𝑒 = 1 1 1 − 𝑍 𝑗𝑔 𝑒 = 0
7
KL Div
is the target value of
– Encouraging faster convergence of gradient descent
– It is 0 for L2, though
8
d=0 d=1
𝐿𝑀 𝑍, 𝑒 = −𝑒𝑚𝑝𝑍 − 1 − 𝑒 log (1 − 𝑍) 𝑀2 𝑍, 𝑒 = (𝑧 − 𝑒)
, d is 0/1, the Kullback Leibler (KL) divergence between the probability distribution and the ideal output probability is popular
– Minimum when d = 𝑍
𝑒𝐸𝑗𝑤(𝑍, 𝑒) 𝑒𝑍 = − 1 𝑍 𝑗𝑔 𝑒 = 1 1 1 − 𝑍 𝑗𝑔 𝑒 = 0
9
KL Div Note: when the derivative is not 0 Even though (minimum) when y = d
𝐸𝑗𝑤 𝑍, 𝑒 = 𝑒 log 𝑒
Note ∑ 𝑒 log 𝑒
𝑒𝐸𝑗𝑤(𝑍, 𝑒) 𝑒𝑍
𝑧 𝑔𝑝𝑠 𝑢ℎ𝑓 𝑑 − 𝑢ℎ 𝑑𝑝𝑛𝑞𝑝𝑜𝑓𝑜𝑢 0 𝑔𝑝𝑠 𝑠𝑓𝑛𝑏𝑗𝑜𝑗𝑜 𝑑𝑝𝑛𝑞𝑝𝑜𝑓𝑜𝑢 𝛼𝐸𝑗𝑤(𝑍, 𝑒) = 0 0 … −1 𝑧 … 0 0
10
KL Div() d1d2d3 d4 Div The slope is negative w.r.t.
will reduce divergence
𝐸𝑗𝑤 𝑍, 𝑒 = − 𝑒 log 𝑧 = − log 𝑧
𝑒𝐸𝑗𝑤(𝑍, 𝑒) 𝑒𝑍
𝑧 𝑔𝑝𝑠 𝑢ℎ𝑓 𝑑 − 𝑢ℎ 𝑑𝑝𝑛𝑞𝑝𝑜𝑓𝑜𝑢 0 𝑔𝑝𝑠 𝑠𝑓𝑛𝑏𝑗𝑜𝑗𝑜 𝑑𝑝𝑛𝑞𝑝𝑜𝑓𝑜𝑢 𝛼𝐸𝑗𝑤(𝑍, 𝑒) = 0 0 … −1 𝑧 … 0 0
11
KL Div() d1d2d3 d4 Div Note: when the derivative is not 0 Even though (minimum) when y = d The slope is negative w.r.t.
will reduce divergence
and :
and :
that minimizes cross-entropy will minimize the KL divergence
– In fact, for one-hot ,
KL divergence
– The Xent is not a divergence, and although it attains its minimum when , its minimum value is not 0
12
with the value in the -th position (for class ) and elsewhere for some small
– “Label smoothing” -- aids gradient descent
𝑒𝐸𝑗𝑤(𝑍, 𝑒) 𝑒𝑍
− 1 − (𝐿 − 1)𝜗 𝑧 𝑔𝑝𝑠 𝑢ℎ𝑓 𝑑 − 𝑢ℎ 𝑑𝑝𝑛𝑞𝑝𝑜𝑓𝑜𝑢 − 𝜗 𝑧 𝑔𝑝𝑠 𝑠𝑓𝑛𝑏𝑗𝑜𝑗𝑜 𝑑𝑝𝑛𝑞𝑝𝑜𝑓𝑜𝑢𝑡
13
KL Div() d1d2d3 d4 Div
with the value in the -th position (for class ) and elsewhere for some small
– “Label smoothing” -- aids gradient descent
𝑒𝐸𝑗𝑤(𝑍, 𝑒) 𝑒𝑍
− 1 − (𝐿 − 1)𝜗 𝑧 𝑔𝑝𝑠 𝑢ℎ𝑓 𝑑 − 𝑢ℎ 𝑑𝑝𝑛𝑞𝑝𝑜𝑓𝑜𝑢 − 𝜗 𝑧 𝑔𝑝𝑠 𝑠𝑓𝑛𝑏𝑗𝑜𝑗𝑜 𝑑𝑝𝑛𝑞𝑝𝑜𝑓𝑜𝑢𝑡
14
KL Div() d1d2d3 d4 Div Negative derivatives encourage increasing the probabilities of all classes, including incorrect classes! (Seems wrong, no?)
15
ALL TERMS HAVE BEEN DEFINED
parameters to minimize the average divergence between their actual output and the desired output at a set of “training instances”
– Input-output samples from the function to be learned – The average divergence is the “Loss” to be minimized
– The network itself – The manner in which inputs are represented as numbers – The manner in which outputs are represented as numbers
– The divergence function that computes the error between actual and desired outputs
16
–
w.r.t
17
– –
– –
18
To minimize any function L(W) w.r.t W
w.r.t.
– –
– For every component
19
Explicitly stating it by component
– Using the extended notation: the bias is also a weight
– For every layer for all update:
() , ()
()
has converged
20
Total training Loss:
Assuming the bias is also represented as a weight
– For every layer for all update:
() , ()
()
has converged
21
Total training Loss:
Assuming the bias is also represented as a weight
22
Total derivative: Total training Loss:
– For all , initialize
()
– For all
– Compute
𝑬𝒋𝒘(𝒁𝒖,𝒆𝒖) ,
()
–
() +=
𝑬𝒋𝒘(𝒁𝒖,𝒆𝒖) ,
()
– For every layer for all :
𝑥,
() = 𝑥, () − 𝜃
𝑈 𝑒𝑀𝑝𝑡𝑡 𝑒𝑥,
()
has converged
23
derivative of divergences of individual training inputs
24
Total derivative: Total training Loss:
25
For any differentiable function with derivative
For any differentiable function
definition
26
Check – we can confirm that : For any nested function
27
Check:
28
Check:
through each of
29
perturbations in each of each of which individually additively perturbs
30
31
– Actual network would have many more neurons and inputs
32
– Actual network would have many more neurons and inputs
activation
33
+ + + + +
𝑔(. ) 𝑔(. ) 𝑔(. ) 𝑔(. ) 𝑔(. )
– Actual network would have many more neurons and inputs
34
+ + + + +
, () , () , () , () , () , () , () , () , () , () , () , () , () , () , ()
35
+ + + + +
, () , () , () , () , () , () , () , () , () , () , () , () , () , () , ()
Div
1 1 2 2 3
36
+ + + + +
, () , () , () , () , () , () , () , () , () , () , () , () , () , () , ()
Div
1 1 2 2 3
What is:
𝒆𝑬𝒋𝒘(𝒁,𝒆) 𝒆,
()
, ()
requires intermediate and final output values of the network in response to the input
37
We will refer to the process of computing the output from an input as the forward pass We will illustrate the forward pass in the following slides fN fN
z(N) y(N-1) z(N-1)
z(1)
y(0)
z(2)
z(3)
38
fN fN
z(N) y(N-1) z(N-1) Assuming
()
the output of every layer by a constant 1, to account for the biases
z(1)
y(0)
z(2)
z(3)
() for notational convenience
1
39
fN fN
z(N) y(N-1) z(N-1)
z(1)
y(0)
z(2)
z(3)
40
fN fN
z(N) y(N-1) z(N-1)
z(1)
y(0)
z(2)
z(3)
41
fN fN
z(N) y(N-1) z(N-1)
z(1)
y(0)
z(2)
z(3)
1
42
fN fN
z(N) y(N-1) z(N-1)
z(1)
y(0)
z(2)
z(3)
fN fN
z(N) y(N-1) z(N-1)
z(1)
y(0)
z(2)
z(3)
1
fN fN
z(N) y(N-1) z(N-1)
z(1)
y(0)
z(2)
z(3)
fN fN
z(N) y(N-1) z(N-1)
z(1)
y(0)
z(2)
z(3)
1
fN fN
z(N) y(N-1) z(N-1)
z(1)
y(0)
z(2)
z(3)
()
1
47
ITERATE FOR k = 1:N for j = 1:layer-width
fN fN
z(N) y(N-1) z(N-1)
z(1)
y(0)
z(2)
z(3)
48
dimensional vector
– , is the width of the 0th (input) layer – ;
– For
, ()
–
49
Dk is the size of the kth layer
We have computed all these intermediate values in the forward computation We must remember them – we will need them to compute the derivatives
fN fN
z(N) y(N-1) z(N-1)
z(1)
y(0)
y(N-2) z(N-2) 1
First, we compute the divergence between the output of the net y = y(N) and the desired output fN fN
z(N) y(N-1) z(N-1)
z(1)
y(0)
y(N-2) z(N-2) 1
Div(Y,d)
We then compute () the derivative of the divergence w.r.t. the final output of the network y(N) fN fN
z(N) y(N-1) z(N-1)
z(1)
y(0)
y(N-2) z(N-2) 1
Div(Y,d)
We then compute () the derivative of the divergence w.r.t. the final output of the network y(N) We then compute () the derivative of the divergence w.r.t. the pre-activation affine combination z(N) using the chain rule fN fN
z(N) y(N-1) z(N-1)
z(1)
y(0)
y(N-2) z(N-2) 1
Div(Y,d)
fN fN
z(N) y(N-1) z(N-1)
z(1)
y(0)
y(N-2) z(N-2) 1
Div(Y,d)
Continuing on, we will compute () the derivative of the divergence with respect to the weights of the connections to the output layer
Continuing on, we will compute () the derivative of the divergence with respect to the weights of the connections to the output layer Then continue with the chain rule to compute () the derivative of the divergence w.r.t. the output of the N-1th layer fN fN
z(N) y(N-1) z(N-1)
z(1)
y(0)
y(N-2) z(N-2) 1
Div(Y,d)
We continue our way backwards in the order shown fN fN
z(N) y(N-1) z(N-1)
z(1)
y(0)
y(N-2) z(N-2) 1
Div(Y,d)
fN fN
z(N) y(N-1) z(N-1)
z(1)
y(0)
y(N-2) z(N-2) 1
Div(Y,d) We continue our way backwards in the order shown
We continue our way backwards in the order shown fN fN
z(N) y(N-1) z(N-1)
z(1)
y(0)
y(N-2) z(N-2) 1
Div(Y,d)
We continue our way backwards in the order shown fN fN
z(N) y(N-1) z(N-1)
z(1)
y(0)
y(N-2) z(N-2) 1
Div(Y,d)
We continue our way backwards in the order shown fN fN
z(N) y(N-1) z(N-1)
z(1)
y(0)
y(N-2) z(N-2) 1
Div(Y,d)
We continue our way backwards in the order shown fN fN
z(N) y(N-1) z(N-1)
z(1)
y(0)
y(N-2) z(N-2) 1
Div(Y,d)
We continue our way backwards in the order shown fN fN
z(N) y(N-1) z(N-1)
z(1)
y(0)
y(N-2) z(N-2) 1
Div(Y,d)
63
fN fN
z(N) y(N-1) z(N-1)
z(1)
y(0)
y(N-2) z(N-2) 1
Div(Y,d)
The derivative w.r.t the actual output of the final layer of the network is simply the derivative w.r.t to the output of the network fN fN
z(N) y(N-1) z(N-1)
z(1)
y(0)
y(N-2) z(N-2) 1
Div(Y,d)
fN fN
z(N) y(N-1) z(N-1)
z(1)
y(0)
y(N-2) z(N-2) 1
Div(Y,d)
fN fN
z(N) y(N-1) z(N-1)
z(1)
y(0)
y(N-2) z(N-2) 1
Div(Y,d) Already computed
fN fN
z(N) y(N-1) z(N-1)
z(1)
y(0)
y(N-2) z(N-2) 1
Div(Y,d)
Derivative of activation function
fN fN
z(N) y(N-1) z(N-1)
z(1)
y(0)
y(N-2) z(N-2) 1
Div(Y,d)
Derivative of activation function Computed in forward pass
fN fN
z(N) y(N-1) z(N-1)
z(1)
y(0)
y(N-2) z(N-2) 1
Div(Y,d)
fN fN
z(N) y(N-1) z(N-1)
z(1)
y(0)
y(N-2) z(N-2) 1
Div(Y,d)
fN fN
z(N) y(N-1) z(N-1)
z(1)
y(0)
y(N-2) z(N-2) 1
Div(Y,d)
fN fN
z(N) y(N-1) z(N-1)
z(1)
y(0)
y(N-2) z(N-2) 1
Div(Y,d)
Just computed
fN fN
z(N) y(N-1) z(N-1)
z(1)
y(0)
y(N-2) z(N-2) 1
Div(Y,d)
Because
fN fN
z(N) y(N-1) z(N-1)
z(1)
y(0)
y(N-2) z(N-2) 1
Div(Y,d)
Because
Computed in forward pass
fN fN
z(N) y(N-1) z(N-1)
z(1)
y(0)
y(N-2) z(N-2) 1
Div(Y,d)
For the bias term
()
fN fN
z(N) y(N-1) z(N-1)
z(1)
y(0)
y(N-2) z(N-2) 1
Div(Y,d)
fN fN
z(N) y(N-1) z(N-1)
z(1)
y(0)
y(N-2) z(N-2) 1
Div(Y,d)
fN fN
z(N) y(N-1) z(N-1)
z(1)
y(0)
y(N-2) z(N-2) 1
Div(Y,d)
Already computed
fN fN
z(N) y(N-1) z(N-1)
z(1)
y(0)
y(N-2) z(N-2) 1
Div(Y,d)
Because
fN fN
z(N) y(N-1) z(N-1)
z(1)
y(0)
y(N-2) z(N-2) 1
Div(Y,d)
fN fN
z(N) y(N-1) z(N-1)
z(1)
y(0)
y(N-2) z(N-2) 1
Div(Y,d)
fN fN
z(N) y(N-1) z(N-1)
z(1)
y(0)
y(N-2) z(N-2) 1
Div(Y,d) We continue our way backwards in the order shown
fN fN
z(N) y(N-1) z(N-1)
z(1)
y(0)
y(N-2) z(N-2) 1
Div(Y,d) We continue our way backwards in the order shown
For the bias term
()
fN fN
z(N) y(N-1) z(N-1)
z(1)
y(0)
y(N-2) z(N-2) 1
Div(Y,d) We continue our way backwards in the order shown
fN fN
z(N) y(N-1) z(N-1)
z(1)
y(0)
z(N-2)
Div(Y,d) We continue our way backwards in the order shown
86
fN fN
z(N) y(N-1) z(N-1)
z(1)
y(0)
y(N-2) z(N-2) 1
Div(Y,d) We continue our way backwards in the order shown
87
fN fN
z(N) y(N-1) z(N-1)
z(1)
y(0)
y(N-2) z(N-2) 1
Div(Y,d) We continue our way backwards in the order shown
88
y(0)
1 We continue our way backwards in the order shown fN fN
z(N) y(N-1) z(N-1)
z(1)
z(N-2) 1
Div(Y,d)
89
Div(Y,d) fN fN
Initialize: Gradient w.r.t network output
y(N) z(N) y(N-1) z(N-1) y(k) z(k) y(k-1) z(k-1)
Div(Y,d)
the “1” bias nodes
90
– For
()
– For
()
()
()
–
– For
()
– For
()
()
()
–
Called “Backpropagation” because the derivative of the loss is propagated “backwards” through the network Backward weighted combination
Backward equivalent of activation Very analogous to the forward pass:
– For
– For
Called “Backpropagation” because the derivative of the loss is propagated “backwards” through the network Backward weighted combination
Backward equivalent of activation Very analogous to the forward pass: Using notation
(,)
w.r.t variable)
dimensional vector
– , is the width of the 0th (input) layer – ;
– For
, ()
–
94
1. The computation of the output of one neuron does not directly affect computation of other neurons in the same (or previous) layers 2. Inputs to neurons only combine through weighted addition 3. Activations are actually differentiable – All of these conditions are frequently not applicable
– Will appear in quiz. Please read the slides
95
all inputs
96
z(k) y(k-1) y(k) z(k) y(k-1) y(k)
97
z(k) y(k-1) y(k)
Scalar activation: Modifying a
Vector activation: Modifying a potentially changes all,
z(k) y(k-1) y(k)
98
z(k) y(k-1) y(k) z(k) y(k)
Scalar activation: Each influences one Vector activation: Each influences all,
y(k-1)
99
z(k) y(k)
same as the number of inputs (z(k))
z(k) y(k) y(k-1) y(k-1)
derivative of the error w.r.t to the input to the unit is a simple product of derivatives
100
z(k) y(k-1) y(k)
to any input is a sum of partial derivatives
– Regardless of the number of outputs
101
z(k) y(k-1) y(k)
Div
Note: derivatives of scalar activations are just a special case of vector activations:
102
z(k) y(k-1) y(k)
103
z(k) y(k-1) y(k)
104
z(k) y(k-1) y(k)
105
z(k) y(k-1) y(k)
– For
𝑧
() 𝜀 − 𝑧 ()
– For
()
()
()
–
z(N) y(N) KL Div d Div softmax
special cases on slides
– Please look up – Will appear in quiz!
107
– E.g. linear combinations, polynomials, logistic (softmax), etc.
108
z(k) y(k-1) y(k)
– In contrast to the additive combination we have seen so far
etc.
z(k-1) y(k-1)
W(k)
Forward:
) 1 ( ) 1 ( ) (
k l k j k i
y y
combination
z(k-1) y(k-1)
W(k)
Forward:
) 1 ( ) 1 ( ) (
k l k j k i
y y
) ( ) 1 ( ) ( ) 1 ( ) ( ) 1 ( k i k l k i k j k i k j
y
y
Div
) ( ) 1 ( ) 1 ( k i k j k l
y y Div
110
111
z(k) y(k-1) y(k)
112
z(k) y(k-1) y(k)
Y, Div
Div(Y,d) fN fN Div y(N) z(N) y(N-1) z(N-1) y(k) z(k) y(k-1) z(k-1)
For k = N…1 For i = 1:layer width
If layer has vector activation Else if activation is scalar
113
– E.g. The RELU (Rectified Linear Unit)
– E.g. The “max” function
– Or “secants”
114 + . . . . . x x x x 𝑨 𝑧 𝑥 𝑥 𝑥 𝑥 𝑔(𝑨) x 𝑥 𝑥 1 𝑨 𝑔(𝑨) = 𝑨 𝑔(𝑨) = 0
z1 y
z3 z4
at a point is any vector such that
Any direction such that moving in that direction increases the function
– “bowl” shaped functions – For non-convex functions, the equivalent concept is a “quasi-secant”
– The gradient is not always the subgradient though
115
– At the differentiable points on the curve, this is the same as the gradient – Typically, will use the equation given
116
– 1 w.r.t. the largest incoming input
– 0 for the rest
117
z1 y
zN
inputs
– Will be seen in convolutional networks
– 1 for the specific component that is maximum in corresponding input subset – 0 otherwise
118
y1 z2 zN y2 y3 yM
– For
– For
()
()
–
These may be subgradients
T
– Forward pass: Pass instance forward through the net. Store all intermediate outputs of all computation. – Backward pass: Sweep backward through the net, iteratively compute all derivatives w.r.t weights
for each training instance
–
120
for all layers
– Initialize ; For all , initialize
()
– For all (Iterate over training instances)
– Output 𝒁𝒖 – 𝑀𝑝𝑡𝑡 += 𝑬𝒋𝒘(𝒁𝒖, 𝒆𝒖)
– Compute
𝑬𝒋𝒘(𝒁𝒖,𝒆𝒖) ,
()
– Compute
() +=
𝑬𝒋𝒘(𝒁𝒖,𝒆𝒖) ,
()
– For all update:
𝑥,
() = 𝑥, () − 𝜃
𝑈 𝑒𝑀𝑝𝑡𝑡 𝑒𝑥,
()
has converged
121
think of the process in terms of vector
– Simpler arithmetic – Fast matrix libraries make operations much faster
terms
– This is what is actually used in any real system
122
Similarly with biases
12 3
𝒍
𝒍
notation as (setting 𝟏 ):
12 4
𝒍
𝒍
𝒍 𝒍 𝒍𝟐 𝒍 𝒍
125
𝟏
126
𝟐
𝟐
127
𝟐 𝟐
128
𝟐 𝟑
129 𝟐 𝟑
𝟐
130 𝟐
𝟑 𝟐
131 𝟐
𝟑 𝟐
Div(Y,d)
For k = 1 to N: Initialize Output
132
– For layer k = 1 to N:
133
134
135
Using vector notation Check:
a Jacobian
– Number of outputs is identical to the number of inputs
– Diagonal entries are individual derivatives of outputs w.r.t inputs – Not showing the superscript “(k)” in equations for brevity
136
z y
– Jacobian is a diagonal matrix – Diagonal entries are individual derivatives of outputs w.r.t inputs
137
z y
– Entries are partial derivatives of individual outputs w.r.t individual inputs
138
z y
and bias
produce vector
139
140
Check
Note the order: The derivative of the outer function comes first
is vector):
141
Check
Note the order: The derivative of the outer function comes first
142
Note reversal of order. This is in fact a simplification
Derivatives w.r.t parameters
to represent the Jacobian 𝐙 to explicitly illustrate the chain rule In general 𝐛 represents a derivative of w.r.t.
143
The actual derivative depends on the divergence function. N.B: The gradient is the transpose of the derivative
144
New term
145
New term
146
New term
147
New term
148
New term
150
matrix for scalar activations
151
the derivative w.r.t. the input
,
– Compute
– Backward recursion step:
,
– Compute
– Backward recursion step:
Note analogy to forward pass
– Forward recursion step:
159
– – For all , initialize 𝐗 , 𝐜 – For all # Loop through training instances
– Output 𝒁(𝒀𝒖) – Divergence 𝑬𝒋𝒘(𝒁𝒖, 𝒆𝒖) – 𝑀𝑝𝑡𝑡 += 𝑬𝒋𝒘(𝒁𝒖, 𝒆𝒖)
– 𝛼
𝐳𝐸𝑗𝑤 = 𝛼𝐴𝐸𝑗𝑤 𝐗
– 𝛼
𝐴𝐸𝑗𝑤 = 𝛼 𝐳𝐸𝑗𝑤 𝐾𝐳 𝐴
– 𝛼𝐗𝑬𝒋𝒘(𝒁𝒖, 𝒆𝒖) = 𝐳𝛼
𝐴𝐸𝑗𝑤; 𝛼𝐜𝑬𝒋𝒘 𝒁𝒖, 𝒆𝒖 = 𝛼 𝐴𝐸𝑗𝑤
– 𝛼𝐗𝑀𝑝𝑡𝑡 += 𝛼𝐗𝑬𝒋𝒘(𝒁𝒖, 𝒆𝒖); 𝛼𝐜𝑀𝑝𝑡𝑡 += 𝛼𝐜𝑬𝒋𝒘(𝒁𝒖, 𝒆𝒖)
– For all update:
𝐗 = 𝐗 −
; 𝐜 = 𝐜 −
has converged
160
– –
161
Training data Sigmoid output neuron
– First ten outputs correspond to the ten digits
– Ideal output: One of the outputs goes to 1, the others go to 0
162
Training data Y1 Y2 Y3 Y4 Y0
divergence between the output of the network and the desired
parameters.
instance) w.r.t. network parameters can be computed using backpropagation
– Which requires a “forward” pass of inference followed by a “backward” pass of gradient computation
163
– And how can we improve it
data)
164
165