Neural Networks Learning the network: Backprop
11-785, Spring 2018 Lecture 4
1
Neural Networks Learning the network: Backprop 11-785, Spring 2018 - - PowerPoint PPT Presentation
Neural Networks Learning the network: Backprop 11-785, Spring 2018 Lecture 4 1 Design exercise Input: Binary coded number Output: One-hot vector Input units? Output units? Architecture? Activations? 2 Recap: The MLP can
11-785, Spring 2018 Lecture 4
1
2
3
4
– Basically, get input-output pairs for a number of samples of input
will be drawn from
5
Xi di
Xi di
error
7
w.r.t
– An instance of optimization
8
OPTIMIZATION
9
function w.r.t a variable “x”
network optimization problem we would be
variable that we’re optimizing a function over and not the input to a neural network
10
the value of x where f(x) is minimum
f(x) x
global minimum inflection point local minimum global maximum
11
= 0
– Solve
– Derivatives go from positive to negative or vice versa at this point
12
x f(x)
13
+ + + + + + + + +
14
x f(x) f’(x)
15
is –ve at maxima and +ve at minima!
x f(x) f’(x) f’’(x)
= 0: Solve
is a minimum, otherwise it is a maximum
16
x f(x)
– Shifting in any direction will increase the value – For smooth functions, miniscule shifts will not result in any change at all
amount will not change the value of the function
17
18
multi-variate input is a multiplicative factor that gives us the change in for tiny variations in
19
20
fixed lengths is maximum when the two vectors are aligned
– i.e. when
21
and
– E.g.
is aligned with
– – The function f(X) increases most rapidly if the input increment is perfectly aligned to
22 Some sloppy maths here, with apology – comparing row and column vectors
23
Gradient vector
24
Gradient vector Moving in this direction increases fastest
25
Gradient vector Moving in this direction increases fastest Moving in this direction decreases fastest
26
Gradient here is 0 Gradient here is 0
is perpendicular to the level curve
27
is given by the second derivative
28
Ñ2 f (x1,..., xn):= ¶2 f ¶x1
2
¶2 f ¶x1¶x2 . . ¶2 f ¶x1¶xn ¶2 f ¶x2¶x1 ¶2 f ¶x2
2
. . ¶2 f ¶x2¶xn . . . . . . . . . . ¶2 f ¶xn¶x1 ¶2 f ¶xn¶x2 . . ¶2 f ¶xn
2
é ë ê ê ê ê ê ê ê ê ê ê ê ê ê ù û ú ú ú ú ú ú ú ú ú ú ú ú ú
29
gradient will be 0
30
where the gradient equation equals to zero
at the candidate solution and verify that
– Hessian is positive definite (eigenvalues positive) -> to identify local minima – Hessian is negative definite (eigenvalues negative) -> to identify local maxima
31
32
f (x1, x2, x3) = (x1)
2 + x1(1- x2)-(x2) 2 - x2x3 +(x3) 2 + x3 T
x x x x x x x f ú ú ú û ù ê ê ê ë é + +
= Ñ 1 2 2 1 2
3 2 3 2 1 2 1
33
Ñf = 0Þ 2x1 +1- x2
é ë ê ê ê ê ù û ú ú ú ú = é ë ê ê ê ù û ú ú ú x = x1 x2 x3 é ë ê ê ê ê ù û ú ú ú ú =
é ë ê ê ê ù û ú ú ú
matrix is positive definite
34
Ñ2 f = 2
2
2 é ë ê ê ê ù û ú ú ú
l1 = 3.414, l2 = 0.586, l3 = 2
x = x1 x2 x3 é ë ê ê ê ê ù û ú ú ú ú =
é ë ê ê ê ù û ú ú ú
– The function to minimize/maximize may have an intractable form
– Begin with a “guess” for the optimal and refine it iteratively until the correct value is obtained
35
X f(X)
– Start from an initial guess
for the optimal
– Update the guess towards a (hopefully) “better” value of – Stop when no longer decreases
– Which direction to step in – How big must the steps be
36
f(X) X x0 x1x2 x3 x4 x5
– Start at some point – Find direction in which to shift this point to decrease error
– A positive derivative moving left decreases error – A negative derivative moving right decreases error
– Shift point in this direction
– Initialize – While
– 𝑦 = 𝑦 − 𝑡𝑢𝑓𝑞
– 𝑦 = 𝑦 + 𝑡𝑢𝑓𝑞
– What must step be to ensure we actually get to the optimum?
– Initialize – While
– Initialize – While
is the “step size”
minimum or maximum of a function iteratively
– To find a maximum move in the direction of the gradient
𝑈
– To find a minimum move exactly opposite the direction of the gradient
𝑈
– Later lecture
11-755/18-797 41
– Use fixed value for
11-755/18-797 42
11-755/18-797 43
2 2 2 1 2 1 2 1
) ( 4 ) ( ) , ( x x x x x x f + + = xinitial = 3 3 é ë ê ù û ú 2 . = 1 . =
x0 x0
iteration-dependent step size
44
when one of the following criteria is satisfied
11-755/18-797 45
f (xk+1)- f (xk) <e1 Ñf (xk) <e2
– –
– –
11-755/18-797 46
size, for convex (bowl- shaped) functions gradient descent will always find the minimum.
functions it will find a local minimum or an inflection point
47
48
w.r.t
– An instance of optimization
49
50
w.r.t
– An instance of optimization
51
What are these input-output pairs?
w.r.t
– An instance of optimization
52
What are these input-output pairs? What is f() and what are its parameters?
w.r.t
– An instance of optimization
53
What are these input-output pairs? What is f() and what are its parameters W? What is the divergence div()?
w.r.t
– An instance of optimization
54
What is f() and what are its parameters W?
– No loops
– We will refer to the inputs as the input units
– We refer to the outputs as the output units – Intermediate units are “hidden” units
55
Input units Output units Hidden units
– Standard setup: A differentiable activation function applied the sum of weighted inputs and a bias
𝑧 = 𝑔 𝑥
– More generally: any differentiable function
– Standard setup: A differentiable activation function applied the sum of weighted inputs and a bias
𝑧 = 𝑔 𝑥
– More generally: any differentiable function
We will assume this unless otherwise specified Parameters are weights
and bias
derivatives
58
– Function
– Modifying a single parameter in will affect all outputs
59
Input Layer Output Layer Hidden Layers
60
t m a x
weights and bias
61
z x y
weights and bias
– We will refer to the inputs as the input layer
– We refer to the outputs as the output layer – Intermediate layers are “hidden” layers
62
Input Layer Output Layer Hidden Layers
perceptrons can be viewed as a single vector activation
63
Input Layer Output Layer Hidden Layers
()
– Input to network:
the k-1th layer and the jth unit of the k-th layer as
– The bias to the jth unit of the k-th layer is
()
64
w.r.t
– An instance of optimization
65
What are these input-output pairs?
network
instance
66
– (or may even be just a scalar, if input layer is of size 1) – E.g. vector of pixel values – E.g. vector of speech features – E.g. real-valued vector representing text
– Other real valued vectors
67
Input Layer Output Layer Hidden Layers
– Scalar Output : single output neuron
– Vector Output : as many output neurons as the dimension of the desired output
68
Input Layer Output Layer Hidden Layers
a simple 1/0 representation of the desired output
– 1 = Yes it’s a cat – 0 = No it’s not a cat.
69
a simple 1/0 representation of the desired output
– Viewed as the probability
X may occur for both classes, but with different probabilities
70
𝜏(𝑨)
𝜏 𝑨 = 1 1 + 𝑓
– 1 = Yes it’s a cat – 0 = No it’s not a cat.
– Yes: [1 0] – No: [0 1]
71
camel, a hat, or a flower
[cat dog camel hat flower]T
cat: [1 0 0 0 0] T dog: [0 1 0 0 0] T camel: [0 0 1 0 0] T hat: [0 0 0 1 0] T flower: [0 0 0 0 1] T
with four zeros and a single 1 at the position of that class
72
representation will have N binary outputs
– An N-dimensional binary vector
and a single 1 in the right place)
– N probability values that sum to 1.
73
Input Layer Output Layer Hidden Layers
classifier nets
74
Input Layer Output Layer Hidden Layers
s
t m a x
which digit the image represents
– Binary recognition: Is this a “2” or not – Multi-class recognition: Which digit is this? Is this a digit in the first place?
75
– learn all weights such that the network does the desired job
76
Training data Input: vector of pixel values Output: sigmoid
– learn all weights such that the network does the desired job
77
Training data Input: vector of pixel values Output: Class prob Input Layer Output Layer Hidden Layers
s
t m a x
w.r.t
– An instance of optimization
78
What is the divergence div()?
– Note: this is differentiable
L2 Div() d1d2d3 d4 Div
, d is 0/1, the cross entropy between the probability distribution and the ideal output probability is popular
– Minimum when d = 𝑍
𝑒𝐸𝑗𝑤(𝑍, 𝑒) 𝑒𝑍 = − 1 𝑍 𝑗𝑔 𝑒 = 1 1 1 − 𝑍 𝑗𝑔 𝑒 = 0
80
KL Div
𝐸𝑗𝑤 𝑍, 𝑒 = − 𝑒 log 𝑧
𝑒𝐸𝑗𝑤(𝑍, 𝑒) 𝑒𝑍
𝑧 𝑔𝑝𝑠 𝑢ℎ𝑓 𝑑 − 𝑢ℎ 𝑑𝑝𝑛𝑞𝑝𝑜𝑓𝑜𝑢 0 𝑔𝑝𝑠 𝑠𝑓𝑛𝑏𝑗𝑜𝑗𝑜 𝑑𝑝𝑛𝑞𝑝𝑜𝑓𝑜𝑢 𝛼𝐸𝑗𝑤(𝑍, 𝑒) = 0 0 … −1 𝑧 … 0 0
81
KL Div() d1d2d3 d4 Div
w.r.t
82
w.r.t.
– –
– –
11-755/18-797 83
w.r.t.
– –
– For every component
11-755/18-797 84
Explicitly stating it by component
– Using the extended notation: the bias is also a weight
– For every layer for all update:
() , ()
()
has converged
85
Total training error:
Assuming the bias is also represented as a weight
– For every layer for all update:
() , ()
()
has converged
86
Total training error:
87
Total derivative: Total training error:
– For all , initialize
()
– For all
– Compute
𝑬𝒋𝒘(𝒁𝒖,𝒆𝒖) ,
()
– Compute
() +=
𝑬𝒋𝒘(𝒁𝒖,𝒆𝒖) ,
()
– For every layer for all :
𝑥,
() = 𝑥, () − 𝜃
𝑈 𝑒𝐹𝑠𝑠 𝑒𝑥,
()
has converged
88
derivative of divergences of individual training inputs
89
Total derivative: Total training error:
90
For any differentiable function with derivative
For any differentiable function
91
Check – we can confirm that : For any nested function
92
Check:
through each of
93
perturbations in each of each of which individually additively perturbs
94
95
– Actual network would have many more neurons and inputs
96
+ +
– Actual network would have many more neurons and inputs
activation
97
+ + +
𝑔(. ) 𝑔(. ) 𝑔(. ) 𝑔(. ) 𝑔(. )
– Actual network would have many more neurons and inputs
and input
98
+ + + + +
, () , () , () , () , () , () , () , () , () , () , () , () , () , () , ()
w.r.t. each of the weights
99
+ + + + +
, () , () , () , () , () , () , () , () , () , () , () , () , () , () , ()
Each yellow ellipse represents a perceptron
100
+ + + + +
, () , () , () , () , () , () , () , () , () , () , () , () , () , () , ()
1 1 2 2 3
Div
– Derive on board?
101
intermediate and final output values of the network in response to the input
102
1 1 1 1 1
node
fN y(N) z(N) y(N-1) z(N-1) y(k) z(k) y(k-1) z(k-1) Div(Y,d) E 1 1 1
fN fN
z(N) y(N-1) z(N-1) y(1) z(1)
1
fN fN
z(N) y(N-1) z(N-1) y(k) z(k) y(k-1) z(k-1) Assuming
()
1 1 1
fN fN
z(N) y(N-1) z(N-1) y(k) z(k) y(k-1) z(k-1) 1 1 1
fN fN
for j = 1:layer-width
y(N) z(N) y(N-1) z(N-1) y(k) z(k) y(k-1) z(k-1) 1 1 1
dimensional vector
– , is the width of the 0th (input) layer – ;
– For
, ()
–
109
Dk is the size of the kth layer
fN fN Div(Y,d) y(N) z(N) y(N-1) z(N-1) y(k) z(k) y(k-1) z(k-1) 1 1 1
fN fN y(N) z(N) y(N-1) z(N-1) y(k) z(k) y(k-1) z(k-1) Div(Y,d) 1 1 1
fN fN y(N) z(N) y(N-1) z(N-1) y(k) z(k) y(k-1) z(k-1) Div(Y,d) 1 1 1
fN fN y(N) z(N) y(N-1) z(N-1) y(k) z(k) y(k-1) z(k-1) Div(Y,d)
forward pass 1 1 1
fN fN y(N) z(N) y(N-1) z(N-1) y(k) z(k) y(k-1) z(k-1) Div(Y,d) Derivative of the activation function of Nth layer 1 1 1
fN fN Because : y(N) z(N) y(N-1) z(N-1) y(k) z(k) y(k-1) z(k-1)
𝜖𝑨
𝜖𝑧
() = 𝑥 ()
Div(Y,d) 1 1 1
fN fN y(N) z(N) y(N-1) z(N-1) y(k) z(k) y(k-1) z(k-1)
Div(Y,d) computed during the forward pass 1 1 1
fN fN y(N) z(N) y(N-1) z(N-1) y(k) z(k) y(k-1) z(k-1)
Div(Y,d)
1 1 1
fN fN wij
(k)
y(N) z(N) y(N-1) z(N-1) y(k) z(k) y(k-1) z(k-1)
Div(Y,d)
1 1 1
Div(Y,d) fN fN
Initialize: Gradient w.r.t network output
y(N) z(N) y(N-1) z(N-1) y(k) z(k) y(k-1) z(k-1)
Div(Y,d)
Figure assumes, but does not show the “1” bias nodes
– For
– For
– For
– For
Called “Backpropagation” because the derivative of the error is propagated “backwards” through the network Very analogous to the forward pass: Backward weighted combination
Backward equivalent of activation
dimensional vector
– , is the width of the 0th (input) layer – ;
– For
, ()
–
122
1. The computation of the output of one neuron does not directly affect computation of other neurons in the same (or previous) layers 2. Outputs of neurons only combine through weighted addition 3. Activations are actually differentiable – All of these conditions are frequently not applicable
– Will appear in quiz. Please read the slides
123
all inputs
124
z(k) y(k-1) y(k) z(k) y(k-1) y(k)
125
z(k) y(k-1) y(k)
Scalar activation: Modifying a
Vector activation: Modifying a potentially changes all,
z(k) y(k-1) y(k)
126
z(k) y(k-1) y(k) z(k) y(k)
Scalar activation: Each influences one Vector activation: Each influences all,
y(k-1)
127
z(k) y(k)
same as the number of inputs (z(k))
z(k) y(k) y(k-1) y(k-1)
derivative of the error w.r.t to the input to the unit is a simple product of derivatives
128
z(k) y(k-1) y(k)
to any input is a sum of partial derivatives
– Regardless of the number of outputs
129
z(k) y(k-1) y(k)
Div
Note: derivatives of scalar activations are just a special case of vector activations:
special cases on slides
– Please look up – Will appear in quiz!
130
131
z(k) y(k-1) y(k)
– E.g. linear combinations, polynomials, logistic (softmax), etc.
132
z(k) y(k-1) y(k)
– In contrast to the additive combination we have seen so far
etc.
z(k-1) y(k-1)
W(k)
Forward:
) 1 ( ) 1 ( ) (
k l k j k i
y y
combination
z(k-1) y(k-1)
W(k)
Forward:
) 1 ( ) 1 ( ) (
k l k j k i
y y
) ( ) 1 ( ) ( ) 1 ( ) ( ) 1 ( k i k l k i k j k i k j
y
y
Div ¶ ¶ = ¶ ¶ ¶ ¶ = ¶ ¶
( ) 1 ( ) 1 ( k i k j k l
y y Div ¶ ¶ = ¶ ¶
135
z(k) y(k-1) y(k)
136
z(k) y(k-1) y(k)
Y, Div
Div(Y,d) fN fN Div y(N) z(N) y(N-1) z(N-1) y(k) z(k) y(k-1) z(k-1)
For k = N…1 For i = 1:layer-width
If layer has vector activation Else if activation is scalar
– For
(,)
– For
z(N) y(N) KL Div d Div softmax
– E.g. The RELU (Rectified Linear Unit)
– E.g. The “max” function
– Or “secants”
139 + . . . . . x x x x 𝑨 𝑧 𝑥 𝑥 𝑥 𝑥 𝑔(𝑨) x 𝑥 𝑥 1 𝑨 𝑔(𝑨) = 𝑨 𝑔(𝑨) = 0
z1 y
z3 z4
at a point is any vector such that
– “bowl” shaped functions – For non-convex functions, the equivalent concept is a “quasi-secant”
increase
– The gradient is not always the subgradient though
140
– At the differentiable points on the curve, this is the same as the gradient – Typically, will use the equation given
141
– 1 w.r.t. the largest incoming input
– 0 for the rest
142
z1 y
zN
inputs
– Will be seen in convolutional networks
– 1 for the specific component that is maximum in corresponding input subset – 0 otherwise
143
y1 z2 zN y2 y3 yM
– For
– For
– Forward pass: Pass instance forward through the net. Store all intermediate outputs of all computation – Backward pass: Sweep backward through the net, iteratively compute all derivatives w.r.t weights
for each training instance
–
– Initialize ; For all , initialize
()
– For all (Loop over training instances)
– Output 𝒁𝒖 – 𝐹𝑠𝑠 += 𝑬𝒋𝒘(𝒁𝒖, 𝒆𝒖)
– Compute
𝑬𝒋𝒘(𝒁𝒖,𝒆𝒖) ,
()
– Compute
() +=
𝑬𝒋𝒘(𝒁𝒖,𝒆𝒖) ,
()
– For all update:
𝑥,
() = 𝑥, () − 𝜃
𝑈 𝑒𝐹𝑠𝑠 𝑒𝑥,
()
has converged
146
think of the process in terms of vector operations
– Simpler arithmetic – Fast matrix libraries make operations much faster
– On slides, please read – This is what is actually used in any real system – Will appear in quiz
147
Similarly with biases
14 8
𝒍
𝒍
notation as (setting 𝟏 ):
14 9
𝒍
𝒍
𝒍 𝒍 𝒍𝟐 𝒍 𝒍
150
𝟏
151
𝟐
𝟐
152
𝟐 𝟐
153
𝟐 𝟑
154 𝟐 𝟑
𝟐
155 𝟐
𝟑 𝟐
156 𝟐
𝟑 𝟐
Div(Y,d)
For k = 1 to N: Initialize Output
– Recursion:
158
160
Using vector notation Check:
a Jacobian
– Number of outputs is identical to the number of inputs
– Diagonal entries are individual derivatives of outputs w.r.t inputs – Not showing the superscript “(k)” in equations for brevity
161
z y
– Jacobian is a diagonal matrix – Diagonal entries are individual derivatives of outputs w.r.t inputs
162
z y
– Entries are partial derivatives of individual outputs w.r.t individual inputs
163
z y
and bias
produce vector
164
165
Check
Note the order: The derivative of the outer function comes first
is vector):
166
Check
Note the order: The derivative of the outer function comes first
167
Note reversal of order. This is in fact a simplification
Derivatives w.r.t parameters
to represent the Jacobian 𝐙 to explicitly illustrate the chain rule In general 𝐛 represents a derivative of w.r.t. and could be a gradient (for scalar ) Or a Jacobian (for vector )
The actual gradient depends on the divergence function.
matrix for scalar activations
the derivative w.r.t. the input
,
– Compute
– Recursion:
,
– Compute
– Recursion:
Note analogy to forward pass
– Recursion:
184
– – For all , initialize 𝐗 , 𝐜 – For all
– Output 𝒁(𝒀𝒖) – Divergence 𝑬𝒋𝒘(𝒁𝒖, 𝒆𝒖) – 𝐹𝑠𝑠 += 𝑬𝒋𝒘(𝒁𝒖, 𝒆𝒖)
– 𝛼𝐗𝑬𝒋𝒘(𝒁𝒖, 𝒆𝒖); 𝛼𝐜𝑬𝒋𝒘(𝒁𝒖, 𝒆𝒖) – 𝛼𝐗𝐹𝑠𝑠 += 𝛼𝐗𝑬𝒋𝒘(𝒁𝒖, 𝒆𝒖); 𝛼𝐜𝐹𝑠𝑠 += 𝛼𝐜𝑬𝒋𝒘(𝒁𝒖, 𝒆𝒖)
– For all update:
𝐗 = 𝐗 −
𝛼𝐗𝐹𝑠𝑠 ; 𝐜 = 𝐜 − 𝛼𝐗𝐹𝑠𝑠
has converged
185
– –
186
Training data Sigmoid output neuron
– First ten outputs correspond to the ten digits
– Ideal output: One of the outputs goes to 1, the others go to 0
187
Training data Y1 Y2 Y3 Y4 Y0
– And how can we improve it
data)
188
189