Neural Networks Learning the network: Part 1
11-785, Fall 2020 Lecture 3
1
Neural Networks Learning the network: Part 1 11-785, Fall 2020 - - PowerPoint PPT Presentation
Neural Networks Learning the network: Part 1 11-785, Fall 2020 Lecture 3 1 Topics for the day The problem of learning The perceptron rule for perceptrons And its inapplicability to multi-layer perceptrons Greedy solutions for
1
2
– Can model any Boolean function – Can model any classification boundary – Can model any continuous valued function
– Networks with fewer than the required number of parameters can be very poor approximators
3
N.Net Voice signal Transcription N.Net Image Text caption N.Net Game State Next move
4
5
N.Net Something
Something weird
6
N.Net Something
Something weird
7
– General setting, inputs are real valued – A bias representing a threshold to trigger the perceptron – Activation functions are not necessarily threshold functions
its weights and bias
8
component that is always set to 1
– If the bias is not explicitly mentioned, we will implicitly be assuming that every perceptron has an additional input that is always fixed at 1
9
– No loops: Neuron outputs do not feed back to their inputs directly or indirectly – Loopy networks are a future topic
– How many layers/neurons, which neuron connects to which and how, etc.
representing the needed function
10
– The weights associated with the blue arrows in the picture
parameters such that the network computes the desired function
1
11
The network is a function f() with parameters W which must be set to the appropriate values to get the desired behavior from the net 1 1 1
12
13
14
0,1 0,-1 1,0
15
0,1 0,-1 1,0 1
1 X1 X2 X1 X2 Assuming simple perceptrons:
16
0,1 0,-1 1,0 X1 X2
1 X1 X2 Assuming simple perceptrons:
17
0,1 0,-1 1,0 X1 X2
1 1 X1 X2 Assuming simple perceptrons:
18
0,1 0,-1 1,0 X1 X2 1 1 1 X1 X2 Assuming simple perceptrons:
19
0,1 0,-1 1,0 1
1 X1 X2 X1 X2 1 1 1 X1 X2
1 X1 X2
1 1 X1 X2
X1 X2
1
1 1 1 1 -1
1 1
1
1 1 1 1 Assuming simple perceptrons:
20
0,1 0,-1 1,0
21
has the capacity to exactly represent
22
23
– Basically, get input-output pairs for a number of samples of input
– Good sampling: the samples of will be drawn from
– E.g. set of images and their class labels – E.g. speech recordings and their transcription
24
Xi di
Xi di
25
don’t have training samples
26
Xi di
network (weights and biases) required for it to model a desired function
– The network must have sufficient capacity to model the function
desired function everywhere
function and estimate network parameters to “fit” the input-output relation at these instances
– And hope it fits the function elsewhere as well
27
28
29
30
31
–
pairs
32
x1 x2 x3 xN
WN+1
xN+1=1
33
groups of points
– Note:
34
– Note:
to 𝑋” (∑ 𝑥𝑌 = 𝑋𝑌 = 0
35
36
Key: Red 1, Blue = -1
37
training instances
– For
𝑢𝑠𝑏𝑗𝑜
38
Using a +1/-1 representation for classes to simplify notation
– I.e. randomly initialize the normal vector
will be assigned +1 class, and those on the other side will be assigned -1
39
+1(Red)
40
Initialization +1(Red)
41
Misclassified negative instance +1(Red)
42
+1(Red)
43
+1(Red)
44
+1(Red)
45
+1(Red)
46
+1(Red)
47
+1(Red)
48
+1(Red)
49
50
R g g +1 (blue)
– 1 in the yellow regions, 0 outside
51
– Making incremental corrections every time we encounter an error
52
54
55
– Can it be learned from this data? – The individual classifier actually requires the kind of labelling shown here
57
58
59
Must know the output of every neuron for every training instance, in order to learn this neuron The outputs should be such that the neuron individually has a linearly separable task The linear separators must combine to form the desired boundary This must be done for every neuron Getting any of them wrong will result in incorrect output!
Individual neurons represent one of the lines that compose the figure (linear classifiers)
problem
for any training input
instance
60
Training data only specifies input and output of network Intermediate outputs (outputs
61
Bernie Widrow
signal processing and machine learning!
62
– Weighted sum on inputs and bias passed through a thresholding function
Using 1-extended vector notation to account for bias
63
Error for a single input
64
Error for a single input
65
66
𝑦 𝑨 1 𝑧 𝑒 𝜀 𝑦 𝑨 1 𝑧 𝑒 𝜀
Perceptron ADALINE
67
𝒈(𝒜)
– Lookahead: Note that this is exactly backpropagation in multilayer nets if we let represent the entire network between and
machine learning and signal processing
– Variants of it appear in almost every problem
68
69
– Classify an input – If error, find the z that is closest to 0 – Flip the output of corresponding unit – If error reduces:
71
– Classify an input – If error, find the z that is closest to 0 – Flip the output of corresponding unit – If error reduces:
– Classify an input – If error, find the z that is closest to 0 – Flip the output of corresponding unit and compute new output – If error reduces:
– Classify an input – If error, find the z that is closest to 0 – Flip the output of corresponding unit and compute new output – If error reduces:
75
– Will require a network with sufficient “capacity”
“training” instances drawn from the target function
function activation) in linear time if classes are linearly separable
knowledge of the input-output relation for every training instance, for every perceptron in the network
– These must be determined as part of training – For threshold activations, this is an NP-complete combinatorial optimization problem
76
77
except at 0 where it is non-differentiable
– You can vary the weights a lot without changing the error – There is no indication of which direction to change the weights to reduce error
78
– Actually a function with 0 derivative nearly everywhere, and no derivatives at the boundaries
79
80
much of the input space
– Small changes in weight can result in non-negligible changes in output – This enables us to estimate the parameters using gradient descent techniques..
81
classification error
– Does not indicate if moving the threshold left was good or not
82
T1 T2 x x y y
– Can now quantify how much the output differs from the desired target value (0 or 1) – Moving the function left or right changes this quantity, even if the classification error itself doesn’t change
T2 T1
0.5 0.5
83
84 84
85
x y
– This is an approximation of the probability of Y=1 at that point
86
x y
– This is an approximation of the probability of 1 at that point
87
x y
– This is an approximation of the probability of 1 at that point
88
x y
– This is an approximation of the probability of 1 at that point
89
x y
– This is an approximation of the probability of 1 at that point
90
x y
– This is an approximation of the probability of 1 at that point
91
x y
– This is an approximation of the probability of 1 at that point
92
x y
– This is an approximation of the probability of 1 at that point
93
x y
– This is an approximation of the probability of 1 at that point
94
x y
– This is an approximation of the probability of 1 at that point
95
x y
– This is an approximation of the probability of 1 at that point
96
x y
– This is an approximation of the probability of 1 at that point
97
x y
– This is an approximation of the probability of 1 at that point
98
x y
99
y=0 y=1 x
– It actually computes the probability that the input belongs to class 1
100
x1 x2 Decision: y > 0.5?
101
–
is well-defined and finite for all
is a differentiable function of both inputs 𝒋 and weights
𝒋
changes in either the input or the weights
102
– Small changes in the parameters result in measurable changes in
103
,
, = weight connecting the ith unit
the k-th layer
is differentiable w.r.t both and
bias connections
104
– We can compute how small changes in the parameters change the
– We will derive the actual derivatives using the chain rule later
1
is the desired output of the network in response to – and may both be vectors
desired output for each training input
– Or a close approximation of it – The architecture of the network must be specified by us
105
has the capacity to exactly represent
106
107
– Obtain input-output pairs for a number of samples of input
will be drawn from
108
Xi di
Xi di
divergence (empiricial risk)
110
error
111
Note : Its really a measure of error, but using standard terminology, we will call it a “Loss” Note 2: The empirical risk is only an empirical approximation to the true risk which is our actual minimization
Note 3: For a given training set the loss is only a function of W
112
113
combinatorial-optimization problem
– Because we cannot compute the influence of small changes to the parameters on the overall error
to estimate network parameters
– This makes the output of the network differentiable w.r.t every parameter in the network – The logistic activation perceptron actually computes the a posteriori probability of the output given the input
desired output for the training instances
– And a total error, which is the average divergence over all training instances
– Empirical risk minimization
114
115
117
is a row vector:
gives us how
increments when only is incremented
Note: is now a vector
119
Note: is now a vector
symbol for vector and matrix derivatives
function w.r.t a variable “x”
network optimization problem we would be
variable that we’re optimizing a function over and not the input to a neural network
120
f(x) x
global minimum inflection point local minimum global maximum
121
= 0
– Solve
– Derivatives go from positive to negative or vice versa at this point
122
123
+ + + + + + + + +
124
125
= 0: Solve
is a minimum, otherwise it is a maximum
126
– These can be local maxima, local minima, or inflection points
– Positive (or 0) at minima – Negative (or 0) at maxima – Zero at inflection points
127
Critical points Derivative is 0
maximum minimum Inflection point
– These can be local maxima, local minima, or inflection points
– at minima – at maxima – Zero at inflection points
128
minimum Inflection point negative positive zero
– Shifting in any direction will increase the value – For smooth functions, miniscule shifts will not result in any change at all
amount will not change the value of the function
129
130
131
132
133
134
T and
T
T
135
136
Gradient vector
𝑈
137
Gradient vector
Moving in this direction increases fastest
138
Gradient vector
𝑈
Moving in this direction increases fastest
Moving in this direction decreases fastest
139
Gradient here is 0 Gradient here is 0
𝑈 is perpendicular to the level curve
140
141
2 2 2 2 1 2 2 2 2 2 2 1 2 2 1 2 2 1 2 2 1 2 1 2
n n n n n n
X
142
143
144
X
145
3 2 3 3 2 2 2 2 1 2 1 3 2 1
3 2 3 2 1 2 1
X
146
3 2 3 2 1 2 1
X
147
2 f X
148
X f(X)
– Start from an initial guess
for the optimal
– Update the guess towards a (hopefully) “better” value of – Stop when no longer decreases
– Which direction to step in – How big must the steps be
149
f(X) X x0 x1x2 x3 x4 x5
– A negative derivative moving right decreases error – A positive derivative moving left decreases error
150
𝑦 = 𝑦 − 𝑡𝑢𝑓𝑞
𝑦 = 𝑦 + 𝑡𝑢𝑓𝑞
151
152
153
154
155
156
2 2 2 1 2 1 2 1
x0 x0
157
158
2
k x
159
160