Neural Networks Learning the network: Part 1
11-785, Spring 2018 Lecture 3
1
Neural Networks Learning the network: Part 1 11-785, Spring 2018 - - PowerPoint PPT Presentation
Neural Networks Learning the network: Part 1 11-785, Spring 2018 Lecture 3 1 Designing a net.. Input: An N-D real vector Output: A class (binary classification) Input units? Output units? Architecture? Output
1
2
3
4
5
6
w 1/-1
1 activatation
7
X w w/2 w 1/-1 1/-1
1 activatation
8
X w w/2 w w/2 w/4 w 1/-1 1/-1 1/-1
1 activatation
9
10
11
– Can model any Boolean function – Can model any classification boundary – Can model any continuous valued function
– Networks with fewer than required parameters can be very poor approximators
12
N.Net Voice signal Transcription N.Net Image Text caption N.Net Game State Next move
13
14
N.Net Something
Something weird
15
N.Net Something
Something weird
16
– General setting, inputs are real valued – Activation functions are not necessarily threshold functions – A bias representing a threshold to trigger the perceptron
17
component that is always set to 1
– If the bias is not explicitly mentioned, we will implicitly be assuming that every perceptron has an additional input that is always fixed at 1
18
– No loops: Neuron outputs do not feed back to their inputs directly or indirectly – Loopy networks are a future topic
– How many layers/neurons, which neuron connects to which and how, etc.
representing the needed function
19
– The weights associated with the blue arrows in the picture
such that the network computes the desired function
1 1
20
The network is a function f() with parameters W which must be set to the appropriate values to get the desired behavior from the net
21
22
23
0,1 0,-1 1,0
24
has the capacity to exactly represent
25
26
– Basically, get input-output pairs for a number of samples of input
– Good sampling: the samples of will be drawn from
– E.g. set of images and their class labels – E.g. speech recordings and their transcription
27
Xi di
Xi di
28
don’t have training samples
29
Xi di
30
31
32
33
–
34
x2 x3 xN
WN+1
xN+1=1
35
36
–
– For
𝑢𝑠𝑏𝑗𝑜
Using a +1/-1 representation for classes to simplify notation
38
– I.e. randomly initialize the normal vector – Classification rule
39
+1 (blue)
40
Initialization +1 (blue)
41
Misclassified positive instance +1 (blue)
42
+1 (blue)
43
Updated weight vector
44
Updated hyperplane +1 (blue)
45
Misclassified instance, negative class +1 (blue)
46
+1 (blue)
47
+1 (blue)
48
Updated hyperplane +1 (blue)
49
+1 (blue)
50
51
R g g +1 (blue)
– 1 in the yellow regions, 0 outside
52
53
58
Must know the output of every neuron for every training instance, in order to learn this neuron The outputs should be such that the neuron individually has a linearly separable task The linear separators must combine to form the desired boundary This must be done for every neuron Getting any of them wrong will result in incorrect output!
Individual neurons represent one of the lines that compose the figure (linear classifiers)
problems
for any training input
instance
Training data only specifies input and output of network Intermediate outputs (outputs
59
60
Bernie Widrow
signal processing and machine learning!
61
– Weighted sum on inputs and bias passed through a thresholding function
Using 1-extended vector notation to account for bias
62
Error for a single input
63
Error for a single input
64
65
𝑦 𝑨 1 𝑧 𝑒 𝜀 𝑦 𝑨 1 𝑧 𝑒 𝜀
Perceptron ADALINE
66
𝒈(𝒜)
– Lookahead: Note that this is exactly backpropagation in multilayer nets if we let represent the entire network between and
machine learning and signal processing
– Variants of it appear in almost every problem
67
68
– Classify an input – If error, find the z that is closest to 0 – Flip the output of corresponding unit – If error reduces:
70
– Classify an input – If error, find the z that is closest to 0 – Flip the output of corresponding unit – If error reduces:
– Classify an input – If error, find the z that is closest to 0 – Flip the output of corresponding unit and compute new output – If error reduces:
– Classify an input – If error, find the z that is closest to 0 – Flip the output of corresponding unit and compute new output – If error reduces:
74
75
except at 0 where it is non-differentiable
– You can vary the weights a lot without changing the error – There is no indication of which direction to change the weights to reduce error
76
77
78
– Small changes in weight can result in non-negligible changes in
– This enables us to estimate the parameters using gradient descent techniques..
79
80
81 81
82
x y
– This is an approximation of the probability of Y=1 at that point
83
x y
– This is an approximation of the probability of 1 at that point
84
x y
– This is an approximation of the probability of 1 at that point
85
x y
– This is an approximation of the probability of 1 at that point
86
x y
– This is an approximation of the probability of 1 at that point
87
x y
– This is an approximation of the probability of 1 at that point
88
x y
– This is an approximation of the probability of 1 at that point
89
x y
– This is an approximation of the probability of 1 at that point
90
x y
– This is an approximation of the probability of 1 at that point
91
x y
– This is an approximation of the probability of 1 at that point
92
x y
– This is an approximation of the probability of 1 at that point
93
x y
– This is an approximation of the probability of 1 at that point
94
x y
– This is an approximation of the probability of 1 at that point
95
x y
96
y=0 y=1 x
– It actually computes the probability that the input belongs to class 1
97
x1 x2 Decision: y > 0.5?
98
–
is well-defined and finite for all
is a differentiable function of both inputs 𝒋 and weights
𝒋
changes in either the input or the weights
99
weights (including “bias” weight)
parameter (weight or bias)
– Small changes in the parameters result in measurable changes in output
,
, = weight connecting the ith unit
the k+1-th layer
is differentiable w.r.t both and
1
101
– Small changes in the parameters result in measurable changes in the output – We will derive the actual derivatives using the chain rule later
is the desired output of the network in response to – and may both be vectors
desired output for each training input
– Or a close approximation of it – The architecture of the network must be specified by us
102
has the capacity to exactly represent
103
104
– Basically, get input-output pairs for a number of samples of input
will be drawn from
105
Xi di
Xi di
error
107
error
108
Note: The empirical risk is only an empirical approximation to the true risk which is our actual minimization
109
110
111
113
is a row vector:
gives us how
increments when only is incremented
Note: is now a vector
115
Note: is now a vector
function w.r.t a variable “x”
network optimization problem we would be
variable that we’re optimizing a function over and not the input to a neural network
116
f(x) x
global minimum inflection point local minimum global maximum
117
= 0
– Solve
– Derivatives go from positive to negative or vice versa at this point
118
119
+ + + + + + + + +
120
121
= 0: Solve
is a minimum, otherwise it is a maximum
122
– These can be local maxima, local minima, or inflection points
– Positive (or 0) at minima – Negative (or 0) at maxima – Zero at inflection points
123
Critical points Derivative is 0
maximum minimum Inflection point
– These can be local maxima, local minima, or inflection points
– at minima – at maxima – Zero at inflection points
124
minimum Inflection point negative positive zero
– Shifting in any direction will increase the value – For smooth functions, miniscule shifts will not result in any change at all
amount will not change the value of the function
125
126
127
128
129
130
131 Some sloppy maths here, with apology – comparing row and column vectors
132
Gradient vector
133
Gradient vector Moving in this direction increases fastest
134
Gradient vector Moving in this direction increases fastest Moving in this direction decreases fastest
135
Gradient here is 0 Gradient here is 0
is perpendicular to the level curve
136
137
2
2
2
138
139
140
141
2 + x1(1- x2)-(x2) 2 - x2x3 +(x3) 2 + x3 T
3 2 3 2 1 2 1
142
143
144
X f(X)
– Start from an initial guess
for the optimal
– Update the guess towards a (hopefully) “better” value of – Stop when no longer decreases
– Which direction to step in – How big must the steps be
145
f(X) X x0 x1x2 x3 x4 x5
– A positive derivative moving left decreases error – A negative derivative moving right decreases error
146
𝑦 = 𝑦 − 𝑡𝑢𝑓𝑞
𝑦 = 𝑦 + 𝑡𝑢𝑓𝑞
147
148
149
150
151
152
2 2 2 1 2 1 2 1
x0 x0
153
154
156