Chapter 6: Multilayer Neural Networks (Sections 6.1-6.3)
- Introduction
- Feedforward Operation and Classification
- Backpropagation Algorithm
Pattern Recognition Two main challenges Representation Matching - - PowerPoint PPT Presentation
Chapter 6: Multilayer Neural Networks (Sections 6.1-6.3) Introduction Feedforward Operation and Classification Backpropagation Algorithm Pattern Recognition Two main challenges Representation Matching Jain CSE 802, Spring
Jain CSE 802, Spring 2013
Driver License Information 2009 driver license photo
Gallery: 34 million (30M DMV photos, 4M mugshots)
Courtesy: Pete Langenfeld, MSP
1 2 3 4 5 6 7 8 9 10
Gallery: 34 million (30M DMV photos, 4M mugshots)
Courtesy: Pete Langenfeld, MSP
LFW Standard Protocol
99.77% (Accuracy) 3,000 genuine & 3,000 imposter pairs; 10-fold CV
LFW BLUFR Protocol
88% TAR @ 0.1% FAR 156,915 genuine, ~46M imposter pairs; 10-fold CV
n Massive parallelism is essential for complex
n Humans take only ~200ms for most cognitive
n Biological networks achieve excellent recognition
n Number of neurons » 1010 – 1012 n Number of interconnections/neuron » 103 – 104 n Total number of interconnections » 1014 n Damage to a few neurons or synapse (links) does
n Nodes are nonlinear, typically analog
x1 x2 xd Y (output)
w1 wd
n Feed-forward networks with one or more
n How many nodes & hidden layers? n Network training?
. . . . . . . . .
d inputs First hidden layer NH1 input units Second hidden layer NH2 input units c outputs
Weight vector Bias or Threshold weight Sign of the function value gives the class label
Linear Part (d+1) parameters Quadratic part, d(d+1)/2 additional parameters
Setting yi(x) to be monomials results in polynomial discriminant functions Dimensionality of the augmented feature space. Weights in the augmented feature space. Note that the function is linear in a.
t d x
ˆ 2 1
t d
ˆ 2 1
also called the augmented feature vector.
differentiable) and is difficult to optimize
Now, the minimization is mathematically tractable, and hence it is a better criterion fn. than no. of misclassifications. The criterion is proportional to the sum of distances from the misclassified samples to the decision boundary
Find a that minimizes this criterion
Chapter 5, page 230
Pattern Classification, Chapter 6
Pattern Classification, Chapter 6
Pattern Classification, Chapter 6
Pattern Classification, Chapter 6
Pattern Classification, Chapter 6
= =
d 1 i d i t j ji i j ji i j
Pattern Classification, Chapter 6
î í ì <
º = net if 1 net if 1 ) net sgn( ) net ( f
= =
= = + =
H H
n 1 j n j t k kj j k kj j k
, y . w w y w w y net
Pattern Classification, Chapter 6
Pattern Classification, Chapter 6
³ 0 Þ y1 = +1 x1 + x2 + 0.5 = 0 < 0 Þ y1 = -1
£ 0 Þ y2 = +1 x1 + x2 -1.5 = 0 < 0 Þ y2 = -1
Pattern Classification, Chapter 6
H
n 1 j k d 1 i j i ji kj k k
= =
Pattern Classification, Chapter 6
n 1 n 2 1 j i ij j
+ =
Pattern Classification, Chapter 6
Pattern Classification, Chapter 6
Pattern Classification, Chapter 6
Pattern Classification, Chapter 6
Pattern Classification, Chapter 6
changed in a direction that will reduce the error:
=
c 1 k 2 2 k k
Pattern Classification, Chapter 6
kj k k kj k k kj
k k
k k k k k k k k
Pattern Classification, Chapter 6
t.y, therefore:
j kj k
ji j j j j ji
Pattern Classification, Chapter 6
= = = =
¶ ¶ ¶ ¶
¶ ¶
ú û ù ê ë é
¶ = ¶ ¶
c 1 k c 1 k kj k k k j k k k k k c 1 k j k k k 2 k c 1 k k j j
w ) net ( ' f ) z t ( y net . net z ) z t ( y z ) z t ( ) z t ( 2 1 y y J
=
c 1 k k kj j j
i j k kj j i ji
x ) net ( ' f w x w
j
! ! ! " ! ! ! # $
d
d S h d h D = =
Pattern Classification, Chapter 6
Pattern Classification, Chapter 6
Pattern Classification, Chapter 6
Pattern Classification, Chapter 6
Pattern Classification, Chapter 6
features of the pattern to not manifest in a single hidden node (in contrary to the example shown above)
and caution must be exercised while analyzing weights Input layer to hidden layer weights for a character recognition task
Weights at two hidden nodes represented as 8x8 patterns Left gets activated for F, right gets activated for L, and both get activated for E
discriminant) and saturate (have max and min value) to keep weights and activation functions bounded
smooth; optionally monotonic
the data comes from a mixture of Gaussians
leads to faster learning
The anti-symmetric sigmoid function: f(-x) = -f(x). a = 1.716, b = 2/3.
units is compensated by normalizing them to be in the same range, [0,1] or [-1,1]; without normalization, error will hardly depend on feature with very small values
variance
Better to use +1 and –1 that lie well within the range of sigmoid function saturation values (+1.716, -1.716)
require the weights to go to infinity to minimize the error
and generate new “virtual” training patterns
training examples (preferably 10 times less); no. of hidden units determines the total no. of weights
response to training data; e.g. start with a “large” no. of hidden units and “decay”, prune, or eliminate weights
place
learning
Error per pattern with the increase in number of hidden nodes.
180) from each class (sampled from mixture of 3 Gaussians)
nodes). This illustrates the rule of thumb that n/10 weights often gives lowest error
minima
small
Pattern Classification, Chapter 6