Pattern Recognition Two main challenges Representation Matching - - PowerPoint PPT Presentation

pattern recognition
SMART_READER_LITE
LIVE PREVIEW

Pattern Recognition Two main challenges Representation Matching - - PowerPoint PPT Presentation

Chapter 6: Multilayer Neural Networks (Sections 6.1-6.3) Introduction Feedforward Operation and Classification Backpropagation Algorithm Pattern Recognition Two main challenges Representation Matching Jain CSE 802, Spring


slide-1
SLIDE 1

Chapter 6: Multilayer Neural Networks (Sections 6.1-6.3)

  • Introduction
  • Feedforward Operation and Classification
  • Backpropagation Algorithm
slide-2
SLIDE 2

Pattern Recognition

Jain CSE 802, Spring 2013

Two main challenges

  • Representation
  • Matching
slide-3
SLIDE 3

Representation and Matching

slide-4
SLIDE 4

Driver License Information 2009 driver license photo

Gallery: 34 million (30M DMV photos, 4M mugshots)

Courtesy: Pete Langenfeld, MSP

How Good is Face Representation?

slide-5
SLIDE 5

1 2 3 4 5 6 7 8 9 10

Top-10 retrievals

Gallery: 34 million (30M DMV photos, 4M mugshots)

Smile makes a difference!

Courtesy: Pete Langenfeld, MSP

How Good is Face Representation?

slide-6
SLIDE 6

State of the Art in FR: Verification

LFW (2007) IJB-A (2015) FRGC v2.0 (2006) MBGC (2010)

  • D. Wang, C. Otto and A. K. Jain, "Face Search at Scale: 80 Million Gallery", arXiv, July 28, 2015

LFW Standard Protocol

99.77% (Accuracy) 3,000 genuine & 3,000 imposter pairs; 10-fold CV

LFW BLUFR Protocol

88% TAR @ 0.1% FAR 156,915 genuine, ~46M imposter pairs; 10-fold CV

slide-7
SLIDE 7

Neural Networks

n Massive parallelism is essential for complex

recognition tasks (speech & image recognition)

n Humans take only ~200ms for most cognitive

tasks; this suggests parallel computation in human brain

n Biological networks achieve excellent recognition

performance via dense interconnection of simple computational elements (neurons)

n Number of neurons » 1010 – 1012 n Number of interconnections/neuron » 103 – 104 n Total number of interconnections » 1014 n Damage to a few neurons or synapse (links) does

not appear to impair performance (robustness)

slide-8
SLIDE 8

Neuron

n Nodes are nonlinear, typically analog

where is internal threshold or offset

x1 x2 xd Y (output)

w1 wd

slide-9
SLIDE 9

n Feed-forward networks with one or more

layers (hidden) between input & output nodes

n How many nodes & hidden layers? n Network training?

Neural Networks

. . . . . . . . .

d inputs First hidden layer NH1 input units Second hidden layer NH2 input units c outputs

slide-10
SLIDE 10

Form of the Discriminant Function

  • Linear: Hyperplane decision boundaries
  • Non-Linear: Arbitrary decision boundaries
  • Adopt a model and then use the resulting

decision boundary

  • Specify the desired decision boundary
slide-11
SLIDE 11

Linear Discriminant Function

  • For a 2-class problem, discriminant function that is a

linear combination of input features can be written as

Weight vector Bias or Threshold weight Sign of the function value gives the class label

slide-12
SLIDE 12

Quadratic Discriminant Function

  • Quadratic Discriminant Function
  • Obtained by adding pair-wise products of features
  • g(x) positive implies class 1; g(x) negative implies class 2
  • g(x) = 0 represents a hyperquadric, as opposed to

hyperplanes in linear discriminant case

  • Adding more terms such as wijkxixjxk results in

polynomial discriminant functions

Linear Part (d+1) parameters Quadratic part, d(d+1)/2 additional parameters

slide-13
SLIDE 13

Generalized Discriminant Function

  • A generalized linear discriminant function, where

y= f(x) can be written as

  • Equivalently,

Setting yi(x) to be monomials results in polynomial discriminant functions Dimensionality of the augmented feature space. Weights in the augmented feature space. Note that the function is linear in a.

t d x

y x y x y )] ( ),..., ( ), ( [

ˆ 2 1

= y

t d

a a a ] ,..., , [ a

ˆ 2 1

=

also called the augmented feature vector.

slide-14
SLIDE 14

Perceptron

13

  • Perceptron is a linear classifier; it makes

predictions based on a linear predictor function combining a set of weights with feature vector

  • The perceptron algorithm was invented by

Rosenblatt in the late 1950s; its first implementation, in custom hardware, was one of the first artificial neural networks to be produced

  • The algorithm allows for online learning; it

processes training samples one at a time

slide-15
SLIDE 15

Two-category Linearly Separable Case

  • Let y1,y2,…,yn be n training samples in augmented

feature space, which are linearly separable

  • We need to find a weight vector a such that
  • aty > 0 for examples from the positive class
  • aty < 0 for examples from the negative class
  • “Normalizing” the input examples by multiplying them

with their class label (replace all samples from class 2 by their negatives), find a weight vector a such that

  • aty > 0 for all the examples (here y is multiplied with class label)
  • The resulting weight vector is called a separating vector
  • r a solution vector
slide-16
SLIDE 16

The Perceptron Criterion Function

  • Goal: Find weight vector a such that aty > 0 for all the

samples (assuming it exists)

  • Mathematically, this can be expressed as finding a weight

vector a that minimizes the no. of samples misclassified

  • Function is piecewise constant (discontinuous, and hence non-

differentiable) and is difficult to optimize

  • Perceptron Criterion Function:

Now, the minimization is mathematically tractable, and hence it is a better criterion fn. than no. of misclassifications. The criterion is proportional to the sum of distances from the misclassified samples to the decision boundary

Find a that minimizes this criterion

slide-17
SLIDE 17

Fixed-increment Single Sample Perceptron

  • Also called perceptron learning in an online setting
  • For large datasets, this is more efficient compared to batch

mode n = no. of training samples; a = weight vector; k = iteration #

Chapter 5, page 230

slide-18
SLIDE 18

Perceptron Convergence Theorem

If training samples are linearly separable, then the sequence of weight vectors given by Fixed- increment single-sample Perceptron will terminate at a solution vector

What happens if the patterns are non-linearly separable?

slide-19
SLIDE 19

Multilayer Perceptron

Can we learn the nonlinearity at the same time as the linear discriminant? This is the goal of multilayer neural networks or multilayer Perceptrons

Pattern Classification, Chapter 6

18

slide-20
SLIDE 20

Pattern Classification, Chapter 6

19

slide-21
SLIDE 21

Pattern Classification, Chapter 6

20

Feedforward Operation and Classification

  • A three-layer neural network consists of an

input layer, a hidden layer and an output layer interconnected by modifiable (learned) weights represented by links between layers

  • Multilayer neural network implements linear

discriminants, but in a space where the inputs have been mapped nonlinearly

  • Figure 6.1 shows a simple three-layer

network

slide-22
SLIDE 22

21 NNo training here No training involved here, since we are implementing a known input/output mapping

slide-23
SLIDE 23

Pattern Classification, Chapter 6

22

slide-24
SLIDE 24

Pattern Classification, Chapter 6

23

  • A single “bias unit” is connected to each unit in addition to

the input units

  • Net activation:

where the subscript i indexes units in the input layer, j in the hidden layer; wji denotes the input-to-hidden layer weights at the hidden unit j. (In neurobiology, such weights or connections are called “synapses”)

  • Each hidden unit emits an output that is a nonlinear function
  • f its activation, that is: yj = f(netj)

å å

= =

º = + =

d 1 i d i t j ji i j ji i j

, x . w w x w w x net

slide-25
SLIDE 25

Pattern Classification, Chapter 6

24 Figure 6.1 shows a simple threshold function

  • The function f(.) is also called the activation

function or “nonlinearity” of a unit. There are more general activation functions with desirables properties

  • Each output unit similarly computes its net

activation based on the hidden unit signals as: where the subscript k indexes units in the ouput layer and nH denotes the number of hidden units

î í ì <

  • ³

º = net if 1 net if 1 ) net sgn( ) net ( f

å å

= =

= = + =

H H

n 1 j n j t k kj j k kj j k

, y . w w y w w y net

slide-26
SLIDE 26

Pattern Classification, Chapter 6

25

  • The output units are referred as zk. An output unit

computes the nonlinear function of its net input, emitting zk = f(netk)

  • In the case of c outputs (classes), we can view the

network as computing c discriminant functions zk = gk(x); the input x is classified according to the largest discriminant function gk(x) " k = 1, …,c

  • The three-layer network with the weights listed in
  • fig. 6.1 solves the XOR problem
slide-27
SLIDE 27

Pattern Classification, Chapter 6

26

  • The hidden unit y1 computes the boundary:

³ 0 Þ y1 = +1 x1 + x2 + 0.5 = 0 < 0 Þ y1 = -1

  • The hidden unit y2 computes the boundary:

£ 0 Þ y2 = +1 x1 + x2 -1.5 = 0 < 0 Þ y2 = -1

  • Output unit emits z1 = +1 if and only if y1 = +1 and y2 = +1

Using the terminology of computer logic, the units are behaving like gates, where the first hidden unit is an OR gate, the second hidden unit is an AND gate, and the output unit implements zk = y1 AND NOT y2 = (x1 OR x2) and NOT(x1 AND x2) = x1 XOR x2 which provides the nonlinear decision of fig. 6.1

slide-28
SLIDE 28

Pattern Classification, Chapter 6

27

  • General Feedforward Operation – case of c output units
  • Hidden units enable us to express more complicated

nonlinear functions and extend classification capability

  • Activation function does not have to be a sign function, it

is often required to be continuous and differentiable

  • We can allow the activation in the output layer to be

different from the activation function in the hidden layer or have different activation for each individual unit

  • Assume for now that all activation functions are identical

c) 1,..., (k (1) w w x w f w f z ) x ( g

H

n 1 j k d 1 i j i ji kj k k

= ÷ ÷ ø ö ç ç è æ + ÷ ÷ ø ö ç ç è æ + = º

å å

= =

slide-29
SLIDE 29

Pattern Classification, Chapter 6

28

  • Expressive Power of multi-layer Networks

Question: Can every decision boundary be implemented by a three-layer network? Answer: Yes (due to A. Kolmogorov) “Any continuous function from input to output can be implemented in a three-layer net, given sufficient number of hidden units nH, proper nonlinearities, and weights.” Any continuous function g(x) defined on the unit cube can be represented in the following form for properly chosen functions dj and bij

( )

) 2 n ]; 1 , [ I ( I x ) x ( ) x ( g

n 1 n 2 1 j i ij j

³ = Î " = å

+ =

b S d

slide-30
SLIDE 30

Pattern Classification, Chapter 6

29

slide-31
SLIDE 31

Pattern Classification, Chapter 6

30

  • Network has two modes of operation:
  • Learning

The supervised learning consists of presenting an input pattern and modifying the network parameters (weights) to bring the actual outputs closer to the desired target values

  • Feedforward

The feedforward operations consists of presenting a pattern to the input units and passing (or feeding) the signals through the network in order to yield a decision from the outputs units

slide-32
SLIDE 32

Pattern Classification, Chapter 6

31

  • Goal is to learn the interconnection weights based on

the training patterns and the desired outputs

  • In a three-layer network, it is a straightforward matter

to understand how the output, and thus the error, depends on the hidden-to-output layer weights

  • The power of backpropagation is that it enables us to

compute an effective error for each hidden unit, and thus derive a learning rule for the input-to-hidden

  • weights. This is known as:

The credit assignment problem

Network Learning: Backpropagation Algorithm

slide-33
SLIDE 33

Pattern Classification, Chapter 6

32

slide-34
SLIDE 34

Pattern Classification, Chapter 6

33

Network Learning

  • Start with an untrained network, present a training

pattern to the input layer, pass the signal through the network and determine the output.

  • Let tk be the k-th target (or desired) output and zk be

the k-th computed output with k = 1, …, c. Let w represent all the weights of the network

  • The training error:
  • The backpropagation learning rule is based on

gradient descent

  • The weights are initialized with random values and are

changed in a direction that will reduce the error:

å

=

  • =
  • =

c 1 k 2 2 k k

z t 2 1 ) z t ( 2 1 ) w ( J

w J w ¶ ¶

  • =

h D

slide-35
SLIDE 35

Pattern Classification, Chapter 6

34 where h is the learning rate which indicates the relative size of the change in weights w(m +1) = w(m) + Dw(m) where m is the m-th training pattern presented

  • Error on the hidden–to-output weights

where the sensitivity of unit k is defined as: and describes how the overall error changes with the activation of the unit’s net activation

kj k k kj k k kj

w net w net . net J w J ¶ ¶

  • =

¶ ¶ ¶ ¶ = ¶ ¶ d

k k

net J ¶ ¶

  • =

d

) net ( ' f ) z t ( net z . z J net J

k k k k k k k k

  • =

¶ ¶ ¶ ¶

  • =

¶ ¶

  • =

d

slide-36
SLIDE 36

Pattern Classification, Chapter 6

35 Since netk = wk

t.y, therefore:

Conclusion: the weight update (or learning rule) for the hidden-to-output weights is: Dwkj = hdkyj = h(tk – zk) f’ (netk)yj

  • Learning rule for the input-to-hiden units is more subtle

and is the crux of the credit assignment problem

  • Error on the input-to-hidden units: Using the chain rule

j kj k

y w net = ¶ ¶

ji j j j j ji

w net . net y . y J w J ¶ ¶ ¶ ¶ ¶ ¶ = ¶ ¶

slide-37
SLIDE 37

Pattern Classification, Chapter 6

36 However, Similarly as in the preceding case, we define the sensitivity

  • f a hidden unit:

Above equation is the core of the “credit assignment” problem: “The sensitivity at a hidden unit is simply the sum

  • f the individual sensitivities at the output units weighted by

the hidden-to-output weights wkj, all multiplied by f’(netj)”; see fig 6.5 Conclusion: Learning rule for the input-to-hidden weights:

å å å å

= = = =

  • =

¶ ¶ ¶ ¶

  • =

¶ ¶

  • =

ú û ù ê ë é

¶ = ¶ ¶

c 1 k c 1 k kj k k k j k k k k k c 1 k j k k k 2 k c 1 k k j j

w ) net ( ' f ) z t ( y net . net z ) z t ( y z ) z t ( ) z t ( 2 1 y y J

å

=

º

c 1 k k kj j j

w ) net ( ' f d d

[ ]

i j k kj j i ji

x ) net ( ' f w x w

j

! ! ! " ! ! ! # $

d

d S h d h D = =

slide-38
SLIDE 38

Sensitivity at Hidden Node

slide-39
SLIDE 39

Backpropagation Algorithm

  • More specifically, the “backpropagation of

errors” algorithm

  • During training, an error must be propagated

from the output layer back to the hidden layer to learn the input-to-hidden weights

  • It is gradient descent in a layered network
  • Exact behavior of the learning algorithm

depends on the starting point

  • Start the process with random values of weights;

in practice you learn many networks with different initializations

Pattern Classification, Chapter 6

38

slide-40
SLIDE 40

Pattern Classification, Chapter 6

39

  • Training protocols:
  • Stochastic: patterns are chosen randomly from training

set; network weights are updated for each pattern

  • Batch: Present all patterns before updating weights
  • On-line: present each pattern once & only once (no

memory for storing patterns)

  • Stochastic backpropagation algorithm:

Begin initialize nH; w, criterion q, h, m ¬ 0 do m ¬ m + 1 xm ¬ randomly chosen pattern wji ¬ wji + hdjxi; wkj ¬ wkj + hdkyj until ||ÑJ(w)|| < q return w End

slide-41
SLIDE 41

Pattern Classification, Chapter 6

40

  • Stopping criterion
  • The algorithm terminates when the change in the

criterion function J(w) is smaller than some preset value q; other stopping criteria that lead to better performance than this one

  • A weight update may reduce the error on the single

pattern being presented but can increase the error on the full training set

  • In stochastic backpropgation and batch propagation,

we must make several passes through the training data

slide-42
SLIDE 42

Pattern Classification, Chapter 6

41

  • Learning Curves
  • Before training starts, the error on the training set is high;

as the learning proceeds, error becomes smaller

  • Error per pattern depends on the amount of training data

and the expressive power (such as the number of weights) in the network

  • Average error on an independent test set is always higher

than on the training set, and it can decrease as well as increase

  • A validation set is used in order to decide when to stop

training ; we do not want to overfit the network and decrease the power of the classifier’s generalization “Stop training when the error on the validation set is minimum”

slide-43
SLIDE 43

Pattern Classification, Chapter 6

42

slide-44
SLIDE 44

Representation at the Hidden Layer

  • What do the learned weights mean?
  • The weights connecting hidden layer to output

layer form a linear discriminant

  • The weights connecting input layer to hidden

layer represent a mapping from the input feature space to a latent feature space

  • For each hidden unit, the weights from input

layer describe the input pattern that leads to the maximum activation of that node

slide-45
SLIDE 45

Backpropagation as Feature Mapping

  • 64-2-3 sigmoidal network for classifying three characters (E,F,L)
  • Non-linear interactions between the features may cause the

features of the pattern to not manifest in a single hidden node (in contrary to the example shown above)

  • It may be difficult to draw similar interpretations in large networks

and caution must be exercised while analyzing weights Input layer to hidden layer weights for a character recognition task

Weights at two hidden nodes represented as 8x8 patterns Left gets activated for F, right gets activated for L, and both get activated for E

slide-46
SLIDE 46

Practical Techniques for Improving Backpropagation

  • A naïve application of backpropagation procedures may

lead to slow convergence and poor performance

  • Some practical suggestions; no theoretical results
  • Activation Function f(.)
  • Must be non-linear (otherwise, 3-layer network is just a linear

discriminant) and saturate (have max and min value) to keep weights and activation functions bounded

  • Activation function and its derivative must be continuous and

smooth; optionally monotonic

  • Choice may depend on the problem. Eg. Gaussian activation if

the data comes from a mixture of Gaussians

  • Eg: sigmoid (most popular), polynomial, tanh, sign function
  • Parameters of activation function (e.g. Sigmoid)
  • Centered at 0, odd function f(-net) = -f(net) (anti-symmetric);

leads to faster learning

  • Choice depends on the range of the input values
slide-47
SLIDE 47

Activation Function

The anti-symmetric sigmoid function: f(-x) = -f(x). a = 1.716, b = 2/3.

First & second derivative

slide-48
SLIDE 48

Practical Considerations

  • Scaling inputs (important not just for neural networks)
  • Large differences in scale of different features due to the choice of

units is compensated by normalizing them to be in the same range, [0,1] or [-1,1]; without normalization, error will hardly depend on feature with very small values

  • Standardization: Shift the inputs to have zero mean and unit

variance

  • Target Values
  • One-of-C representation for the target vector (C is no. of classes).

Better to use +1 and –1 that lie well within the range of sigmoid function saturation values (+1.716, -1.716)

  • Higher values (e.g. 1.716, saturation point of a sigmoid) may

require the weights to go to infinity to minimize the error

  • Training with Noise
  • For small training sets, it is better to add noise to the input patterns

and generate new “virtual” training patterns

slide-49
SLIDE 49

Practical Considerations

  • Number of Hidden units (nH)
  • Governs the expressive power of the network
  • The easier the task, the fewer the nodes needed
  • Rule of thumb: total no. of weights must be less than the number of

training examples (preferably 10 times less); no. of hidden units determines the total no. of weights

  • A more principled method is to adjust the network complexity in

response to training data; e.g. start with a “large” no. of hidden units and “decay”, prune, or eliminate weights

  • Initializing weights
  • We cannot initialize weights to zero, otherwise learning cannot take

place

  • Choose initial weights w such that |w| < w’
  • w’ too small – slow learning; too large – early saturation and no

learning

  • w’ is chosen to be 1/Öd for input layer, and 1/ ÖnH for hidden layer
slide-50
SLIDE 50

Total no. of Weights

Error per pattern with the increase in number of hidden nodes.

  • 2-nH-1 network (with bias) trained on 90 2D-Gaussian patterns (n =

180) from each class (sampled from mixture of 3 Gaussians)

  • Minimum test error occurs at 17-21 weights in total (4-5 hidden

nodes). This illustrates the rule of thumb that n/10 weights often gives lowest error

slide-51
SLIDE 51

Practical Considerations

  • Learning Rate
  • Small learning rate: slow convergence
  • Large learning rate: high oscillation and slow convergence
slide-52
SLIDE 52

Practical Considerations

  • Momentum
  • Prevents the algorithm from getting stuck at plateaus and local

minima

  • Weight decay
  • Avoid overfitting by imposing the condition that weights must be

small

  • After each update, weights are decayed by some factor
  • Related to regularization (also used in SVM)
  • Hints
  • Additional input nodes added to NN that are only

used during training. Help learn better feature representation.

slide-53
SLIDE 53

Practical Considerations

  • Training setup
  • Online, stochastic, batch-mode
  • Stop training
  • Halt when validation error reaches (first) minimum
  • Number of hidden layers
  • More layers -> more complex
  • Networks with more hidden layers are more prone to

get caught in local minima

  • Smaller the better (KISS)
  • Criterion function
  • We talked about squared error, but there are others

Pattern Classification, Chapter 6

52