[PPT] - Neural Networks Oliver Schulte - CMPT 726 Bishop PRML Ch. 5 PowerPoint Presentation

SLIDE 1

Feed-forward Networks Network Training Error Backpropagation Applications

Neural Networks

Oliver Schulte - CMPT 726 Bishop PRML Ch. 5

SLIDE 2

Feed-forward Networks Network Training Error Backpropagation Applications

Neural Networks

Neural networks arise from attempts to model

human/animal brains

Many models, many claims of biological plausibility
We will focus on multi-layer perceptrons
Mathematical properties rather than plausibility
Prof. Hadley CMPT418

SLIDE 3

Feed-forward Networks Network Training Error Backpropagation Applications

Uses of Neural Networks

Pros
Good for continuous input variables.
General continuous function approximators.
Highly non-linear.
Learn feature functions.
Good to use in continuous domains with little knowledge:
When you don’t know good features.
You don’t know the form of a good functional model.
Cons
Not interpretable, “black box”.
Learning is slow.
Good generalization can require many datapoints.

SLIDE 4

Feed-forward Networks Network Training Error Backpropagation Applications

Applications

There are many, many applications.

World-Champion Backgammon Player.
No Hands Across America Tour.
Digit Recognition with 99.26% accuracy.
...

SLIDE 5

Feed-forward Networks Network Training Error Backpropagation Applications

Outline

Feed-forward Networks Network Training Error Backpropagation Applications

SLIDE 6

Feed-forward Networks Network Training Error Backpropagation Applications

Outline

Feed-forward Networks Network Training Error Backpropagation Applications

SLIDE 7

Feed-forward Networks Network Training Error Backpropagation Applications

Feed-forward Networks

We have looked at generalized linear models of the form:

y(x, w) = f  

M

j=1

wjφj(x)   for fixed non-linear basis functions φ(·)

We now extend this model by allowing adaptive basis

functions, and learning their parameters

In feed-forward networks (a.k.a. multi-layer perceptrons)

we let each basis function be another non-linear function of linear combination of the inputs: φj(x) = f  

M

j=1

. . .  

SLIDE 8

Feed-forward Networks Network Training Error Backpropagation Applications

Feed-forward Networks

We have looked at generalized linear models of the form:

y(x, w) = f  

M

j=1

wjφj(x)   for fixed non-linear basis functions φ(·)

We now extend this model by allowing adaptive basis

functions, and learning their parameters

In feed-forward networks (a.k.a. multi-layer perceptrons)

we let each basis function be another non-linear function of linear combination of the inputs: φj(x) = f  

M

j=1

. . .  

SLIDE 9

Feed-forward Networks Network Training Error Backpropagation Applications

Feed-forward Networks

Starting with input x = (x1, . . . , xD), construct linear

combinations: aj =

D

i=1

w(1)

ji xi + w(1) j0

These aj are known as activations

Pass through an activation function h(·) to get output

zj = h(aj)

Model of an individual neuron

from Russell and Norvig, AIMA2e

SLIDE 10

Feed-forward Networks Network Training Error Backpropagation Applications

Feed-forward Networks

Starting with input x = (x1, . . . , xD), construct linear

combinations: aj =

D

i=1

w(1)

ji xi + w(1) j0

These aj are known as activations

Pass through an activation function h(·) to get output

zj = h(aj)

Model of an individual neuron

from Russell and Norvig, AIMA2e

SLIDE 11

Feed-forward Networks Network Training Error Backpropagation Applications

Feed-forward Networks

Starting with input x = (x1, . . . , xD), construct linear

combinations: aj =

D

i=1

w(1)

ji xi + w(1) j0

These aj are known as activations

Pass through an activation function h(·) to get output

zj = h(aj)

Model of an individual neuron

from Russell and Norvig, AIMA2e

SLIDE 12

Feed-forward Networks Network Training Error Backpropagation Applications

Activation Functions

Can use a variety of activation functions
Sigmoidal (S-shaped)
Logistic sigmoid 1/(1 + exp(−a)) (useful for binary

classification)

Hyperbolic tangent tanh
Radial basis function zj =

i(xi − wji)2

Softmax
Useful for multi-class classification
Identity
Useful for regression
Threshold
. . .
Needs to be differentiable for gradient-based learning

(later)

Can use different activation functions in each unit

SLIDE 13

Feed-forward Networks Network Training Error Backpropagation Applications

Feed-forward Networks

x0 x1 xD z0 z1 zM y1 yK w(1)

MD

w(2)

KM

w(2)

10

hidden units inputs

utputs
Connect together a number of these units into a

feed-forward network (DAG)

Above shows a network with one layer of hidden units
Implements function:

yk(x, w) = h  

M

j=1

w(2)

kj h

D

i=1

w(1)

ji xi + w(1) j0

+ w(2)

k0

 

SLIDE 14

Feed-forward Networks Network Training Error Backpropagation Applications

Hidden Units Compute Basis Functions

red dots = network function
dashed line = hidden unit activation function.
blue dots = data points

SLIDE 15

Feed-forward Networks Network Training Error Backpropagation Applications

Outline

Feed-forward Networks Network Training Error Backpropagation Applications

SLIDE 16

Feed-forward Networks Network Training Error Backpropagation Applications

Network Training

Given a specified network structure, how do we set its

parameters (weights)?

As usual, we define a criterion to measure how well our

network performs, optimize against it

For regression, training data are (xn, t), tn ∈ R
Squared error naturally arises:

E(w) =

N

n=1

{y(xn, w) − tn}2

For binary classification, this is another discriminative

model, ML:

p(t|w) =

N

n=1

ytn

n {1 − yn}1−tn

E(w) = −

N

n=1

{tn ln yn + (1 − tn) ln(1 − yn)}

SLIDE 17

Feed-forward Networks Network Training Error Backpropagation Applications

Network Training

Given a specified network structure, how do we set its

parameters (weights)?

As usual, we define a criterion to measure how well our

network performs, optimize against it

For regression, training data are (xn, t), tn ∈ R
Squared error naturally arises:

E(w) =

N

n=1

{y(xn, w) − tn}2

For binary classification, this is another discriminative

model, ML:

p(t|w) =

N

n=1

ytn

n {1 − yn}1−tn

E(w) = −

N

n=1

{tn ln yn + (1 − tn) ln(1 − yn)}

SLIDE 18

Feed-forward Networks Network Training Error Backpropagation Applications

Network Training

Given a specified network structure, how do we set its

parameters (weights)?

As usual, we define a criterion to measure how well our

network performs, optimize against it

For regression, training data are (xn, t), tn ∈ R
Squared error naturally arises:

E(w) =

N

n=1

{y(xn, w) − tn}2

For binary classification, this is another discriminative

model, ML:

p(t|w) =

N

n=1

ytn

n {1 − yn}1−tn

E(w) = −

N

n=1

{tn ln yn + (1 − tn) ln(1 − yn)}

SLIDE 19

Feed-forward Networks Network Training Error Backpropagation Applications

Parameter Optimization

w1 w2 E(w) wA wB wC ∇E

For either of these problems, the error function E(w) is

nasty

Nasty = non-convex
Non-convex = has local minima

SLIDE 20

Feed-forward Networks Network Training Error Backpropagation Applications

Descent Methods

The typical strategy for optimization problems of this sort is

a descent method: w(τ+1) = w(τ) + ∆w(τ)

These come in many flavours
Gradient descent ∇E(w(τ))
Stochastic gradient descent ∇En(w(τ))
Newton-Raphson (second order) ∇2
All of these can be used here, stochastic gradient descent

is particularly effective

Redundancy in training data, escaping local minima

SLIDE 21

Feed-forward Networks Network Training Error Backpropagation Applications

Descent Methods

The typical strategy for optimization problems of this sort is

a descent method: w(τ+1) = w(τ) + ∆w(τ)

These come in many flavours
Gradient descent ∇E(w(τ))
Stochastic gradient descent ∇En(w(τ))
Newton-Raphson (second order) ∇2
All of these can be used here, stochastic gradient descent

is particularly effective

Redundancy in training data, escaping local minima

SLIDE 22

Feed-forward Networks Network Training Error Backpropagation Applications

Descent Methods

The typical strategy for optimization problems of this sort is

a descent method: w(τ+1) = w(τ) + ∆w(τ)

These come in many flavours
Gradient descent ∇E(w(τ))
Stochastic gradient descent ∇En(w(τ))
Newton-Raphson (second order) ∇2
All of these can be used here, stochastic gradient descent

is particularly effective

Redundancy in training data, escaping local minima

SLIDE 23

Feed-forward Networks Network Training Error Backpropagation Applications

Computing Gradients

The function y(xn, w) implemented by a network is

complicated

It isn’t obvious how to compute error function derivatives

with respect to weights

Numerical method for calculating error derivatives, use

finite differences: ∂En ∂wji ≈ En(wji + ǫ) − En(wji − ǫ) 2ǫ

How much computation would this take with W weights in

the network?

O(W) per derivative, O(W2) total per gradient descent step

SLIDE 24

Feed-forward Networks Network Training Error Backpropagation Applications

Computing Gradients

The function y(xn, w) implemented by a network is

complicated

It isn’t obvious how to compute error function derivatives

with respect to weights

Numerical method for calculating error derivatives, use

finite differences: ∂En ∂wji ≈ En(wji + ǫ) − En(wji − ǫ) 2ǫ

How much computation would this take with W weights in

the network?

O(W) per derivative, O(W2) total per gradient descent step

SLIDE 25

Feed-forward Networks Network Training Error Backpropagation Applications

Computing Gradients

The function y(xn, w) implemented by a network is

complicated

It isn’t obvious how to compute error function derivatives

with respect to weights

Numerical method for calculating error derivatives, use

finite differences: ∂En ∂wji ≈ En(wji + ǫ) − En(wji − ǫ) 2ǫ

How much computation would this take with W weights in

the network?

O(W) per derivative, O(W2) total per gradient descent step

SLIDE 26

Feed-forward Networks Network Training Error Backpropagation Applications

Computing Gradients

The function y(xn, w) implemented by a network is

complicated

It isn’t obvious how to compute error function derivatives

with respect to weights

Numerical method for calculating error derivatives, use

finite differences: ∂En ∂wji ≈ En(wji + ǫ) − En(wji − ǫ) 2ǫ

How much computation would this take with W weights in

the network?

O(W) per derivative, O(W2) total per gradient descent step

SLIDE 27

Feed-forward Networks Network Training Error Backpropagation Applications

Outline

Feed-forward Networks Network Training Error Backpropagation Applications

SLIDE 28

Feed-forward Networks Network Training Error Backpropagation Applications

Error Backpropagation

Backprop is an efficient method for computing error

derivatives ∂En

∂wji

O(W) to compute derivatives wrt all weights
First, feed training example xn forward through the network,

storing all activations aj

Calculating derivatives for weights connected to output

nodes is easy

e.g. For linear output nodes yk =

i wkizi:

∂En ∂wki = ∂ ∂wki 1 2(yn − tn)2 = (yn − tn)xni

For hidden layers, propagate error backwards from the
utput nodes

SLIDE 29

Feed-forward Networks Network Training Error Backpropagation Applications

Error Backpropagation

Backprop is an efficient method for computing error

derivatives ∂En

∂wji

O(W) to compute derivatives wrt all weights
First, feed training example xn forward through the network,

storing all activations aj

Calculating derivatives for weights connected to output

nodes is easy

e.g. For linear output nodes yk =

i wkizi:

∂En ∂wki = ∂ ∂wki 1 2(yn − tn)2 = (yn − tn)xni

For hidden layers, propagate error backwards from the
utput nodes

SLIDE 30

Feed-forward Networks Network Training Error Backpropagation Applications

Error Backpropagation

Backprop is an efficient method for computing error

derivatives ∂En

∂wji

O(W) to compute derivatives wrt all weights
First, feed training example xn forward through the network,

storing all activations aj

Calculating derivatives for weights connected to output

nodes is easy

e.g. For linear output nodes yk =

i wkizi:

∂En ∂wki = ∂ ∂wki 1 2(yn − tn)2 = (yn − tn)xni

For hidden layers, propagate error backwards from the
utput nodes

SLIDE 31

Feed-forward Networks Network Training Error Backpropagation Applications

Chain Rule for Partial Derivatives

A “reminder”
For f(x, y), with f differentiable wrt x and y, and x and y

differentiable wrt u and v: ∂f ∂u = ∂f ∂x ∂x ∂u + ∂f ∂y ∂y ∂u and ∂f ∂v = ∂f ∂x ∂x ∂v + ∂f ∂y ∂y ∂v

SLIDE 32

Feed-forward Networks Network Training Error Backpropagation Applications

Error Backpropagation

We can write

∂En ∂wji = ∂ ∂wji En(aj1, aj2, . . . , ajm) where {ji} are the indices of the nodes in the same layer as node j

Using the chain rule:

∂En ∂wji = ∂En ∂aj ∂aj ∂wji +

k

∂En ∂ak ∂ak ∂wji where

k runs over all other nodes k in the same layer as

node j.

Since ak does not depend on wji, all terms in the

summation go to 0 ∂En ∂wji = ∂En ∂aj ∂aj ∂wji

SLIDE 33

Feed-forward Networks Network Training Error Backpropagation Applications

Error Backpropagation

We can write

∂En ∂wji = ∂ ∂wji En(aj1, aj2, . . . , ajm) where {ji} are the indices of the nodes in the same layer as node j

Using the chain rule:

∂En ∂wji = ∂En ∂aj ∂aj ∂wji +

k

∂En ∂ak ∂ak ∂wji where

k runs over all other nodes k in the same layer as

node j.

Since ak does not depend on wji, all terms in the

summation go to 0 ∂En ∂wji = ∂En ∂aj ∂aj ∂wji

SLIDE 34

Feed-forward Networks Network Training Error Backpropagation Applications

Error Backpropagation

We can write

∂En ∂wji = ∂ ∂wji En(aj1, aj2, . . . , ajm) where {ji} are the indices of the nodes in the same layer as node j

Using the chain rule:

∂En ∂wji = ∂En ∂aj ∂aj ∂wji +

k

∂En ∂ak ∂ak ∂wji where

k runs over all other nodes k in the same layer as

node j.

Since ak does not depend on wji, all terms in the

summation go to 0 ∂En ∂wji = ∂En ∂aj ∂aj ∂wji

SLIDE 35

Feed-forward Networks Network Training Error Backpropagation Applications

Error Backpropagation cont.

Introduce error δj ≡ ∂En

∂aj

∂En ∂wji = δj ∂aj ∂wji

Other factor is:

∂aj ∂wji = ∂ ∂wji

k

wjkzk = zi

SLIDE 36

Feed-forward Networks Network Training Error Backpropagation Applications

Error Backpropagation cont.

Error δj can also be computed using chain rule:

δj ≡ ∂En ∂aj =

k

∂En ∂ak

δk

∂ak ∂aj where

k runs over all nodes k in the layer after node j.

Eventually:

δj = h′(aj)

k

wkjδk

A weighted sum of the later error “caused” by this weight

SLIDE 37

Feed-forward Networks Network Training Error Backpropagation Applications

Error Backpropagation cont.

Error δj can also be computed using chain rule:

δj ≡ ∂En ∂aj =

k

∂En ∂ak

δk

∂ak ∂aj where

k runs over all nodes k in the layer after node j.

Eventually:

δj = h′(aj)

k

wkjδk

A weighted sum of the later error “caused” by this weight

SLIDE 38

Feed-forward Networks Network Training Error Backpropagation Applications

Outline

Feed-forward Networks Network Training Error Backpropagation Applications

SLIDE 39

Feed-forward Networks Network Training Error Backpropagation Applications

Applications of Neural Networks

Many success stories for neural networks
Credit card fraud detection
Hand-written digit recognition
Face detection
Autonomous driving (CMU ALVINN)

SLIDE 40

Feed-forward Networks Network Training Error Backpropagation Applications

Hand-written Digit Recognition

MNIST - standard dataset for hand-written digit recognition
60000 training, 10000 test images

SLIDE 41

Feed-forward Networks Network Training Error Backpropagation Applications

LeNet-5

INPUT 32x32

Convolutions Subsampling Convolutions

C1: feature maps 6@28x28

Subsampling

S2: f. maps 6@14x14 S4: f. maps 16@5x5 C5: layer 120 C3: f. maps 16@10x10 F6: layer 84

Full connection Full connection Gaussian connections

OUTPUT 10

LeNet developed by Yann LeCun et al.
Convolutional neural network
Local receptive fields (5x5 connectivity)
Subsampling (2x2)
Shared weights (reuse same 5x5 “filter”)
Breaking symmetry
See

http://www.codeproject.com/KB/library/NeuralNetRecognition.aspx

SLIDE 42

Feed-forward Networks Network Training Error Backpropagation Applications 4>6 3>5 8>2 2>1 5>3 4>8 2>8 3>5 6>5 7>3 9>4 8>0 7>8 5>3 8>7 0>6 3>7 2>7 8>3 9>4 8>2 5>3 4>8 3>9 6>0 9>8 4>9 6>1 9>4 9>1 9>4 2>0 6>1 3>5 3>2 9>5 6>0 6>0 6>0 6>8 4>6 7>3 9>4 4>6 2>7 9>7 4>3 9>4 9>4 9>4 8>7 4>2 8>4 3>5 8>4 6>5 8>5 3>8 3>8 9>8 1>5 9>8 6>3 0>2 6>5 9>5 0>7 1>6 4>9 2>1 2>8 8>5 4>9 7>2 7>2 6>5 9>7 6>1 5>6 5>0 4>9 2>8

The 82 errors made by LeNet5 (0.82% test error rate)

SLIDE 43

Feed-forward Networks Network Training Error Backpropagation Applications

Conclusion

Readings: Ch. 5.1, 5.2, 5.3
Feed-forward networks can be used for regression or

classification

Similar to linear models, except with adaptive non-linear

basis functions

These allow us to do more than e.g. linear decision

boundaries

Different error functions
Learning is more difficult, error function not convex
Use stochastic gradient descent, obtain (good?) local

minimum

Backpropagation for efficient gradient computation