Neural Networks Oliver Schulte - CMPT 726 Bishop PRML Ch. 5 - - PowerPoint PPT Presentation

neural networks
SMART_READER_LITE
LIVE PREVIEW

Neural Networks Oliver Schulte - CMPT 726 Bishop PRML Ch. 5 - - PowerPoint PPT Presentation

Feed-forward Networks Network Training Error Backpropagation Applications Neural Networks Oliver Schulte - CMPT 726 Bishop PRML Ch. 5 Feed-forward Networks Network Training Error Backpropagation Applications Neural Networks Neural


slide-1
SLIDE 1

Feed-forward Networks Network Training Error Backpropagation Applications

Neural Networks

Oliver Schulte - CMPT 726 Bishop PRML Ch. 5

slide-2
SLIDE 2

Feed-forward Networks Network Training Error Backpropagation Applications

Neural Networks

  • Neural networks arise from attempts to model

human/animal brains

  • Many models, many claims of biological plausibility
  • We will focus on multi-layer perceptrons
  • Mathematical properties rather than plausibility
  • Prof. Hadley CMPT418
slide-3
SLIDE 3

Feed-forward Networks Network Training Error Backpropagation Applications

Uses of Neural Networks

  • Pros
  • Good for continuous input variables.
  • General continuous function approximators.
  • Highly non-linear.
  • Learn feature functions.
  • Good to use in continuous domains with little knowledge:
  • When you don’t know good features.
  • You don’t know the form of a good functional model.
  • Cons
  • Not interpretable, “black box”.
  • Learning is slow.
  • Good generalization can require many datapoints.
slide-4
SLIDE 4

Feed-forward Networks Network Training Error Backpropagation Applications

Applications

There are many, many applications.

  • World-Champion Backgammon Player.
  • No Hands Across America Tour.
  • Digit Recognition with 99.26% accuracy.
  • ...
slide-5
SLIDE 5

Feed-forward Networks Network Training Error Backpropagation Applications

Outline

Feed-forward Networks Network Training Error Backpropagation Applications

slide-6
SLIDE 6

Feed-forward Networks Network Training Error Backpropagation Applications

Outline

Feed-forward Networks Network Training Error Backpropagation Applications

slide-7
SLIDE 7

Feed-forward Networks Network Training Error Backpropagation Applications

Feed-forward Networks

  • We have looked at generalized linear models of the form:

y(x, w) = f  

M

  • j=1

wjφj(x)   for fixed non-linear basis functions φ(·)

  • We now extend this model by allowing adaptive basis

functions, and learning their parameters

  • In feed-forward networks (a.k.a. multi-layer perceptrons)

we let each basis function be another non-linear function of linear combination of the inputs: φj(x) = f  

M

  • j=1

. . .  

slide-8
SLIDE 8

Feed-forward Networks Network Training Error Backpropagation Applications

Feed-forward Networks

  • We have looked at generalized linear models of the form:

y(x, w) = f  

M

  • j=1

wjφj(x)   for fixed non-linear basis functions φ(·)

  • We now extend this model by allowing adaptive basis

functions, and learning their parameters

  • In feed-forward networks (a.k.a. multi-layer perceptrons)

we let each basis function be another non-linear function of linear combination of the inputs: φj(x) = f  

M

  • j=1

. . .  

slide-9
SLIDE 9

Feed-forward Networks Network Training Error Backpropagation Applications

Feed-forward Networks

  • Starting with input x = (x1, . . . , xD), construct linear

combinations: aj =

D

  • i=1

w(1)

ji xi + w(1) j0

These aj are known as activations

  • Pass through an activation function h(·) to get output

zj = h(aj)

  • Model of an individual neuron

from Russell and Norvig, AIMA2e

slide-10
SLIDE 10

Feed-forward Networks Network Training Error Backpropagation Applications

Feed-forward Networks

  • Starting with input x = (x1, . . . , xD), construct linear

combinations: aj =

D

  • i=1

w(1)

ji xi + w(1) j0

These aj are known as activations

  • Pass through an activation function h(·) to get output

zj = h(aj)

  • Model of an individual neuron

from Russell and Norvig, AIMA2e

slide-11
SLIDE 11

Feed-forward Networks Network Training Error Backpropagation Applications

Feed-forward Networks

  • Starting with input x = (x1, . . . , xD), construct linear

combinations: aj =

D

  • i=1

w(1)

ji xi + w(1) j0

These aj are known as activations

  • Pass through an activation function h(·) to get output

zj = h(aj)

  • Model of an individual neuron

from Russell and Norvig, AIMA2e

slide-12
SLIDE 12

Feed-forward Networks Network Training Error Backpropagation Applications

Activation Functions

  • Can use a variety of activation functions
  • Sigmoidal (S-shaped)
  • Logistic sigmoid 1/(1 + exp(−a)) (useful for binary

classification)

  • Hyperbolic tangent tanh
  • Radial basis function zj =

i(xi − wji)2

  • Softmax
  • Useful for multi-class classification
  • Identity
  • Useful for regression
  • Threshold
  • . . .
  • Needs to be differentiable for gradient-based learning

(later)

  • Can use different activation functions in each unit
slide-13
SLIDE 13

Feed-forward Networks Network Training Error Backpropagation Applications

Feed-forward Networks

x0 x1 xD z0 z1 zM y1 yK w(1)

MD

w(2)

KM

w(2)

10

hidden units inputs

  • utputs
  • Connect together a number of these units into a

feed-forward network (DAG)

  • Above shows a network with one layer of hidden units
  • Implements function:

yk(x, w) = h  

M

  • j=1

w(2)

kj h

D

  • i=1

w(1)

ji xi + w(1) j0

  • + w(2)

k0

 

slide-14
SLIDE 14

Feed-forward Networks Network Training Error Backpropagation Applications

Hidden Units Compute Basis Functions

  • red dots = network function
  • dashed line = hidden unit activation function.
  • blue dots = data points
slide-15
SLIDE 15

Feed-forward Networks Network Training Error Backpropagation Applications

Outline

Feed-forward Networks Network Training Error Backpropagation Applications

slide-16
SLIDE 16

Feed-forward Networks Network Training Error Backpropagation Applications

Network Training

  • Given a specified network structure, how do we set its

parameters (weights)?

  • As usual, we define a criterion to measure how well our

network performs, optimize against it

  • For regression, training data are (xn, t), tn ∈ R
  • Squared error naturally arises:

E(w) =

N

  • n=1

{y(xn, w) − tn}2

  • For binary classification, this is another discriminative

model, ML:

p(t|w) =

N

  • n=1

ytn

n {1 − yn}1−tn

E(w) = −

N

  • n=1

{tn ln yn + (1 − tn) ln(1 − yn)}

slide-17
SLIDE 17

Feed-forward Networks Network Training Error Backpropagation Applications

Network Training

  • Given a specified network structure, how do we set its

parameters (weights)?

  • As usual, we define a criterion to measure how well our

network performs, optimize against it

  • For regression, training data are (xn, t), tn ∈ R
  • Squared error naturally arises:

E(w) =

N

  • n=1

{y(xn, w) − tn}2

  • For binary classification, this is another discriminative

model, ML:

p(t|w) =

N

  • n=1

ytn

n {1 − yn}1−tn

E(w) = −

N

  • n=1

{tn ln yn + (1 − tn) ln(1 − yn)}

slide-18
SLIDE 18

Feed-forward Networks Network Training Error Backpropagation Applications

Network Training

  • Given a specified network structure, how do we set its

parameters (weights)?

  • As usual, we define a criterion to measure how well our

network performs, optimize against it

  • For regression, training data are (xn, t), tn ∈ R
  • Squared error naturally arises:

E(w) =

N

  • n=1

{y(xn, w) − tn}2

  • For binary classification, this is another discriminative

model, ML:

p(t|w) =

N

  • n=1

ytn

n {1 − yn}1−tn

E(w) = −

N

  • n=1

{tn ln yn + (1 − tn) ln(1 − yn)}

slide-19
SLIDE 19

Feed-forward Networks Network Training Error Backpropagation Applications

Parameter Optimization

w1 w2 E(w) wA wB wC ∇E

  • For either of these problems, the error function E(w) is

nasty

  • Nasty = non-convex
  • Non-convex = has local minima
slide-20
SLIDE 20

Feed-forward Networks Network Training Error Backpropagation Applications

Descent Methods

  • The typical strategy for optimization problems of this sort is

a descent method: w(τ+1) = w(τ) + ∆w(τ)

  • These come in many flavours
  • Gradient descent ∇E(w(τ))
  • Stochastic gradient descent ∇En(w(τ))
  • Newton-Raphson (second order) ∇2
  • All of these can be used here, stochastic gradient descent

is particularly effective

  • Redundancy in training data, escaping local minima
slide-21
SLIDE 21

Feed-forward Networks Network Training Error Backpropagation Applications

Descent Methods

  • The typical strategy for optimization problems of this sort is

a descent method: w(τ+1) = w(τ) + ∆w(τ)

  • These come in many flavours
  • Gradient descent ∇E(w(τ))
  • Stochastic gradient descent ∇En(w(τ))
  • Newton-Raphson (second order) ∇2
  • All of these can be used here, stochastic gradient descent

is particularly effective

  • Redundancy in training data, escaping local minima
slide-22
SLIDE 22

Feed-forward Networks Network Training Error Backpropagation Applications

Descent Methods

  • The typical strategy for optimization problems of this sort is

a descent method: w(τ+1) = w(τ) + ∆w(τ)

  • These come in many flavours
  • Gradient descent ∇E(w(τ))
  • Stochastic gradient descent ∇En(w(τ))
  • Newton-Raphson (second order) ∇2
  • All of these can be used here, stochastic gradient descent

is particularly effective

  • Redundancy in training data, escaping local minima
slide-23
SLIDE 23

Feed-forward Networks Network Training Error Backpropagation Applications

Computing Gradients

  • The function y(xn, w) implemented by a network is

complicated

  • It isn’t obvious how to compute error function derivatives

with respect to weights

  • Numerical method for calculating error derivatives, use

finite differences: ∂En ∂wji ≈ En(wji + ǫ) − En(wji − ǫ) 2ǫ

  • How much computation would this take with W weights in

the network?

  • O(W) per derivative, O(W2) total per gradient descent step
slide-24
SLIDE 24

Feed-forward Networks Network Training Error Backpropagation Applications

Computing Gradients

  • The function y(xn, w) implemented by a network is

complicated

  • It isn’t obvious how to compute error function derivatives

with respect to weights

  • Numerical method for calculating error derivatives, use

finite differences: ∂En ∂wji ≈ En(wji + ǫ) − En(wji − ǫ) 2ǫ

  • How much computation would this take with W weights in

the network?

  • O(W) per derivative, O(W2) total per gradient descent step
slide-25
SLIDE 25

Feed-forward Networks Network Training Error Backpropagation Applications

Computing Gradients

  • The function y(xn, w) implemented by a network is

complicated

  • It isn’t obvious how to compute error function derivatives

with respect to weights

  • Numerical method for calculating error derivatives, use

finite differences: ∂En ∂wji ≈ En(wji + ǫ) − En(wji − ǫ) 2ǫ

  • How much computation would this take with W weights in

the network?

  • O(W) per derivative, O(W2) total per gradient descent step
slide-26
SLIDE 26

Feed-forward Networks Network Training Error Backpropagation Applications

Computing Gradients

  • The function y(xn, w) implemented by a network is

complicated

  • It isn’t obvious how to compute error function derivatives

with respect to weights

  • Numerical method for calculating error derivatives, use

finite differences: ∂En ∂wji ≈ En(wji + ǫ) − En(wji − ǫ) 2ǫ

  • How much computation would this take with W weights in

the network?

  • O(W) per derivative, O(W2) total per gradient descent step
slide-27
SLIDE 27

Feed-forward Networks Network Training Error Backpropagation Applications

Outline

Feed-forward Networks Network Training Error Backpropagation Applications

slide-28
SLIDE 28

Feed-forward Networks Network Training Error Backpropagation Applications

Error Backpropagation

  • Backprop is an efficient method for computing error

derivatives ∂En

∂wji

  • O(W) to compute derivatives wrt all weights
  • First, feed training example xn forward through the network,

storing all activations aj

  • Calculating derivatives for weights connected to output

nodes is easy

  • e.g. For linear output nodes yk =

i wkizi:

∂En ∂wki = ∂ ∂wki 1 2(yn − tn)2 = (yn − tn)xni

  • For hidden layers, propagate error backwards from the
  • utput nodes
slide-29
SLIDE 29

Feed-forward Networks Network Training Error Backpropagation Applications

Error Backpropagation

  • Backprop is an efficient method for computing error

derivatives ∂En

∂wji

  • O(W) to compute derivatives wrt all weights
  • First, feed training example xn forward through the network,

storing all activations aj

  • Calculating derivatives for weights connected to output

nodes is easy

  • e.g. For linear output nodes yk =

i wkizi:

∂En ∂wki = ∂ ∂wki 1 2(yn − tn)2 = (yn − tn)xni

  • For hidden layers, propagate error backwards from the
  • utput nodes
slide-30
SLIDE 30

Feed-forward Networks Network Training Error Backpropagation Applications

Error Backpropagation

  • Backprop is an efficient method for computing error

derivatives ∂En

∂wji

  • O(W) to compute derivatives wrt all weights
  • First, feed training example xn forward through the network,

storing all activations aj

  • Calculating derivatives for weights connected to output

nodes is easy

  • e.g. For linear output nodes yk =

i wkizi:

∂En ∂wki = ∂ ∂wki 1 2(yn − tn)2 = (yn − tn)xni

  • For hidden layers, propagate error backwards from the
  • utput nodes
slide-31
SLIDE 31

Feed-forward Networks Network Training Error Backpropagation Applications

Chain Rule for Partial Derivatives

  • A “reminder”
  • For f(x, y), with f differentiable wrt x and y, and x and y

differentiable wrt u and v: ∂f ∂u = ∂f ∂x ∂x ∂u + ∂f ∂y ∂y ∂u and ∂f ∂v = ∂f ∂x ∂x ∂v + ∂f ∂y ∂y ∂v

slide-32
SLIDE 32

Feed-forward Networks Network Training Error Backpropagation Applications

Error Backpropagation

  • We can write

∂En ∂wji = ∂ ∂wji En(aj1, aj2, . . . , ajm) where {ji} are the indices of the nodes in the same layer as node j

  • Using the chain rule:

∂En ∂wji = ∂En ∂aj ∂aj ∂wji +

  • k

∂En ∂ak ∂ak ∂wji where

k runs over all other nodes k in the same layer as

node j.

  • Since ak does not depend on wji, all terms in the

summation go to 0 ∂En ∂wji = ∂En ∂aj ∂aj ∂wji

slide-33
SLIDE 33

Feed-forward Networks Network Training Error Backpropagation Applications

Error Backpropagation

  • We can write

∂En ∂wji = ∂ ∂wji En(aj1, aj2, . . . , ajm) where {ji} are the indices of the nodes in the same layer as node j

  • Using the chain rule:

∂En ∂wji = ∂En ∂aj ∂aj ∂wji +

  • k

∂En ∂ak ∂ak ∂wji where

k runs over all other nodes k in the same layer as

node j.

  • Since ak does not depend on wji, all terms in the

summation go to 0 ∂En ∂wji = ∂En ∂aj ∂aj ∂wji

slide-34
SLIDE 34

Feed-forward Networks Network Training Error Backpropagation Applications

Error Backpropagation

  • We can write

∂En ∂wji = ∂ ∂wji En(aj1, aj2, . . . , ajm) where {ji} are the indices of the nodes in the same layer as node j

  • Using the chain rule:

∂En ∂wji = ∂En ∂aj ∂aj ∂wji +

  • k

∂En ∂ak ∂ak ∂wji where

k runs over all other nodes k in the same layer as

node j.

  • Since ak does not depend on wji, all terms in the

summation go to 0 ∂En ∂wji = ∂En ∂aj ∂aj ∂wji

slide-35
SLIDE 35

Feed-forward Networks Network Training Error Backpropagation Applications

Error Backpropagation cont.

  • Introduce error δj ≡ ∂En

∂aj

∂En ∂wji = δj ∂aj ∂wji

  • Other factor is:

∂aj ∂wji = ∂ ∂wji

  • k

wjkzk = zi

slide-36
SLIDE 36

Feed-forward Networks Network Training Error Backpropagation Applications

Error Backpropagation cont.

  • Error δj can also be computed using chain rule:

δj ≡ ∂En ∂aj =

  • k

∂En ∂ak

  • δk

∂ak ∂aj where

k runs over all nodes k in the layer after node j.

  • Eventually:

δj = h′(aj)

  • k

wkjδk

  • A weighted sum of the later error “caused” by this weight
slide-37
SLIDE 37

Feed-forward Networks Network Training Error Backpropagation Applications

Error Backpropagation cont.

  • Error δj can also be computed using chain rule:

δj ≡ ∂En ∂aj =

  • k

∂En ∂ak

  • δk

∂ak ∂aj where

k runs over all nodes k in the layer after node j.

  • Eventually:

δj = h′(aj)

  • k

wkjδk

  • A weighted sum of the later error “caused” by this weight
slide-38
SLIDE 38

Feed-forward Networks Network Training Error Backpropagation Applications

Outline

Feed-forward Networks Network Training Error Backpropagation Applications

slide-39
SLIDE 39

Feed-forward Networks Network Training Error Backpropagation Applications

Applications of Neural Networks

  • Many success stories for neural networks
  • Credit card fraud detection
  • Hand-written digit recognition
  • Face detection
  • Autonomous driving (CMU ALVINN)
slide-40
SLIDE 40

Feed-forward Networks Network Training Error Backpropagation Applications

Hand-written Digit Recognition

  • MNIST - standard dataset for hand-written digit recognition
  • 60000 training, 10000 test images
slide-41
SLIDE 41

Feed-forward Networks Network Training Error Backpropagation Applications

LeNet-5

INPUT 32x32

Convolutions Subsampling Convolutions

C1: feature maps 6@28x28

Subsampling

S2: f. maps 6@14x14 S4: f. maps 16@5x5 C5: layer 120 C3: f. maps 16@10x10 F6: layer 84

Full connection Full connection Gaussian connections

OUTPUT 10

  • LeNet developed by Yann LeCun et al.
  • Convolutional neural network
  • Local receptive fields (5x5 connectivity)
  • Subsampling (2x2)
  • Shared weights (reuse same 5x5 “filter”)
  • Breaking symmetry
  • See

http://www.codeproject.com/KB/library/NeuralNetRecognition.aspx

slide-42
SLIDE 42

Feed-forward Networks Network Training Error Backpropagation Applications 4>6 3>5 8>2 2>1 5>3 4>8 2>8 3>5 6>5 7>3 9>4 8>0 7>8 5>3 8>7 0>6 3>7 2>7 8>3 9>4 8>2 5>3 4>8 3>9 6>0 9>8 4>9 6>1 9>4 9>1 9>4 2>0 6>1 3>5 3>2 9>5 6>0 6>0 6>0 6>8 4>6 7>3 9>4 4>6 2>7 9>7 4>3 9>4 9>4 9>4 8>7 4>2 8>4 3>5 8>4 6>5 8>5 3>8 3>8 9>8 1>5 9>8 6>3 0>2 6>5 9>5 0>7 1>6 4>9 2>1 2>8 8>5 4>9 7>2 7>2 6>5 9>7 6>1 5>6 5>0 4>9 2>8

  • The 82 errors made by LeNet5 (0.82% test error rate)
slide-43
SLIDE 43

Feed-forward Networks Network Training Error Backpropagation Applications

Conclusion

  • Readings: Ch. 5.1, 5.2, 5.3
  • Feed-forward networks can be used for regression or

classification

  • Similar to linear models, except with adaptive non-linear

basis functions

  • These allow us to do more than e.g. linear decision

boundaries

  • Different error functions
  • Learning is more difficult, error function not convex
  • Use stochastic gradient descent, obtain (good?) local

minimum

  • Backpropagation for efficient gradient computation