Artificial Neural Networks Threshold units Gradient descent - - PDF document

artificial neural networks
SMART_READER_LITE
LIVE PREVIEW

Artificial Neural Networks Threshold units Gradient descent - - PDF document

Artificial Neural Networks Threshold units Gradient descent Multilayer networks Backpropagation Hidden layer representations Example: Face Recognition Advanced topics 1 Connectionist Models Consider humans: Neuron


slide-1
SLIDE 1

Artificial Neural Networks

  • Threshold units
  • Gradient descent
  • Multilayer networks
  • Backpropagation
  • Hidden layer representations
  • Example: Face Recognition
  • Advanced topics

1

slide-2
SLIDE 2

Connectionist Models

Consider humans:

  • Neuron switching time ˜ .001 second
  • Number of neurons ˜ 1010
  • Connections per neuron ˜ 104−5
  • Scene recognition time ˜ .1 second
  • 100 inference steps doesn’t seem like enough

→ much parallel computation Properties of artificial neural nets (ANN’s):

  • Many neuron-like threshold switching units
  • Many weighted interconnections among units
  • Highly parallel, distributed process
  • Emphasis on tuning weights automatically

2

slide-3
SLIDE 3

When to Consider Neural Networks

  • Input is high-dimensional discrete or real-valued (e.g.

raw sensor input)

  • Output is discrete or real valued
  • Output is a vector of values
  • Possibly noisy data
  • Form of target function is unknown
  • Human readability of result is unimportant

Examples:

  • Speech phoneme recognition [Waibel]
  • Image classification [Kanade, Baluja, Rowley]
  • Financial prediction

3

slide-4
SLIDE 4

ALVINN drives 70 mph on highways

4

slide-5
SLIDE 5

Perceptron

w1 w2 wn w0 x1 x2 xn x0=1

. . .

Σ

  • Σ wi xi

n i=0 1 if > 0

  • 1 otherwise

{

  • =

Σ wi xi

n i=0

  • (x1, . . . , xn) =

⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩

1 if w0 + w1x1 + · · · + wnxn > 0 −1 otherwise. Sometimes we’ll use simpler vector notation:

  • (

x) =

⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩

1 if w · x > 0 −1 otherwise.

5

slide-6
SLIDE 6

Decision Surface of a Perceptron

x1 x2 + +

  • +
  • x1

x2

(a) (b)

  • +
  • +

Represents some useful functions

  • What weights represent g(x1, x2) = AND(x1, x2)?

But some functions not representable

  • e.g., not linearly separable
  • Therefore, we’ll want networks of these...

6

slide-7
SLIDE 7

Perceptron training rule

wi ← wi + Δwi where Δwi = η(t − o)xi Where:

  • t = c(

x) is target value

  • o is perceptron output
  • η is small constant (e.g., .1) called learning rate

7

slide-8
SLIDE 8

Perceptron training rule

Can prove it will converge

  • If training data is linearly separable
  • and η sufficiently small

8

slide-9
SLIDE 9

Gradient Descent

To understand, consider simpler linear unit, where

  • = w0 + w1x1 + · · · + wnxn

Let’s learn wi’s that minimize the squared error E[ w] ≡ 1 2

  • d∈D(td − od)2

Where D is set of training examples

9

slide-10
SLIDE 10

Gradient Descent

  • 1

1 2

  • 2
  • 1

1 2 3 5 10 15 20 25 w0 w1 E[w]

Gradient ∇E[ w] ≡

⎡ ⎢ ⎣ ∂E

∂w0 , ∂E ∂w1 , · · · ∂E ∂wn

⎤ ⎥ ⎦

Training rule: Δ w = −η∇E[ w] i.e., Δwi = −η ∂E ∂wi

10

slide-11
SLIDE 11

Gradient Descent

∂E ∂wi = ∂ ∂wi 1 2

  • d (td − od)2

= 1 2

  • d

∂ ∂wi (td − od)2 = 1 2

  • d 2(td − od) ∂

∂wi (td − od) =

  • d (td − od) ∂

∂wi (td − w · xd) ∂E ∂wi =

  • d (td − od)(−xi,d)

11

slide-12
SLIDE 12

Gradient Descent

Gradient-Descent(training examples, η) Each training example is a pair of the form

  • x, t, where

x is the vector of input values, and t is the target output value. η is the learning rate (e.g., .05).

  • Initialize each wi to some small random value
  • Until the termination condition is met, Do

– Initialize each Δwi to zero. – For each x, t in training examples, Do ∗ Input the instance x to the unit and compute the output o ∗ For each linear unit weight wi, Do Δwi ← Δwi + η(t − o)xi – For each linear unit weight wi, Do wi ← wi + Δwi

12

slide-13
SLIDE 13

Summary

Perceptron training rule guaranteed to succeed if

  • Training examples are linearly separable
  • Sufficiently small learning rate η

Linear unit training rule uses gradient descent

  • Guaranteed to converge to hypothesis with minimum

squared error

  • Given sufficiently small learning rate η
  • Even when training data contains noise
  • Even when training data not separable by H

13

slide-14
SLIDE 14

Incremental (Stochastic) Gradient Descent

Batch mode Gradient Descent: Do until satisfied

  • 1. Compute the gradient ∇ED[

w] 2. w ← w − η∇ED[ w] Incremental mode Gradient Descent: Do until satisfied

  • For each training example d in D
  • 1. Compute the gradient ∇Ed[

w] 2. w ← w − η∇Ed[ w] ED[ w] ≡ 1 2

  • d∈D(td − od)2

Ed[ w] ≡ 1 2(td − od)2 Incremental Gradient Descent can approximate Batch Gradient Descent arbitrarily closely if η made small enough

14

slide-15
SLIDE 15

Multilayer Networks of Sigmoid Units

15

slide-16
SLIDE 16

Learning Complex Concepts with Gradient Descent

  • Threshold Units

– Complex decision surfaces – But, cannot differentiate threshold rule

  • Linear Units

– Differentiable – But, networks can only learn linear functions Need a non-linear, differentiable threshold function

16

slide-17
SLIDE 17

Sigmoid Unit

w1 w2 wn w0 x1 x2 xn x0 = 1

  • .

. .

Σ

net = Σ wi xi

i=0 n

1 1 + e

  • net
  • = σ(net) =

σ(x) is the sigmoid function 1 1 + e−x Nice property: dσ(x)

dx

= σ(x)(1 − σ(x)) We can derive gradient decent rules to train

  • One sigmoid unit
  • Multilayer networks of sigmoid units →

Backpropagation

17

slide-18
SLIDE 18

Error Gradient for a Sigmoid Unit

∂E ∂wi = ∂ ∂wi 1 2

  • d∈D(td − od)2

= 1 2

  • d

∂ ∂wi (td − od)2 = 1 2

  • d 2(td − od)

∂ ∂wi (td − od) =

  • d (td − od)

⎛ ⎜ ⎝−∂od

∂wi

⎞ ⎟ ⎠

= −

  • d (td − od)

∂od ∂netd ∂netd ∂wi But we know: ∂od ∂netd = ∂σ(netd) ∂netd = od(1 − od) ∂netd ∂wi = ∂( w · xd) ∂wi = xi,d So: ∂E ∂wi = −

  • d∈D(td − od)od(1 − od)xi,d

18

slide-19
SLIDE 19

Backpropagation Algorithm

Initialize all weights to small random numbers. Until satisfied, Do

  • For each training example, Do
  • 1. Input the training example to the network and

compute the network outputs

  • 2. For each output unit k

δk ← ok(1 − ok)(tk − ok)

  • 3. For each hidden unit h

δh ← oh(1 − oh)

  • k∈outputs wh,kδk
  • 4. Update each network weight wi,j

wi,j ← wi,j + Δwi,j where Δwi,j = ηδjxi,j

19

slide-20
SLIDE 20

More on Backpropagation

  • Gradient descent over entire network weight vector
  • Easily generalized to arbitrary directed graphs
  • Will find a local, not necessarily global error

minimum – In practice, often works well (can run multiple times)

  • Often include weight momentum α

Δwi,j(n) = ηδjxi,j + αΔwi,j(n − 1)

  • Minimizes error over training examples

– Will it generalize well to subsequent examples?

  • Training can take thousands of iterations → slow!
  • Using network after training is very fast

20

slide-21
SLIDE 21

Learning Hidden Layer Representations

Inputs Outputs

A target function:

Input Output 10000000 → 10000000 01000000 → 01000000 00100000 → 00100000 00010000 → 00010000 00001000 → 00001000 00000100 → 00000100 00000010 → 00000010 00000001 → 00000001

Can this be learned??

21

slide-22
SLIDE 22

Learning Hidden Layer Representations

A network:

Inputs Outputs

Learned hidden layer representation: Input Hidden Output Values 10000000 → .89 .04 .08 → 10000000 01000000 → .01 .11 .88 → 01000000 00100000 → .01 .97 .27 → 00100000 00010000 → .99 .97 .71 → 00010000 00001000 → .03 .05 .02 → 00001000 00000100 → .22 .99 .99 → 00000100 00000010 → .80 .01 .98 → 00000010 00000001 → .60 .94 .01 → 00000001

22

slide-23
SLIDE 23

Training

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 500 1000 1500 2000 2500 Sum of squared errors for each output unit

23

slide-24
SLIDE 24

Training

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 500 1000 1500 2000 2500 Hidden unit encoding for input 01000000

24

slide-25
SLIDE 25

Training

  • 5
  • 4
  • 3
  • 2
  • 1

1 2 3 4 500 1000 1500 2000 2500 Weights from inputs to one hidden unit

25

slide-26
SLIDE 26

Convergence of Backpropagation

Gradient descent to some local minimum

  • Perhaps not global minimum...
  • Add momentum
  • Stochastic gradient descent
  • Train multiple nets with different initial weights

Nature of convergence

  • Initialize weights near zero
  • Therefore, initial networks near-linear
  • Increasingly non-linear functions possible as training

progresses

26

slide-27
SLIDE 27

Expressive Capabilities of ANNs

Boolean functions:

  • Every boolean function can be represented by

network with single hidden layer

  • but might require exponential (in number of inputs)

hidden units Continuous functions:

  • Every bounded continuous function can be

approximated with arbitrarily small error, by network with one hidden layer [Cybenko 1989; Hornik et al. 1989]

  • Any function can be approximated to arbitrary

accuracy by a network with two hidden layers [Cybenko 1988].

27

slide-28
SLIDE 28

Overfitting in ANNs

0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009 0.01 5000 10000 15000 20000 Error Number of weight updates Error versus weight updates (example 1) Training set error Validation set error 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 1000 2000 3000 4000 5000 6000 Error Number of weight updates Error versus weight updates (example 2) Training set error Validation set error

28

slide-29
SLIDE 29

Neural Nets for Face Recognition

left strt rght up

30x32 inputs

Typical input images 90% accurate learning head pose, and recognizing 1-of-20 faces

29

slide-30
SLIDE 30

Learned Hidden Unit Weights

left strt rght up

30x32 inputs

Learned Weights

Typical input images http://www.cs.cmu.edu/∼tom/faces.html

30

slide-31
SLIDE 31

Alternative Error Functions

Penalize large weights: E( w) ≡ 1 2

  • d∈D
  • k∈outputs(tkd − okd)2 + γ
  • i,j w2

ji

Train on target slopes as well as values: E( w) ≡ 1 2

  • d∈D
  • k∈outputs

⎡ ⎢ ⎢ ⎢ ⎣(tkd − okd)2 + μ

  • j∈inputs

⎛ ⎜ ⎜ ⎝∂tkd

∂xj

d

− ∂okd ∂xj

d

⎞ ⎟ ⎟ ⎠

2

⎤ ⎥ ⎥ ⎥ ⎦

Tie together weights:

  • e.g., in phoneme recognition network

31

slide-32
SLIDE 32

Recurrent Networks

x(t) x(t) c(t) x(t) c(t) y(t)

b

y(t + 1)

Feedforward network

  • Recurrent network
  • Recurrent network

unfolded in time

y(t + 1) y(t + 1) y(t – 1) x(t – 1) c(t – 1) x(t – 2) c(t – 2) (a) (b) (c)

32