Neural networks Chapter 20 Chapter 20 1 Outline Brains Neural - - PowerPoint PPT Presentation

neural networks
SMART_READER_LITE
LIVE PREVIEW

Neural networks Chapter 20 Chapter 20 1 Outline Brains Neural - - PowerPoint PPT Presentation

Neural networks Chapter 20 Chapter 20 1 Outline Brains Neural networks Perceptrons Multilayer networks Applications of neural networks Chapter 20 2 Brains 10 11 neurons of > 20 types, 10 14 synapses, 1ms10ms cycle


slide-1
SLIDE 1

Neural networks

Chapter 20

Chapter 20 1

slide-2
SLIDE 2

Outline

♦ Brains ♦ Neural networks ♦ Perceptrons ♦ Multilayer networks ♦ Applications of neural networks

Chapter 20 2

slide-3
SLIDE 3

Brains

1011 neurons of > 20 types, 1014 synapses, 1ms–10ms cycle time Signals are noisy “spike trains” of electrical potential

Axon Cell body or Soma Nucleus Dendrite Synapses Axonal arborization Axon from another cell Synapse

Chapter 20 3

slide-4
SLIDE 4

McCulloch–Pitts “unit”

Output is a “squashed” linear function of the inputs: ai ← g(ini) = g

ΣjWj,iaj

  • Output

Σ

Input Links Activation Function Input Function Output Links

a0 = −1 ai = g(ini) ai g ini Wj,i W0,i

Bias Weight

aj

Chapter 20 4

slide-5
SLIDE 5

Activation functions

(a) (b) +1 +1 ini ini g(ini) g(ini) (a) is a step function or threshold function (b) is a sigmoid function 1/(1 + e−x) Changing the bias weight W0,i moves the threshold location

Chapter 20 5

slide-6
SLIDE 6

Implementing logical functions

McCulloch and Pitts: every Boolean function can be implemented (with large enough network) AND? OR? NOT? MAJORITY?

Chapter 20 6

slide-7
SLIDE 7

Implementing logical functions

McCulloch and Pitts: every Boolean function can be implemented (with large enough network)

AND

W0 = 1.5 W1 = 1 W2 = 1

OR

W2 = 1 W1 = 1 W0 = 0.5

NOT

W1 = –1 W0 = – 0.5

Chapter 20 7

slide-8
SLIDE 8

Network structures

Feed-forward networks: – single-layer perceptrons – multi-layer networks Feed-forward networks implement functions, have no internal state Recurrent networks: – Hopfield networks have symmetric weights (Wi,j = Wj,i) g(x) = sign(x), ai = ± 1; holographic associative memory – Boltzmann machines use stochastic activation functions, ≈ MCMC in BNs – recurrent neural nets have directed cycles with delays ⇒ have internal state (like flip-flops), can oscillate etc.

Chapter 20 8

slide-9
SLIDE 9

Feed-forward example

W

1,3 1,4

W

2,3

W

2,4

W W

3,5 4,5

W 1 2 3 4 5

Feed-forward network = a parameterized family of nonlinear functions: a5 = g(W3,5 · a3 + W4,5 · a4) = g(W3,5 · g(W1,3 · a1 + W2,3 · a2) + W4,5 · g(W1,4 · a1 + W2,4 · a2))

Chapter 20 9

slide-10
SLIDE 10

Perceptrons

Input Units Units Output

Wj,i

  • 4
  • 2

2 4 x1

  • 4 -2 0 2 4

x2 0.2 0.4 0.6 0.8 1 Perceptron output

Chapter 20 10

slide-11
SLIDE 11

Expressiveness of perceptrons

Consider a perceptron with g = step function (Rosenblatt, 1957, 1960) Can represent AND, OR, NOT, majority, etc. Represents a linear separator in input space:

ΣjWjxj > 0

  • r

W · x > 0

I

1

I

2

I

1

I

2

I

1

I

2

?

(a) (b) (c)

1 1 1 1 1 1

xor I

2

I

1

  • r

I

1

I

2

and

I

1

I

2

Chapter 20 11

slide-12
SLIDE 12

Perceptron learning

Learn by adjusting weights to reduce error on training set The squared error for an example with input x and true output y is E = 1 2Err 2 ≡ 1 2(y − hW(x))2

Chapter 20 12

slide-13
SLIDE 13

Perceptron learning

Learn by adjusting weights to reduce error on training set The squared error for an example with input x and true output y is E = 1 2Err 2 ≡ 1 2(y − hW(x))2 Perform optimization search by gradient descent: ∂E ∂Wj =?

Chapter 20 13

slide-14
SLIDE 14

Perceptron learning

Learn by adjusting weights to reduce error on training set The squared error for an example with input x and true output y is E = 1 2Err 2 ≡ 1 2(y − hW(x))2 Perform optimization search by gradient descent: ∂E ∂Wj = Err × ∂Err ∂Wj = Err × ∂ ∂Wj

  • y − g(Σn

j = 0Wjxj)

  • Chapter 20

14

slide-15
SLIDE 15

Perceptron learning

Learn by adjusting weights to reduce error on training set The squared error for an example with input x and true output y is E = 1 2Err 2 ≡ 1 2(y − hW(x))2 Perform optimization search by gradient descent: ∂E ∂Wj = Err × ∂Err ∂Wj = Err × ∂ ∂Wj

  • y − g(Σn

j = 0Wjxj)

  • = −Err × g′(in) × xj

Chapter 20 15

slide-16
SLIDE 16

Perceptron learning

Learn by adjusting weights to reduce error on training set The squared error for an example with input x and true output y is E = 1 2Err 2 ≡ 1 2(y − hW(x))2 Perform optimization search by gradient descent: ∂E ∂Wj = Err × ∂Err ∂Wj = Err × ∂ ∂Wj

  • y − g(Σn

j = 0Wjxj)

  • = −Err × g′(in) × xj

Simple weight update rule: Wj ← Wj + α × Err × g′(in) × xj E.g., +ve error ⇒ increase network output ⇒ increase weights on +ve inputs, decrease on -ve inputs

Chapter 20 16

slide-17
SLIDE 17

Perceptron learning

W = random initial values for iter = 1 to T for i = 1 to N (all examples)

  • x = input for example i

y = output for example i Wold = W Err = y − g(Wold · x) for j = 1 to M (all weights) Wj = Wj + α · Err · g′(Wold · x) · xj

Chapter 20 17

slide-18
SLIDE 18

Perceptron learning contd.

Derivative of sigmoid g(x) can be written in simple form: g(x) = 1 1 + e−x g′(x) = ?

Chapter 20 18

slide-19
SLIDE 19

Perceptron learning contd.

Derivative of sigmoid g(x) can be written in simple form: g(x) = 1 1 + e−x g′(x) = e−x (1 + e−x)2 = e−xg(x)2 Also, g(x) = 1 1 + e−x ⇒ g(x) + e−xg(x) = 1 ⇒ e−x = 1 − g(x) g(x) So g′(x) = 1 − g(x) g(x) g(x)2 = (1 − g(x))g(x)

Chapter 20 19

slide-20
SLIDE 20

Perceptron learning contd.

Perceptron learning rule converges to a consistent function for any linearly separable data set

0.4 0.5 0.6 0.7 0.8 0.9 1 10 20 30 40 50 60 70 80 90 100 Proportion correct on test set Training set size - MAJORITY on 11 inputs Perceptron Decision tree 0.4 0.5 0.6 0.7 0.8 0.9 1 10 20 30 40 50 60 70 80 90 100 Proportion correct on test set Training set size - RESTAURANT data Perceptron Decision tree

Chapter 20 20

slide-21
SLIDE 21

Multilayer networks

Layers are usually fully connected; numbers of hidden units typically chosen by hand

Input units Hidden units Output units ai Wj,i aj W

k,j

ak

Chapter 20 21

slide-22
SLIDE 22

Expressiveness of MLPs

All continuous functions w/ 1 hidden layer, all functions w/ 2 hidden layers

  • 4
  • 2

2 4 x1

  • 4 -2 0 2 4

x2 0.2 0.4 0.6 0.8 1 hW(x1, x2)

  • 4
  • 2

2 4 x1

  • 4 -2 0 2 4

x2 0.2 0.4 0.6 0.8 1 hW(x1, x2)

Chapter 20 22

slide-23
SLIDE 23

Training a MLP

In general have n output nodes, E ≡ 1 2

  • i Err2

i ,

where Erri = (yi − ai) and

  • i runs over all nodes in the output layer.

Need to calculate ∂E ∂Wij for any Wij.

Chapter 20 23

slide-24
SLIDE 24

Training a MLP cont.

Can approximate derivatives by: f′(x) ≈ f(x + h) − f(x) h ∂E ∂Wij (W) ≈ E(W + (0, . . . , h, . . . , 0)) − E(W) h What would this entail for a network with n weights?

Chapter 20 24

slide-25
SLIDE 25

Training a MLP cont.

Can approximate derivatives by: f′(x) ≈ f(x + h) − f(x) h ∂E ∂Wij (W) ≈ E(W + (0, . . . , h, . . . , 0)) − E(W) h What would this entail for a network with n weights?

  • one iteration would take O(n2) time

Complicated networks have tens of thousands of weights, O(n2) time is intractable. Back-propagation is a recursive method of calculating all of these derivatives in O(n) time.

Chapter 20 25

slide-26
SLIDE 26

Back-propagation learning

In general have n output nodes, E ≡ 1 2

  • i Err2

i ,

where Erri = (yi − ai) and

  • i runs over all nodes in the output layer.

Output layer: same as for single-layer perceptron, Wj,i ← Wj,i + α × aj × ∆i where ∆i = Err i × g′(ini) Hidden layers: back-propagate the error from the output layer: ∆j = g′(inj)

  • i Wj,i∆i .

Update rule for weights in hidden layers: Wk,j ← Wk,j + α × ak × ∆j .

Chapter 20 26

slide-27
SLIDE 27

Back-propagation derivation

For a node i in the output layer: ∂E ∂Wj,i = −(yi − ai) ∂ai ∂Wj,i

Chapter 20 27

slide-28
SLIDE 28

Back-propagation derivation

For a node i in the output layer: ∂E ∂Wj,i = −(yi − ai) ∂ai ∂Wj,i = −(yi − ai)∂g(ini) ∂Wj,i

Chapter 20 28

slide-29
SLIDE 29

Back-propagation derivation

For a node i in the output layer: ∂E ∂Wj,i = −(yi − ai) ∂ai ∂Wj,i = −(yi − ai)∂g(ini) ∂Wj,i = −(yi − ai)g′(ini) ∂ini ∂Wj,i

Chapter 20 29

slide-30
SLIDE 30

Back-propagation derivation

For a node i in the output layer: ∂E ∂Wj,i = −(yi − ai) ∂ai ∂Wj,i = −(yi − ai)∂g(ini) ∂Wj,i = −(yi − ai)g′(ini) ∂ini ∂Wj,i = −(yi − ai)g′(ini) ∂ ∂Wj,i

  

  • k Wk,iaj

  

Chapter 20 30

slide-31
SLIDE 31

Back-propagation derivation

For a node i in the output layer: ∂E ∂Wj,i = −(yi − ai) ∂ai ∂Wj,i = −(yi − ai)∂g(ini) ∂Wj,i = −(yi − ai)g′(ini) ∂ini ∂Wj,i = −(yi − ai)g′(ini) ∂ ∂Wj,i

  

  • k Wk,iaj

  

= −(yi − ai)g′(ini)aj = −aj∆i where ∆i = (yi − ai)g′(ini)

Chapter 20 31

slide-32
SLIDE 32

Back-propagation derivation: hidden layer

For a node j in a hidden layer: ∂E ∂Wk,j = ?

Chapter 20 32

slide-33
SLIDE 33

“Reminder”: Chain rule for partial derivatives

For f(x, y), with f differentiable wrt x and y, and x and y differentiable wrt u and v: ∂f ∂u = ∂f ∂x ∂x ∂u + ∂f ∂y ∂y ∂u and ∂f ∂v = ∂f ∂x ∂x ∂v + ∂f ∂y ∂y ∂v

Chapter 20 33

slide-34
SLIDE 34

Back-propagation derivation: hidden layer

For a node j in a hidden layer: ∂E ∂Wk,j = ∂ ∂Wk,j E(aj1, aj2, . . . , ajm) where {ji} are the indices of the nodes in the same layer as node j.

Chapter 20 34

slide-35
SLIDE 35

Back-propagation derivation: hidden layer

For a node j in a hidden layer: ∂E ∂Wk,j = ∂E ∂aj ∂aj ∂Wk,j +

  • i

∂E ∂ai ∂ai ∂Wk,j where

  • i runs over all other nodes i in the same layer as node j.

Chapter 20 35

slide-36
SLIDE 36

Back-propagation derivation: hidden layer

For a node j in a hidden layer: ∂E ∂Wk,j = ∂E ∂aj ∂aj ∂Wk,j +

  • i

∂E ∂ai ∂ai ∂Wk,j = ∂E ∂aj ∂aj ∂Wk,j since ∂ai ∂Wk,j = 0 for i = j

Chapter 20 36

slide-37
SLIDE 37

Back-propagation derivation: hidden layer

For a node j in a hidden layer: ∂E ∂Wk,j = ∂E ∂aj ∂aj ∂Wk,j +

  • i

∂E ∂ai ∂ai ∂Wk,j = ∂E ∂aj ∂aj ∂Wk,j since ∂ai ∂Wk,j = 0 for i = j = ∂E ∂aj · g′(inj)ak

Chapter 20 37

slide-38
SLIDE 38

Back-propagation derivation: hidden layer

For a node j in a hidden layer: ∂E ∂Wk,j = ∂E ∂aj ∂aj ∂Wk,j +

  • i

∂E ∂ai ∂ai ∂Wk,j = ∂E ∂aj ∂aj ∂Wk,j since ∂ai ∂Wk,j = 0 for i = j = ∂E ∂aj · g′(inj)ak ∂E ∂aj = ?

Chapter 20 38

slide-39
SLIDE 39

Back-propagation derivation: hidden layer

For a node j in a hidden layer: ∂E ∂Wk,j = ∂E ∂aj ∂aj ∂Wk,j +

  • i

∂E ∂ai ∂ai ∂Wk,j = ∂E ∂aj ∂aj ∂Wk,j since ∂ai ∂Wk,j = 0 for i = j = ∂E ∂aj · g′(inj)ak ∂E ∂aj = ∂ ∂aj E(ak1, ak2, . . . , akm) where {ki} are the indices of the nodes in the layer after node j.

Chapter 20 39

slide-40
SLIDE 40

Back-propagation derivation: hidden layer

For a node j in a hidden layer: ∂E ∂Wk,j = ∂E ∂aj ∂aj ∂Wk,j +

  • i

∂E ∂ai ∂ai ∂Wk,j = ∂E ∂aj ∂aj ∂Wk,j since ∂ai ∂Wk,j = 0 for i = j = ∂E ∂aj · g′(inj)ak ∂E ∂aj =

  • k

∂E ∂ak ∂ak ∂aj where

  • k runs over all nodes k that node j connects to.

Chapter 20 40

slide-41
SLIDE 41

Back-propagation derivation: hidden layer

For a node j in a hidden layer: ∂E ∂Wk,j = ∂E ∂aj ∂aj ∂Wk,j +

  • i

∂E ∂ai ∂ai ∂Wk,j = ∂E ∂aj ∂aj ∂Wk,j since ∂ai ∂Wk,j = 0 for i = j = ∂E ∂aj · g′(inj)ak ∂E ∂aj =

  • k

∂E ∂ak ∂ak ∂aj =

  • k

∂E ∂ak g′(ink)Wj,k

Chapter 20 41

slide-42
SLIDE 42

Back-propagation derivation: hidden layer

If we define ∆j ≡ g′(inj)

  • k Wj,k∆k

then ∂E ∂Wk,j = −∆jak

Chapter 20 42

slide-43
SLIDE 43

Back-propagation pseudocode

for iter = 1 to T for e = 1 to N (all examples)

  • x = input for example e
  • y = output for example e

run x forward through network, computing all {ai}, {ini} for all nodes i (in reverse order) compute ∆i =

      

(yi − ai) × g′(ini) if i is output node g′(ini)

  • k Wi,k∆k
  • .w.

for all weights Wj,i Wj,i = Wj,i + α × aj × ∆i

Chapter 20 43

slide-44
SLIDE 44

Back-propagation learning contd.

At each epoch, sum gradient updates for all examples and apply Restaurant data:

2 4 6 8 10 12 14 50 100 150 200 250 300 350 400 Total error on training set Number of epochs

Usual problems with slow convergence, local minima

Chapter 20 44

slide-45
SLIDE 45

Back-propagation learning contd.

Restaurant data:

0.4 0.5 0.6 0.7 0.8 0.9 1 10 20 30 40 50 60 70 80 90 100 % correct on test set Training set size Multilayer network Decision tree

Chapter 20 45

slide-46
SLIDE 46

Handwritten digit recognition

3-nearest-neighbor = 2.4% error 400–300–10 unit MLP = 1.6% error LeNet: 768–192–30–10 unit MLP = 0.9% error

Chapter 20 46

slide-47
SLIDE 47

Summary

Most brains have lots of neurons; each neuron ≈ linear–threshold unit (?) Perceptrons (one-layer networks) insufficiently expressive Multi-layer networks are sufficiently expressive; can be trained by gradient descent, i.e., error back-propagation Many applications: speech, driving, handwriting, credit cards, etc.

Chapter 20 47