Artificial neural networks Chapter 18, Section 7 of; based on AIMA - - PowerPoint PPT Presentation

artificial neural networks
SMART_READER_LITE
LIVE PREVIEW

Artificial neural networks Chapter 18, Section 7 of; based on AIMA - - PowerPoint PPT Presentation

Artificial neural networks Chapter 18, Section 7 of; based on AIMA Slides c Artificial Intelligence, spring 2013, Peter Ljungl Stuart Russel and Peter Norvig, 2004 Chapter 18, Section 7 1 Outline Brains Neural networks


slide-1
SLIDE 1

Artificial neural networks

Chapter 18, Section 7

Artificial Intelligence, spring 2013, Peter Ljungl¨

  • f; based on AIMA Slides c

Stuart Russel and Peter Norvig, 2004 Chapter 18, Section 7 1

slide-2
SLIDE 2

Outline

♦ Brains ♦ Neural networks ♦ Perceptrons ♦ Multilayer perceptrons ♦ Applications of neural networks

Artificial Intelligence, spring 2013, Peter Ljungl¨

  • f; based on AIMA Slides c

Stuart Russel and Peter Norvig, 2004 Chapter 18, Section 7 2

slide-3
SLIDE 3

Brains

1011 neurons of > 20 types, 1014 synapses, 1ms–10ms cycle time Signals are noisy “spike trains” of electrical potential

Axon Cell body or Soma Nucleus Dendrite Synapses Axonal arborization Axon from another cell Synapse

Artificial Intelligence, spring 2013, Peter Ljungl¨

  • f; based on AIMA Slides c

Stuart Russel and Peter Norvig, 2004 Chapter 18, Section 7 3

slide-4
SLIDE 4

McCulloch–Pitts simplified neuron

Output is a “squashed” linear function of the inputs: ai = g(ini) = g(wi · a) = g

Σj wj,i aj

  • Output

Σ

Input Links Activation Function Input Function Output Links

a0 = −1 ai = g(ini) ai g ini Wj,i W0,i

Bias Weight

aj

Note that a0 = −1 is a constant input, and w0,i is the bias weight This is a gross oversimplification of real neurons, but its purpose is to develop understanding of what networks of simple units can do

Artificial Intelligence, spring 2013, Peter Ljungl¨

  • f; based on AIMA Slides c

Stuart Russel and Peter Norvig, 2004 Chapter 18, Section 7 4

slide-5
SLIDE 5

Activation functions

(a) (b) +1 +1 ini ini g(ini) g(ini) (a) is a step function or threshold function, g(x) = 1 if x ≥ 0, else 0 (b) is a sigmoid function, g(x) = 1/(1 + e−x) Changing the bias weight w0,i moves the threshold location

Artificial Intelligence, spring 2013, Peter Ljungl¨

  • f; based on AIMA Slides c

Stuart Russel and Peter Norvig, 2004 Chapter 18, Section 7 5

slide-6
SLIDE 6

Network structures

Feed-forward networks: – single-layer perceptrons – multi-layer networks Feed-forward networks implement functions, and have no internal state Recurrent networks have directed cycles with delays ⇒ they have internal state (like flip-flops), can oscillate etc.

Artificial Intelligence, spring 2013, Peter Ljungl¨

  • f; based on AIMA Slides c

Stuart Russel and Peter Norvig, 2004 Chapter 18, Section 7 6

slide-7
SLIDE 7

Feed-forward example

W

1,3 1,4

W

2,3

W

2,4

W W

3,5 4,5

W 1 2 3 4 5

Feed-forward network = a parameterized family of nonlinear functions: a5 = g(w3,5 · a3 + w4,5 · a4) = g(w3,5 · g(w1,3 · a1 + w2,3 · a2) + w4,5 · g(w1,4 · a1 + w2,4 · a2)) Adjusting the weights changes the function: ⇒ this is how we do learning!

Artificial Intelligence, spring 2013, Peter Ljungl¨

  • f; based on AIMA Slides c

Stuart Russel and Peter Norvig, 2004 Chapter 18, Section 7 7

slide-8
SLIDE 8

Single-layer perceptrons

Input Units Units Output

Wj,i

  • 4
  • 2

2 4 x1

  • 4 -2 0 2 4

x2 0.2 0.4 0.6 0.8 1 Perceptron output

Output units all operate separately: – there are no shared weights – each output unit corresponds to a separate function Adjusting weights moves the location, orientation, and steepness of the cliff

Artificial Intelligence, spring 2013, Peter Ljungl¨

  • f; based on AIMA Slides c

Stuart Russel and Peter Norvig, 2004 Chapter 18, Section 7 8

slide-9
SLIDE 9

Expressiveness of perceptrons

Consider a perceptron with g = the step function Can represent AND, OR, NOT, majority, etc., but not XOR Represents a linear separator in input space:

Σj wjxj > 0

  • r

w · x > 0

  • r

hw(x)

(a) x1 and x2 1 1

  • x1

x2 (b) x1 or x2

  • 1

1 x1 x2 (c) x1 xor x2 ? 1 1 x1 x2

Artificial Intelligence, spring 2013, Peter Ljungl¨

  • f; based on AIMA Slides c

Stuart Russel and Peter Norvig, 2004 Chapter 18, Section 7 9

slide-10
SLIDE 10

Perceptron learning

Learn by adjusting weights to reduce the error on the training set The perceptron learning rule: wj ← wj + α(y − h)xj where h = hw(x) ∈ {0, 1} is the calculated hypothesis, y ∈ {0, 1} is the desired value, and 0 < α < 1 is the learning rate. Or, in other words:

  • if y = 1, h = 0, add αxj to wj
  • if y = 0, h = 1, subtract αxj from wj
  • otherwise y = h, do nothing

Artificial Intelligence, spring 2013, Peter Ljungl¨

  • f; based on AIMA Slides c

Stuart Russel and Peter Norvig, 2004 Chapter 18, Section 7 10

slide-11
SLIDE 11

Perceptrons = linear classifiers

Perceptron learning rule converges to a consistent function for any linearly separable data set But what if the data set is not linearly separable?

Artificial Intelligence, spring 2013, Peter Ljungl¨

  • f; based on AIMA Slides c

Stuart Russel and Peter Norvig, 2004 Chapter 18, Section 7 11

slide-12
SLIDE 12

Data that are not linearly separable

Perceptron learning rule converges to a consistent function for any linearly separable data set But what can we do if the data set is not linearly separable?

  • Stop after a fixed number of iterations
  • Stop when the total error does not change between iterations
  • Let α decrease between iterations

Artificial Intelligence, spring 2013, Peter Ljungl¨

  • f; based on AIMA Slides c

Stuart Russel and Peter Norvig, 2004 Chapter 18, Section 7 12

slide-13
SLIDE 13

Perceptrons vs decision trees

Perceptron learns the majority function easily, DTL is hopeless DTL learns the restaurant function easily, perceptron cannot represent it

0.4 0.5 0.6 0.7 0.8 0.9 1 10 20 30 40 50 60 70 80 90 100 Proportion correct on test set Training set size - MAJORITY on 11 inputs Perceptron Decision tree 0.4 0.5 0.6 0.7 0.8 0.9 1 10 20 30 40 50 60 70 80 90 100 Proportion correct on test set Training set size - RESTAURANT data Perceptron Decision tree

Artificial Intelligence, spring 2013, Peter Ljungl¨

  • f; based on AIMA Slides c

Stuart Russel and Peter Norvig, 2004 Chapter 18, Section 7 13

slide-14
SLIDE 14

Multilayer perceptrons

Layers are usually fully connected; the number of hidden units are typically chosen by hand

Input units Hidden units Output units ai Wj,i aj W

k,j

ak

Artificial Intelligence, spring 2013, Peter Ljungl¨

  • f; based on AIMA Slides c

Stuart Russel and Peter Norvig, 2004 Chapter 18, Section 7 14

slide-15
SLIDE 15

Expressiveness of MLPs

What functions can be described by MLPs? – with 2 hidden layers: all continuous functions – with 3 hidden layers: all functions

  • 4
  • 2

2 4 x1

  • 4 -2 0 2 4

x2 0.2 0.4 0.6 0.8 1 hW(x1, x2)

  • 4
  • 2

2 4 x1

  • 4 -2 0 2 4

x2 0.2 0.4 0.6 0.8 1 hW(x1, x2)

Combine two opposite-facing threshold functions to make a ridge Combine two perpendicular ridges to make a bump Add bumps of various sizes and locations to fit any surface The proof requires exponentially many hidden units

Artificial Intelligence, spring 2013, Peter Ljungl¨

  • f; based on AIMA Slides c

Stuart Russel and Peter Norvig, 2004 Chapter 18, Section 7 15

slide-16
SLIDE 16

Example: Handwritten digit recognition

MLPs are quite good for complex pattern recognition tasks, (but the resulting hypotheses cannot be understood easily) 3-nearest-neighbor classifier = 2.4% error MLP (400 inputs, 300 hidden, 10 output) = 1.6% error LeNet, an MLP specialized for image analysis = 0.9% error SVM, without any domain knowledge = 1.1% error

Artificial Intelligence, spring 2013, Peter Ljungl¨

  • f; based on AIMA Slides c

Stuart Russel and Peter Norvig, 2004 Chapter 18, Section 7 16