AN INTRODUCTION TO NEURAL NETWORKS Scott Kuindersma November 12, - - PowerPoint PPT Presentation

an introduction to neural networks
SMART_READER_LITE
LIVE PREVIEW

AN INTRODUCTION TO NEURAL NETWORKS Scott Kuindersma November 12, - - PowerPoint PPT Presentation

AN INTRODUCTION TO NEURAL NETWORKS Scott Kuindersma November 12, 2009 SUPERVISED LEARNING We are given some training data: We must learn a function If y is discrete, we call it classification If it is continuous, we call it


slide-1
SLIDE 1

AN INTRODUCTION TO NEURAL NETWORKS

Scott Kuindersma November 12, 2009

slide-2
SLIDE 2
slide-3
SLIDE 3

SUPERVISED LEARNING

  • We are given some training data:
  • We must learn a function
  • If y is discrete, we call it classification
  • If it is continuous, we call it regression
slide-4
SLIDE 4

ARTIFICIAL NEURAL NETWORKS

  • Artificial neural networks are one technique that can be used to

solve supervised learning problems

  • Very loosely inspired by biological neural networks
  • real neural networks are much more complicated, e.g. using

spike timing to encode information

  • Neural networks consist of layers of interconnected units
slide-5
SLIDE 5

PERCEPTRON UNIT

  • The simplest computational neural unit is called a perceptron
  • The input of a perceptron is a real vector x
  • The output is either 1 or -1
  • Therefore, a perceptron can be applied to binary classification

problems

  • Whether or not it will be useful depends on the problem...

more on this later...

slide-6
SLIDE 6

PERCEPTRON UNIT[MITCHELL 1997]

slide-7
SLIDE 7

SIGN FUNCTION

slide-8
SLIDE 8

EXAMPLE

  • Suppose we have a perceptron with 3 weights:
  • On input x1 = 0.5, x2 = 0.0, the perceptron outputs:
  • where x0 = 1
slide-9
SLIDE 9

LEARNING RULE

  • Now that we know how to calculate the output of a perceptron,

we would like to find a way to modify the weights to produce

  • utput that matches the training data
  • This is accomplished via the perceptron learning rule
  • for an input pair where, again, x0 = 1
  • Loop through the training data until (nearly) all examples are

classified correctly

slide-10
SLIDE 10

MATLAB EXAMPLE

slide-11
SLIDE 11

LIMITATIONS OF THE PERCEPTRON MODEL

  • Can only distinguish between linearly separable classes of inputs
  • Consider the following data:
slide-12
SLIDE 12

PERCEPTRONS AND BOOLEAN FUNCTIONS

  • Suppose we let the values (1,-1) correspond to true and false,

respectively

  • Can we describe a perceptron capable of computing the AND

function? What about OR? NAND? NOR? XOR?

  • Let’s think about it geometrically
slide-13
SLIDE 13

BOOLEAN FUNCS CONT’D

AND OR NOR NAND

slide-14
SLIDE 14

EXAMPLE: AND

  • Let pAND(x1,x2) be the output of the perceptron with weights

w0 = -0.3, w1 = 0.5, w2 = 0.5 on input x1, x2 x1 x2 pAND(x1,x2)

  • 1
  • 1
  • 1
  • 1

1

  • 1

1

  • 1
  • 1

1 1 1

slide-15
SLIDE 15

XOR

slide-16
SLIDE 16

XOR

  • XOR cannot be represented by a perceptron, but it can be

represented by a small network of perceptrons, e.g.,

AND OR

x1 x2

NAND

x1 x2

slide-17
SLIDE 17

PERCEPTRON CONVERGENCE

  • The perceptron learning rule is not guaranteed to converge if the

data is not linearly separable

  • We can remedy this situation by considering linear unit and

applying gradient descent

  • The linear unit is equivalent to a perceptron without the sign
  • function. That is, its output is given by:
  • where x0 = 1
slide-18
SLIDE 18

LEARNING RULE DERIVATION

  • Goal: a weight update rule of the form
  • First we define a suitable measure of error
  • Typically we choose a quadratic function so we have a global

minimum

slide-19
SLIDE 19

ERROR SURFACE [MITCHELL 1997]

slide-20
SLIDE 20

LEARNING RULE DERIVATION

  • The learning algorithm should update each weight in the direction

that minimizes the error according to our error function

  • That is, the weight change should look something like
slide-21
SLIDE 21

GRADIENT DESCENT

slide-22
SLIDE 22

GRADIENT DESCENT

  • Good: guaranteed to converge to the minimum error weight

vector regardless of whether the training data are linearly separable (given that α is sufficiently small)

  • Bad: still can only correctly classify linearly separable data
slide-23
SLIDE 23

NETWORKS

  • In general, many-layered networks of threshold units are capable
  • f representing a rich variety of nonlinear decision surfaces
  • However, to use our gradient descent approach on multi-layered

networks, we must avoid the non-differentiable sign function

  • Multiple layers of linear units can still only represent linear

functions

  • Introducing the sigmoid function...
slide-24
SLIDE 24

SIGMOID FUNCTION

slide-25
SLIDE 25

SIGMOID UNIT [MITCHELL 1997]

slide-26
SLIDE 26

EXAMPLE

  • Suppose we have a sigmoid unit k with 3 weights:
  • On input x1 = 0.5, x2 = 0.0, the unit outputs:
slide-27
SLIDE 27

NETWORK OF SIGMOID UNITS

2 3 4 1

x0 x1 x2 x3

  • utput layer

hidden layer

w02 w31

  • 2
  • 3
  • 4
slide-28
SLIDE 28

EXAMPLE

3 1 2

x0 x1 x2

.1 .2 .3

  • .2

3.2 .5

  • .5

1.0

slide-29
SLIDE 29

EXAMPLE

3 1 2

x0 x1 x2

.1 .2 .3

  • .2

3.2 .5

  • .5

1.0

−2 −1 1 2 −2 −1.5 −1 −0.5 0.5 1 1.5 2 0.65 0.7 0.75 0.8 x1 x2

  • utput
slide-30
SLIDE 30

BACK-PROPAGATION

  • Really just applying the same gradient descent approach to our

network of sigmoid units

  • We use the error function:
slide-31
SLIDE 31

BACKPROP ALGORITHM

slide-32
SLIDE 32

BACKPROP CONVERGENCE

  • Unfortunately, there may exist many local minima in the error

function

  • Therefore we cannot guarantee convergence to an optimal

solution as in the single linear unit case

  • Time to convergence is also a concern
  • Nevertheless, backprop does reasonably well in many cases
slide-33
SLIDE 33

MATLAB EXAMPLE

  • Quadratic decision boundary
  • Single linear unit vs. Three-sigmoid unit backprop network... GO!
slide-34
SLIDE 34

BACK TO ALVINN

  • ALVINN was a 1989 project at CMU in which an autonomous

vehicle learned to drive by watching a person drive

  • ALVINN's architecture consists of a single hidden layer back-

propagation network

  • The input layer of the network is a 30x32 unit two dimensional

"retina" which receives input from the vehicles video camera

  • The output layer is a linear representation of the direction the

vehicle should travel in order to keep the vehicle on the road

slide-35
SLIDE 35

ALVINN

slide-36
SLIDE 36

REPRESENTATIONAL POWER OF NEURAL NETWORKS

  • Every boolean function can be represented by a network with

two layers of units

  • Every bounded continuous function can be approximated to

arbitrarily accuracy by a two-layer network of sigmoid hidden units and linear output units

  • Any function can be approximated to arbitrarily accuracy by a

three layer network sigmoid hidden units and linear output units

slide-37
SLIDE 37

READING SUGGESTIONS

  • Mitchell, Machine Learning, Chapter 4
  • Russell and Norvig, AI a Modern Approach, Chapter 20