Neural Networks (Perceptrons) A mathematical perspective Textbook - - PowerPoint PPT Presentation

neural networks perceptrons
SMART_READER_LITE
LIVE PREVIEW

Neural Networks (Perceptrons) A mathematical perspective Textbook - - PowerPoint PPT Presentation

Neural Networks (Perceptrons) A mathematical perspective Textbook reference: Sections 11.1-11.2 Additional Reference: Nillson, N. Artificial Intelligence: A New Synthesis , San Francisco: Morgan Kaufmann, 1998. (Chapter 2, Chapter 3 (3.1 -


slide-1
SLIDE 1

Neural Networks (Perceptrons)

A mathematical perspective Textbook reference: Sections 11.1-11.2 Additional Reference: Nillson, N. Artificial Intelligence: A New Synthesis, San Francisco: Morgan Kaufmann, 1998. (Chapter 2, Chapter 3 (3.1 - 3.2))

. – p.1/30

slide-2
SLIDE 2

Neural networks (NNs)

Nilsson (1998) refers to them as "stimulus response agents". Agents behave based on motor responses stimulated by immediate sensory inputs. They "learn" these motor responses through exposure to a set of samples of inputs paired with the action that would be appropriate for each input. We are focusing on "engineering" such networks rather than studying biological neurons.

. – p.2/30

slide-3
SLIDE 3

An artificial neuron

Θ xi xn x2 x1 Output, f Threshold, f wi wn = 1 if = 0 otherwise Θ

Σ

w1 w2

Σ

n i=1 xi wi

Σ

n i=1 xi wi

Remember that a single neuron is capable of two actions corresponding to the two possible outputs of the neuron.

. – p.3/30

slide-4
SLIDE 4

The learning problem

We are given a set, T, of n-dimensional vectors, X, with components

,

= 1 , . . . , n. These vectors are feature vectors computed by the perceptual processing component of a reactive agent. The values can be real or Boolean. For each X in T, we also know the appropriate action,

  • a. These associated actions are sometimes called

the labels or the classes of the vectors.

. – p.4/30

slide-5
SLIDE 5

The learning problem (cont’d)

The set T and the associated labels are called the training set. The machine learning problem is to find a function, say,

✂ ✄

, that responds "acceptably" to the members of the training set. Remember that this type of learning is supervised. We would like the action computed by

  • to agree with

the label for as many vectors in T as possible.

. – p.5/30

slide-6
SLIDE 6

Training a single neuron

− X . W Θ = 0 Equation of hyperplane − X . W Θ > 0

  • n this side

Origin W Unit vector normal to hyperplane |W| − X . W Θ < 0

  • n this side

adjusting the threshold

  • changes the position of the

hyperplane boundary with respect to the origin adjusting the weights changes the orientation of the hyperplane

. – p.6/30

slide-7
SLIDE 7

Augmented vectors

The procedure is simplified if we use a threshold of 0 rather than an arbitrary threshold. This can be achieved by using (n+1)-dimensional "augmented" vectors. The (n+1)-th component of the augmented input vector always has value 1; the weight of the (n+1)-th component is set to the negative of the desired threshold value,

  • .

. – p.7/30

slide-8
SLIDE 8

Augmented vectors (cont’d)

So rather than checking X . W against

  • ,

we check X . W -

  • against 0.

Using augmented vectors, the output of the neuron is 1 when X . W

, and 0 otherwise.

. – p.8/30

slide-9
SLIDE 9

Gradient Descent Method

Define an error function that can be minimized by adjusting weight values. A commonly used error function is squared error:

✂☎✄ ✆ ✝ ✁ ✞ ✁ ✟
✄ ✠

where

is the actual response for input

✂ ✁

, and

✞ ✁

is the desired response. For fixed T, we see that the error depends on the weight values through

.

. – p.9/30

slide-10
SLIDE 10

Gradient Descent Method (cont’d)

A gradient descent process is useful to find the minimum of

  • : calculate the gradient of
  • in weight

space and move the weight vector along the negative gradient (downhill). Note that,

  • as defined, depends on all the input

vectors in S. Use one vector at a time incrementally rather than all at once. Note that, the incremental process is an approximation of the “batch” process. Nevertheless, it works.

. – p.10/30

slide-11
SLIDE 11

Gradient Descent Method (cont’d)

The following is a hypothetical error surface in two dimensions. Constant c dictates the size of the learning step.

W error surface c local minimum W old W new E

. – p.11/30

slide-12
SLIDE 12

The procedure

Take one member of T. Adjust the weights if needed. Repeat (a predefined number of times or until

  • is sufficiently

small.)

. – p.12/30

slide-13
SLIDE 13

How to adjust the weights

The squared error for a single output vector,

, evoking an output of

  • , when the desired output is

is:

✁ ✞ ✟
  • The gradient of
  • with respect to the weights is
✁ ✁ ✄ ✁
✁ ☎ ✆✞✝
✁ ☎ ✁ ✝
✁ ☎✠✟ ✡ ✆ ☛

.

. – p.13/30

slide-14
SLIDE 14

How to adjust the weights (cont’d)

Since

dependence on W is entirely through the dot product, s = X . W, we can use the chain rule to write

✁ ✁ ✁
✁ ✁ ✂ ✁ ✁ ✂ ✁

Because

✁ ✁ ✂ ✁ ✁ ✂ ✁
✁ ✁ ✁
✁ ✁ ✂ ✂

Note that

✁ ✁ ✁ ✟ ✄ ✁ ✞ ✟
✁ ✁

. Thus

✁ ✁ ✟ ✄ ✁ ✞ ✟
✁ ✁ ✂ ✂

. – p.14/30

slide-15
SLIDE 15

How to adjust the weights (cont’d)

The remaining problem is to compute

✁ ✁

. The perceptron output,

  • , is not continuously

differentiable with respect to

because of the presence of the threshold function. Most small changes in the dot product do not change

  • at all, and when
  • does change, it changes

abruptly from 1 to 0 or vice versa. We will look at two methods to compute the differential.

. – p.15/30

slide-16
SLIDE 16

Computing the differential

Ignore the threshold function and let

. (The Widrow-Hoff Procedure). Replace the threshold function with another nonlinear function that is differentiable (The Generalized Delta Procedure).

. – p.16/30

slide-17
SLIDE 17

The Widrow-Hoff Procedure

Suppose we attempt to adjust the weights so that every training vector labeled with a 1 produces a dot product of exactly 1, and every vector labeled with a 0 produces a dot product of exactly -1. In that case, with

,

✁ ✞ ✟
✠ ✁ ✁ ✞ ✟ ✁ ✄ ✠

, and,

✁ ✁ ✁
  • .

Now, the gradient is

✁ ✁ ✟ ✄ ✁ ✞ ✟

. – p.17/30

slide-18
SLIDE 18

The Widrow-Hoff Proc. (cont’d)

Moving the weight vector along the negative gradient, and incorporating the factor 2, into a learning rate parameter, c, the new value of the weight vector is given by

  • ✁✄✂
✁ ✞ ✟

All we need to do now is to plug in this formula in the "adjust the weights" step of the training procedure.

. – p.18/30

slide-19
SLIDE 19

The Widrow-Hoff Proc. (cont’d)

We have,

✂ ✁ ✞ ✟

. Whenever

✁ ✞ ✟

is positive, we add a fraction of the input vector into the weight vector. This addition makes the dot product larger, and

✁ ✞ ✟

smaller. Similarly, when

✁ ✞ ✟

is negative, we subtract a fraction of the input vector from the weight vector.

. – p.19/30

slide-20
SLIDE 20

The Widrow-Hoff Proc. (cont’d)

This procedure is also known as the Delta rule. After finding a set of weights that minimize the squared error (using

), we are free to revert to the threshold function for

  • .

. – p.20/30

slide-21
SLIDE 21

The generalized Delta procedure

Another way of dealing with the nondifferentiable threshold function: replace the threshold function by an S-shaped differentiable function called a sigmoid. Usually, the sigmoid function used is:

✁ ✄ ✁
  • ✁✁
✂ ✄ ✄ ✝

where

is the input and

  • is the
  • utput.

. – p.21/30

slide-22
SLIDE 22

A Sigmoid Function

−2 2 4 6 −6 −4 0.6 0.8 1.0 0.4 0.2

It is possible to get sigmoid functions of different “flatness” by adjusting the exponent.

. – p.22/30

slide-23
SLIDE 23

The generalized Delta procedure (cont’d)

With the sigmoid function,

✁ ✁ ✁

Substitute into

✁ ✁ ✟ ✄ ✁ ✞ ✟
✁ ✁ ✂ ✂ ✁
✁ ✁ ✟ ✄ ✁ ✞ ✟
✂ ✂

The new weight change rule is:

  • ✁✄✂
✁ ✞ ✟

. – p.23/30

slide-24
SLIDE 24

Comparison

Compare Widrow-Hoff and Generalized Delta: the desired output,

, Widrow-Hoff: either 1 or -1, Generalized Delta: either 1 or 0. the actual output,

  • ,

Widrow-Hoff: equals

, the dot product, Generalized Delta: the sigmoid function. The sigmoid can be thought of implementing a “fuzzy” hyperplane.

. – p.24/30

slide-25
SLIDE 25

Fuzzy hyperplane

In generalized Delta, there is the added term

due to the presence of the sigmoid function. When

  • = 0,

is also 0. When

  • = 1,

is 0. When

  • = 1/2,

reaches its maximum value (1/4). Weight changes are made where changes have much effect on

  • .

For an input vector far away from the fuzzy hyperplane,

has value closer to 0, and the generalized Delta rule makes little or no change to the weight values regardless of the desired output.

. – p.25/30

slide-26
SLIDE 26

The error-correction procedure

Keep the threshold element Adjust the weight vector only when the perceptron responds in error, i.e., when

✁ ✞ ✟

is 1 or -1. The weight change rule is

  • ✁✄✂
✁ ✞ ✟

As before, the change is in the direction that helps correct the error. Whether it is corrected fully depends on

.

. – p.26/30

slide-27
SLIDE 27

The error-correction procedure (cont’d)

Note that here both

and

  • are either 0 or 1,

whereas in Widrow-Hoff,

is either +1 or -1, and

is the value of the dot product. It can be proven that if there is some weight vector, W, that produces a correct output for all the input vectors in S then after a finite number of input vector presentations, the error-correction procedure will find such a weight vector and thus make no more weight changes.

. – p.27/30

slide-28
SLIDE 28

Linearly non-separable inputs

Remember that a single perceptron can only learn linearly separable input vectors. When the input vectors in the training set are not linearly separable, the error-correction procedure will never terminate. Thus, it cannot be used to find a "good enough" answer. On the other hand, the Widrow-Hoff and generalized Delta procedures can find minimum squared error solutions even when the minimum error is not zero.

. – p.28/30

slide-29
SLIDE 29

Final remarks

Other names for a perceptron are: TLU: Threshold Logic Unit (Nilsson) Adaline: Adaptive Linear Element Neural networks are a very commonly used structure in machine learning. The input has to be numeric. They are useful when the learned function is not easily understood (compare to decision trees which implement a DNF Boolean formula)

. – p.29/30

slide-30
SLIDE 30

Final remarks (cont’d)

Typical examples are: handwritten character recognition (ZIP code), speech recognition, learning to pronounce written text. Designing and training neural networks is still an art requiring experience and experiments conference: Neural Information Processing Systems (NIPS) yearly publication of best results: Advances in Neural Information Processing Systems

. – p.30/30