Learning: Nearest Neighbor, Perceptrons & Neural Nets - - PowerPoint PPT Presentation

learning nearest neighbor perceptrons neural nets
SMART_READER_LITE
LIVE PREVIEW

Learning: Nearest Neighbor, Perceptrons & Neural Nets - - PowerPoint PPT Presentation

Learning: Nearest Neighbor, Perceptrons & Neural Nets Artificial Intelligence CSPP 56553 February 4, 2004 Nearest Neighbor Example II Credit Rating: Name L R G/P Classifier: Good / A 0 1.2 G Poor B 25


slide-1
SLIDE 1

Learning: Nearest Neighbor, Perceptrons & Neural Nets

Artificial Intelligence CSPP 56553 February 4, 2004

slide-2
SLIDE 2

Nearest Neighbor Example II

  • Credit Rating:

– Classifier: Good / Poor – Features:

  • L = # late payments/yr;
  • R = Income/Expenses

Name L R G/P A 0 1.2 G B 25 0.4 P C 5 0.7 G D 20 0.8 P E 30 0.85 P F 11 1.2 G G 7 1.15 G H 15 0.8 P

slide-3
SLIDE 3

Nearest Neighbor Example II

Name L R G/P A 0 1.2 G B 25 0.4 P C 5 0.7 G D 20 0.8 P E 30 0.85 P F 11 1.2 G G 7 1.15 G H 15 0.8 P L R 30 20 10 1 A B C D E F G H

slide-4
SLIDE 4

Nearest Neighbor Example II

L 30 20 10 1 A B C D E F G H R Name L R G/P I 6 1.15 J 22 0.45 K 15 1.2 G I P J ?? K Distance Measure: Sqrt ((L1-L2)^2 + [sqrt(10)*(R1-R2)]^2))

  • Scaled distance
slide-5
SLIDE 5

Nearest Neighbor: Issues

  • Prediction can be expensive if many

features

  • Affected by classification, feature noise

– One entry can change prediction

  • Definition of distance metric

– How to combine different features

  • Different types, ranges of values
  • Sensitive to feature selection
slide-6
SLIDE 6

Efficient Implementations

  • Classification cost:

– Find nearest neighbor: O(n)

  • Compute distance between unknown and all

instances

  • Compare distances

– Problematic for large data sets

  • Alternative:

– Use binary search to reduce to O(log n)

slide-7
SLIDE 7

Efficient Implementation: K-D Trees

  • Divide instances into sets based on features

– Binary branching: E.g. > value – 2^d leaves with d split path = n

  • d= O(log n)

– To split cases into sets,

  • If there is one element in the set, stop
  • Otherwise pick a feature to split on

– Find average position of two middle objects on that dimension » Split remaining objects based on average position » Recursively split subsets

slide-8
SLIDE 8

K-D Trees: Classification

R > 0.825? L > 17.5? L > 9 ? No Yes R > 0.6? R > 0.75? R > 1.025 ? R > 1.175 ? No Yes No Yes No Poor Good Yes No Yes Good Poor No Yes Good Good No Poor Yes Good

slide-9
SLIDE 9

Efficient Implementation: Parallel Hardware

  • Classification cost:

– # distance computations

  • Const time if O(n) processors

– Cost of finding closest

  • Compute pairwise minimum, successively
  • O(log n) time
slide-10
SLIDE 10

Nearest Neighbor: Analysis

  • Issue:

– What features should we use?

  • E.g. Credit rating: Many possible features

– Tax bracket, debt burden, retirement savings, etc..

– Nearest neighbor uses ALL – Irrelevant feature(s) could mislead

  • Fundamental problem with nearest neighbor
slide-11
SLIDE 11

Nearest Neighbor: Advantages

  • Fast training:

– Just record feature vector - output value set

  • Can model wide variety of functions

– Complex decision boundaries – Weak inductive bias

  • Very generally applicable
slide-12
SLIDE 12

Summary: Nearest Neighbor

  • Nearest neighbor:

– Training: record input vectors + output value – Prediction: closest training instance to new data

  • Efficient implementations
  • Pros: fast training, very general, little bias
  • Cons: distance metric (scaling), sensitivity

to noise & extraneous features

slide-13
SLIDE 13

Learning: Perceptrons

Artificial Intelligence CSPP 56553 February 4, 2004

slide-14
SLIDE 14

Agenda

  • Neural Networks:

– Biological analogy

  • Perceptrons: Single layer networks
  • Perceptron training
  • Perceptron convergence theorem
  • Perceptron limitations
  • Conclusions
slide-15
SLIDE 15

Neurons: The Concept

Axon Cell Body Nucleus Dendrites Neurons: Receive inputs from other neurons (via synapses) When input exceeds threshold, “fires” Sends output along axon to other neurons Brain: 10^11 neurons, 10^16 synapses

slide-16
SLIDE 16

Artificial Neural Nets

  • Simulated Neuron:

– Node connected to other nodes via links

  • Links = axon+synapse+link
  • Links associated with weight (like synapse)

– Multiplied by output of node

– Node combines input via activation function

  • E.g. sum of weighted inputs passed thru threshold
  • Simpler than real neuronal processes
slide-17
SLIDE 17

Artificial Neural Net

x x x w w w Sum Threshold +

slide-18
SLIDE 18

Perceptrons

  • Single neuron-like element

– Binary inputs – Binary outputs

  • Weighted sum of inputs > threshold
slide-19
SLIDE 19

Perceptron Structure

x0=1 x1 x3 x2 xn w1 w0 . . . w2 w3 wn

y

     > =

=

  • therwise

if 1

i n i ix

w y

x0 w0 compensates for threshold

slide-20
SLIDE 20

Perceptron Example

  • Logical-OR: Linearly separable

– 00: 0; 01: 1; 10: 1; 11: 1

x2 x1

+ + +

  • r

x2 x1

+ + +

  • r
slide-21
SLIDE 21

Perceptron Convergence Procedure

  • Straight-forward training procedure

– Learns linearly separable functions

  • Until perceptron yields correct output for

all

– If the perceptron is correct, do nothing – If the percepton is wrong,

  • If it incorrectly says “yes”,

– Subtract input vector from weight vector

  • Otherwise, add input vector to weight vector
slide-22
SLIDE 22

Perceptron Convergence Example

  • LOGICAL-OR:
  • Sample

x0 x1 x2 Desired Output

  • 1 1 0 0 0
  • 2 1 0 1 1
  • 3 1 1 0 1
  • 4 1 1 1 1
  • Initial: w=(000);After S2, w=w+s2=(101)
  • Pass2: S1:w=w-s1=(001);S3:w=w+s3=(111)
  • Pass3: S1:w=w-s1=(011)
slide-23
SLIDE 23

Perceptron Convergence Theorem

  • If there exists a vector W s.t.
  • Perceptron training will find it
  • Assume

for all +ive examples x

  • ||w||^2 increases by at most ||x||^2, in

each iteration

  • ||w+x||^2 <= ||w||^2+||x||^2 <=k ||x||^2
  • v.w/||w|| > <= 1

Converges in k <= O steps

δ k w v x x x w

k

> ⋅ + + + = r r r r r r , ...

2 1

     > =

=

  • therwise

if 1

i n i ix

w y

δ > ⋅ x v r r

k x k / δ

( )

2

/ 1 δ

slide-24
SLIDE 24

Perceptron Learning

  • Perceptrons learn linear decision

boundaries

  • E.g.

+ + + + + +

x1 x2 But not x2 x1

+ + xor

X1 X2

  • 1 -1 w1x1 + w2x2 < 0

1 -1 w1x1 + w2x2 > 0 => implies w1 > 0 1 1 w1x1 + w2x2 >0 => but should be false

  • 1 1 w1x1 + w2x2 > 0 => implies w2 > 0
slide-25
SLIDE 25

Perceptron Example

  • Digit recognition

– Assume display= 8 lightable bars – Inputs – on/off + threshold – 65 steps to recognize “8”

slide-26
SLIDE 26

Perceptron Summary

  • Motivated by neuron activation
  • Simple training procedure
  • Guaranteed to converge

– IF linearly separable

slide-27
SLIDE 27

Neural Nets

  • Multi-layer perceptrons

– Inputs: real-valued – Intermediate “hidden” nodes – Output(s): one (or more) discrete-valued

X1 X2 X3 X4 Inputs Hidden Hidden Outputs Y1 Y2

slide-28
SLIDE 28

Neural Nets

  • Pro: More general than perceptrons

– Not restricted to linear discriminants – Multiple outputs: one classification each

  • Con: No simple, guaranteed training

procedure

– Use greedy, hill-climbing procedure to train – “Gradient descent”, “Backpropagation”

slide-29
SLIDE 29

Solving the XOR Problem

x1 w13 w11 w21

  • 2
  • 1

w12 y w03 w22

  • 1

x2 w23 w02

  • 1

w01

  • 1

Network Topology: 2 hidden nodes 1 output Desired behavior: x1 x2 o1 o2 y 0 0 0 0 0 1 0 0 1 1 0 1 0 1 1 1 1 1 1 0 Weights: w11= w12=1 w21=w22 = 1 w01=3/2; w02=1/2; w03=1/2 w13=-1; w23=1

slide-30
SLIDE 30

Neural Net Applications

  • Speech recognition
  • Handwriting recognition
  • NETtalk: Letter-to-sound rules
  • ALVINN: Autonomous driving
slide-31
SLIDE 31

ALVINN

  • Driving as a neural network
  • Inputs:

– Image pixel intensities

  • I.e. lane lines
  • 5 Hidden nodes
  • Outputs:

– Steering actions

  • E.g. turn left/right; how far
  • Training:

– Observe human behavior: sample images, steering

slide-32
SLIDE 32

Backpropagation

  • Greedy, Hill-climbing procedure

– Weights are parameters to change – Original hill-climb changes one parameter/step

  • Slow

– If smooth function, change all parameters/step

  • Gradient descent

– Backpropagation: Computes current output, works backward to correct error

slide-33
SLIDE 33

Producing a Smooth Function

  • Key problem:

– Pure step threshold is discontinuous

  • Not differentiable
  • Solution:

– Sigmoid (squashed ‘s’ function): Logistic fn

+ = =

n i z i i

e z s x w z 1 1 ) (

slide-34
SLIDE 34

Neural Net Training

  • Goal:

– Determine how to change weights to get correct output

  • Large change in weight to produce large reduction

in error

  • Approach:
  • Compute actual output: o
  • Compare to desired output: d
  • Determine effect of each weight w on error = d-o
  • Adjust weights
slide-35
SLIDE 35

Neural Net Example

y3 w03 w23 z3 z2 w02 w22 w21 w12 w1

1

w01 z1

  • 1
  • 1
  • 1

x1 x2 w13 y1 y2

− =

i i i

w x F y E

2 *

)) , ( ( 2 1 v v

xi : ith sample input vector w : weight vector yi*: desired output for ith sample Sum of squares error over training samples ) ) ( ) ( ( ) , (

03 02 2 22 1 12 23 01 2 21 1 11 13 3

w w x w x w s w w x w x w s w s w x F y − − + + − + = = v v

z3 z1 z2

Full expression of output in terms of input and weights

  • From 6.034 notes lozano-perez
slide-36
SLIDE 36

Gradient Descent

  • Error: Sum of squares error of inputs with

current weights

  • Compute rate of change of error wrt each

weight

– Which weights have greatest effect on error? – Effectively, partial derivatives of error wrt weights

  • In turn, depend on other weights => chain rule
slide-37
SLIDE 37

Gradient Descent

  • E = G(w)

– Error as function of weights

  • Find rate of change of

error

– Follow steepest rate of change – Change weights s.t. error is minimized E w G(w) dG dw Local minima w0w1

slide-38
SLIDE 38

MIT AI lecture notes, Lozano- Perez 2000

Gradient of Error

) ) ( ) ( ( ) , (

03 02 2 22 1 12 23 01 2 21 1 11 13 3

w w x w x w s w w x w x w s w s w x F y − − + + − + = = v v

z3 z1 z2

j i j

w y y y w E ∂ ∂ − − = ∂ ∂

3 3 *

) (

1 3 3 1 3 3 13 3 3 3 13 3

) ( ) ( ) ( ) ( y z z s z s z z s w z z z s w y ∂ ∂ = ∂ ∂ = ∂ ∂ ∂ ∂ = ∂ ∂

1 1 1 13 3 3 11 1 1 1 13 3 3 11 3 3 3 11 3

) ( ) ( ) ( ) ( ) ( x z z s w z z s w z z z s w z z s w z z z s w y ∂ ∂ ∂ ∂ = ∂ ∂ ∂ ∂ ∂ ∂ = ∂ ∂ ∂ ∂ = ∂ ∂

− =

i i i

w x F y E

2 *

)) , ( ( 2 1 v v

y3 w03 w23 z3 z2 w02 w22 w21 w12 w1

1

w01 z1

  • 1
  • 1
  • 1

x1 x2 w13 y1 y2

Note: Derivative of sigmoid: ds(z1) = s(z1)(1-s(z1)) dz1

  • From 6.034 notes lozano-perez
slide-39
SLIDE 39

From Effect to Update

  • Gradient computation:

– How each weight contributes to performance

  • To train:

– Need to determine how to CHANGE weight based on contribution to performance – Need to determine how MUCH change to make per iteration

  • Rate parameter ‘r’

– Large enough to learn quickly – Small enough reach but not overshoot target values

slide-40
SLIDE 40

Backpropagation Procedure

  • Pick rate parameter ‘r’
  • Until performance is good enough,

– Do forward computation to calculate output – Compute Beta in output node with – Compute Beta in all other nodes with – Compute change for all weights with

z z z

  • d −

= β

− =

→ k k k k k j j

  • w

β β ) 1 (

j j j i j i

  • ro

w β ) 1 ( − = ∆

i j k

j i

w →

k j

w →

i

  • j
  • )

1 (

j j

) 1 (

k k

slide-41
SLIDE 41

Backprop Example

y3 w03 w23 z3 z2 w02 w22 w21 w12 w11 w01 z1

  • 1
  • 1
  • 1

x1 x2 w13 y1 y2

) (

3 * 3 3

y y − = β

23 3 3 3 2

) 1 ( w y y β β − =

13 3 3 3 1

) 1 ( w y y β β − =

Forward prop: Compute zi and yi given xk, wl

) 1 ( ) 1 (

3 3 3 03 03

− − + = β y ry w w ) 1 ( ) 1 (

2 2 2 02 02

− − + = β y ry w w ) 1 ( ) 1 (

1 1 1 01 01

− − + = β y ry w w

3 3 3 1 13 13

) 1 ( β y y ry w w − + =

2 2 2 1 12 12

) 1 ( β y y rx w w − + =

1 1 1 1 11 11

) 1 ( β y y rx w w − + =

3 3 3 2 23 23

) 1 ( β y y ry w w − + =

2 2 2 2 22 22

) 1 ( β y y rx w w − + =

1 1 1 2 21 21

) 1 ( β y y rx w w − + =

slide-42
SLIDE 42

Backpropagation Observations

  • Procedure is (relatively) efficient

– All computations are local

  • Use inputs and outputs of current node
  • What is “good enough”?

– Rarely reach target (0 or 1) outputs

  • Typically, train until within 0.1 of target
slide-43
SLIDE 43

Neural Net Summary

  • Training:

– Backpropagation procedure

  • Gradient descent strategy (usual problems)
  • Prediction:

– Compute outputs based on input vector & weights

  • Pros: Very general, Fast prediction
  • Cons: Training can be VERY slow (1000’s
  • f epochs), Overfitting
slide-44
SLIDE 44

Training Strategies

  • Online training:

– Update weights after each sample

  • Offline (batch training):

– Compute error over all samples

  • Then update weights
  • Online training “noisy”

– Sensitive to individual instances – However, may escape local minima

slide-45
SLIDE 45

Training Strategy

  • To avoid overfitting:

– Split data into: training, validation, & test

  • Also, avoid excess weights (less than # samples)
  • Initialize with small random weights

– Small changes have noticeable effect

  • Use offline training

– Until validation set minimum

  • Evaluate on test set

– No more weight changes

slide-46
SLIDE 46

Classification

  • Neural networks best for classification

task

– Single output -> Binary classifier – Multiple outputs -> Multiway classification

  • Applied successfully to learning pronunciation

– Sigmoid pushes to binary classification

  • Not good for regression
slide-47
SLIDE 47

Neural Net Example

  • NETtalk: Letter-to-sound by net
  • Inputs:

– Need context to pronounce

  • 7-letter window: predict sound of middle letter
  • 29 possible characters – alphabet+space+,+.

– 7*29=203 inputs

  • 80 Hidden nodes
  • Output: Generate 60 phones

– Nodes map to 26 units: 21 articulatory, 5 stress/sil

  • Vector quantization of acoustic space
slide-48
SLIDE 48

Neural Net Example: NETtalk

  • Learning to talk:

– 5 iterations/1024 training words: bound/stress – 10 iterations: intelligible – 400 new test words: 80% correct

  • Not as good as DecTalk, but automatic
slide-49
SLIDE 49

Neural Net Conclusions

  • Simulation based on neurons in brain
  • Perceptrons (single neuron)

– Guaranteed to find linear discriminant

  • IF one exists -> problem XOR
  • Neural nets (Multi-layer perceptrons)

– Very general – Backpropagation training procedure

  • Gradient descent - local min, overfitting issues