Learning: Nearest Neighbor, Perceptrons & Neural Nets - - PowerPoint PPT Presentation
Learning: Nearest Neighbor, Perceptrons & Neural Nets - - PowerPoint PPT Presentation
Learning: Nearest Neighbor, Perceptrons & Neural Nets Artificial Intelligence CSPP 56553 February 4, 2004 Nearest Neighbor Example II Credit Rating: Name L R G/P Classifier: Good / A 0 1.2 G Poor B 25
Nearest Neighbor Example II
- Credit Rating:
– Classifier: Good / Poor – Features:
- L = # late payments/yr;
- R = Income/Expenses
Name L R G/P A 0 1.2 G B 25 0.4 P C 5 0.7 G D 20 0.8 P E 30 0.85 P F 11 1.2 G G 7 1.15 G H 15 0.8 P
Nearest Neighbor Example II
Name L R G/P A 0 1.2 G B 25 0.4 P C 5 0.7 G D 20 0.8 P E 30 0.85 P F 11 1.2 G G 7 1.15 G H 15 0.8 P L R 30 20 10 1 A B C D E F G H
Nearest Neighbor Example II
L 30 20 10 1 A B C D E F G H R Name L R G/P I 6 1.15 J 22 0.45 K 15 1.2 G I P J ?? K Distance Measure: Sqrt ((L1-L2)^2 + [sqrt(10)*(R1-R2)]^2))
- Scaled distance
Nearest Neighbor: Issues
- Prediction can be expensive if many
features
- Affected by classification, feature noise
– One entry can change prediction
- Definition of distance metric
– How to combine different features
- Different types, ranges of values
- Sensitive to feature selection
Efficient Implementations
- Classification cost:
– Find nearest neighbor: O(n)
- Compute distance between unknown and all
instances
- Compare distances
– Problematic for large data sets
- Alternative:
– Use binary search to reduce to O(log n)
Efficient Implementation: K-D Trees
- Divide instances into sets based on features
– Binary branching: E.g. > value – 2^d leaves with d split path = n
- d= O(log n)
– To split cases into sets,
- If there is one element in the set, stop
- Otherwise pick a feature to split on
– Find average position of two middle objects on that dimension » Split remaining objects based on average position » Recursively split subsets
K-D Trees: Classification
R > 0.825? L > 17.5? L > 9 ? No Yes R > 0.6? R > 0.75? R > 1.025 ? R > 1.175 ? No Yes No Yes No Poor Good Yes No Yes Good Poor No Yes Good Good No Poor Yes Good
Efficient Implementation: Parallel Hardware
- Classification cost:
– # distance computations
- Const time if O(n) processors
– Cost of finding closest
- Compute pairwise minimum, successively
- O(log n) time
Nearest Neighbor: Analysis
- Issue:
– What features should we use?
- E.g. Credit rating: Many possible features
– Tax bracket, debt burden, retirement savings, etc..
– Nearest neighbor uses ALL – Irrelevant feature(s) could mislead
- Fundamental problem with nearest neighbor
Nearest Neighbor: Advantages
- Fast training:
– Just record feature vector - output value set
- Can model wide variety of functions
– Complex decision boundaries – Weak inductive bias
- Very generally applicable
Summary: Nearest Neighbor
- Nearest neighbor:
– Training: record input vectors + output value – Prediction: closest training instance to new data
- Efficient implementations
- Pros: fast training, very general, little bias
- Cons: distance metric (scaling), sensitivity
to noise & extraneous features
Learning: Perceptrons
Artificial Intelligence CSPP 56553 February 4, 2004
Agenda
- Neural Networks:
– Biological analogy
- Perceptrons: Single layer networks
- Perceptron training
- Perceptron convergence theorem
- Perceptron limitations
- Conclusions
Neurons: The Concept
Axon Cell Body Nucleus Dendrites Neurons: Receive inputs from other neurons (via synapses) When input exceeds threshold, “fires” Sends output along axon to other neurons Brain: 10^11 neurons, 10^16 synapses
Artificial Neural Nets
- Simulated Neuron:
– Node connected to other nodes via links
- Links = axon+synapse+link
- Links associated with weight (like synapse)
– Multiplied by output of node
– Node combines input via activation function
- E.g. sum of weighted inputs passed thru threshold
- Simpler than real neuronal processes
Artificial Neural Net
x x x w w w Sum Threshold +
Perceptrons
- Single neuron-like element
– Binary inputs – Binary outputs
- Weighted sum of inputs > threshold
Perceptron Structure
x0=1 x1 x3 x2 xn w1 w0 . . . w2 w3 wn
y
> =
∑
=
- therwise
if 1
i n i ix
w y
x0 w0 compensates for threshold
Perceptron Example
- Logical-OR: Linearly separable
– 00: 0; 01: 1; 10: 1; 11: 1
x2 x1
+ + +
- r
x2 x1
+ + +
- r
Perceptron Convergence Procedure
- Straight-forward training procedure
– Learns linearly separable functions
- Until perceptron yields correct output for
all
– If the perceptron is correct, do nothing – If the percepton is wrong,
- If it incorrectly says “yes”,
– Subtract input vector from weight vector
- Otherwise, add input vector to weight vector
Perceptron Convergence Example
- LOGICAL-OR:
- Sample
x0 x1 x2 Desired Output
- 1 1 0 0 0
- 2 1 0 1 1
- 3 1 1 0 1
- 4 1 1 1 1
- Initial: w=(000);After S2, w=w+s2=(101)
- Pass2: S1:w=w-s1=(001);S3:w=w+s3=(111)
- Pass3: S1:w=w-s1=(011)
Perceptron Convergence Theorem
- If there exists a vector W s.t.
- Perceptron training will find it
- Assume
for all +ive examples x
- ||w||^2 increases by at most ||x||^2, in
each iteration
- ||w+x||^2 <= ||w||^2+||x||^2 <=k ||x||^2
- v.w/||w|| > <= 1
Converges in k <= O steps
δ k w v x x x w
k
> ⋅ + + + = r r r r r r , ...
2 1
> =
∑
=
- therwise
if 1
i n i ix
w y
δ > ⋅ x v r r
k x k / δ
( )
2
/ 1 δ
Perceptron Learning
- Perceptrons learn linear decision
boundaries
- E.g.
+ + + + + +
x1 x2 But not x2 x1
+ + xor
X1 X2
- 1 -1 w1x1 + w2x2 < 0
1 -1 w1x1 + w2x2 > 0 => implies w1 > 0 1 1 w1x1 + w2x2 >0 => but should be false
- 1 1 w1x1 + w2x2 > 0 => implies w2 > 0
Perceptron Example
- Digit recognition
– Assume display= 8 lightable bars – Inputs – on/off + threshold – 65 steps to recognize “8”
Perceptron Summary
- Motivated by neuron activation
- Simple training procedure
- Guaranteed to converge
– IF linearly separable
Neural Nets
- Multi-layer perceptrons
– Inputs: real-valued – Intermediate “hidden” nodes – Output(s): one (or more) discrete-valued
X1 X2 X3 X4 Inputs Hidden Hidden Outputs Y1 Y2
Neural Nets
- Pro: More general than perceptrons
– Not restricted to linear discriminants – Multiple outputs: one classification each
- Con: No simple, guaranteed training
procedure
– Use greedy, hill-climbing procedure to train – “Gradient descent”, “Backpropagation”
Solving the XOR Problem
x1 w13 w11 w21
- 2
- 1
w12 y w03 w22
- 1
x2 w23 w02
- 1
w01
- 1
Network Topology: 2 hidden nodes 1 output Desired behavior: x1 x2 o1 o2 y 0 0 0 0 0 1 0 0 1 1 0 1 0 1 1 1 1 1 1 0 Weights: w11= w12=1 w21=w22 = 1 w01=3/2; w02=1/2; w03=1/2 w13=-1; w23=1
Neural Net Applications
- Speech recognition
- Handwriting recognition
- NETtalk: Letter-to-sound rules
- ALVINN: Autonomous driving
ALVINN
- Driving as a neural network
- Inputs:
– Image pixel intensities
- I.e. lane lines
- 5 Hidden nodes
- Outputs:
– Steering actions
- E.g. turn left/right; how far
- Training:
– Observe human behavior: sample images, steering
Backpropagation
- Greedy, Hill-climbing procedure
– Weights are parameters to change – Original hill-climb changes one parameter/step
- Slow
– If smooth function, change all parameters/step
- Gradient descent
– Backpropagation: Computes current output, works backward to correct error
Producing a Smooth Function
- Key problem:
– Pure step threshold is discontinuous
- Not differentiable
- Solution:
– Sigmoid (squashed ‘s’ function): Logistic fn
∑
−
+ = =
n i z i i
e z s x w z 1 1 ) (
Neural Net Training
- Goal:
– Determine how to change weights to get correct output
- Large change in weight to produce large reduction
in error
- Approach:
- Compute actual output: o
- Compare to desired output: d
- Determine effect of each weight w on error = d-o
- Adjust weights
Neural Net Example
y3 w03 w23 z3 z2 w02 w22 w21 w12 w1
1
w01 z1
- 1
- 1
- 1
x1 x2 w13 y1 y2
∑
− =
i i i
w x F y E
2 *
)) , ( ( 2 1 v v
xi : ith sample input vector w : weight vector yi*: desired output for ith sample Sum of squares error over training samples ) ) ( ) ( ( ) , (
03 02 2 22 1 12 23 01 2 21 1 11 13 3
w w x w x w s w w x w x w s w s w x F y − − + + − + = = v v
z3 z1 z2
Full expression of output in terms of input and weights
- From 6.034 notes lozano-perez
Gradient Descent
- Error: Sum of squares error of inputs with
current weights
- Compute rate of change of error wrt each
weight
– Which weights have greatest effect on error? – Effectively, partial derivatives of error wrt weights
- In turn, depend on other weights => chain rule
Gradient Descent
- E = G(w)
– Error as function of weights
- Find rate of change of
error
– Follow steepest rate of change – Change weights s.t. error is minimized E w G(w) dG dw Local minima w0w1
MIT AI lecture notes, Lozano- Perez 2000
Gradient of Error
) ) ( ) ( ( ) , (
03 02 2 22 1 12 23 01 2 21 1 11 13 3
w w x w x w s w w x w x w s w s w x F y − − + + − + = = v v
z3 z1 z2
j i j
w y y y w E ∂ ∂ − − = ∂ ∂
3 3 *
) (
1 3 3 1 3 3 13 3 3 3 13 3
) ( ) ( ) ( ) ( y z z s z s z z s w z z z s w y ∂ ∂ = ∂ ∂ = ∂ ∂ ∂ ∂ = ∂ ∂
1 1 1 13 3 3 11 1 1 1 13 3 3 11 3 3 3 11 3
) ( ) ( ) ( ) ( ) ( x z z s w z z s w z z z s w z z s w z z z s w y ∂ ∂ ∂ ∂ = ∂ ∂ ∂ ∂ ∂ ∂ = ∂ ∂ ∂ ∂ = ∂ ∂
∑
− =
i i i
w x F y E
2 *
)) , ( ( 2 1 v v
y3 w03 w23 z3 z2 w02 w22 w21 w12 w1
1
w01 z1
- 1
- 1
- 1
x1 x2 w13 y1 y2
Note: Derivative of sigmoid: ds(z1) = s(z1)(1-s(z1)) dz1
- From 6.034 notes lozano-perez
From Effect to Update
- Gradient computation:
– How each weight contributes to performance
- To train:
– Need to determine how to CHANGE weight based on contribution to performance – Need to determine how MUCH change to make per iteration
- Rate parameter ‘r’
– Large enough to learn quickly – Small enough reach but not overshoot target values
Backpropagation Procedure
- Pick rate parameter ‘r’
- Until performance is good enough,
– Do forward computation to calculate output – Compute Beta in output node with – Compute Beta in all other nodes with – Compute change for all weights with
z z z
- d −
= β
∑
− =
→ k k k k k j j
- w
β β ) 1 (
j j j i j i
- ro
w β ) 1 ( − = ∆
→
i j k
j i
w →
k j
w →
i
- j
- )
1 (
j j
- −
) 1 (
k k
- −
Backprop Example
y3 w03 w23 z3 z2 w02 w22 w21 w12 w11 w01 z1
- 1
- 1
- 1
x1 x2 w13 y1 y2
) (
3 * 3 3
y y − = β
23 3 3 3 2
) 1 ( w y y β β − =
13 3 3 3 1
) 1 ( w y y β β − =
Forward prop: Compute zi and yi given xk, wl
) 1 ( ) 1 (
3 3 3 03 03
− − + = β y ry w w ) 1 ( ) 1 (
2 2 2 02 02
− − + = β y ry w w ) 1 ( ) 1 (
1 1 1 01 01
− − + = β y ry w w
3 3 3 1 13 13
) 1 ( β y y ry w w − + =
2 2 2 1 12 12
) 1 ( β y y rx w w − + =
1 1 1 1 11 11
) 1 ( β y y rx w w − + =
3 3 3 2 23 23
) 1 ( β y y ry w w − + =
2 2 2 2 22 22
) 1 ( β y y rx w w − + =
1 1 1 2 21 21
) 1 ( β y y rx w w − + =
Backpropagation Observations
- Procedure is (relatively) efficient
– All computations are local
- Use inputs and outputs of current node
- What is “good enough”?
– Rarely reach target (0 or 1) outputs
- Typically, train until within 0.1 of target
Neural Net Summary
- Training:
– Backpropagation procedure
- Gradient descent strategy (usual problems)
- Prediction:
– Compute outputs based on input vector & weights
- Pros: Very general, Fast prediction
- Cons: Training can be VERY slow (1000’s
- f epochs), Overfitting
Training Strategies
- Online training:
– Update weights after each sample
- Offline (batch training):
– Compute error over all samples
- Then update weights
- Online training “noisy”
– Sensitive to individual instances – However, may escape local minima
Training Strategy
- To avoid overfitting:
– Split data into: training, validation, & test
- Also, avoid excess weights (less than # samples)
- Initialize with small random weights
– Small changes have noticeable effect
- Use offline training
– Until validation set minimum
- Evaluate on test set
– No more weight changes
Classification
- Neural networks best for classification
task
– Single output -> Binary classifier – Multiple outputs -> Multiway classification
- Applied successfully to learning pronunciation
– Sigmoid pushes to binary classification
- Not good for regression
Neural Net Example
- NETtalk: Letter-to-sound by net
- Inputs:
– Need context to pronounce
- 7-letter window: predict sound of middle letter
- 29 possible characters – alphabet+space+,+.
– 7*29=203 inputs
- 80 Hidden nodes
- Output: Generate 60 phones
– Nodes map to 26 units: 21 articulatory, 5 stress/sil
- Vector quantization of acoustic space
Neural Net Example: NETtalk
- Learning to talk:
– 5 iterations/1024 training words: bound/stress – 10 iterations: intelligible – 400 new test words: 80% correct
- Not as good as DecTalk, but automatic
Neural Net Conclusions
- Simulation based on neurons in brain
- Perceptrons (single neuron)
– Guaranteed to find linear discriminant
- IF one exists -> problem XOR
- Neural nets (Multi-layer perceptrons)
– Very general – Backpropagation training procedure
- Gradient descent - local min, overfitting issues