[PDF] - CSCE 478/878 Lecture 8: Instance-Based Learning Stephen D. Scott PDF Document

SLIDE 1

CSCE 478/878 Lecture 8: Instance-Based Learning

Stephen D. Scott (Adapted from Tom Mitchell’s slides)

SLIDE 2

Outline

SLIDE 3

Nearest Neighbor Key idea: just store all training examples xi, f(xi) Need some distance measure between instances (e.g. Eu- clidean distance, Hamming distance) Nearest neighbor:

example xn, then estimate ˆ f(xq) = f(xn) k-Nearest neighbor:

discrete-valued target function) – Let k not be divisible by number of possible labels

real-valued ˆ f(xq) =

k

SLIDE 4

Voronoi Diagram Decision surface for 1-NN

SLIDE 5

When To Consider Nearest Neighbor

define some distance measure between instances)

– To avoid curse of dimensionality, where many ir- relevant attributes causes distance to be large, but distance is small if only relevant attributes used – Also, large number of attributes increases classifi- cation complexity

Advantages:

Disadvantages:

ing and accessing algorithms)

SLIDE 6

Nearest Neighbor’s Behavior in the Limit Consider p(x) defines probability that instance x will be labeled 1 (positive) versus 0 (negative). Nearest neighbor (k = 1):

Gibbs Algorithm Recall Gibbs has at most twice the expected error of Bayes optimal k-Nearest neighbor:

large, approaches Bayes optimal (best possible with given hyp. space and prior information) Bayes optimal: if p(x) > .5 then predict 1, else 0

SLIDE 7

Distance-Weighted k-NN Might want weight nearer neighbors more heavily: ˆ f(xq) ← argmax

wiδ(v, f(xi)) for discrete-valued (δ(v, f(xi)) = 1 if v = f(xi) and 0

ˆ f(xq) ←

for continuous where wi ≡ 1 d(xq, xi)2 and d(xq, xi) is distance between xq and xi Note now it makes sense to use all training examples in- stead of just k (Shepard’s method), but then get increased time to classify instances

SLIDE 8

Curse of Dimensionality Imagine instances described by 20 attributes, but only 2 are relevant to target function Curse of dimensionality: nearest neighbor is easily misled by high-dimensional X One approach:

to minimize prediction error

z1, . . . , zn

gether see [Moore and Lee, 1994]

SLIDE 9

Locally Weighted Regression Note k-NN forms local approximation to f for each query point xq Why not form an explicit approximation ˆ f(x) for region surrounding xq?

bors

Several choices of error to minimize:

E1(xq) ≡ 1 2

f(x)

E2(xq) ≡ 1 2

f(x)

(K is decreasing in its argument)

SLIDE 10

Radial Basis Function (RBF) Networks

linear combination of local approximations

“eager” instead of “lazy”

SLIDE 11

RBF Networks (cont’d)

... ...

1

1

2

n

where ai(x) are the attributes describing instance x, and ˆ f(x) = w0 +

wuKu(d(xu, x)) (Note no weights from input to hidden layer) One common choice for Ku(d(xu, x)) is Ku(d(xu, x)) = exp

2σ2

d2(xu, x)

i.e. Gaussian with mean at xu and variance σ2

independent [note bug on p. 239]

SLIDE 12

Training Radial Basis Function Networks

actly by placing one center per ex

fitting

function Ku(d(xu, x))

answer step 1 also)

(for Gaussian)

tion (output layer), e.g. GD or EG

SLIDE 13

Case-Based Reasoning and CADET Can apply instance-based learning even when X much more complex Need different “distance” metric Case-Based Reasoning is instance-based learning where instances have symbolic logic descriptions ((user-complaint error53-on-shutdown) (cpu-model PowerPC) (operating-system Windows) (memory 48meg) (installed-apps Excel Netscape VirusScan) (disk 1gig) (likely-cause ???)) CADET: 75 stored examples of mechanical devices, e.g. water faucets

Distance metric: match qualitative function descriptions

SLIDE 14

Case-Based Reasoning in CADET Example

= temperature = waterflow

?

E.g. distance measure = size of largest isomorphic sub- graph

SLIDE 15

Case-Based Reasoning in CADET (cont’d)

scriptions, vs. e.g. points in ℜn for k-NN

lution to new problem: Similar to k-NN, except com- bination procedure can rely on knowledge-based rea- soning (e.g. can two components be fit together?)

reasoning, and problem solving, e.g. application of rewrite rules in function graphs and backtracking in search space Bottom line:

swering help-desk queries

ing and search methods

SLIDE 16

Lazy and Eager Learning Lazy: Wait for query before generalizing

soning Eager: Generalize before seeing query

tion, Naive Bayes Does it matter?

learner can create many local approximations

functions (e.g. consider H = linear functions) since it considers the query instance xq before generalizing, i.e. lazy produces a new hypothesis for each new xq