Instance-based Learning Hamid R. Rabiee Spring 2015 - - PowerPoint PPT Presentation

instance based learning
SMART_READER_LITE
LIVE PREVIEW

Instance-based Learning Hamid R. Rabiee Spring 2015 - - PowerPoint PPT Presentation

Instance-based Learning Hamid R. Rabiee Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1 Agenda Agenda Instance-based learning K-Nearest Neighbor Locally-weighted Regression Radial Basis Function Networks Case-based


slide-1
SLIDE 1

Instance-based Learning

Hamid R. Rabiee

Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1

slide-2
SLIDE 2

Agenda Agenda

 Instance-based learning  K-Nearest Neighbor  Locally-weighted Regression  Radial Basis Function Networks  Case-based Reasoning

2

slide-3
SLIDE 3

Instance Instance-ba based sed Le Learn arning ing

 Key idea: In contrast to learning methods that construct a general, explicit description of the target function when training examples are provided, instance-based learning constructs the target function only when a new instance must be classified.  Only store all training examples <𝒚𝒋, 𝒈(𝒚𝒋)> where 𝒚𝒋 describes the attributes of each instance and 𝒈(𝒚𝒋) denotes its class (or value).  Examples: K-Nearest Neighbor

3

slide-4
SLIDE 4

K-Neare Nearest st Neighb Neighbor

  • r

 Simple 2-D case, each instance described only by two values (x, y co-

  • rdinates)

 Given query instance  , take vote among its k nearest neighbors to decide its class, (return most common value of f among the k nearest training elements to xquery )  Need to consider

1. Similarity (how to calculate distance) 2. Number (and weight) of similar (near) instances

4

slide-5
SLIDE 5

Sim Simil ilari arity ty

 Euclidean distance, more precisely let an arbitrary instance x be described by the feature vector (set of attributes) as follows:  where ar(x) denotes the value of the rth attribute of instance x. Then the distance between two instances xi and xj is defined to be d(xi, xj) where

5

  ) ( ),... ( ), (

2 1

x a x a x a

n

 

n r j r i r j i

x a x a x x d

1

)) ( ) ( ( 2 ) , (

slide-6
SLIDE 6

Trai Training ning data data

6

Number Lines Line types Rectangles Colors Mondrian? 1 6 1 10 4 No 2 4 2 8 5 No 3 5 2 7 4 Yes 4 5 1 8 4 Yes 5 5 1 10 5 No 6 6 1 8 6 Yes 7 7 1 14 5 No Number Lines Line types Rectangles Colors Mondrian? 8 7 2 9 4

Test Instance:

slide-7
SLIDE 7

Keep data eep data in normali in normalised f sed form

  • rm

 One way to normalize the data ar(x) to a´r(x) is  

7

t t t t

x x x    '

attributes t

  • f

mean x

th r 

attributes t

  • f

deviation ndard sta

th t 

slide-8
SLIDE 8

Normal

  • rmalized

ized tr traini aining ng data data

8

Number Lines Line types Rectangles Colors Mondrian? 1 0.632

  • 0.632

0.327

  • 1.021

No 2

  • 1.581

1.581

  • 0.588

0.408 No 3

  • 0.474

1.581

  • 1.046
  • 1.021

Yes 4

  • 0.474
  • 0.632
  • 0.588
  • 1.021

Yes 5

  • 0.474
  • 0.632

0.327 0.408 No 6 0.632

  • 0.632
  • 0.588

1.837 Yes 7 1.739

  • 0.632

2.157 0.408 No Number Lines Line types Rectangles Colors Mondrian? 8 1.739 1.581

  • 0.131
  • 1.021

Test Instance:

slide-9
SLIDE 9

Di Distances stances of

  • f test

test instance instance fr from trai

  • m training

ning data data

9

Example Distance

  • f test

from example Mondrian? 1 2.517 No 2 3.644 No 3 2.395 Yes 4 3.164 Yes 5 3.472 No 6 3.808 Yes 7 3.490 No

Classification

1-NN Yes 3-NN Yes 5-NN No 7-NN No

slide-10
SLIDE 10

Lazy Lazy vs vs Ea Eage ger r Le Learn arning ing

 The K-NN method does not form an explicit hypothesis regarding the target classification function. It simply computes the classification for each new query instance as needed.  Implied Hypothesis: the following diagram (Voronoi diagram) shows the shape of the implied hypothesis about the decision surface that can be derived for a simple 1- NN case.

 The decision surface is a combination of complex polyhedra surrounding each of the training examples. For every training example, the polyhedron indicates the set of query points whose classification will be completely determined by that training example.

10

slide-11
SLIDE 11

Cont Continuous inuous vs vs Di Discre screte te valu valued ed f fun unction ctions s (cl (classes) asses)

 K-NN works well for discrete-valued target functions.  Furthermore, the idea can be extended f or continuous (real) valued

  • functions. In this case we can take mean of the f values of k nearest

neighbors:

11

k x f x f

k i i q

1

) ( ) ( ˆ

slide-12
SLIDE 12

When When to to Cons Consider N ider Nea earest rest Neighb Neighbor

  • r ?

 Instances map to points in Rn  Average number of attributes (e.g. Less than 20 attributes per instance)  Lots of training data  When target function is complex but can be approximated by separate local simple approximations

12

Advantages: Disadvantages: Training is very fast Slow at query time Learn complex target functions Easily fooled by irrelevant attributes

slide-13
SLIDE 13

Colla Collaborati borative ve Fil Filtering tering (AKA (AKA Recomm Recommende ender r Sys Systems tems)

 Problem:

 Predict whether someone will like a webpage, postings, movie, book, etc.

 Previous Approach

 Look at the content

 Collaborative Filtering

 Look at what similar users liked  Similar users = Similar likes and dislikes

13

slide-14
SLIDE 14

Col Collaborat laborativ ive e Filt Filtering ering

 Represent each user by vector of ratings  Two types

 Yes/No  Explicit ratings (e.g., 0 - * * *)

 Predict rating  Similarity (Pearson coefficient)

14

slide-15
SLIDE 15

Col Collaborat laborativ ive e Fil Filteri tering ng

 Primitive version   Ni can be whole database, or only k nearest neighbors  Rjk: Rating of user j on item k  𝑺𝒌: Average of all of user j's ratings  Summation in Pearson coefficient is over all items rated by both users  In principle, any prediction method can be used for collaborative filtering

15

slide-16
SLIDE 16

Exa Exampl mple e (C (Coll

  • llaborati

aborative ve Fil Filteri tering) ng)

16

slide-17
SLIDE 17

Di Distance stance-weigh weighted ted k-NN NN

 Might want to weigh nearer neighbors more heavily  and d (xq , xi) is distance between xq and xi  For continuous functions

17 2 1 V v

1 where )) ( , ( argmax ) ( ˆ ) , x d(x w x f v w x f

i q i i k i i

q

 

 

1 2 1

( ) 1 ˆ( ) where ( , )

k i i i q i k q i i i

w f x f x w d x x w

 

 

 

slide-18
SLIDE 18

Cur Curse se of

  • f Di

Dimen mension sionalit ality

 Imagine instances described by 20 attributes, but only 2 are relevant to target function: Instances that have identical values for the two relevant attributes may nevertheless be distant from one another in the 20- dimensional space.  Curse of dimensionality: nearest neighbor is easily misled when high- dimensional X. (Compare to decision trees).  One approach: Weight each attribute differently (Use training)

1. Stretch jth axis by weight zj , where z1, …., zn chosen to minimize prediction error 2. Use cross-validation to automatically choose weights z1, …., zn 3. Note setting zj to zero eliminates dimension i altogether

18

slide-19
SLIDE 19

Loca Locall lly-weigh weighted ted R Reg egress ression ion

 Basic idea:  k-NN forms local approximation to f for each query point xq  Why not form an explicit approximation f(x) for region surrounding xq

 Fit linear function to k nearest neighbors  Fit quadratic, ...  Thus producing “piecewise approximation” to f

19

slide-20
SLIDE 20

20

f1 (simple regression)

Training data Predicted value using locally weighted (piece-wise) regression Predicted value using simple regression

Locally-weighted regression f2 Locally-weighted regression f3 Locally-weighted regression f4

slide-21
SLIDE 21

 Several choices of error to minimize:

 e.g Squared error over k nearest neighbors 

  • r Distance-weighted square error over all neighbors

  • r …..

21

f1 (simple regression) Locally-weighted regression f2 Locally-weighted regression f3 Locally-weighted regression f4

slide-22
SLIDE 22

Radi Radial Basi al Basis F s Function unction Netw etworks

  • rks

 ‘Global’ approximation to target function, in terms of linear combination

  • f ‘local’ approximations

 Used, e.g., for image classification  A different kind of neural network  Closely related to distance-weighted regression, but  “eager” instead of “lazy”

22

slide-23
SLIDE 23

Radi Radial Basi al Basis F s Function unction Netw etworks

  • rks

 where ai(x )are the attributes describing instance x, and  One common choice for Ku(d(xu;x)) is

23

slide-24
SLIDE 24

Trai Training ning Radi Radial Basi al Basis F s Function N unction Netw etworks

  • rks

 Question 1: What xu to use for each kernel function Ku(d(xu;x))

 Scatter uniformly throughout instance space  Or use training instances (reflects instance distribution)

 Question 2: How to train weights (assume here Gaussian Ku)

 First choose variance (and perhaps mean) for each Ku  e.g., use EM  Then hold Ku fixed, and train linear output layer  efficient methods to fit linear function

24

slide-25
SLIDE 25

Case Case-ba based sed Reas Reason

  • ning

ing

 Can apply instance-based learning even when X != Rn  However, in this case we need different “distance” metrics.  For example, case-based reasoning is instance-based learning applied to instances with symbolic logic descriptions:

 ((user-complaint error53-on-shutdown)  (cpu-model PowerPC)  (operating-system Windows)  (network-connection PCIA)  (memory 48meg)  (installed-applications Excel Netscape VirusScan)  (disk 1gig)  (likely-cause ???))

25

slide-26
SLIDE 26

Case Case-ba based sed R Rea eason soning ing (C (CBR BR)

 Objects may include complex structural descriptions of cases & adaptation rules  CBR cannot use Euclidean distance measures  Must define distance measures for those complex objects instead (e.g. semantic nets)  CBR tries to model human problem-solving  uses past experience (cases) to solve new problems  retains solutions to new problems  CBR is an ongoing area of machine learning research with many applications

26

slide-27
SLIDE 27

CB CBR R Exa Exampl mple

27

Case Location code Bedrooms Recep rooms Type floors Cond- ition Price £ 1 8 2 1 terraced 1 poor 20,500 2 8 2 2 terraced 1 fair 25,000 3 5 1 2 semi 2 good 48,000 4 5 1 2 terraced 2 good 41,000 Case Location code Bedrooms Recep rooms Type floors Cond- ition Price £ 5 7 2 2 semi 1 poor ??? Test Instance:

slide-28
SLIDE 28

How rules

  • w rules are

are gene generated rated

 There is no unique way of doing it. Here is one possibility:  Examine cases and look for ones that are almost identical

 case 1 and case 2  R1: If recep-rooms changes from 2 to 1 then reduce price by £5,000  case 3 and case 4  R2: If Type changes from semi to terraced then reduce price by £7,000

 CBR Challenges

 How should cases be represented?  How should cases be indexed for fast retrieval?  How can good adaptation heuristics be developed?  When should old cases be removed?

28

slide-29
SLIDE 29

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

29

Conclusi Conclusions

  • ns

 Instance-based learning  K-Nearest Neighbor: A simple Algorithm  Locally-weighted Regression  Radial Basis Function Networks  Case-based Reasoning: for features which X != Rn

slide-30
SLIDE 30

30

Any Q Any Questi uestion?

  • n?

End of Lecture 15 Thank you!

Spring 2015

http://ce.sharif.edu/courses/93-94/2/ce717-1