CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 8: SIMILARITY-BASED - - PowerPoint PPT Presentation

cse217 introduction to data science
SMART_READER_LITE
LIVE PREVIEW

CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 8: SIMILARITY-BASED - - PowerPoint PPT Presentation

CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 8: SIMILARITY-BASED PREDICTION Spring 2019 Marion Neumann RECAP: CLUSTERING Good clustering high similarity within each group low similarity across the groups minimize distance of each


slide-1
SLIDE 1

CSE217 INTRODUCTION TO DATA SCIENCE

Spring 2019 Marion Neumann

LECTURE 8: SIMILARITY-BASED PREDICTION

slide-2
SLIDE 2

RECAP: CLUSTERING

  • Good clustering
  • high similarity within each group
  • low similarity across the groups

à minimize distance of each data points to its cluster center à we learn the grouping from the data based on similarities

  • no labels (no supervision)

2

slide-3
SLIDE 3

SIMILARITIES FOR SUPERVISED ML

  • oftentimes clusters are used for prediction tasks
  • cluster news articles à recommend articles in group

3

t

I

slide-4
SLIDE 4

SIMILARITIES FOR SUPERVISED ML

  • What if we had class labels for the prediction task?
  • train a classifier on labelled news articles à

recommend articles with positive predicted label

4

slide-5
SLIDE 5

SIMILARITIES FOR CLASSIFICATION

  • New Idea: combine both ideas
  • use similarities to predict class label directly without

computing clusters first

  • possible since we have observed class labels in our

training data (supervised learning)

5

ME KNN classification

slide-6
SLIDE 6

SIMILARITIES FOR REGRESSION

  • This also works for regression:

6

prediction

7

q

predict average

price among 3 NN

f

KUN regression

slide-7
SLIDE 7

K-NEAREST NEIGHBOR MODEL

7

  • Prediction

classification

ft

muffle lb E

in CWf

Regessian

K

flxt EE.fi

IIENj

where We

isthe set of

nearest neighbors of I

slide-8
SLIDE 8

K-NEAREST NEIGHBOR MODEL

8

  • Algorithm to find NNs

k

v

testinput

training

inputs

INPUT K

i

in

I d

WI fl

k

takethefirst k datapoints as initial KNN

de Ed

die

where di

D I Ii

indiceesONLY max D

wax de

max id

argma de

FOR i k11

n

II DIE E

a maid

VVI maxid

E

de maxi

d

DEitc

Imax.d

nEgYE

ENDENDh.atid

RETURNWE

slide-9
SLIDE 9

K-NN DISCUSSION

9

quickfire training

slow at lest time

simple

lazylearner

have to select k explainable

need to store entire

same for regression

trainingdata D

KiisiBia n

classification

for test prediction

multi classclassification

1 hugemodelsize

slide-10
SLIDE 10

HOW TO SET K?

10

model selection

keep DTE

T

DvatD

for evaluation

use validation set

FOR K I

Kmax

FOR it inXval

ya

K NN DTR

select k

egg

a g perf y

y

qq.ayggq.gg END

k

argyrax perf k

  • nseyalidation
slide-11
SLIDE 11

CROSS-VALIDATION (CV)

11

Using

a fixed Validation set

Dna Dirac

split has issues

  • f

class discussion solution perform

cross validation instead FOR k I

Kmax

I

FOR f I

num folds

YET

KDTE

For

inXilie

1

g

KNN D

  • ut

END

Dv

perfk

avgperfly't

perfk

targperf perfKf

II

play

END

µ

k

argynaxperf k

RD

CV onDTRIDTE can be used for modelCompanion

slide-12
SLIDE 12

SIMILARITY-BASED METHODS

ACTIVITY 2

  • k-NN classification or regression
  • Clustering/k-means

12

If a variable is measured at a higher scale than the other variables, then whatever measure we use will be

  • verly influenced by that variable.
slide-13
SLIDE 13

DATA TRANSFORMATIONS

13

standardized

  • Min-Max scaling
  • Centering
  • Standardization

!" !# min-max scaled

slide-14
SLIDE 14

14

  • [DSFS]
  • Ch12: k-NN
  • Ch10: Working with data à Rescaling (p132-133)
  • [PDSH]
  • Ch5: Hyperparameters and Model Validation
  • Thinking about Model Validation [cross-validation] (p359-362)

SUMMARY & READING

  • K-NN is an extremely simple and versatile model for

supervised machine learning.

  • K-NN is a lazy learner: we do not learn/train a model, we

simply use the data directly for predictions.

  • Cross-Validation is a better way to evaluate ML models or

to perform model selection.