CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 8: SIMILARITY-BASED - - PowerPoint PPT Presentation
CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 8: SIMILARITY-BASED - - PowerPoint PPT Presentation
CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 8: SIMILARITY-BASED PREDICTION Spring 2019 Marion Neumann RECAP: CLUSTERING Good clustering high similarity within each group low similarity across the groups minimize distance of each
RECAP: CLUSTERING
- Good clustering
- high similarity within each group
- low similarity across the groups
à minimize distance of each data points to its cluster center à we learn the grouping from the data based on similarities
- no labels (no supervision)
2
SIMILARITIES FOR SUPERVISED ML
- oftentimes clusters are used for prediction tasks
- cluster news articles à recommend articles in group
3
t
I
SIMILARITIES FOR SUPERVISED ML
- What if we had class labels for the prediction task?
- train a classifier on labelled news articles à
recommend articles with positive predicted label
4
SIMILARITIES FOR CLASSIFICATION
- New Idea: combine both ideas
- use similarities to predict class label directly without
computing clusters first
- possible since we have observed class labels in our
training data (supervised learning)
5
ME KNN classification
SIMILARITIES FOR REGRESSION
- This also works for regression:
6
prediction
7
q
predict average
price among 3 NN
f
KUN regression
K-NEAREST NEIGHBOR MODEL
7
- Prediction
classification
ft
muffle lb E
in CWf
Regessian
K
flxt EE.fi
IIENj
where We
isthe set of
nearest neighbors of I
K-NEAREST NEIGHBOR MODEL
8
- Algorithm to find NNs
k
v
testinput
training
inputs
INPUT K
i
in
I d
WI fl
k
takethefirst k datapoints as initial KNN
de Ed
die
where di
D I Ii
indiceesONLY max D
wax de
max id
argma de
FOR i k11
n
II DIE E
a maid
VVI maxid
E
de maxi
d
DEitc
Imax.d
nEgYE
ENDENDh.atid
RETURNWE
K-NN DISCUSSION
9
quickfire training
slow at lest time
simple
lazylearner
have to select k explainable
need to store entire
same for regression
trainingdata D
KiisiBia n
classification
for test prediction
multi classclassification
1 hugemodelsize
HOW TO SET K?
10
model selection
keep DTE
T
DvatD
for evaluation
use validation set
FOR K I
Kmax
FOR it inXval
ya
K NN DTR
select k
egg
a g perf y
y
qq.ayggq.gg END
k
argyrax perf k
- nseyalidation
CROSS-VALIDATION (CV)
11
Using
a fixed Validation set
Dna Dirac
split has issues
- f
class discussion solution perform
cross validation instead FOR k I
Kmax
I
FOR f I
num folds
YET
KDTE
For
inXilie
1
g
KNN D
- ut
END
Dv
perfk
avgperfly't
perfk
targperf perfKf
II
play
END
µ
k
argynaxperf k
RD
CV onDTRIDTE can be used for modelCompanion
SIMILARITY-BASED METHODS
ACTIVITY 2
- k-NN classification or regression
- Clustering/k-means
12
If a variable is measured at a higher scale than the other variables, then whatever measure we use will be
- verly influenced by that variable.
DATA TRANSFORMATIONS
13
standardized
- Min-Max scaling
- Centering
- Standardization
!" !# min-max scaled
14
- [DSFS]
- Ch12: k-NN
- Ch10: Working with data à Rescaling (p132-133)
- [PDSH]
- Ch5: Hyperparameters and Model Validation
- Thinking about Model Validation [cross-validation] (p359-362)
SUMMARY & READING
- K-NN is an extremely simple and versatile model for
supervised machine learning.
- K-NN is a lazy learner: we do not learn/train a model, we
simply use the data directly for predictions.
- Cross-Validation is a better way to evaluate ML models or