cse217 introduction to data science
play

CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 8: SIMILARITY-BASED - PowerPoint PPT Presentation

CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 8: SIMILARITY-BASED PREDICTION Spring 2019 Marion Neumann RECAP: CLUSTERING Good clustering high similarity within each group low similarity across the groups minimize distance of each


  1. CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 8: SIMILARITY-BASED PREDICTION Spring 2019 Marion Neumann

  2. RECAP: CLUSTERING • Good clustering • high similarity within each group • low similarity across the groups à minimize distance of each data points to its cluster center à we learn the grouping from the data based on similarities • no labels (no supervision ) 2

  3. SIMILARITIES FOR SUPERVISED ML • oftentimes clusters are used for prediction tasks • cluster news articles à recommend articles in group t I 3

  4. SIMILARITIES FOR SUPERVISED ML • What if we had class labels for the prediction task ? • train a classifier on labelled news articles à recommend articles with positive predicted label 4

  5. SIMILARITIES FOR CLASSIFICATION • New Idea: combine both ideas • use similarities to predict class label directly without computing clusters first • possible since we have observed class labels in our training data ( supervised learning) ME KNN classification 5

  6. SIMILARITIES FOR REGRESSION • This also works for regression: prediction 7 predict average q price among 3 NN KUN regression f 6

  7. K-NEAREST NEIGHBOR MODEL • Prediction classification ft muffle lb in C Wf E Regessian flxt EE.fi K IIE Nj isthe set of nearest neighbors of I where We 7

  8. K-NEAREST NEIGHBOR MODEL k test input v • Algorithm to find NNs training inputs INPUT K I in d i takethe first k data points as initial K NN WI fl k indiceesONLY where di D I Ii de Ed die max D wax de argma de max id FOR i k 11 n a maid II DIE E E VVI max id de maxi D Eitc d Imax.d nEgYE ENDENDh.at id 8 RETURN WE

  9. K-NN DISCUSSION quickfire training slow at lest time lazy learner simple have to select k explainable need to store entire same for regression training data D KiisiBia n classification for test prediction multi class classification 1 huge model size 9

  10. HOW TO SET K? keep DTE model selection for evaluation T DvatD use validation set FOR K I Kmax FOR it in Xval K NN DTR ya select k egg a g perf y qq.ayggq.gg y END onseyalidation argyrax perf k k 10

  11. CROSS-VALIDATION (CV) a fixed Validation set Dna Dirac Using of class discussion split has issues cross validation instead solution perform FOR k I Kmax I FOR f I num folds YET KDTE in Xilie For K NN D 1 g out END Dv avgperfly't perf k targ perf perfKf perf k II END play k argynax perf k µ RD 11 CV on DTRIDTE can be used for model Companion

  12. SIMILARITY-BASED METHODS ACTIVITY 2 • k-NN classification or regression • Clustering/k-means If a variable is measured at a higher scale than the other variables, then whatever measure we use will be overly influenced by that variable. 12

  13. DATA TRANSFORMATIONS min-max scaled ! # • Min-Max scaling • Centering • Standardization ! " standardized 13

  14. SUMMARY & READING • K-NN is an extremely simple and versatile model for supervised machine learning. • K-NN is a lazy learner : we do not learn/train a model, we simply use the data directly for predictions. • Cross-Validation is a better way to evaluate ML models or to perform model selection . • [ DSFS ] • Ch12: k-NN • Ch10: Working with data à Rescaling (p132-133) • [ PDSH ] • Ch5: Hyperparameters and Model Validation • Thinking about Model Validation [cross-validation] (p359-362 ) 14

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend