CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 8: SIMILARITY-BASED - PowerPoint PPT Presentation

CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 8: SIMILARITY-BASED PREDICTION Spring 2019 Marion Neumann

RECAP: CLUSTERING • Good clustering • high similarity within each group • low similarity across the groups à minimize distance of each data points to its cluster center à we learn the grouping from the data based on similarities • no labels (no supervision ) 2

SIMILARITIES FOR SUPERVISED ML • oftentimes clusters are used for prediction tasks • cluster news articles à recommend articles in group t I 3

SIMILARITIES FOR SUPERVISED ML • What if we had class labels for the prediction task ? • train a classifier on labelled news articles à recommend articles with positive predicted label 4

SIMILARITIES FOR CLASSIFICATION • New Idea: combine both ideas • use similarities to predict class label directly without computing clusters first • possible since we have observed class labels in our training data ( supervised learning) ME KNN classification 5

SIMILARITIES FOR REGRESSION • This also works for regression: prediction 7 predict average q price among 3 NN KUN regression f 6

K-NEAREST NEIGHBOR MODEL • Prediction classification ft muffle lb in C Wf E Regessian flxt EE.fi K IIE Nj isthe set of nearest neighbors of I where We 7

K-NEAREST NEIGHBOR MODEL k test input v • Algorithm to find NNs training inputs INPUT K I in d i takethe first k data points as initial K NN WI fl k indiceesONLY where di D I Ii de Ed die max D wax de argma de max id FOR i k 11 n a maid II DIE E E VVI max id de maxi D Eitc d Imax.d nEgYE ENDENDh.at id 8 RETURN WE

K-NN DISCUSSION quickfire training slow at lest time lazy learner simple have to select k explainable need to store entire same for regression training data D KiisiBia n classification for test prediction multi class classification 1 huge model size 9

HOW TO SET K? keep DTE model selection for evaluation T DvatD use validation set FOR K I Kmax FOR it in Xval K NN DTR ya select k egg a g perf y qq.ayggq.gg y END onseyalidation argyrax perf k k 10

CROSS-VALIDATION (CV) a fixed Validation set Dna Dirac Using of class discussion split has issues cross validation instead solution perform FOR k I Kmax I FOR f I num folds YET KDTE in Xilie For K NN D 1 g out END Dv avgperfly't perf k targ perf perfKf perf k II END play k argynax perf k µ RD 11 CV on DTRIDTE can be used for model Companion

SIMILARITY-BASED METHODS ACTIVITY 2 • k-NN classification or regression • Clustering/k-means If a variable is measured at a higher scale than the other variables, then whatever measure we use will be overly influenced by that variable. 12

DATA TRANSFORMATIONS min-max scaled ! # • Min-Max scaling • Centering • Standardization ! " standardized 13

SUMMARY & READING • K-NN is an extremely simple and versatile model for supervised machine learning. • K-NN is a lazy learner : we do not learn/train a model, we simply use the data directly for predictions. • Cross-Validation is a better way to evaluate ML models or to perform model selection . • [ DSFS ] • Ch12: k-NN • Ch10: Working with data à Rescaling (p132-133) • [ PDSH ] • Ch5: Hyperparameters and Model Validation • Thinking about Model Validation [cross-validation] (p359-362 ) 14

CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 8: SIMILARITY-BASED - PowerPoint PPT Presentation

CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 8: SIMILARITY-BASED PREDICTION Spring 2019 Marion Neumann RECAP: CLUSTERING Good clustering high similarity within each group low similarity across the groups minimize distance of each

CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 2: EXPLORATORY DATA ANALYSIS Spring 2019 Marion

CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 4: REGRESSION Spring 2019 Marion Neumann RECAP:

CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 10: DATA ENGINEERING Spring 2019 Marion Neumann

CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 3: SENTIMENT ANALYSIS Spring 2019 Marion Neumann

CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 6: LEARNING PRINCIPLES Spring 2019 Marion Neumann

CSE217 INTRODUCTION TO DATA SCIENCE COURSE WEBSITE, SYLLABUS, ACADEMIC INTEGRITY Spring 2019

DataCamp Data Types for Data Science DataCamp Data Types for Data Science Data types Data type

EMIS/DS 1300: A Practical Introduction to Data Science Slides by Michael Hahsler Data + Science

Introduction and lists Jason Myers Instructor DataCamp Data Types for Data Science Data types

Data Science in the Wild Lecture 1: Introduction Eran Toch Data Science in the Wild, Spring 2019

Data Science: Statistics or Computer Science? 9/15/2015 DATA SCIENCE: STATISTICS OR COMPUTER

Environmental Health Science Data Streams Data Streams Health Data Health Data Brian S.

Kotlin for Data Science Thomas Nield @thomasnield9727 Agenda Kotlin for Data Science

CSCI 3022 Intro to Data Science with Probability and Statistics What is Data Science? What is

DATA SCIENCE DAN S REZNIK, DIRECTOR DATA SCIENCE CONSULTING LTD (c) 2019 Data Science Consutling

Data Science in the Wild Lecture 12: Memory-Based Data Warehouses Eran Toch Data Science in the

REASONML DOUGLAS TEOH dteoh.com dteoh #3808 allm.net/en Medical Device Recalls Software

Math 211 Math 211 Complex Numbers and Matrices October 29, 2001 2 Complex Numbers Complex

Renormalons in Quantum Mechanics Cihan Pazarba s Bo gazi ci University based on

Agenda Grading ELLs FLE ID Accommodations ACCESS Title III Supplemental

Divergent series in quantum mechanics Large-order behavior of the perturbation series: its

Coalescences of IMBH binaries as sources for the future IMBH binaries as seen by the GW ET

PROPERTY-PRESERVING ENCRYPTION GRAD SEC NOV 07 2017 TODAYS PAPERS CRYPTDB BUILDING

Time and Space Crawling, on the planets face. Space Complexity Some insects, called the human