CS145: INTRODUCTION TO DATA MINING 7: Vector Data: K Nearest - PowerPoint PPT Presentation

CS145: INTRODUCTION TO DATA MINING 7: Vector Data: K Nearest Neighbor Instructor: Yizhou Sun yzsun@cs.ucla.edu October 22, 2017

Methods to Learn: Last Lecture Vector Data Set Data Sequence Data Text Data Logistic Regression; Naïve Bayes for Text Classification Decision Tree ; KNN SVM ; NN Clustering K-means; hierarchical PLSA clustering; DBSCAN; Mixture Models Linear Regression Prediction GLM* Apriori; FP growth GSP; PrefixSpan Frequent Pattern Mining Similarity Search DTW 2

Methods to Learn Vector Data Set Data Sequence Data Text Data Logistic Regression; Naïve Bayes for Text Classification Decision Tree ; KNN SVM ; NN Clustering K-means; hierarchical PLSA clustering; DBSCAN; Mixture Models Linear Regression Prediction GLM* Apriori; FP growth GSP; PrefixSpan Frequent Pattern Mining Similarity Search DTW 3

K Nearest Neighbor • Introduction • kNN • Similarity and Dissimilarity • Summary 4

Lazy vs. Eager Learning • Lazy vs. eager learning • La Lazy zy le lear arni ning ng (e.g., instance-based learning): Simply stores training data (or only minor processing) and waits until it is given a test tuple • Ea Eage ger r le lear arni ning ng (the above discussed methods): Given a set of training tuples, constructs a classification model before receiving new (e.g., test) data to classify • Lazy: less time in training but more time in predicting • Accuracy • Lazy method effectively uses a richer hypothesis space since it uses many local linear functions to form an implicit global approximation to the target function • Eager: must commit to a single hypothesis that covers the entire instance space 5

Lazy Learner: Instance-Based Methods • Instance-based learning: • Store training examples and delay the processing (“lazy evaluation”) until a new instance must be classified • Typical approaches • k -nearest neighbor approach • Instances represented as points in, e.g., a Euclidean space. • Locally weighted regression • Constructs local approximation 6

The k -Nearest Neighbor Algorithm • All instances correspond to points in the n-D space • The nearest neighbor are defined in terms of a distance measure, dist( X 1 , X 2 ) • Target function could be discrete- or real- valued • For discrete-valued, k -NN returns the most common value among the k training examples nearest to x q • Vonoroi diagram: the decision surface induced by 1-NN for a typical set of training examples . _ _ _ _ + . + . . . _ + x q _ . + 8

kNN Example 9

kNN Algorithm Summary • Choose K • For a given new instance 𝑌 𝑜𝑓𝑥 , find K closest training points w.r.t. a distance measure • Classify 𝑌 𝑜𝑓𝑥 = majority vote among the K points 10

Discussion on the k -NN Algorithm • k -NN for real-valued prediction for a given unknown tuple • Returns the mean values of the k nearest neighbors • Distance-weighted nearest neighbor algorithm • Weight the contribution of each of the k neighbors according to their distance to the query x q 1 • Give greater weight to closer neighbors 𝑓. 𝑕. , 𝑥 𝑗 = 2 𝑒 𝑦 𝑟 , 𝑦 𝑗 ∑𝑥 𝑗 𝑧 𝑗 • 𝑧 𝑟 = ∑𝑥 𝑗 , where 𝑦 𝑗 ’s are 𝑦 𝑟 ’s nearest neighbors 2 /2𝜏 2 ) 𝑥 𝑗 = exp(−𝑒 𝑦 𝑟 , 𝑦 𝑗 • Robust to noisy data by averaging k -nearest neighbors • Curse of dimensionality: distance between neighbors could be dominated by irrelevant attributes • To overcome it, axes stretch or elimination of the least relevant attributes 11

Selection of k for kNN • The number of neighbors k • Small k: overfitting (high var., low bias) • Big k: bringing too many irrelevant points (high bias, low var.) • More discussions: http://scott.fortmann-roe.com/docs/BiasVariance.html 12

Similarity and Dissimilarity • Similarity • Numerical measure of how alike two data objects are • Value is higher when objects are more alike • Often falls in the range [0,1] • Dissimilarity (e.g., distance) • Numerical measure of how different two data objects are • Lower when objects are more alike • Minimum dissimilarity is often 0 • Upper limit varies • Proximity refers to a similarity or dissimilarity 14

Data Matrix and Dissimilarity Matrix • Data matrix   • n data points with p x ... x ... x 11 1f 1p   dimensions ... ... ... ... ...     x ... x ... x • Two modes  i1 if ip    ... ... ... ... ...   x ... x ... x     n1 nf np • Dissimilarity matrix   0 • n data points, but registers   d(2,1) 0   only the distance   d(3,1 ) d ( 3 , 2 ) 0 • A triangular matrix   : : :   • Single mode     d ( n , 1 ) d ( n , 2 ) ... ... 0 15

Proximity Measure for Nominal Attributes • Can take 2 or more states, e.g., red, yellow, blue, green (generalization of a binary attribute) • Method 1: Simple matching • m : # of matches, p : total # of variables  p pm  ( , ) d i j • Method 2: Use a large number of binary attributes • creating a new binary attribute for each of the M nominal states 16

Proximity Measure for Binary Attributes Object j • A contingency table for binary data Object i • Distance measure for symmetric binary variables: • Distance measure for asymmetric binary variables: • Jaccard coefficient ( similarity measure for asymmetric binary variables): 17

Dissimilarity between Binary Variables • Example Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4 Jack M Y N P N N N Mary F Y N P N P N Jim M Y P N N N N • Gender is a symmetric attribute • The remaining attributes are asymmetric binary • Let the values Y and P be 1, and the value N 0  0 1   d ( jack , mary ) 0 . 33   2 0 1  1 1   d ( jack , jim ) 0 . 67   1 1 1  1 2   d ( jim , mary ) 0 . 75   1 1 2 18

Standardizing Numeric Data    x z  • Z-score: • X: raw score to be standardized, μ : mean of the population, σ : standard deviation • the distance between the raw score and the population mean in units of the standard deviation • negative when the raw score is below the mean, “+” when above • An alternative way: Calculate the mean absolute deviation        1 (| | | | ... | |) s x m x m x m n 1 2 f f f f f nf f 1   where m (x x x    ... ) n . x m 1 2 f f f nf  if f z s if • standardized measure ( z-score ): f • Using mean absolute deviation is more robust than using standard deviation 19

Example: Data Matrix and Dissimilarity Matrix Data Matrix point attribute1 attribute2 1 2 x1 3 5 x2 2 0 x3 x4 4 5 Dissimilarity Matrix (with Euclidean Distance) x1 x2 x3 x4 0 x1 3.61 0 x2 2.24 5.1 0 x3 x4 4.24 1 5.39 0 20

Distance on Numeric Data: Minkowski Distance • Minkowski distance : A popular distance measure where i = ( x i1 , x i2 , …, x ip ) and j = ( x j1 , x j2 , …, x jp ) are two p - dimensional data objects, and h is the order (the distance so defined is also called L- h norm) • Properties • d(i, j) > 0 if i ≠ j , and d(i, i) = 0 (Positive definiteness) • d(i, j) = d(j, i) (Symmetry) d(i, j)  d(i, k) + d(k, j) (Triangle Inequality) • • A distance that satisfies these properties is a metric 21

Special Cases of Minkowski Distance • h = 1: Manhattan (city block, L 1 norm) distance • E.g., the Hamming distance: the number of bits that are different between two binary vectors        ( , ) | | | | ... | | d i j x x x x x x i j i j i j 1 1 2 2 p p • h = 2: (L 2 norm) Euclidean distance        2 2 2 ( , ) (| | | | ... | | ) d i j x x x x x x i j i j i j 1 1 2 2 p p • h   . “supremum” (L max norm, L  norm) distance. • This is the maximum difference between any component (attribute) of the vectors 22

Example: Minkowski Distance Dissimilarity Matrices Manhattan (L 1 ) point attribute 1 attribute 2 1 2 x1 L x1 x2 x3 x4 3 5 x2 0 x1 2 0 x3 5 0 x2 4 5 x4 3 6 0 x3 6 1 7 0 x4 Euclidean (L 2 ) L2 x1 x2 x3 x4 0 x1 3.61 0 x2 2.24 5.1 0 x3 4.24 1 5.39 0 x4 Supremum x1 x2 x3 x4 L  0 x1 3 0 x2 2 5 0 x3 3 1 5 0 x4 23

Ordinal Variables • Order is important, e.g., rank • Can be treated like interval-scaled r  { 1 ,..., } • replace x if by their rank M if f • map the range of each variable onto [0, 1] by replacing i -th object in the f -th variable by  1 r  if z  1 if M f • compute the dissimilarity using methods for interval-scaled variables 24

CS145: INTRODUCTION TO DATA MINING 7: Vector Data: K Nearest - PowerPoint PPT Presentation

CS145: INTRODUCTION TO DATA MINING 7: Vector Data: K Nearest Neighbor Instructor: Yizhou Sun yzsun@cs.ucla.edu October 22, 2017 Methods to Learn: Last Lecture Vector Data Set Data Sequence Data Text Data Logistic Regression; Nave Bayes

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

CS145: INTRODUCTION TO DATA MINING Sequence Data: Sequential Pattern Mining Instructor: Yizhou

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

CS145: INTRODUCTION TO DATA MINING 09: Vector Data: Clustering Basics Instructor: Yizhou Sun

CS145: INTRODUCTION TO DATA MINING Sequence Data: Similarity Search Instructor: Yizhou Sun

CS145: INTRODUCTION TO DATA MINING 6: Vector Data: Neural Network Instructor: Yizhou Sun

CS145: INTRODUCTION TO DATA MINING Text Data: Topic Model Instructor: Yizhou Sun

CS145: INTRODUCTION TO DATA MINING Course Project Overview Instructor: Yizhou Sun

CS145: INTRODUCTION TO DATA MINING 08: Classification Evaluation and Practical Issues

CS145: INTRODUCTION TO DATA MINING Clustering Evaluation and Practical Issues Instructor: Yizhou

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Introduction What is data mining? to Data mining functionalities Data Mining Major

CS145: INTRODUCTION TO DATA MINING 5: Vector Data: Support Vector Machine Instructor: Yizhou Sun

CS145: INTRODUCTION TO DATA MINING 1: Introduction Instructor: Yizhou Sun yzsun@cs.ucla.edu

CS145: INTRODUCTION TO DATA MINING 1: Introduction Instructor: Yizhou Sun yzsun@cs.ucla.edu

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

Synthesis and characterization of zeolite-encapsulated porphyrins Mohamad Mehdi Kashani-Motlagh,

+ The BlackGEM Array for follow-up of GW sources Steven Bloemen BlackGEM Project Manager

Spectral Estimation Overview Introduction Periodogram R x (e j ) r x ( )e

M a t c h i n g a t L O a n d N L O I n t ro d u c t i o n t o Q C D - L e c t u re 4

Staying in the Game in the Face of Adversity 3 stories of real companies in real crisis John

One-Slide Summary L-System Recursive transition networks and Backus-Naur Form context-free

Genome-wide characterization of copy number variants in epilepsy patients Canadian Human and

Negative velocity fluctuations of pulled reaction fronts Racah Institute of Physics, Baruch

Sambuz

Useful Links

Newsletter

Mail Us