CS145: INTRODUCTION TO DATA MINING Clustering Evaluation and - PowerPoint PPT Presentation

CS145: INTRODUCTION TO DATA MINING Clustering Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu November 7, 2017

Learnt Clustering Methods Vector Data Set Data Sequence Data Text Data Logistic Regression; Naïve Bayes for Text Classification Decision Tree ; KNN; SVM ; NN K-means; hierarchical PLSA Clustering clustering; DBSCAN; Mixture Models Linear Regression Prediction GLM* Apriori; FP growth GSP; PrefixSpan Frequent Pattern Mining DTW Similarity Search 2

Evaluation and Other Practical Issues • Evaluation of Clustering • Model Selection • Summary 3

Measuring Clustering Quality • Two methods: extrinsic vs. intrinsic • Extrinsic: supervised, i.e., the ground truth is available • Compare a clustering against the ground truth using certain clustering quality measure • Ex. Purity, precision and recall metrics, normalized mutual information • Intrinsic: unsupervised, i.e., the ground truth is unavailable • Evaluate the goodness of a clustering by considering how well the clusters are separated, and how compact the clusters are • Ex. Silhouette coefficient 4

Purity • Let 𝑫 = 𝑑 1 ,… , 𝑑 𝐿 be the output clustering result, 𝜵 = 𝜕 1 ,… , 𝜕 𝐾 be the ground truth clustering result (ground truth class) • 𝑑 𝑙 𝑏𝑜𝑒 𝑥 𝑙 are sets of data points 1 • 𝑞𝑣𝑠𝑗𝑢𝑧 𝐷,Ω = 𝑂 σ 𝑙 max |𝑑 𝑙 ∩ 𝜕 𝑘 | 𝑘 • Match each output cluster 𝑑 𝑙 to the best ground truth cluster 𝜕 𝑘 • Examine the overlap of data points between the two matched clusters • Purity is the proportion of data points that are matched 5

Example • Clustering output: cluster 1, cluster 2, and cluster 3 • Ground truth clustering result: × ’s, ◊ ’s, and ○ ’s. • cluster 1 vs. × ’s, cluster 2 vs. ○ ’s, and cluster 3 vs. ◊ ’s 6

Normalized Mutual Information 𝐽(𝐷,Ω) • 𝑂𝑁𝐽 𝐷, Ω = 𝐼 𝐷 𝐼(Ω) 𝑄(𝑑 𝑙 ∩𝜕 𝑘 ) • 𝐽 Ω, 𝐷 = σ 𝑙 σ 𝑘 𝑄(𝑑 𝑙 ∩ 𝜕 𝑘 ) 𝑚𝑝𝑕 𝑄 𝑑 𝑙 𝑄(𝜕 𝑘 ) |𝑑 𝑙 ∩𝜕 𝑘 | 𝑂|𝑑 𝑙 ∩𝜕 𝑘 | = σ 𝑙 σ 𝑘 𝑚𝑝𝑕 𝑂 𝑑 𝑙 ⋅|𝜕 𝑘 | 𝐼 Ω = − σ 𝑘 𝑄 𝜕 𝑘 𝑚𝑝𝑕𝑄 𝜕 𝑘 • |𝜕 𝑘 | 𝑂 𝑚𝑝𝑕 |𝜕 𝑘 | = − ෍ 𝑂 𝑘 7

Example NMI=0.36 |𝝏 𝒍 ∩ 𝒅 𝒌 | |𝝏 𝒍 | Cluster 1 Cluster 2 Cluster 3 sum crosses 5 1 2 8 circles 1 4 0 5 diamonds 0 1 3 4 sum 6 6 5 N=17 |𝒅 𝒌 | 8

Precision and Recall • Random Index (RI) = (TP+TN)/(TP+FP+FN+TN) • F-measure: 2P*R/(P+R) • P = TP/(TP+FP) • R = TP/(TP+FN) • Consider pairs of data points: • hopefully, two data points that are in the same cluster will be clustered into the same cluster (TP), and two data points that are in different clusters will be clustered into different clusters (TN). Same cluster Different clusters Same class TP FN Different classes FP TN 9

Example Data points Output clustering Ground truth clustering (class) a 1 2 b 1 2 c 2 2 d 2 1 • # pairs of data points: 6 • (a, b): same class, same cluster • (a, c): same class, different cluster TP = 1 FP = 1 • (a, d): different class, different cluster FN = 2 • (b, c): same class, different cluster TN = 2 • (b, d): different class, different cluster RI = 0.5 • (c, d): different class, same cluster P= ½, R= 1/3, F = 0.4 10

Question • If we flip the ground truth cluster labels (2->1 and 1->2), will the evaluation results be the same? Data points Output clustering Ground truth clustering (class) a 1 2 b 1 2 c 2 2 d 2 1 11

Selecting K in K-means and GMM • Selecting K is a model selection problem • Methods • Heuristics-based methods • Penalty method • Cross-validation 13

Heuristic Approaches • For K-means, plot sum of squared error for different k • Bigger k always leads to smaller cost • Knee points suggest good candidates for k 14

Penalty Method: BIC • For model-based clustering, e.g., GMM, choose k that can maximizes BIC • Larger k increases the likelihood, but also increases the penalty term: a trade-off! 15

Cross-Validation Likelihood • The likelihood of the training data will increase when increasing k • Compute the likelihood on unseen data • For each possible k • Partition the data into training and test • Learn the GMM related parameters on training dataset and compute the log-likelihood on test dataset • Repeat this multiple times to get an average value • Select k that maximizes the average log-likelihood on test dataset 16

Summary • Evaluation of Clustering • Purity, NMI, F-measure • Model selection • How to select k for k-means and GMM 18

CS145: INTRODUCTION TO DATA MINING Clustering Evaluation and - PowerPoint PPT Presentation

CS145: INTRODUCTION TO DATA MINING Clustering Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu November 7, 2017 Learnt Clustering Methods Vector Data Set Data Sequence Data Text Data Logistic Regression; Nave

CS145: INTRODUCTION TO DATA MINING Sequence Data: Sequential Pattern Mining Instructor: Yizhou

CS145: INTRODUCTION TO DATA MINING 09: Vector Data: Clustering Basics Instructor: Yizhou Sun

CS145: INTRODUCTION TO DATA MINING 7: Vector Data: K Nearest Neighbor Instructor: Yizhou Sun

CS145: INTRODUCTION TO DATA MINING Sequence Data: Similarity Search Instructor: Yizhou Sun

CS145: INTRODUCTION TO DATA MINING 6: Vector Data: Neural Network Instructor: Yizhou Sun

CS145: INTRODUCTION TO DATA MINING Text Data: Topic Model Instructor: Yizhou Sun

CS145: INTRODUCTION TO DATA MINING 08: Classification Evaluation and Practical Issues

CS145: INTRODUCTION TO DATA MINING 5: Vector Data: Support Vector Machine Instructor: Yizhou Sun

CS145: INTRODUCTION TO DATA MINING Course Project Overview Instructor: Yizhou Sun

CS145: INTRODUCTION TO DATA MINING 1: Introduction Instructor: Yizhou Sun yzsun@cs.ucla.edu

CS145: INTRODUCTION TO DATA MINING 1: Introduction Instructor: Yizhou Sun yzsun@cs.ucla.edu

Introductjon to SQL Part 1 Single-Table Queries By Michael Hahsler based on slides for CS145

E/R ODL UML CS145 Monday November 26 Sanaz Motahari-Asl Problem Statement The

Introductjon to SQL Part 2 Multj-table Queries By Michael Hahsler based on slides for CS145

MINING 4: Vector Data: Decision Tree Instructor: Yizhou Sun yzsun@cs.ucla.edu October 10, 2017

MINING Text Data: Nave Bayes Instructor: Yizhou Sun yzsun@cs.ucla.edu December 7, 2017

Weighted dependency graphs Valentin Fray Institut fr Mathematik, Universitt Zrich Final

Weak Detection of Signal in the Spiked Wigner Model Hye Won Chung and Ji Oon Lee Korea Advanced

Computational Learning Theory - MT 2018 Introduction and Course Details Varun Kanade University

Cryptanalysis of the New CLT Multilinear Map over the Integers May 11, 2016 Cheon, Fouque, Lee,

Standalone to High Availability SQL Server Clusters in Minutes Denny Cherry mrdenny@dcac.co

Blurred Clustering: Improved Dynamic Blurring Mike Wallbank University of She ffi eld 14/7/2015

Correction for multiple comparisons in FreeSurfer 1 Problem of Multiple Comparisons p < 10 -7

Clustered Samba in a Briefcase Kai Blin kai@samba.org, @kaiblin Team 2016-05-12 Outline

CS145: INTRODUCTION TO DATA MINING Clustering Evaluation and - PowerPoint PPT Presentation

CS145: INTRODUCTION TO DATA MINING Clustering Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu November 7, 2017 Learnt Clustering Methods Vector Data Set Data Sequence Data Text Data Logistic Regression; Nave

CS145: INTRODUCTION TO DATA MINING Sequence Data: Sequential Pattern Mining Instructor: Yizhou

CS145: INTRODUCTION TO DATA MINING 09: Vector Data: Clustering Basics Instructor: Yizhou Sun

CS145: INTRODUCTION TO DATA MINING 7: Vector Data: K Nearest Neighbor Instructor: Yizhou Sun

CS145: INTRODUCTION TO DATA MINING Sequence Data: Similarity Search Instructor: Yizhou Sun

CS145: INTRODUCTION TO DATA MINING 6: Vector Data: Neural Network Instructor: Yizhou Sun

CS145: INTRODUCTION TO DATA MINING Text Data: Topic Model Instructor: Yizhou Sun

CS145: INTRODUCTION TO DATA MINING 08: Classification Evaluation and Practical Issues

CS145: INTRODUCTION TO DATA MINING 5: Vector Data: Support Vector Machine Instructor: Yizhou Sun

CS145: INTRODUCTION TO DATA MINING Course Project Overview Instructor: Yizhou Sun

CS145: INTRODUCTION TO DATA MINING 1: Introduction Instructor: Yizhou Sun yzsun@cs.ucla.edu

CS145: INTRODUCTION TO DATA MINING 1: Introduction Instructor: Yizhou Sun yzsun@cs.ucla.edu

Introductjon to SQL Part 1 Single-Table Queries By Michael Hahsler based on slides for CS145

E/R ODL UML CS145 Monday November 26 Sanaz Motahari-Asl Problem Statement The

Introductjon to SQL Part 2 Multj-table Queries By Michael Hahsler based on slides for CS145

MINING 4: Vector Data: Decision Tree Instructor: Yizhou Sun yzsun@cs.ucla.edu October 10, 2017

MINING Text Data: Nave Bayes Instructor: Yizhou Sun yzsun@cs.ucla.edu December 7, 2017

Weighted dependency graphs Valentin Fray Institut fr Mathematik, Universitt Zrich Final

Weak Detection of Signal in the Spiked Wigner Model Hye Won Chung and Ji Oon Lee Korea Advanced

Computational Learning Theory - MT 2018 Introduction and Course Details Varun Kanade University

Cryptanalysis of the New CLT Multilinear Map over the Integers May 11, 2016 Cheon, Fouque, Lee,

Standalone to High Availability SQL Server Clusters in Minutes Denny Cherry mrdenny@dcac.co

Blurred Clustering: Improved Dynamic Blurring Mike Wallbank University of She ffi eld 14/7/2015

Correction for multiple comparisons in FreeSurfer 1 Problem of Multiple Comparisons p &lt; 10 -7

Clustered Samba in a Briefcase Kai Blin kai@samba.org, @kaiblin Team 2016-05-12 Outline

Correction for multiple comparisons in FreeSurfer 1 Problem of Multiple Comparisons p < 10 -7