CS145: INTRODUCTION TO DATA MINING Clustering Evaluation and - - PowerPoint PPT Presentation

β–Ά
cs145 introduction to data
SMART_READER_LITE
LIVE PREVIEW

CS145: INTRODUCTION TO DATA MINING Clustering Evaluation and - - PowerPoint PPT Presentation

CS145: INTRODUCTION TO DATA MINING Clustering Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu November 7, 2017 Learnt Clustering Methods Vector Data Set Data Sequence Data Text Data Logistic Regression; Nave


slide-1
SLIDE 1

CS145: INTRODUCTION TO DATA MINING

Instructor: Yizhou Sun

yzsun@cs.ucla.edu November 7, 2017

Clustering Evaluation and Practical Issues

slide-2
SLIDE 2

Learnt Clustering Methods

2

Vector Data Set Data Sequence Data Text Data Classification

Logistic Regression; Decision Tree; KNN; SVM; NN NaΓ―ve Bayes for Text

Clustering

K-means; hierarchical clustering; DBSCAN; Mixture Models PLSA

Prediction

Linear Regression GLM*

Frequent Pattern Mining

Apriori; FP growth GSP; PrefixSpan

Similarity Search

DTW

slide-3
SLIDE 3

Evaluation and Other Practical Issues

  • Evaluation of Clustering
  • Model Selection
  • Summary

3

slide-4
SLIDE 4

Measuring Clustering Quality

  • Two methods: extrinsic vs. intrinsic
  • Extrinsic: supervised, i.e., the ground truth is available
  • Compare a clustering against the ground truth using certain

clustering quality measure

  • Ex. Purity, precision and recall metrics, normalized mutual

information

  • Intrinsic: unsupervised, i.e., the ground truth is unavailable
  • Evaluate the goodness of a clustering by considering how well

the clusters are separated, and how compact the clusters are

  • Ex. Silhouette coefficient

4

slide-5
SLIDE 5

Purity

  • Let 𝑫 = 𝑑1,… , 𝑑𝐿 be the output clustering

result, 𝜡 = πœ•1,… , πœ•πΎ be the ground truth clustering result (ground truth class)

  • 𝑑𝑙 π‘π‘œπ‘’ π‘₯𝑙 are sets of data points
  • π‘žπ‘£π‘ π‘—π‘’π‘§ 𝐷,Ξ© =

1 𝑂 σ𝑙 max π‘˜

|𝑑𝑙 ∩ πœ•π‘˜|

  • Match each output cluster 𝑑𝑙 to the best ground truth

cluster πœ•π‘˜

  • Examine the overlap of data points between the two

matched clusters

  • Purity is the proportion of data points that are matched

5

slide-6
SLIDE 6

Example

  • Clustering output: cluster 1, cluster 2, and cluster 3
  • Ground truth clustering result: ×’s, β—Šβ€™s, and ○’s.
  • cluster 1 vs. ×’s, cluster 2 vs. ○’s, and cluster 3 vs. β—Šβ€™s

6

slide-7
SLIDE 7

Normalized Mutual Information

  • 𝑂𝑁𝐽 𝐷, Ξ© =

𝐽(𝐷,Ω) 𝐼 𝐷 𝐼(Ω)

  • 𝐽 Ξ©, 𝐷 = σ𝑙 Οƒπ‘˜ 𝑄(𝑑𝑙 ∩ πœ•π‘˜) π‘šπ‘π‘•

𝑄(π‘‘π‘™βˆ©πœ•π‘˜) 𝑄 𝑑𝑙 𝑄(πœ•π‘˜)

  • 𝐼 Ξ© = βˆ’ Οƒπ‘˜ 𝑄 πœ•π‘˜ π‘šπ‘π‘•π‘„ πœ•π‘˜

= βˆ’ ෍

π‘˜

|πœ•π‘˜| 𝑂 π‘šπ‘π‘• |πœ•π‘˜| 𝑂

7

= σ𝑙 Οƒπ‘˜

|π‘‘π‘™βˆ©πœ•π‘˜| 𝑂

π‘šπ‘π‘•

𝑂|π‘‘π‘™βˆ©πœ•π‘˜| 𝑑𝑙 β‹…|πœ•π‘˜|

slide-8
SLIDE 8

Example

Cluster 1 Cluster 2 Cluster 3 sum crosses 5 1 2 8 circles 1 4 5 diamonds 1 3 4 sum 6 6 5 N=17

8

|𝝏𝒍 ∩ π’…π’Œ| |𝝏𝒍| |π’…π’Œ|

NMI=0.36

slide-9
SLIDE 9

Precision and Recall

  • Random Index (RI) = (TP+TN)/(TP+FP+FN+TN)
  • F-measure: 2P*R/(P+R)
  • P = TP/(TP+FP)
  • R = TP/(TP+FN)
  • Consider pairs of data points:
  • hopefully, two data points that are in the same cluster will be

clustered into the same cluster (TP), and two data points that are in different clusters will be clustered into different clusters (TN).

9

Same cluster Different clusters Same class TP FN Different classes FP TN

slide-10
SLIDE 10

Example

Data points Output clustering Ground truth clustering (class) a 1 2 b 1 2 c 2 2 d 2 1

10

  • # pairs of data points: 6
  • (a, b): same class, same cluster
  • (a, c): same class, different cluster
  • (a, d): different class, different cluster
  • (b, c): same class, different cluster
  • (b, d): different class, different cluster
  • (c, d): different class, same cluster

TP = 1 FP = 1 FN = 2 TN = 2 RI = 0.5 P= Β½, R= 1/3, F = 0.4

slide-11
SLIDE 11

Question

  • If we flip the ground truth cluster labels

(2->1 and 1->2), will the evaluation results be the same?

11

Data points Output clustering Ground truth clustering (class) a 1 2 b 1 2 c 2 2 d 2 1

slide-12
SLIDE 12

Evaluation and Other Practical Issues

  • Evaluation of Clustering
  • Model Selection
  • Summary

12

slide-13
SLIDE 13

Selecting K in K-means and GMM

  • Selecting K is a model selection problem
  • Methods
  • Heuristics-based methods
  • Penalty method
  • Cross-validation

13

slide-14
SLIDE 14

Heuristic Approaches

  • For K-means, plot sum of squared error

for different k

  • Bigger k always leads to smaller cost
  • Knee points suggest good candidates for k

14

slide-15
SLIDE 15

Penalty Method: BIC

  • For model-based clustering, e.g., GMM,

choose k that can maximizes BIC

  • Larger k increases the likelihood, but also

increases the penalty term: a trade-off!

15

slide-16
SLIDE 16

Cross-Validation Likelihood

  • The likelihood of the training data will

increase when increasing k

  • Compute the likelihood on unseen data
  • For each possible k
  • Partition the data into training and test
  • Learn the GMM related parameters on training

dataset and compute the log-likelihood on test dataset

  • Repeat this multiple times to get an average value
  • Select k that maximizes the average log-likelihood
  • n test dataset

16

slide-17
SLIDE 17

Evaluation and Other Practical Issues

  • Evaluation of Clustering
  • Model Selection
  • Summary

17

slide-18
SLIDE 18

Summary

  • Evaluation of Clustering
  • Purity, NMI, F-measure
  • Model selection
  • How to select k for k-means and GMM

18