Machine Learning Lecture Notes on Clustering (IV) 2016-2017 Davide - - PowerPoint PPT Presentation

machine learning
SMART_READER_LITE
LIVE PREVIEW

Machine Learning Lecture Notes on Clustering (IV) 2016-2017 Davide - - PowerPoint PPT Presentation

Machine Learning Lecture Notes on Clustering (IV) 2016-2017 Davide Eynard davide.eynard@usi.ch Institute of Computational Science Universit` a della Svizzera italiana p. 1/34 Lecture outline Cluster Evaluation Internal measures


slide-1
SLIDE 1

Machine Learning

Lecture Notes on Clustering (IV) 2016-2017

Davide Eynard

davide.eynard@usi.ch

Institute of Computational Science Universit` a della Svizzera italiana

– p. 1/34

slide-2
SLIDE 2

Lecture outline

  • Cluster Evaluation
  • Internal measures
  • External measures
  • Finding the correct number of clusters
  • Framework for cluster validity

– p. 2/34

slide-3
SLIDE 3

Cluster Evaluation

  • Every algorithm has its pros and cons
  • (Not only about cluster quality: complexity, #clusters in advance, etc.)
  • For what concerns cluster quality, we can evaluate (or, better,

validate) clusters

  • For supervised classification we have a variety of measures to

evaluate how good our model is

  • Accuracy, precision, recall
  • For cluster analysis, the analogous question is: how can we evaluate

the "goodness" of the resulting clusters?

  • But most of all... why should we evaluate it?

– p. 3/34

slide-4
SLIDE 4

Cluster found in random data

"Clusters are in the eye of the beholder"

– p. 4/34

slide-5
SLIDE 5

Why evaluate?

  • To determine the clustering tendency of the dataset, that is

distinguish whether non-random structure actually exists in the data

  • To determine the correct number of clusters
  • To evaluate how well the results of a cluster analysis fit the data

without reference to external information

  • To compare the results of a cluster analysis to externally known

results, such as externally provided class labels

  • To compare two sets of clusters to determine which is better

Note:

  • the first three are unsupervised techniques, while the last two require external info
  • the last three can be applied to the entire clustering or just to individual clusters

– p. 5/34

slide-6
SLIDE 6

Open challenges

Cluster evaluation has a number of challenges:

  • a measure of cluster validity may be quite limited in the scope of its

applicability

  • ie. dimensions of the problem: most work has been done only on

2- or 3-dimensional data

  • we need a framework to interpret any measure
  • How good is "10"?
  • if a measure is too complicated to apply or to understand, nobody will

use it

– p. 6/34

slide-7
SLIDE 7

Measures of Cluster Validity

Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types:

  • Internal (unsupervised) Indices: Used to measure the goodness of

a clustering structure without respect to external information

  • cluster cohesion vs cluster separation
  • e.g. Sum of Squared Error (SSE)
  • External (supervised) Indices: Used to measure the extent to which

cluster labels match externally supplied class labels

  • e.g. entropy, purity, precision, accuracy, ...

Internal or external indices (e.g. SSE or entropy) can be used to evaluate a single clustering/cluster or to compare two different ones. In the latter case, they are used as relative indices.

– p. 7/34

slide-8
SLIDE 8

External Measures

  • Entropy
  • The degree to which each cluster consists of objects of a single

class

  • For cluster i we compute pij, the probability that a member of

cluster i belongs to class j, as pij = mij/mi, where mi is the number of objects in cluster i and mij is the number of objects of class j in cluster i

  • The entropy of each cluster i is ei = − L

j=1 pijlog2pij, where L

is the number of classes

  • The total entropy is e = K

i=1 mi m ei, where K is the number of

clusters and m is the total number of data points

– p. 8/34

slide-9
SLIDE 9

External Measures

  • Purity
  • Another measure of the extent to which a cluster contains objects
  • f a single class
  • Using the previous terminology, the purity of cluster i is

pi = max(pij) for all the j

  • The overall purity is purity = K

i=1 mi m pi

– p. 9/34

slide-10
SLIDE 10

External Measures

  • Precision
  • The fraction of a cluster that consists of objects of a specified

class

  • The precision of cluster i with respect to class j is

precision(i, j) = pij

  • Recall
  • The extent to which a cluster contains all objects of a specified

class

  • The recall of cluster i with respect to class j is

recall(i, j) = mij/mj, where mj is the number of objects in class j

– p. 10/34

slide-11
SLIDE 11

External Measures

  • F-measure
  • A combination of both precision and recall that measures the

extent to which a cluster contains only objects of a particular class and all objects of that class

  • The F-measure of cluster i with respect to class j is

F(i, j) = 2×precision(i,j)×recall(i,j)

precision(i,j)+recall(i,j)

– p. 11/34

slide-12
SLIDE 12

External Measures: example

– p. 12/34

slide-13
SLIDE 13

Internal measures: Cohesion and Separation

  • Graph-based view
  • Prototype-based view

– p. 13/34

slide-14
SLIDE 14

Internal measures: Cohesion and Separation

  • Cluster Cohesion: Measures how closely related objects in a cluster

are

cohesion(Ci) =

  • x∈Ci,y∈Ci

proximity(x, y) cohesion(Ci) =

  • x∈Ci

proximity(x, ci)

  • Cluster Separation: Measure how distinct or well-separated a cluster

is from other clusters

separation(Ci, Cj) =

  • x∈Ci,y∈Cj

proximity(x, y) separation(Ci, Cj) = proximity(ci, cj) separation(Ci) = proximity(ci, c)

– p. 14/34

slide-15
SLIDE 15

Cohesion and separation example

  • Cohesion is measured by the within cluster sum of squares (SSE)

WSS =

  • i
  • x∈Ci

(x − mi)2

  • Separation is measured by the between cluster sum of squares

BSS =

  • i

|Ci|(m − mi)2 where |Ci| is the size of cluster i

– p. 15/34

slide-16
SLIDE 16

Cohesion and separation example

  • K=1 cluster:

WSS = (1 − 3)2 + (2 − 3)2 + (4 − 3)2 + (5 − 3)2 = 10 BSS = 4 × (3 − 3)2 = 0 Total = 10 + 0 = 10

  • K=2 clusters:

WSS = (1 − 1.5)2 + (2 − 1.5)2 + (4 − 4.5)2 + (5 − 4.5)2 = 1 BSS = 2 × (3 − 1.5)2 + 2 × (4.5 − 3)2 = 9 Total = 1 + 9 = 10

– p. 16/34

slide-17
SLIDE 17

Evaluating individual clusters and Objects

  • So far, we have focused on evaluation of a group of clusters
  • Many of these measures, however, can also be used to evaluate

individual clusters and objects

  • For example, a cluster with a high cohesion may be considered

better than a cluster with a lower one

  • This information can often be used to improve the quality of the

clustering

  • Split not very cohesive clusters
  • Merge not very separated ones
  • We can also evaluate the objects within a cluster in terms of their

contribution to the overall cohesion or separation of the cluster

– p. 17/34

slide-18
SLIDE 18

The Silhouette Coefficient

  • Silhouette Coefficient combines ideas of both cohesion and

separation, but for individual points, as well as clusters and clusterings

  • For an individual point, i
  • Calculate ai = average distance of i to the points in its cluster
  • Calculate bi = min (average distance of i to points in another

cluster)

  • The silhouette coefficient for a point is then given by

si = (bi − ai)/max(ai, bi)

– p. 18/34

slide-19
SLIDE 19

The Silhouette Coefficient

  • Silhouette Coefficient combine ideas of both cohesion and separation,

but for individual points, as well as clusters and clusterings

– p. 19/34

slide-20
SLIDE 20

Measuring Cluster Validity via Correlation

If we are given the similarity matrix for a data set and the cluster labels from a cluster analysis of the data set, then we can evaluate the "goodness" of the clustering by looking at the correlation between the similarity matrix and an ideal version of the similarity matrix based on the cluster labels

  • Similarity/Proximity Matrix
  • Ideal Matrix
  • One row and one column for each data point
  • An entry is 1 if the associated pair of points belongs to the same

cluster

  • An entry is 0 if the associated pair of points belongs to different

clusters

– p. 20/34

slide-21
SLIDE 21

Measuring Cluster Validity via Correlation

  • Compute the correlation between the two matrices
  • Since the matrices are symmetric, only the correlation between n(n − 1)/2 entries

needs to be calculated

  • High correlation indicates that points that belong to the same cluster

are close to each other

– p. 21/34

slide-22
SLIDE 22

Using Similarity Matrix for Cluster Validation

  • Order the similarity matrix with respect to cluster labels and inspect

visually

– p. 22/34

slide-23
SLIDE 23

Using Similarity Matrix for Cluster Validation

  • Order the similarity matrix with respect to cluster labels and inspect

visually

– p. 23/34

slide-24
SLIDE 24

Using Similarity Matrix for Cluster Validation

  • Clusters in random data are not so crisp

– p. 24/34

slide-25
SLIDE 25

Using Similarity Matrix for Cluster Validation

  • Clusters in random data are not so crisp

– p. 25/34

slide-26
SLIDE 26

Using Similarity Matrix for Cluster Validation

  • Clusters in random data are not so crisp

– p. 26/34

slide-27
SLIDE 27

Finding the Correct Number of Clusters

  • Look for the number of clusters for which there is a knee, peak, or dip

in the plot of the evaluation measure when it is plotted against the number of clusters

– p. 27/34

slide-28
SLIDE 28

Finding the Correct Number of Clusters

  • Of course, this isn’t always easy...

– p. 28/34

slide-29
SLIDE 29

Framework for Cluster Validity

  • Need a framework to interpret any measure.
  • For example, if our measure of evaluation has the value "10", is that good, fair, or

poor?

  • Statistics provide a framework for cluster validity
  • The more atypical a clustering result is, the more likely it represents valid structure

in the data

  • Can compare the values of an index that result from random data or clusterings to

those of a clustering result: if the value of the index is unlikely, then the cluster results are valid

  • These approaches are more complicated and harder to understand
  • For comparing the results of two different sets of cluster analyses, a

framework is less necessary

  • However, there is the question of whether the difference between two index values

is significant

– p. 29/34

slide-30
SLIDE 30

Statistical Framework for SSE

  • Example
  • Compare SSE of 0.005 against three clusters in random data
  • Histogram shows SSE of three clusters in 500 sets of random data points of size

100 distributed over the range 0.2–0.8 for x and y values

– p. 30/34

slide-31
SLIDE 31

Statistical Framework for Correlation

  • Correlation of incidence and proximity matrices for the K-means

clusterings of the following two data sets

– p. 31/34

slide-32
SLIDE 32

Final Comment on Cluster Validity

"The validation of clustering structures is the most difficult and frustrating part of cluster analysis. Without a strong effort in this direction, cluster analysis will remain a black art accessible only to those true believers who have experience and great courage." Algorithms for Clustering Data, Jain and Dubes

– p. 32/34

slide-33
SLIDE 33

Bibliography

  • Slides about clustering for the Data Mining course
  • prof. Salvatore Orlando (link)
  • Tan, Steinbach, Kumar: "Introduction to Data Mining", Ch. 8

http://www-users.cs.umn.edu/ kumar/dmbook/index.php

– p. 33/34

slide-34
SLIDE 34
  • The end (really!)

– p. 34/34