Machine Learning
Lecture Notes on Clustering (IV) 2016-2017
Davide Eynard
davide.eynard@usi.ch
Institute of Computational Science Universit` a della Svizzera italiana
– p. 1/34
Machine Learning Lecture Notes on Clustering (IV) 2016-2017 Davide - - PowerPoint PPT Presentation
Machine Learning Lecture Notes on Clustering (IV) 2016-2017 Davide Eynard davide.eynard@usi.ch Institute of Computational Science Universit` a della Svizzera italiana p. 1/34 Lecture outline Cluster Evaluation Internal measures
Davide Eynard
davide.eynard@usi.ch
Institute of Computational Science Universit` a della Svizzera italiana
– p. 1/34
Lecture outline
– p. 2/34
Cluster Evaluation
validate) clusters
evaluate how good our model is
the "goodness" of the resulting clusters?
– p. 3/34
Cluster found in random data
"Clusters are in the eye of the beholder"
– p. 4/34
Why evaluate?
distinguish whether non-random structure actually exists in the data
without reference to external information
results, such as externally provided class labels
Note:
– p. 5/34
Open challenges
Cluster evaluation has a number of challenges:
applicability
2- or 3-dimensional data
use it
– p. 6/34
Measures of Cluster Validity
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types:
a clustering structure without respect to external information
cluster labels match externally supplied class labels
Internal or external indices (e.g. SSE or entropy) can be used to evaluate a single clustering/cluster or to compare two different ones. In the latter case, they are used as relative indices.
– p. 7/34
External Measures
class
cluster i belongs to class j, as pij = mij/mi, where mi is the number of objects in cluster i and mij is the number of objects of class j in cluster i
j=1 pijlog2pij, where L
is the number of classes
i=1 mi m ei, where K is the number of
clusters and m is the total number of data points
– p. 8/34
External Measures
pi = max(pij) for all the j
i=1 mi m pi
– p. 9/34
External Measures
class
precision(i, j) = pij
class
recall(i, j) = mij/mj, where mj is the number of objects in class j
– p. 10/34
External Measures
extent to which a cluster contains only objects of a particular class and all objects of that class
F(i, j) = 2×precision(i,j)×recall(i,j)
precision(i,j)+recall(i,j)
– p. 11/34
External Measures: example
– p. 12/34
Internal measures: Cohesion and Separation
– p. 13/34
Internal measures: Cohesion and Separation
are
cohesion(Ci) =
proximity(x, y) cohesion(Ci) =
proximity(x, ci)
is from other clusters
separation(Ci, Cj) =
proximity(x, y) separation(Ci, Cj) = proximity(ci, cj) separation(Ci) = proximity(ci, c)
– p. 14/34
Cohesion and separation example
WSS =
(x − mi)2
BSS =
|Ci|(m − mi)2 where |Ci| is the size of cluster i
– p. 15/34
Cohesion and separation example
WSS = (1 − 3)2 + (2 − 3)2 + (4 − 3)2 + (5 − 3)2 = 10 BSS = 4 × (3 − 3)2 = 0 Total = 10 + 0 = 10
WSS = (1 − 1.5)2 + (2 − 1.5)2 + (4 − 4.5)2 + (5 − 4.5)2 = 1 BSS = 2 × (3 − 1.5)2 + 2 × (4.5 − 3)2 = 9 Total = 1 + 9 = 10
– p. 16/34
Evaluating individual clusters and Objects
individual clusters and objects
better than a cluster with a lower one
clustering
contribution to the overall cohesion or separation of the cluster
– p. 17/34
The Silhouette Coefficient
separation, but for individual points, as well as clusters and clusterings
cluster)
si = (bi − ai)/max(ai, bi)
– p. 18/34
The Silhouette Coefficient
but for individual points, as well as clusters and clusterings
– p. 19/34
Measuring Cluster Validity via Correlation
If we are given the similarity matrix for a data set and the cluster labels from a cluster analysis of the data set, then we can evaluate the "goodness" of the clustering by looking at the correlation between the similarity matrix and an ideal version of the similarity matrix based on the cluster labels
cluster
clusters
– p. 20/34
Measuring Cluster Validity via Correlation
needs to be calculated
are close to each other
– p. 21/34
Using Similarity Matrix for Cluster Validation
visually
– p. 22/34
Using Similarity Matrix for Cluster Validation
visually
– p. 23/34
Using Similarity Matrix for Cluster Validation
– p. 24/34
Using Similarity Matrix for Cluster Validation
– p. 25/34
Using Similarity Matrix for Cluster Validation
– p. 26/34
Finding the Correct Number of Clusters
in the plot of the evaluation measure when it is plotted against the number of clusters
– p. 27/34
Finding the Correct Number of Clusters
– p. 28/34
Framework for Cluster Validity
poor?
in the data
those of a clustering result: if the value of the index is unlikely, then the cluster results are valid
framework is less necessary
is significant
– p. 29/34
Statistical Framework for SSE
100 distributed over the range 0.2–0.8 for x and y values
– p. 30/34
Statistical Framework for Correlation
clusterings of the following two data sets
– p. 31/34
Final Comment on Cluster Validity
"The validation of clustering structures is the most difficult and frustrating part of cluster analysis. Without a strong effort in this direction, cluster analysis will remain a black art accessible only to those true believers who have experience and great courage." Algorithms for Clustering Data, Jain and Dubes
– p. 32/34
Bibliography
http://www-users.cs.umn.edu/ kumar/dmbook/index.php
– p. 33/34
– p. 34/34