March 6, 2008 Data Mining: Concepts and Techniques 1
Data Mining:
Concepts and Techniques
Cluster Analysis
Li Xiong
Slide credits: Jiawei Han and Micheline Kamber Tan, Steinbach, Kumar
Data Mining: Concepts and Techniques Cluster Analysis Li Xiong - - PowerPoint PPT Presentation
Data Mining: Concepts and Techniques Cluster Analysis Li Xiong Slide credits: Jiawei Han and Micheline Kamber Tan, Steinbach, Kumar March 6, 2008 Data Mining: Concepts and Techniques 1 Chapter 7. Cluster Analysis Overview
March 6, 2008 Data Mining: Concepts and Techniques 1
Li Xiong
Slide credits: Jiawei Han and Micheline Kamber Tan, Steinbach, Kumar
March 6, 2008 Data Mining: Concepts and Techniques 2
March 6, 2008 Data Mining: Concepts and Techniques 3
Finding groups of objects (clusters)
Objects similar to one another in the same group Objects different from the objects in other groups
Unsupervised learning
Inter-cluster distances are maximized Intra-cluster distances are minimized
March 6, 2008 Li Xiong 4
Marketing research Social network analysis
March 6, 2008 Data Mining: Concepts and Techniques 5
WWW: Documents and search results clustering
March 6, 2008 Li Xiong 6
Earthquake studies
March 6, 2008 Li Xiong 7
3 2 1 Gene 5 3 8 7 Gene 4 3 8.6 4 Gene 3 9 10 Gene 2 10 8 10 Gene 1 Time Z Time Y Time X Time:
March 6, 2008 Data Mining: Concepts and Techniques 8
Scalability Ability to deal with different types of attributes Ability to handle dynamic data Ability to deal with noise and outliers Ability to deal with high dimensionality Minimal requirements for domain knowledge to
determine input parameters
Incorporation of user-specified constraints Interpretability and usability
March 6, 2008 Data Mining: Concepts and Techniques 9
Agreement with “ground truth” A good clustering will produce high quality clusters with
Homogeneity - high intra-class similarity Separation - low inter-class similarity
Inter-cluster distances are maximized Intra-cluster distances are minimized
March 6, 2008 Li Xiong 11
Similarity or Dissimilarity between Data Objects
Euclidean distance Manhattan distance Minkowski distance Weighted
| | ... | | | | ) , (
2 2 1 1 p p
j x i x j x i x j x i x j i d − + + − + − =
) | | ... | | | (| ) , (
2 2 2 2 2 1 1 p p
j x i x j x i x j x i x j i d − + + − + − =
⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ ⎡ np x ... nf x ... n1 x ... ... ... ... ... ip x ... if x ... i1 x ... ... ... ... ... 1p x ... 1f x ... 11 x
q q p p q q
j x i x j x i x j x i x j i d ) | | ... | | | (| ) , (
2 2 1 1
− + + − + − =
March 6, 2008 Li Xiong 12
Other Similarity or Dissimilarity Metrics
Pearson correlation Cosine measure KL divergence, Bregman divergence, …
⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ ⎡ np x ... nf x ... n1 x ... ... ... ... ... ip x ... if x ... i1 x ... ... ... ... ... 1p x ... 1f x ... 11 x
|| || || || j X i X j X i X ⋅
X i
X j j i i
p X X X X σ σ ) 1 ( ) )( ( − − −
March 6, 2008 Data Mining: Concepts and Techniques 13
To compute
f is continuous
Normalization if necessary Logarithmic transformation for ratio-scaled values
f is ordinal
Mapping by rank
f is categorical
Mapping function
= 0 if xif = xjf , or 1 otherwise
Hamming distance (edit distance) for strings
1 1 − − =
f if
M r zif
| |
f f
j x i x − | |
f f
j x i x − Bt Ae i x
f =
) log(
f f
i x i y =
March 6, 2008 Data Mining: Concepts and Techniques 14
e.g., minimizing the sum of square errors
some criterion
March 6, 2008 Data Mining: Concepts and Techniques 15
March 6, 2008 Data Mining: Concepts and Techniques 16
into a set of k clusters, s.t., the sum of squared distance is minimized
partitioning criterion
Global optimal: exhaustively enumerate all partitions Heuristic methods: k-means and k-medoids algorithms k-means (MacQueen’67): Each cluster is represented by the center
k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the objects in the cluster
2 1
i C p k i
i
∈ =
March 6, 2008 Data Mining: Concepts and Techniques 17
Given k, and randomly choose k initial cluster centers Partition objects into k nonempty subsets by assigning
each object to the cluster with the nearest centroid
Update centroid, i.e. mean point of the cluster Go back to Step 2, stop when no more new
assignment
March 6, 2008 Data Mining: Concepts and Techniques 18
Example
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
K= 2 Arbitrarily choose K
cluster center Assign each
to most similar center Update the cluster means Update the cluster means reassign reassign
K-means Clustering – Details
cluster.
similarity, correlation, etc.
iterations.
points change clusters’
n is # objects, k is # clusters, and t is # iterations. O(tkn)
March 6, 2008 Data Mining: Concepts and Techniques 20
Simple and works well for “regular” disjoint clusters Relatively efficient and scalable (normally, k, t < < n)
Need to specify k, the number of clusters, in advance Depending on initial centroids, may terminate at a local optimum
Potential solutions
Unable to handle noisy data and outliers Not suitable for clusters of
Different sizes Non-convex shapes
Importance of Choosing Initial Centroids – Case 1
0.5 1 1.5 2 0.5 1 1.5 2 2.5 3
x y
Iteration 1
0.5 1 1.5 2 0.5 1 1.5 2 2.5 3
x y
Iteration 2
0.5 1 1.5 2 0.5 1 1.5 2 2.5 3
x y
Iteration 3
0.5 1 1.5 2 0.5 1 1.5 2 2.5 3
x y
Iteration 4
0.5 1 1.5 2 0.5 1 1.5 2 2.5 3
x y
Iteration 5
0.5 1 1.5 2 0.5 1 1.5 2 2.5 3
x y
Iteration 6
Importance of Choosing Initial Centroids – Case 2
0.5 1 1.5 2 0.5 1 1.5 2 2.5 3
x y
Iteration 1
0.5 1 1.5 2 0.5 1 1.5 2 2.5 3
x y
Iteration 2
0.5 1 1.5 2 0.5 1 1.5 2 2.5 3
x y
Iteration 3
0.5 1 1.5 2 0.5 1 1.5 2 2.5 3
x y
Iteration 4
0.5 1 1.5 2 0.5 1 1.5 2 2.5 3
x y
Iteration 5
Limitations of K-means: Differing Sizes
Original Points K-means (3 Clusters)
Limitations of K-means: Non-convex Shapes
Original Points K-means (2 Clusters)
Overcoming K-means Limitations
Original Points K-means Clusters
Overcoming K-means Limitations
Original Points K-means Clusters
March 6, 2008 Data Mining: Concepts and Techniques 27
Selection of the initial k means Dissimilarity calculations Strategies to calculate cluster means
Replacing means of clusters with modes Using new dissimilarity measures to deal with categorical objects Using a frequency-based method to update modes of clusters A mixture of categorical and numerical data: k-prototype method
March 6, 2008 Data Mining: Concepts and Techniques 28
The k-means algorithm is sensitive to outliers !
Since an object with an extremely large value may
substantially distort the distribution of the data.
in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster.
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
March 6, 2008 Data Mining: Concepts and Techniques 29
PAM (Kaufman and Rousseeuw, 1987)
medoid.
total sum of absolute error)
k-medoids and (n-k) instances pair-wise comparison
March 6, 2008 Data Mining: Concepts and Techniques 30
Total Cost = 20
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10K= 2
Arbitrary choose k
initial medoids
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10Assign each remaining
nearest medoids Randomly select a nonmedoid object,Orandom Compute total cost of swapping
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10Total Cost = 26 Swapping O and Oramdom If quality is improved.
Do loop Until no change
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10March 6, 2008 Data Mining: Concepts and Techniques 31
Pam is more robust than k-means in the presence of
noise and outliers
Pam works efficiently for small data sets but does not
scale well for large data sets.
Complexity? O(k(n-k)2t)
n is # of data,k is # of clusters, t is # of iterations
Sampling based method,
CLARA(Clustering LARge Applications)
March 6, 2008 Data Mining: Concepts and Techniques 32
CLARA (Kaufmann and Rousseeuw in 1990) It draws multiple samples of the data set, applies PAM on
each sample, and gives the best clustering as the output
Strength: deals with larger data sets than PAM Weakness:
Efficiency depends on the sample size A good clustering based on samples will not
necessarily represent a good clustering of the whole data set if the sample is biased
March 6, 2008 Data Mining: Concepts and Techniques 33
CLARANS (A Clustering Algorithm based on Randomized
Search) (Ng and Han’94)
The clustering process can be presented as searching a
graph where every node is a potential solution, that is, a set of k medoids
PAM examines neighbors for local minimum CLARA works on subgraphs of samples CLARANS examines neighbors dynamically
If local optimum is found, starts with new randomly selected
node in search for a new local optimum
March 6, 2008 Data Mining: Concepts and Techniques 34
Produces a set of nested clusters organized as a
hierarchical tree
Can be visualized as a dendrogram
A tree like diagram representing a hierarchy of nested
clusters
Clustering obtained by cutting at desired level
1 3 2 5 4 6 0.05 0.1 0.15 0.2
1 2 3 4 5 6 1 2 3 4 5
Do not have to assume any particular number of
clusters
May correspond to meaningful taxonomies
Two main types of hierarchical clustering
Agglomerative:
cluster (or k clusters) left
Divisive:
(or there are k clusters)
1.
Compute the proximity matrix
2.
Let each data point be a cluster
3.
Repeat
4.
Merge the two closest clusters
5.
Update the proximity matrix
6.
Until only a single cluster remains
Start with clusters of individual points and a
proximity matrix
p1 p3 p5 p4 p2 p1 p2 p3 p4 p5
. . .
. . .
Proximity Matrix
C1 C4 C2 C5 C3 C2 C1 C1 C3 C5 C4 C2 C3 C4 C5
Proximity Matrix
p1 p3 p5 p4 p2 p1 p2 p3 p4 p5
. . . . . . Similarity?
Proximity Matrix
Single Link: smallest distance between
points
Complete Link: largest distance between
points
Average Link: average distance between
points
Centroid: distance between centroids
Nested Clusters Dendrogram
1 2 3 4 5 6 1 2 3 4 5
3 6 2 5 4 1 0.05 0.1 0.15 0.2
Start with a tree that consists of any point In successive steps, look for the closest pair of points (p,
q) such that one point (p) is in the current tree but the
Add q to the tree and put an edge between p and q
Min vs. Max vs. Group Average
MIN Group Average 1 2 3 4 5 6 1 2 5 3 4 MAX 1 2 3 4 5 6 1 2 5 3 4 1 2 3 4 5 6 1 2 3 4 5
Original Points Two Clusters
Original Points Two Clusters
Original Points Two Clusters
Original Points Two Clusters
Less susceptible to noise and outliers
Biased towards globular clusters
Hierarchical Clustering: Major Weaknesses
Do not scale well (N: number of points)
Space complexity: Time complexity:
Cannot undo what was done previously Quality varies in terms of distance measures
globular clusters O(N2) O(N3) O(N2 log(N)) for some cases/approaches
March 6, 2008 Data Mining: Concepts and Techniques 52
BIRCH (1996): uses CF-tree and incrementally adjusts the
quality of sub-clusters
CURE(1998): uses representative points for inter-cluster
distance
ROCK (1999): clustering categorical data by neighbor and
link analysis
CHAMELEON (1999): hierarchical clustering using dynamic
modeling
Birch: Balanced Iterative Reducing and Clustering using
Hierarchies (Zhang, Ramakrishnan & Livny, SIGMOD’96)
Main ideas:
Use in-memory clustering feature to summarize
data/cluster
Minimize database scans and I/O cost
Use hierarchical clustering for microclustering and
macroclustering
Fix the problems of hierarchical clustering
Features:
Scales linearly: single scan and improves the quality
with a few additional scans
handles only numeric data, and sensitive to the order
March 6, 2008 Data Mining: Concepts and Techniques 53
March 6, 2008 54
Centroid: Radius: average distance from member points to
centroid
Diameter: average pair-wise distance within a cluster
March 6, 2008 55
March 6, 2008 56
CF = (5, (16,30),(54,190)) (3,4) (2,6) (4,5) (4,7) (3,8)
March 6, 2008 57
CF entry is more compact
Stores significantly less then all of the data
points in the sub-cluster
A CF entry has sufficient information to
Additivity theorem allows us to merge sub-
March 6, 2008 Data Mining: Concepts and Techniques 58
A CF tree is a height-balanced tree that stores the
clustering features for a hierarchical clustering
A nonleaf node in a tree has descendants or “children” The nonleaf nodes store sums of the CFs of their
children
A CF tree has two parameters
Branching factor: specify the maximum number of
children.
threshold: max diameter of sub-clusters stored at the
leaf nodes
March 6, 2008 Data Mining: Concepts and Techniques 59
CF1
child1
CF3
child3
CF2
child2
CF6
child6
CF1
child1
CF3
child3
CF2
child2
CF5
child5
CF1 CF2 CF6
prev next
CF1 CF2 CF4
prev next
Root Non-leaf node Leaf node Leaf node
March 6, 2008 60
Traverse down from root, find the appropriate
leaf
Follow the "closest"-CF path, w.r.t. intra-
cluster distance measures
Modify the leaf
If the closest-CF leaf cannot absorb, make a
new CF entry.
If there is no room for new leaf, split the
parent node
Traverse back & up
Updating CFs on the path or splitting nodes
March 6, 2008 61
March 6, 2008 62
Phase 1: Scan database to build an initial in-
memory CF-tree
Subsequent phases become fast, accurate, less order
sensitive
Phase 2: Condense data (optional)
Rebuild the CF-tree with a larger T
Phase 3: Global clustering
Use existing clustering algorithm on CF entries Helps fix problem where natural clusters span nodes
Phase 4: Cluster refining (optional)
Do additional passes over the dataset & reassign data
points to the closest centroid from phase 3
CURE: An Efficient Clustering Algorithm for Large
Databases (1998) Sudipto Guha, Rajeev Rastogi, Kyuscok Shim
Main ideas:
Use representative points for inter-cluster distance Random sampling and partitioning
Features:
Handles non-spherical shapes and arbitrary sizes
better
Uses a number of points to represent a cluster Representative points are found by selecting a constant
number of points from a cluster and then “shrinking” them toward the center of the cluster
How to shrink?
Cluster similarity is the similarity of the closest pair of
representative points from different clusters
× ×
Experimental Results: CURE
Picture from CURE, Guha, Rastogi, Shim.
Picture from CURE, Guha, Rastogi, Shim.
(centroid) (single link)
Original Points CURE
March 6, 2008 Data Mining: Concepts and Techniques 68
ROCK: RObust Clustering using linKs
Major ideas
Use links to measure similarity/proximity Sampling-based clustering
Features:
More meaningful clusters Emphasizes interconnectivity but ignores proximity
March 6, 2008 Data Mining: Concepts and Techniques 69
e} , { b, d, e} , { c, d, e}
Sim T T T T T T ( , )
1 2 1 2 1 2
= ∩ ∪
2 . 5 1 } , , , , { } { ) , (
2 1
= = = e d c b a c T T Sim 5 . 4 2 } , , , { } , { ) , (
3
1
= = = f c b a f c T T Sim
March 6, 2008 Data Mining: Concepts and Techniques 70
θ ≥ ) , (
3
1 P
P Sim
from the original similarities (computed by Jaccard coefficient)
neighbors” as similarity measure
have been found
March 6, 2008 Data Mining: Concepts and Techniques 71
March 6, 2008 Data Mining: Concepts and Techniques 72
clusters
clusters
Uses the proximity graph
Start with the proximity matrix Consider each point as a node in a graph Each edge between two nodes has a weight
which is the proximity between the two points
Fully connected proximity graph
MIN (single-link) and MAX (complete-link)
Sparsification
Clusters are connected components in the
graph
CHAMELEON
March 6, 2008 Data Mining: Concepts and Techniques 74
Construct Sparse Graph Partition the Graph Merge Partition Final Clusters Data Set
Preprocessing Step:
Represent the Data by a Graph
Given a set of points, construct the k-nearest-neighbor
(k-NN) graph to capture the relationship between a point and its k nearest neighbors
Concept of neighborhood is captured dynamically
(even if region is sparse)
Phase 1: Use a multilevel graph partitioning algorithm on
the graph to find a large number of clusters of well- connected vertices
Each cluster should contain mostly points from one
“true” cluster, i.e., is a sub-cluster of a “real” cluster
Phase 2: Use Hierarchical Agglomerative Clustering to
merge sub-clusters
Two clusters are combined if the resulting cluster
shares certain properties with the constituent clusters
Two key properties used to model cluster similarity:
clusters normalized by the internal connectivity of the clusters
normalized by the internal closeness of the clusters
Cluster Merging: Limitations of Current Schemes
Existing schemes are static in nature
MIN or CURE:
merge two clusters based on their closeness (or
minimum distance)
GROUP-AVERAGE or ROCK:
merge two clusters based on their average
connectivity
Limitations of Current Merging Schemes
Closeness schemes will merge (a) and (b)
(a) (b) (c) (d)
Average connectivity schemes will merge (c) and (d)
Chameleon: Clustering Using Dynamic Modeling
Adapt to the characteristics of the data set to find the
natural clusters
Use a dynamic model to measure the similarity between
clusters
Main property is the relative closeness and relative inter-
connectivity of the cluster
Two clusters are combined if the resulting cluster shares certain
properties with the constituent clusters
The merging scheme preserves self-similarity
March 6, 2008 Data Mining: Concepts and Techniques 80
March 6, 2008 Data Mining: Concepts and Techniques 81
March 6, 2008 Data Mining: Concepts and Techniques 82
Clustering based on density Major features:
Clusters of arbitrary shape Handle noise Need density parameters as termination condition
Several interesting studies:
DBSCAN: Ester, et al. (KDD’96) OPTICS: Ankerst, et al (SIGMOD’99). DENCLUE: Hinneburg & D. Keim (KDD’98) CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-based)
neighborhood of a core point
border point Core point noise point
March 6, 2008 Data Mining: Concepts and Techniques 84
Two parameters:
Eps: radius of the neighbourhood MinPts: Minimum number of points in an Eps-
neighbourhood of that point
NEps(p):
{ q belongs to D | dist(p,q) < = Eps}
core point: |NEps (q)| > = MinPts
p q MinPts = 5 Eps = 1 cm
Data Mining: Concepts and Techniques 85
Directly density-reachable: p belongs
to NEps(q)
Density-reachable: if there is a chain
such that pi+ 1 is directly density- reachable from pi
Density-connected: if there is a point
reachable from o w.r.t. Eps and MinPts
p q
q p1 p q MinPts = 5 Eps = 1 cm
March 6, 2008 Data Mining: Concepts and Techniques 86
A cluster is defined as a maximal set of density-connected
points
Core Border Outlier Eps = 1cm MinPts = 5
March 6, 2008 Data Mining: Concepts and Techniques 87
Arbitrary select a point p Retrieve all points density-reachable from p w.r.t. Eps
and MinPts.
If p is a core point, a cluster is formed. If p is a border point, no points are density-reachable
from p and DBSCAN visits the next point of the database.
Continue the process until all of the points have been
processed.
DBSCAN: Determining EPS and MinPts
are at roughly the same distance
farther distance
neighbor
March 6, 2008 Data Mining: Concepts and Techniques 89
March 6, 2008 Data Mining: Concepts and Techniques 90
March 6, 2008 Data Mining: Concepts and Techniques 91
Attempt to optimize the fit between the given data and
some mathematical model
Typical methods
Statistical approach
EM (Expectation maximization)
Machine learning approach
COBWEB
Neural network approach
SOM (Self-Organizing Feature Map)
Assume data are generated by a mixture of probabilistic
model
Each cluster can be represented by a probabilistic
model, like a Gaussian (continuous) or a Poisson (discrete) distribution.
March 6, 2008 Data Mining: Concepts and Techniques 92
March 6, 2008 Data Mining: Concepts and Techniques 93
Starts with an initial estimate of the parameters of the
mixture model
Iteratively refine the parameters using EM method
Expectation step: computes expectation of the likelihood
Maximization step: computes maximum likelihood
estimates of the parameters
March 6, 2008 Data Mining: Concepts and Techniques 94
Conceptual clustering
Generates a concept description for each concept (class) Produces a hierarchical category or classification scheme Related to decision tree learning and mixture model
learning
COBWEB (Fisher’87)
A popular and simple method of incremental conceptual
learning
Creates a hierarchical clustering in the form of a
classification tree
Each node refers to a concept and contains a
probabilistic description of that concept
March 6, 2008 Data Mining: Concepts and Techniques 95
Incrementally builds the classification tree Given a new object
Search for the best node at which to
incorporate the object or add a new node for the object
Update the probabilistic description at each
node
Merging and splitting Use a heuristic measure - Category Utility – to
guide construction of the tree
March 6, 2008 Data Mining: Concepts and Techniques 96
March 6, 2008 Data Mining: Concepts and Techniques 97
Limitations
The assumption that the attributes are independent of
each other is often too strong because correlation may exist
Not suitable for clustering large database – skewed tree
and expensive probability distributions
March 6, 2008 Data Mining: Concepts and Techniques 98
Neural network approach for unsupervised learning
Two modes
Training: builds the network using input data Mapping: automatically classifies a new input vector.
Typical methods
SOM (Soft-Organizing feature Map) Competitive learning
March 6, 2008 Data Mining: Concepts and Techniques 99
Feature Map (KSOMs)
dimensional input data, called a map
The distance and proximity relationship (i.e., topology) are
preserved as much as possible
believed to resemble processing that can occur in the brain
The unit whose weight vector is closest to the current object
becomes the winning unit
The winner and its neighbors learn by having their weights
adjusted
March 6, 2008 Data Mining: Concepts and Techniques 100
March 6, 2008 Data Mining: Concepts and Techniques 101
SOM clustering
articles
the right: drilling down on the keyword “mining”
March 6, 2008 Data Mining: Concepts and Techniques 102
Determine clustering tendency of data, i.e.
distinguish whether non-random structure exists
Determine correct number of clusters Evaluate how well the cluster results fit the data
without external information
Evaluate how well the cluster results are
compared to externally known results
Compare different clustering algorithms/results
Clusters found in Random Data
0.2 0.4 0.6 0.8 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x y
Random Points
0.2 0.4 0.6 0.8 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x y
K-means
0.2 0.4 0.6 0.8 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x y
DBSCAN
0.2 0.4 0.6 0.8 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x y
Complete Link
Unsupervised (internal indices): Used to measure the
goodness of a clustering structure without respect to external information.
Sum of Squared Error (SSE)
Supervised (external indices): Used to measure the extent
to which cluster labels match externally supplied class labels.
Entropy
Relative: Used to compare two different clustering results
Often an external or internal index is used for this function, e.g., SSE
Measures of Cluster Validity
Cluster Cohesion: how closely related are objects in a
cluster
Cluster Separation: how distinct or well-separated a
cluster is from other clusters
Internal Measures: Cohesion and Separation
∑ ∑
∈
− =
i C x i
i
m x WSS
2
) (
− =
i j j i
m m BSS
2
) (
separatio n Cohesion
SSE is good for comparing two clusterings Can also be used to estimate the number of clusters
2 5 10 15 20 25 30 1 2 3 4 5 6 7 8 9 10
K SSE
5 10 15
2 4 6
Another example of a more complicated data set
1 2 3 5 6 4 7
SSE of clusters found using K-means
Statistical Framework for SSE
0.016 0.018 0.02 0.022 0.024 0.026 0.028 0.03 0.032 0.034 5 10 15 20 25 30 35 40 45 50
SSE Count
0.2 0.4 0.6 0.8 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x y
Compare cluster results with “ground truth” or manually
clustering
Classification-oriented measures: entropy, purity,
precision, recall, F-measures
Similarity-oriented measures: Jaccard scores
External Measures: Classification-Oriented Measures
Entropy: the degree to which each cluster consists of
Precision: the fraction of a cluster that consists of objects
Recall: the extent to which a cluster contains all objects
External Measure: Similarity-Oriented Measures
Given a reference clustering T and clustering S
and S
same cluster in S
different cluster in S
S
March 6, 2008 Li Xiong 112 11 10 01 00 11 00
f f f f f f Rand + + + + =
11 10 01 11
f f f f Jaccard + + =
T S