Hierarchical and Ensemble Clustering
Ke Chen Reading: [7.8-7.10, EA], [25.5, KPM], [Fred & Jain, 2005]
COMP24111 Machine Learning
Hierarchical and Ensemble Clustering Ke Chen Reading: [7.8-7.10, - - PowerPoint PPT Presentation
Hierarchical and Ensemble Clustering Ke Chen Reading: [7.8-7.10, EA], [25.5, KPM], [Fred & Jain, 2005] COMP24111 Machine Learning Outline Introduction Cluster Distance Measures Agglomerative Algorithm Example and
COMP24111 Machine Learning
COMP24111 Machine Learning
2
COMP24111 Machine Learning
3
– A typical clustering analysis approach via partitioning data set sequentially – Construct nested partitions layer by layer via grouping objects into a tree of clusters (without the need to know the number of clusters in advance) – Use (generalised) distance matrix as clustering criteria
– Agglomerative: a bottom-up strategy
– Divisive: a top-down strategy
– Using multiple clustering results for robustness and overcoming weaknesses of single clustering algorithms.
COMP24111 Machine Learning
4
Agglomerative and divisive clustering on the data set { a, b, c, d ,e }
Step 0 Step 1 Step 2 Step 3 Step 4
b d c e a a b d e c d e a b c d e
Step 4 Step 3 Step 2 Step 1 Step 0
Agglomerative Divisive
COMP24111 Machine Learning
5
single link (min) complete link (max) average
between an element in one cluster and an element in the other, i.e., d(Ci, Cj) = min{ d(xip, xjq)}
between an element in one cluster and an element in the other, i.e., d(Ci, Cj) = max{ d(xip, xjq)}
elements in one cluster and elements in the other, i.e., d(Ci, Cj) = avg{ d(xip, xjq)}
d(C, C)=0
COMP24111 Machine Learning
6
Example: Given a data set of five objects characterised by a single continuous feature, assume
that there are two clusters: C1: { a, b} and C2: { c, d, e} .
a b c d e
Feature 1 2 4 5 6
a b c d e a
1 3 4 5
b
1 2 3 4
c
3 2 1 2
d
4 3 1 1
e
5 4 2 1 Single link Complete link Average
2 4} 3, 2, 5, 4, min{3, e)} (b, d), (b, c), (b, e), (a, d), a, ( , c) a, ( min{ ) C , C ( dist
2 1
= = = d d d d d d 5 4} 3, 2, 5, 4, max{3, e)} (b, d), (b, c), (b, e), (a, d), a, ( , c) a, ( max{ ) C , dist(C
2 1
= = = d d d d d d
5 . 3 6 21 6 4 3 2 5 4 3 6 e) (b, d) (b, c) (b, e) (a, d) a, ( c) a, ( ) C , dist(C
2 1
= = + + + + + = + + + + + = d d d d d d
COMP24111 Machine Learning
7
1) Convert all object features into a distance matrix 2) Set each object as a cluster (thus if we have N objects, we will have N clusters at the beginning) 3) Repeat until number of cluster is one (or known # of clusters)
COMP24111 Machine Learning
8
data matrix distance matrix Euclidean distance
COMP24111 Machine Learning
9
COMP24111 Machine Learning
10
COMP24111 Machine Learning
11
COMP24111 Machine Learning
12
COMP24111 Machine Learning
13
(iteration 3)
COMP24111 Machine Learning
14
(iteration 4)
COMP24111 Machine Learning
15
COMP24111 Machine Learning
16
clusters: A, B, C, D, E and F
cluster (D, F) at distance 0.50
into (A, B) at distance 0.71
into ((D, F), E) at distance 1.00
into (((D, F), E), C) at distance 1.41
and (A, B) into ((((D, F), E), C), (A, B)) at distance 2.50
thus conclude the computation
2 3 4 5 6
lifetime
COMP24111 Machine Learning
17
2 3 4 5 6
lifetime
The distance between that a cluster is created and that it disappears (merges with other clusters during clustering). e.g. lifetime of A, B, C, D, E and F are 0.71, 0.71, 1.41, 0.50, 1.00 and 0.50, respectively, the life time of (A, B) is 2.50 – 0.71 = 1.79, ……
The distance from that K clusters emerge to that K clusters vanish (due to the reduction to K-1 clusters). e.g. 5-cluster lifetime is 0.71 - 0.50 = 0.21 4-cluster lifetime is 1.00 - 0.71 = 0.29 3-cluster lifetime is 1.41 – 1.00 = 0.41 2-cluster lifetime is 2.50 – 1.41 = 1.09
COMP24111 Machine Learning
18
Agglomerative Demo
COMP24111 Machine Learning
19
– If the number of clusters known, termination condition is given! – The K-cluster lifetime as the range of threshold value on the dendrogram tree that leads to the identification of K clusters – Heuristic rule: cut a dendrogram tree with maximum life time to find a “proper” K
– Can never undo what was done previously – Sensitive to cluster distance measures and noise/outliers – Less efficient: O (n2 logn), where n is the number of total objects
– BIRCH: scalable to a large data set – ROCK: clustering categorical data – CHAMELEON: hierarchical clustering using dynamic modelling
COMP24111 Machine Learning
20
– A single clustering algorithm may be affected by various factors
– An effective treatments: clustering ensemble
COMP24111 Machine Learning
21
– A simple clustering ensemble algorithm to overcome the main weaknesses of different clustering methods by exploiting their synergy via evidence accumulation
– Initial clustering analysis by using either different clustering algorithms or running a single clustering algorithm on different conditions, leading to multiple partitions e.g. the K-mean with various initial centroid settings and different K, the agglomerative algorithm with different distance metrics and forced to terminated with different number of clusters… – Converting clustering results on different partitions into binary “distance” matrices – Evidence accumulation: form a collective “distance” matrix based on all the binary “distance” matrices – Apply a hierarchical clustering algorithm (with a proper cluster distance metric) to the collective “distance” matrix and use the maximum K-cluster lifetime to decide K
A B C D Cluster 1 (C1) Cluster 2 (C2)
= 1 1 1 1 1 1 1 1
1
D
A D C B D C A B
“distance” Matrix
22
COMP24111 Machine Learning
A B C D Cluster 1 (C1) Cluster 2 (C2)
= 1 1 1 1 1 1 1 1 1 1
2
D
A D C B D C A B
“distance Matrix”
23
Cluster 3 (C3)
COMP24111 Machine Learning
24
= 1 1 1 1 1 1 1 1 1 1
2
D = 1 1 1 1 1 1 1 1
1
D = + = 1 2 2 1 2 2 2 2 2 2
2 1
D D DC
COMP24111 Machine Learning
COMP24111 Machine Learning
25
– Data set of 400 data points – Initial clustering analysis: K-mean (K= 2,…,11), 3 initial settings per K totally 30 partitions – Converting clustering results to binary “distance” matrices for the collective “distance matrix” – Applying the Agglomerative algorithm to the collective “distance matrix” (single-link) – Cut the dendrogram tree with the maximum K-cluster lifetime to decide K
COMP24111 Machine Learning
26
– Use distance matrix to construct a tree of clusters (dendrogram) – Hierarchical representation without the need of knowing # of clusters (can set termination condition with known # of clusters)
– Can never undo what was done previously – Sensitive to cluster distance measures and noise/outliers – Less efficient: O (n2 logn), where n is the number of total objects
– Initial clustering with different conditions, e.g., K-means on different K, initialisations – Evidence accumulation – “collective” distance matrix – Apply agglomerative algorithm to “collective” distance matrix and max k-cluster lifetime
Online tutorial: how to use hierarchical clustering functions in Matlab:
https://www.youtube.com/watch?v= aYzjenNNOcc