IRDM ‘15/16
Jilles Vreeken
Chapter 5: Cluster ering ing
10 Nov 2015
Chapter 5: Cluster ering ing Jilles Vreeken IRDM 15/16 10 Nov - - PowerPoint PPT Presentation
Chapter 5: Cluster ering ing Jilles Vreeken IRDM 15/16 10 Nov 2015 Qu Question o of f th the w week How can we discover groups s of objec ects that are highly similar to each other? 10 Nov 2015 V-1: 2 Clustering, where?
IRDM ‘15/16
10 Nov 2015
10 Nov 2015 V-1: 2
IRDM ‘15/16
Biology
creation of phylogenies (relations between organisms) inferring population structures from clusterings of DNA data analysis of genes and cellular processes (co-clustering)
Business
grouping of consumers into market segments
Computer science
pre-processing to reduce computation (representative-based methods) automatic discovery of similar items
V-1: 3
IRDM ‘15/16
(Wessmann, ‘Mixture Model Clustering in the analysis of complex diseases’, 2012)
V-1: 4
IRDM ‘15/16
(Heikinheimo et al., ‘Clustering of European Mammals’, 2007)
V-1: 5
IRDM ‘15/16
1.
Basic idea
2.
Representative-based clustering
3.
Probabilistic clustering
4.
Hierarchical clustering
5.
Density-based clustering
6.
Clustering high-dimensional data
7.
Validation
You’ll find this covered in Aggarwal Ch. 6, 7 Zaki & Meira, Ch. 13—15
V-1: 6
IRDM ‘15/16
1.
Basic idea
2.
Representative-based clustering
3.
Probabilistic clustering
4.
Hierarchical clustering
5.
Density-based clustering
6.
Clustering high-dimensional data
7.
Validation
You’ll find this covered in Aggarwal Ch. 6, 7 Zaki & Meira, Ch. 13—15
V-1: 7
IRDM ‘15/16
V-1: 8
IRDM ‘15/16
V-1: 9
low inter-cluster similarity high intra-cluster similarity an outlier?
IRDM ‘15/16
Given a set 𝑉 of objects and a distance 𝑒: 𝑉2 → 𝑆+ between
clust usters such that the distance between een poin
in the same c clu luster is is low low and the distance between t the poin
in diffe ifferent c clu lusters is is la large
small and la
large are not well defined
a clustering of 𝑉 can be
exc
xclus usive (each point belongs to exactly one cluster)
probab
abilis listic (each point has a probability of belonging to a cluster)
fuz
fuzzy (each point can belong to multiple clusters)
the number of clusters can be pre-defined, or not
V-1: 10
IRDM ‘15/16
A function 𝑒: 𝑉2 → 𝑆+ is a metri ric if:
𝑒 𝑣, 𝑤 = 0 if and only if 𝑣 = 𝑤 𝑒 𝑣, 𝑤 = 𝑒(𝑤, 𝑣) for all 𝑣, 𝑤 ∈ 𝑉 𝑒 𝑣, 𝑤 ≤ 𝑒 𝑣, 𝑥 + 𝑒(𝑥, 𝑤) for all 𝑣, 𝑤, 𝑥 ∈ 𝑉
A metric is a distan tance; if 𝑒: 𝑉2 → [0, 𝛽] for some positive 𝛽 then 𝑏 − 𝑒(𝑣, 𝑤) is a sim imila ilarit ity score Common metrics include
𝑀𝑞: ∑
𝑣𝑗 − 𝑤𝑗 𝑞
𝑒 𝑗=1
1 𝑞 for 𝑒-dimensional space
𝑀1 = Hamming = city-block; 𝑀2 = Euclidean distance
Correlation distance: 1 − 𝜚 Jaccard distance: 1 − |𝐵 ∩ 𝐶|/|𝐵 ∪ 𝐶|
V-1: 11
sym symmetry tria riangle-in inequ quali ality sel elf-sim imil ilarit ity
IRDM ‘15/16
For all-numerical data, the sum o sum of sq squa uared e errors (SSE) is the most common distance measure: ∑ 𝑣𝑗 − 𝑤𝑗 2
𝑒 𝑗=1
For all-binary data, either Hamming or Jaccard is typically used For categorical data, we either
first convert the data to binary by adding one binary variable per category
label and then use Hamming distance; or
count the agreements and disagreements of category labels with Jaccard
For mixed data, some combination must be used.
V-1: 12
IRDM ‘15/16
𝑒1,2 𝑒1,3 𝑒1,2 𝑒2,3 𝑒1,3 𝑒2,3 ⋯ 𝑒1,𝑜 𝑒2,𝑜 𝑒3,𝑜 ⋮ ⋱ ⋮ 𝑒1,𝑜 𝑒2,𝑜 𝑒3,𝑜 ⋯
A dis istanc nce (or dissimilarit ity) matrix ix is
𝑜-by-𝑜 for 𝑜 objects non-negative (𝑒𝑗,𝑘 ≥ 0) symmetric (𝑒𝑗,𝑘 = 𝑒𝑘,𝑗) Zero on diagonal (𝑒𝑗,𝑗 = 0)
V-1: 13
IRDM ‘15/16
Aggarwal Ch. 6.3
V-1: 14
IRDM ‘15/16
Exclusive representative-based clustering
the set of objects 𝑉 is partit
itio ioned d into 𝑙 clusters 𝐷1, 𝐷2, … , 𝐷𝑙
⋃ 𝐷𝑗
𝑗
= 𝑉 and 𝐷𝑗 ∩ 𝐷
𝑘 = ∅ for 𝑗 ≠ 𝑘
every cluster is re
repre resented by a prototype (aka centroid or mean) 𝜈𝑗
clustering quality is based on sum o
squared err rrors rs between objects in a cluster and the cluster prototype
𝑦𝑘 − 𝜈𝑗
2 2 𝑦𝑘∈𝐷𝑗 𝑙 𝑗=1
= 𝑦𝑘𝑘 − 𝜈𝑗𝑘
2 𝑒 𝑘=1 𝑦𝑘∈𝐷𝑗 𝑙 𝑗=1
V-1: 15
IRDM ‘15/16
Exclusive representative-based clustering
the set of objects 𝑉 is partit
itio ioned d into 𝑙 clusters 𝐷1, 𝐷2, … , 𝐷𝑙
⋃ 𝐷𝑗
𝑗
= 𝑉 and 𝐷𝑗 ∩ 𝐷
𝑘 = ∅ for 𝑗 ≠ 𝑘
every cluster is re
repre resented by a prototype (aka centroid or mean) 𝜈𝑗
clustering quality is based on sum o
squared err rrors rs between objects in a cluster and the cluster prototype
𝑦𝑘 − 𝜈𝑗
2 2 𝑦𝑘∈𝐷𝑗 𝑙 𝑗=1
= 𝑦𝑘𝑘 − 𝜈𝑗𝑘
2 𝑒 𝑘=1 𝑦𝑘∈𝐷𝑗 𝑙 𝑗=1
V-1: 16
IRDM ‘15/16
The naïve algorithm goes like this
one by one generate all possible clusterings compute the squared error select the best
Sadly, this is infeasible
there are too many possible clusterings to try
𝑙𝑜 different clusterings to 𝑙 clusters (some possibly empty) the number of ways to cluster 𝑜 points in 𝑙 non-empty clusters is the
Stirling number of the second kind, 𝑇(𝑜, 𝑙), 𝑇 𝑜, 𝑙 = 𝑜 𝑙 = 1 𝑙! −1 𝑘 𝑙 𝑘 𝑙 − 𝑘 𝑜
𝑙 𝑘=0
V-1: 17
IRDM ‘15/16
1.
select 𝑙 random cluster centroids
2.
assign each point to its closest centroid
3.
compute the error
4. 4.
do do
1. 1.
fo for ea each cluster 𝐷𝑗
1.
compute new centroid as 𝜈𝑗 =
1 𝐷𝑗 ∑
𝑦𝑘
𝑦𝑘∈𝐷𝑗
2. 2.
fo for ea each element 𝑦𝑘 ∈ 𝑉
1.
assign 𝑦𝑘 to its closest cluster centroid
5. 5.
while ile error decreases
V-1: 18
IRDM ‘15/16
V-1: 19
1 3 4 5 1 3 4 5
expression in condition 2 expression in condition 1
k1 k2 k3
IRDM ‘15/16
Always converges, eve ventu tual ally
on each step the error decreases only finite number of possible clusterings convergence to local optimum
At some point a cluster can become empty ty
all points are closer to some other centroid some options include
split the biggest cluster take the furthest point as a singleton cluster
Outliers can yield bad clusterings
V-1: 20
IRDM ‘15/16
How long does iterative 𝑙-means take?
computing the centroid takes 𝑃 𝑜𝑒 time
averages over total of 𝑜 points in 𝑒-dimensional space
computing the cluster assignment takes 𝑃(𝑜𝑙𝑒) time
for each 𝑜 points we have to compute the distances to
all 𝑙 clusters in 𝑒-dimensional space
if the algorithm takes 𝑢 iterations, the total running
time is 𝑃(𝑢𝑜𝑙𝑒)
how many iterations will we need?
V-1: 21
IRDM ‘15/16
In practice the algorithm usually doesn’t need many
some hundred iterations is usually enough
Worst-case upper bound is 𝑃(𝑜𝑒𝑙) Worst-case lower bound is superpolynomial: 2Ω
𝑜
The discrepancy between practice and worst-case analysis can be (somewhat) explained with some smoothed analysis
if the data is sampled from independent 𝑒-dimensional normal
distributions with same variance, iterative 𝑙-means will terminate in 𝑃(𝑜𝑙) time with high probability
(Arthur & Vassilvitskii, 2006) V-1: 22
IRDM ‘15/16
V-1: 23
IRDM ‘15/16
V-1: 24
IRDM ‘15/16
V-1: 25
The 𝑙-means algorithm converges to a local
compared to the global optimum
IRDM ‘15/16
The Key Idea: Carefu eful in init itia ial s l seedin ing
choose first centroid u.a.r. from data points let 𝐸(𝑦) be the shortest distance from 𝑦 to any already-selected centroid choose next centroid to be 𝑦𝑦 with probability
𝐸 𝑦′ 2 ∑ 𝐸 𝑦 2
𝑦∈𝑌
points that are further away are more
re pro robable to be selected
repeat until 𝑙 centroids have been selected and
continue as normal iterative 𝑙-means algorithm
The 𝑙-means++ algorithm achieves 𝑃(log 𝑙) approximation ratio on expectation
𝐹[𝑑𝑑𝑑𝑢] = 8(ln 𝑙 + 2)OPT
The 𝑙-means++ algorithm converges fast in practice
(Arthur & Vassilvitskii ’07) V-1: 26
IRDM ‘15/16
The clusters have to be of roughly equal size
V-1: 27
IRDM ‘15/16
The clusters have to be of roughly equal size The clusters have to be of roughly equal density
V-1: 28
IRDM ‘15/16
The clusters have to be of roughly equal size The clusters have to be of roughly equal density The clusters have to be of roughly spherical shape
V-1: 29
IRDM ‘15/16
Aggarwal Ch. 6.5
V-1: 30
IRDM ‘15/16
Probabilistic clustering
i.e. not ex
every object has a certain probability (affinity) to every cluster
Representative, in a way
each cluster is represented by some parameters, Θ the parameter may include (or specify) a cluster centroid
Requi uires us us to assume ssume a a dist stribu bution o n of a cl clus uster
for now, each cluster is ind
ndependent nt Gaus Gaussian
We use the expecta tati tion-maxim imiz ization
V-1: 31
IRDM ‘15/16
We aim at finding model Θ, i.e. parameters 𝝂𝑗 and 𝚻𝑗 for each 𝑒- dimensional Gaussian cluster, plus 𝑙 mixture parameters 𝑄(𝐷𝑗)
pdf of an object 𝒚 in cluster 𝐷𝑗 is
𝑔
𝑗 𝒚 = 𝑔 𝒚
𝝂𝑗, 𝚻𝑗 = 2𝜌 −𝑒
2 𝚻𝑗 −1 2 exp − 𝒚 − 𝝂𝑗 𝑈𝚻i −1 𝒚 − 𝝂𝑗
2
total pdf of 𝑦 is a mixtu
xture mod
el of the 𝑙 cluster Gaussians 𝑔 𝒚 = 𝑔
𝑗 𝒚 𝑄 𝐷𝑗 = 𝑔(𝒚 ∣ 𝝂𝑗, 𝚻𝑗 𝑙 𝑗 𝑙 𝑗
)𝑄(𝐷𝑗)
the log-likelihood of the data D given parameters Θ then is
log (𝑄 𝑬 Θ = log ( 𝑔 𝒚𝑘 𝝂𝑗, 𝚻𝑗
𝑙 𝑗 𝑜 𝑘=1
𝑄(𝐷𝑗)
V-1: 32
IRDM ‘15/16
Initialisation
initialise parameters Θ randomly
Expectation (𝐹) step
compute the posterior probability 𝑄(𝐷𝑗 ∣ 𝒚𝑘) per Bayes’ theorem
𝑄 𝐷𝑗 𝒚𝑘 = 𝑄 𝒚𝑘 𝐷𝑗 𝑄 𝐷𝑗 ∑ 𝑄 𝒚𝑘 𝐷𝑏 𝑄 𝐷𝑏
𝑙 𝑏
Maximisation (𝑁) step
re-estimate Θ given 𝑄(𝐷𝑗 ∣ 𝒚𝑘)
Repeate 𝐹 and 𝑁 steps until convergence
V-1: 33
IRDM ‘15/16
Pdf is: 𝑔 𝑦 𝜈𝑗, 𝜏𝑗
2
=
1 2𝜌𝜏𝑗 exp − 𝑦−𝜈𝑗 2 2𝜏𝑗
2
Initialisation step
mean 𝜈 is sampled u.a.r. from possible values, 𝜏2 = 1, and 𝑄 𝐷𝑗 = 1
𝑙 (every cluster is equiprobable)
Expectation step 𝑥𝑗𝑘 = 𝑄 𝐷𝑗 𝑦𝑘 = 𝑔 𝑦𝑘 𝜈𝑗, 𝜏𝑗
2 𝑄 𝐷𝑗
∑ 𝑔 𝑦𝑘 𝜈𝑏, 𝜏𝑏
2 𝑄 𝐷𝑏 𝑙 𝑏
Maximisation step 𝜈𝑗 =
∑ 𝑥𝑗𝑘𝑦𝑘
𝑜 𝑘
∑ 𝑥𝑗𝑘
𝑜 𝑘
𝜏𝑗
2 = ∑ 𝑥𝑗𝑘 𝑦𝑘−𝜈𝑗
2 𝑜 𝑘
∑ 𝑥𝑗𝑘
𝑜 𝑘
𝑄 𝐷𝑗 =
∑ 𝑥𝑗𝑘
𝑜 𝑘
𝑜
V-1: 34
Weighted mean Weighted variance Fraction of weight in cluster 𝑗
IRDM ‘15/16
V-1: 35
Initialisation Iteration 1 Iteration 2
IRDM ‘15/16
If we generalise to 𝑒-dimensional Gaussians, we need to model the interactions between all dimensions – we need the covariance matrix. In practice we need to estimate only the upper triangular matrix, which means estimating
𝑒 𝑒+1 2
hence, in practice often dimensions are assum umed to be independent, yielding 𝑒 parameters
The expectation step is as in 1-D The mean and prior 𝑄(𝐷𝑗) are estimated as in 1-D The variance of cluster 𝐷𝑗 in dimension 𝑏 is 𝜏𝑏𝑏
𝑗 2 =
∑ 𝑥𝑗𝑘 𝒚𝑘𝑏 − 𝝂𝑗𝑏
2 𝑜 𝑘
∑ 𝑥𝑗𝑘
𝑜 𝑘
V-1: 36
IRDM ‘15/16
V-1: 37
IRDM ‘15/16
V-1: 38
IRDM ‘15/16
V-1: 39
IRDM ‘15/16
Iterative 𝑙-means can be seen as a special case of EM,
i.e. with a different cluster density function
𝑄 𝑦𝑘
𝐷𝑗 = 1 iff centroid 𝑗 is the closest to point 𝑦𝑘
The posterior probability is then
𝑄 𝐷𝑗
𝑦𝑘 = 1 iff point 𝑦𝑘 belongs to cluster 𝑗
The parameters are the centroids and 𝑄(𝐷𝑗)
the co-variance matrix can be ignored
V-1: 40
IRDM ‘15/16
Aggarwal Ch. 6.9
V-1: 41
IRDM ‘15/16
Both 𝑙-means and EM require user to define 𝑙 before the algorithm is run
what if we don’t know the number of clusters beforehand?
The larger the value of 𝑙,
the smaller the error the more complex the model the higher the risk for over-fitting
V-1: 42
IRDM ‘15/16
As with regression:
hold out some random points (test set) run clustering on the remaining points (training set) compute the error with test set included re-iterate with different values of 𝑙 and select the one with least
Normally 𝑂-fold cross validation
typically 𝑂 = 10 data is divided in 𝑂 even sized sets cross-validation is run 𝑂 times, each time keeping one set as the
test set and rest 𝑂– 1 sets together as the training set
V-1: 43
IRDM ‘15/16
Let 𝑄Θ (𝐸 ∣ 𝐷) be the maximized likelihood of clustering 𝐷 (obtained e.g. via EM algorithm) Let 𝑚(𝐷) be the number of parameters in Θ we need for C
for Gaussian with independent dimensions, 𝑟 𝐷 = 𝑙 × 𝑒 + 2
𝑙 clusters, and per cluster 1 mixture parameter 𝑄(𝐷𝑗), 𝑒 variances, and
1 mean (although 𝑒-dimensional, it only counts as one parameter)
Main idea: we pay for every parameter in the model
in Akaike’s
’s Inf nformat ation n Criterion (AIC) we select the 𝑙 that minimizes 𝐵𝐵𝐷 = − log 𝑄Θ(𝐸 ∣ 𝐷) + 𝑚(𝐷)
in Ba
Bayes esian Inf nformation n Criterion (BIC) we select the 𝑙 that minimizes 𝐶𝐵𝐷 = − log 𝑄Θ (𝐸 ∣ 𝐷) +
𝑘(𝐷) 2 log 𝑜
(Akaike, 1974; Schwarz, 1978) V-1: 44
IRDM ‘15/16
so far we’ve seen representative and probabilistic clustering every type of clustering has its strengths and weaknesses
cross-validation is a standard method AIC and BIC are principled general ways for model sele
lectio ion
V-1: 45
IRDM ‘15/16
so far we’ve seen representative and probabilistic clustering every type of clustering has its strengths and weaknesses
cross-validation is a standard method AIC and BIC are principled general ways for model sele
lectio ion
V-1: 46