[PPT] - Chapter 5: Cluster ering ing Jilles Vreeken IRDM 15/16 10 Nov PowerPoint Presentation

SLIDE 1

IRDM ‘15/16

Jilles Vreeken

Chapter 5: Cluster ering ing

10 Nov 2015

SLIDE 2

Qu Question o

f

f th the w week

How can we discover groups s of objec ects that are highly similar to each other?

10 Nov 2015 V-1: 2

SLIDE 3

IRDM ‘15/16

Clustering, where?

Biology

 creation of phylogenies (relations between organisms)  inferring population structures from clusterings of DNA data  analysis of genes and cellular processes (co-clustering)

Business

 grouping of consumers into market segments

Computer science

 pre-processing to reduce computation (representative-based methods)  automatic discovery of similar items

V-1: 3

SLIDE 4

IRDM ‘15/16

Motivational Example

(Wessmann, ‘Mixture Model Clustering in the analysis of complex diseases’, 2012)

V-1: 4

SLIDE 5

IRDM ‘15/16

Even more motivation

(Heikinheimo et al., ‘Clustering of European Mammals’, 2007)

V-1: 5

SLIDE 6

IRDM ‘15/16

IRDM Chapter 5, overview

1.

Basic idea

2.

Representative-based clustering

3.

Probabilistic clustering

4.

Hierarchical clustering

5.

Density-based clustering

6.

Clustering high-dimensional data

7.

Validation

You’ll find this covered in Aggarwal Ch. 6, 7 Zaki & Meira, Ch. 13—15

V-1: 6

SLIDE 7

IRDM ‘15/16

IRDM Chapter 5, today

1.

Basic idea

2.

Representative-based clustering

3.

Probabilistic clustering

4.

Hierarchical clustering

5.

Density-based clustering

6.

Clustering high-dimensional data

7.

Validation

You’ll find this covered in Aggarwal Ch. 6, 7 Zaki & Meira, Ch. 13—15

V-1: 7

SLIDE 8

IRDM ‘15/16

Chapter 5.1: Bas asics

V-1: 8

SLIDE 9

IRDM ‘15/16

Example

V-1: 9

low inter-cluster similarity high intra-cluster similarity an outlier?

SLIDE 10

IRDM ‘15/16

The clustering problem

Given a set 𝑉 of objects and a distance 𝑒: 𝑉2 → 𝑆+ between

bjects, group the objects of 𝑉 into cl

clust usters such that the distance between een poin

ints in

in the same c clu luster is is low low and the distance between t the poin

ints in

in diffe ifferent c clu lusters is is la large

 small and la

large are not well defined

 a clustering of 𝑉 can be

 exc

xclus usive (each point belongs to exactly one cluster)

 probab

abilis listic (each point has a probability of belonging to a cluster)

 fuz

fuzzy (each point can belong to multiple clusters)

 the number of clusters can be pre-defined, or not

V-1: 10

SLIDE 11

IRDM ‘15/16

On distances

A function 𝑒: 𝑉2 → 𝑆+ is a metri ric if:

 𝑒 𝑣, 𝑤 = 0 if and only if 𝑣 = 𝑤  𝑒 𝑣, 𝑤 = 𝑒(𝑤, 𝑣) for all 𝑣, 𝑤 ∈ 𝑉  𝑒 𝑣, 𝑤 ≤ 𝑒 𝑣, 𝑥 + 𝑒(𝑥, 𝑤) for all 𝑣, 𝑤, 𝑥 ∈ 𝑉

A metric is a distan tance; if 𝑒: 𝑉2 → [0, 𝛽] for some positive 𝛽 then 𝑏 − 𝑒(𝑣, 𝑤) is a sim imila ilarit ity score Common metrics include

 𝑀𝑞: ∑

𝑣𝑗 − 𝑤𝑗 𝑞

𝑒 𝑗=1

1 𝑞 for 𝑒-dimensional space



𝑀1 = Hamming = city-block; 𝑀2 = Euclidean distance

 Correlation distance: 1 − 𝜚  Jaccard distance: 1 − |𝐵 ∩ 𝐶|/|𝐵 ∪ 𝐶|

V-1: 11

sym symmetry tria riangle-in inequ quali ality sel elf-sim imil ilarit ity

SLIDE 12

IRDM ‘15/16

More distantly

For all-numerical data, the sum o sum of sq squa uared e errors (SSE) is the most common distance measure: ∑ 𝑣𝑗 − 𝑤𝑗 2

𝑒 𝑗=1

For all-binary data, either Hamming or Jaccard is typically used For categorical data, we either

 first convert the data to binary by adding one binary variable per category

label and then use Hamming distance; or

 count the agreements and disagreements of category labels with Jaccard

For mixed data, some combination must be used.

V-1: 12

SLIDE 13

IRDM ‘15/16

The distance matrix

𝑒1,2 𝑒1,3 𝑒1,2 𝑒2,3 𝑒1,3 𝑒2,3 ⋯ 𝑒1,𝑜 𝑒2,𝑜 𝑒3,𝑜 ⋮ ⋱ ⋮ 𝑒1,𝑜 𝑒2,𝑜 𝑒3,𝑜 ⋯

A dis istanc nce (or dissimilarit ity) matrix ix is

 𝑜-by-𝑜 for 𝑜 objects  non-negative (𝑒𝑗,𝑘 ≥ 0)  symmetric (𝑒𝑗,𝑘 = 𝑒𝑘,𝑗)  Zero on diagonal (𝑒𝑗,𝑗 = 0)

V-1: 13

SLIDE 14

IRDM ‘15/16

Chapter 5.2:

Represe sentat ative ive-bas based C ed Clustering ing

Aggarwal Ch. 6.3

V-1: 14

SLIDE 15

IRDM ‘15/16

Partitions and Prototypes

Exclusive representative-based clustering

 the set of objects 𝑉 is partit

itio ioned d into 𝑙 clusters 𝐷1, 𝐷2, … , 𝐷𝑙

 ⋃ 𝐷𝑗

𝑗

= 𝑉 and 𝐷𝑗 ∩ 𝐷

𝑘 = ∅ for 𝑗 ≠ 𝑘

 every cluster is re

repre resented by a prototype (aka centroid or mean) 𝜈𝑗

 clustering quality is based on sum o

f s

squared err rrors rs between objects in a cluster and the cluster prototype

𝑦𝑘 − 𝜈𝑗

2 2 𝑦𝑘∈𝐷𝑗 𝑙 𝑗=1

= 𝑦𝑘𝑘 − 𝜈𝑗𝑘

2 𝑒 𝑘=1 𝑦𝑘∈𝐷𝑗 𝑙 𝑗=1

V-1: 15

SLIDE 16

IRDM ‘15/16

Partitions and Prototypes

Exclusive representative-based clustering

 the set of objects 𝑉 is partit

itio ioned d into 𝑙 clusters 𝐷1, 𝐷2, … , 𝐷𝑙

 ⋃ 𝐷𝑗

𝑗

= 𝑉 and 𝐷𝑗 ∩ 𝐷

𝑘 = ∅ for 𝑗 ≠ 𝑘

 every cluster is re

repre resented by a prototype (aka centroid or mean) 𝜈𝑗

 clustering quality is based on sum o

f s

squared err rrors rs between objects in a cluster and the cluster prototype

𝑦𝑘 − 𝜈𝑗

2 2 𝑦𝑘∈𝐷𝑗 𝑙 𝑗=1

= 𝑦𝑘𝑘 − 𝜈𝑗𝑘

2 𝑒 𝑘=1 𝑦𝑘∈𝐷𝑗 𝑙 𝑗=1

V-1: 16

ver all clusters
ver all objects in the cluster
ver all dimensions

SLIDE 17

IRDM ‘15/16

The Naïve algorithm

The naïve algorithm goes like this

 one by one generate all possible clusterings  compute the squared error  select the best

Sadly, this is infeasible

 there are too many possible clusterings to try

 𝑙𝑜 different clusterings to 𝑙 clusters (some possibly empty)  the number of ways to cluster 𝑜 points in 𝑙 non-empty clusters is the

Stirling number of the second kind, 𝑇(𝑜, 𝑙), 𝑇 𝑜, 𝑙 = 𝑜 𝑙 = 1 𝑙! −1 𝑘 𝑙 𝑘 𝑙 − 𝑘 𝑜

𝑙 𝑘=0

V-1: 17

SLIDE 18

IRDM ‘15/16

An iterative 𝑙-means algorithm

1.

select 𝑙 random cluster centroids

2.

assign each point to its closest centroid

3.

compute the error

4. 4.

do do

1. 1.

fo for ea each cluster 𝐷𝑗

1.

compute new centroid as 𝜈𝑗 =

1 𝐷𝑗 ∑

𝑦𝑘

𝑦𝑘∈𝐷𝑗

2. 2.

fo for ea each element 𝑦𝑘 ∈ 𝑉

1.

assign 𝑦𝑘 to its closest cluster centroid

5. 5.

while ile error decreases

V-1: 18

SLIDE 19

IRDM ‘15/16

k-means Example

V-1: 19

1 3 4 5 1 3 4 5

expression in condition 2 expression in condition 1

k1 k2 k3

SLIDE 20

IRDM ‘15/16

Some observations

Always converges, eve ventu tual ally

 on each step the error decreases  only finite number of possible clusterings  convergence to local optimum

At some point a cluster can become empty ty

 all points are closer to some other centroid  some options include

 split the biggest cluster  take the furthest point as a singleton cluster

Outliers can yield bad clusterings

V-1: 20

SLIDE 21

IRDM ‘15/16

Computational complexity

How long does iterative 𝑙-means take?

 computing the centroid takes 𝑃 𝑜𝑒 time

 averages over total of 𝑜 points in 𝑒-dimensional space

 computing the cluster assignment takes 𝑃(𝑜𝑙𝑒) time

 for each 𝑜 points we have to compute the distances to

all 𝑙 clusters in 𝑒-dimensional space

 if the algorithm takes 𝑢 iterations, the total running

time is 𝑃(𝑢𝑜𝑙𝑒)

 how many iterations will we need?

V-1: 21

SLIDE 22

IRDM ‘15/16

How many iterations?

In practice the algorithm usually doesn’t need many

 some hundred iterations is usually enough

Worst-case upper bound is 𝑃(𝑜𝑒𝑙) Worst-case lower bound is superpolynomial: 2Ω

𝑜

The discrepancy between practice and worst-case analysis can be (somewhat) explained with some smoothed analysis

 if the data is sampled from independent 𝑒-dimensional normal

distributions with same variance, iterative 𝑙-means will terminate in 𝑃(𝑜𝑙) time with high probability

(Arthur & Vassilvitskii, 2006) V-1: 22

SLIDE 23

IRDM ‘15/16

On the importance of starting well

V-1: 23

SLIDE 24

IRDM ‘15/16

On the importance of starting well

V-1: 24

SLIDE 25

IRDM ‘15/16

On the importance of starting well

V-1: 25

The 𝑙-means algorithm converges to a local

ptimum, which can be arbitrarily bad

compared to the global optimum

SLIDE 26

IRDM ‘15/16

The 𝑙-means++ algorithm

The Key Idea: Carefu eful in init itia ial s l seedin ing

 choose first centroid u.a.r. from data points  let 𝐸(𝑦) be the shortest distance from 𝑦 to any already-selected centroid  choose next centroid to be 𝑦𝑦 with probability

𝐸 𝑦′ 2 ∑ 𝐸 𝑦 2

𝑦∈𝑌

 points that are further away are more

re pro robable to be selected

 repeat until 𝑙 centroids have been selected and

continue as normal iterative 𝑙-means algorithm

The 𝑙-means++ algorithm achieves 𝑃(log 𝑙) approximation ratio on expectation

 𝐹[𝑑𝑑𝑑𝑢] = 8(ln 𝑙 + 2)OPT

The 𝑙-means++ algorithm converges fast in practice

(Arthur & Vassilvitskii ’07) V-1: 26

SLIDE 27

IRDM ‘15/16

Limitations of 𝑙-means clusterings

The clusters have to be of roughly equal size

V-1: 27

SLIDE 28

IRDM ‘15/16

Limitations of 𝑙-means clusterings

The clusters have to be of roughly equal size The clusters have to be of roughly equal density

V-1: 28

SLIDE 29

IRDM ‘15/16

Limitations of 𝑙-means clusterings

The clusters have to be of roughly equal size The clusters have to be of roughly equal density The clusters have to be of roughly spherical shape

V-1: 29

SLIDE 30

IRDM ‘15/16

Chapter 5.3:

Probabilis abilistic ic M Model el-bas based ed

Aggarwal Ch. 6.5

V-1: 30

SLIDE 31

IRDM ‘15/16

The EM clustering algorithm

Probabilistic clustering

 i.e. not ex

t exclusive

 every object has a certain probability (affinity) to every cluster

Representative, in a way

 each cluster is represented by some parameters, Θ  the parameter may include (or specify) a cluster centroid

Requi uires us us to assume ssume a a dist stribu bution o n of a cl clus uster

 for now, each cluster is ind

ndependent nt Gaus Gaussian

We use the expecta tati tion-maxim imiz ization

n (EM) approach

V-1: 31

SLIDE 32

IRDM ‘15/16

The basics

We aim at finding model Θ, i.e. parameters 𝝂𝑗 and 𝚻𝑗 for each 𝑒- dimensional Gaussian cluster, plus 𝑙 mixture parameters 𝑄(𝐷𝑗)

 pdf of an object 𝒚 in cluster 𝐷𝑗 is

𝑔

𝑗 𝒚 = 𝑔 𝒚

𝝂𝑗, 𝚻𝑗 = 2𝜌 −𝑒

2 𝚻𝑗 −1 2 exp − 𝒚 − 𝝂𝑗 𝑈𝚻i −1 𝒚 − 𝝂𝑗

2

 total pdf of 𝑦 is a mixtu

xture mod

del

el of the 𝑙 cluster Gaussians 𝑔 𝒚 = 𝑔

𝑗 𝒚 𝑄 𝐷𝑗 = 𝑔(𝒚 ∣ 𝝂𝑗, 𝚻𝑗 𝑙 𝑗 𝑙 𝑗

)𝑄(𝐷𝑗)

 the log-likelihood of the data D given parameters Θ then is

log (𝑄 𝑬 Θ = log ( 𝑔 𝒚𝑘 𝝂𝑗, 𝚻𝑗

𝑙 𝑗 𝑜 𝑘=1

𝑄(𝐷𝑗)

V-1: 32

SLIDE 33

IRDM ‘15/16

The general EM clustering algorithm

Initialisation

 initialise parameters Θ randomly

Expectation (𝐹) step

 compute the posterior probability 𝑄(𝐷𝑗 ∣ 𝒚𝑘) per Bayes’ theorem

𝑄 𝐷𝑗 𝒚𝑘 = 𝑄 𝒚𝑘 𝐷𝑗 𝑄 𝐷𝑗 ∑ 𝑄 𝒚𝑘 𝐷𝑏 𝑄 𝐷𝑏

𝑙 𝑏

Maximisation (𝑁) step

 re-estimate Θ given 𝑄(𝐷𝑗 ∣ 𝒚𝑘)

Repeate 𝐹 and 𝑁 steps until convergence

V-1: 33

SLIDE 34

IRDM ‘15/16

EM with 1d Gaussians

Pdf is: 𝑔 𝑦 𝜈𝑗, 𝜏𝑗

2

=

1 2𝜌𝜏𝑗 exp − 𝑦−𝜈𝑗 2 2𝜏𝑗

2

Initialisation step



mean 𝜈 is sampled u.a.r. from possible values, 𝜏2 = 1, and 𝑄 𝐷𝑗 = 1

𝑙 (every cluster is equiprobable)

Expectation step 𝑥𝑗𝑘 = 𝑄 𝐷𝑗 𝑦𝑘 = 𝑔 𝑦𝑘 𝜈𝑗, 𝜏𝑗

2 𝑄 𝐷𝑗

∑ 𝑔 𝑦𝑘 𝜈𝑏, 𝜏𝑏

2 𝑄 𝐷𝑏 𝑙 𝑏

Maximisation step 𝜈𝑗 =

∑ 𝑥𝑗𝑘𝑦𝑘

𝑜 𝑘

∑ 𝑥𝑗𝑘

𝑜 𝑘

𝜏𝑗

2 = ∑ 𝑥𝑗𝑘 𝑦𝑘−𝜈𝑗

2 𝑜 𝑘

∑ 𝑥𝑗𝑘

𝑜 𝑘

𝑄 𝐷𝑗 =

∑ 𝑥𝑗𝑘

𝑜 𝑘

𝑜

V-1: 34

Weighted mean Weighted variance Fraction of weight in cluster 𝑗

SLIDE 35

IRDM ‘15/16

ExaMple

V-1: 35

Initialisation Iteration 1 Iteration 2

SLIDE 36

IRDM ‘15/16

EM in 𝑒 dimensions

If we generalise to 𝑒-dimensional Gaussians, we need to model the interactions between all dimensions – we need the covariance matrix. In practice we need to estimate only the upper triangular matrix, which means estimating

𝑒 𝑒+1 2

parameters. That’s a lot of parameters.



hence, in practice often dimensions are assum umed to be independent, yielding 𝑒 parameters

The expectation step is as in 1-D The mean and prior 𝑄(𝐷𝑗) are estimated as in 1-D The variance of cluster 𝐷𝑗 in dimension 𝑏 is 𝜏𝑏𝑏

𝑗 2 =

∑ 𝑥𝑗𝑘 𝒚𝑘𝑏 − 𝝂𝑗𝑏

2 𝑜 𝑘

∑ 𝑥𝑗𝑘

𝑜 𝑘

V-1: 36

SLIDE 37

IRDM ‘15/16

Example – initialisation

V-1: 37

SLIDE 38

IRDM ‘15/16

Example – iteration 1

V-1: 38

SLIDE 39

IRDM ‘15/16

Example – iteration 36

V-1: 39

SLIDE 40

IRDM ‘15/16

𝑙-means as EM

Iterative 𝑙-means can be seen as a special case of EM,

i.e. with a different cluster density function

 𝑄 𝑦𝑘

𝐷𝑗 = 1 iff centroid 𝑗 is the closest to point 𝑦𝑘

The posterior probability is then

 𝑄 𝐷𝑗

𝑦𝑘 = 1 iff point 𝑦𝑘 belongs to cluster 𝑗

The parameters are the centroids and 𝑄(𝐷𝑗)

 the co-variance matrix can be ignored

V-1: 40

SLIDE 41

IRDM ‘15/16

Chapter 5.4:

Valida idatio ion

Aggarwal Ch. 6.9

V-1: 41

SLIDE 42

IRDM ‘15/16

How to select 𝑙

Both 𝑙-means and EM require user to define 𝑙 before the algorithm is run

 what if we don’t know the number of clusters beforehand?

The larger the value of 𝑙,

 the smaller the error  the more complex the model  the higher the risk for over-fitting

V-1: 42

SLIDE 43

IRDM ‘15/16

Cross-validation

As with regression:

 hold out some random points (test set)  run clustering on the remaining points (training set)  compute the error with test set included  re-iterate with different values of 𝑙 and select the one with least

verall error

Normally 𝑂-fold cross validation

 typically 𝑂 = 10  data is divided in 𝑂 even sized sets  cross-validation is run 𝑂 times, each time keeping one set as the

test set and rest 𝑂– 1 sets together as the training set

V-1: 43

SLIDE 44

IRDM ‘15/16

AIC and BIC

Let 𝑄Θ (𝐸 ∣ 𝐷) be the maximized likelihood of clustering 𝐷 (obtained e.g. via EM algorithm) Let 𝑚(𝐷) be the number of parameters in Θ we need for C

 for Gaussian with independent dimensions, 𝑟 𝐷 = 𝑙 × 𝑒 + 2

 𝑙 clusters, and per cluster 1 mixture parameter 𝑄(𝐷𝑗), 𝑒 variances, and

1 mean (although 𝑒-dimensional, it only counts as one parameter)

Main idea: we pay for every parameter in the model

 in Akaike’s

’s Inf nformat ation n Criterion (AIC) we select the 𝑙 that minimizes 𝐵𝐵𝐷 = − log 𝑄Θ(𝐸 ∣ 𝐷) + 𝑚(𝐷)

 in Ba

Bayes esian Inf nformation n Criterion (BIC) we select the 𝑙 that minimizes 𝐶𝐵𝐷 = − log 𝑄Θ (𝐸 ∣ 𝐷) +

𝑘(𝐷) 2 log 𝑜

(Akaike, 1974; Schwarz, 1978) V-1: 44

SLIDE 45

IRDM ‘15/16

T

day’s Conclusions

Clustering is one of the most important and most used data analysis methods There exist many different types of clustering

 so far we’ve seen representative and probabilistic clustering  every type of clustering has its strengths and weaknesses

Choosing the number of clusters is often difficult

 cross-validation is a standard method  AIC and BIC are principled general ways for model sele

lectio ion

V-1: 45

SLIDE 46

IRDM ‘15/16

Thank you!

Clustering is one of the most important and most used data analysis methods There exist many different types of clustering

 so far we’ve seen representative and probabilistic clustering  every type of clustering has its strengths and weaknesses

Choosing the number of clusters is often difficult

 cross-validation is a standard method  AIC and BIC are principled general ways for model sele

lectio ion

V-1: 46