IRDM β15/16
Jilles Vreeken
Chapter 5-2: Clu lust ster erin ing
12 Nov 2015 Revision 1, November 20th typoβs fixed: dendrogram Revision 2, December 10th clarified: we do consider a point π¦ as a member of its own π-neighborhood
Chapter 5-2: Clu lust ster erin ing Jilles Vreeken Revision 1, - - PowerPoint PPT Presentation
Chapter 5-2: Clu lust ster erin ing Jilles Vreeken Revision 1, November 20 th typos fixed: dendrogram Revision 2, December 10 th clarified: we do consider a point as a member of its own -neighborhood IRDM 15/16 12 Nov 2015
IRDM β15/16
12 Nov 2015 Revision 1, November 20th typoβs fixed: dendrogram Revision 2, December 10th clarified: we do consider a point π¦ as a member of its own π-neighborhood
IRDM β15/16
th 2015
Wh Where: GΓΌnter-Hotz HΓΆrsaal (E2.2) Material: the first four lectures, the first two homeworks You are a allo llowed to br brin ing o
(1) ) sheet o
A4 p pape per wit with handwr writ itten or pr prin inted notes o
both s sid ides . . No other material (n l (notes, bo books, c course m materials) ) or devic ices (c (calc lculator, n notebook, c cell ph ll phone, t toothbrush, etc tc) a ) allo llowed. Br Brin ing a an ID; D; eit ither y your UdS UdS card, o
passport.
V-2: 2
IRDM β15/16
th an
th 2016
Oral e l exam. Can o
ly be be t taken wh when you passed tw two o
t of th three mid id-term t tests. More details l later.
V-2: 3
IRDM β15/16
1.
Basic idea
2.
Representative-based clustering
3.
Probabilistic clustering
4.
Validation
5.
Hierarchical clustering
6.
Density-based clustering
7.
Clustering high-dimensional data
Youβll find this covered in Aggarwal Ch. 6, 7 Zaki & Meira, Ch. 13β15
V-2: 4
IRDM β15/16
1.
Basic idea
2.
Representative-based clustering
3.
Probabilistic clustering
4.
Validation
5.
Hierarchical clustering
6.
Density-based clustering
7.
Clustering high-dimensional data
Youβll find this covered in Aggarwal Ch. 6, 7 Zaki & Meira, Ch. 13β15
V-2: 5
IRDM β15/16
Aggarwal Ch. 6.4
V-2: 6
IRDM β15/16
Create clustering for each number of clusters π = 1,2, β¦ , π The clusterings must be hie ierarch chica ical
ο§ every cluster of π-clustering is a union of some clusters in an π-
clustering for all π < π
ο§ i.e. for all π, and for all π > π, every cluster in an π-clustering is a
subset of some cluster in the π-clustering Example:
V-2: 7
k = 6
IRDM β15/16
Create clustering for each number of clusters π = 1,2, β¦ , π The clusterings must be hie ierarch chica ical
ο§ every cluster of π-clustering is a union of some clusters in an π-
clustering for all π < π
ο§ i.e. for all π, and for all π > π, every cluster in an π-clustering is a
subset of some cluster in the π-clustering Example:
V-2: 8
k = 5
IRDM β15/16
Create clustering for each number of clusters π = 1,2, β¦ , π The clusterings must be hie ierarch chica ical
ο§ every cluster of π-clustering is a union of some clusters in an π-
clustering for all π < π
ο§ i.e. for all π, and for all π > π, every cluster in an π-clustering is a
subset of some cluster in the π-clustering Example:
V-2: 9
k = 4
IRDM β15/16
Create clustering for each number of clusters π = 1,2, β¦ , π The clusterings must be hie ierarch chica ical
ο§ every cluster of π-clustering is a union of some clusters in an π-
clustering for all π < π
ο§ i.e. for all π, and for all π > π, every cluster in an π-clustering is a
subset of some cluster in the π-clustering Example:
V-2: 10
k = 3
IRDM β15/16
Create clustering for each number of clusters π = 1,2, β¦ , π The clusterings must be hie ierarch chica ical
ο§ every cluster of π-clustering is a union of some clusters in an π-
clustering for all π < π
ο§ i.e. for all π, and for all π > π, every cluster in an π-clustering is a
subset of some cluster in the π-clustering Example:
V-2: 11
k = 2
IRDM β15/16
Create clustering for each number of clusters π = 1,2, β¦ , π The clusterings must be hie ierarch chica ical
ο§ every cluster of π-clustering is a union of some clusters in an π-
clustering for all π < π
ο§ i.e. for all π, and for all π > π, every cluster in an π-clustering is a
subset of some cluster in the π-clustering Example:
V-2: 12
k = 1
IRDM β15/16
The difference in height between the tree and its subtrees shows the distance between the two branches
V-2: 13
Distance is β0.7
IRDM β15/16
V-2: 14
IRDM β15/16
Dendrograms show the hierarchy of the clustering Number of clusters can be deduced from a dendrogram
ο§ higher branches
Outliers can be detected from a dendrogram
ο§ single points that are far from others
V-2: 15
IRDM β15/16
Agglome
ο§ start with π clusters ο§ combine two closest clusters into a cluster of one bigger cluster
Div ivis isiv ive: top-down
ο§ start with 1 cluster ο§ divide the cluster into two
ο§ divide the largest (per diameter) cluster into smaller clusters
V-2: 16
IRDM β15/16
The distance between two points π¦ and π§ is π(π¦, π§) What is the distance between two clusters? Many intuitive definitions β no universal truth
ο§ different cluster distances yield different clusterings ο§ the selection of cluster distance depends on application
Some distances between clusters πΆ and π·:
ο§ minimum distance
π(πΆ, π·) = min {π(π¦, π§) βΆ π¦ β πΆ πππ π§ β π·}
ο§ maximum distance
π(πΆ, π·) = max {π(π¦, π§) βΆ π¦ β πΆ πππ π§ β π·}
ο§ average distance
π(πΆ, π·) = πππ{π(π¦, π§) βΆ π¦ β πΆ πππ π§ β π·}
ο§ distance of centroids
π(πΆ, π·) = π(ππΆ, ππ·), where ππΆ is the centroid of πΆ and ππ· is the centroid of π·
V-2: 17
IRDM β15/16
The distance between two clusters is the distance between the closest points
ο§ π(πΆ, π·) = min
{π(π¦, π§) βΆ π¦ β πΆ πππ π§ β π·}
V-2: 18
IRDM β15/16
Can n ha hand ndle non non-spheric ical l clu clusters o
l size ize
V-2: 19
IRDM β15/16
Se Sens nsitive to
noise a and nd out
Prod
uces e elong
usters
V-2: 20
IRDM β15/16
The distance between two clusters is the distance between the furthest points
ο§ π(πΆ, π·) = max
{π(π¦, π§) βΆ π¦ β πΆ πππ π§ β π·}
V-2: 21
IRDM β15/16
Le Less s sus usceptible t to
noise and nd out
V-2: 22
IRDM β15/16
Break aks s largest c st clusters Bia iased t towards s spherica ical clu l clusters
V-2: 23
IRDM β15/16
Gr Group
ο§ π πΆ, π· = avg π π¦, π§ : π¦ β πΆ πππ π§ β π· = β
π π¦,π§ πΆ π· π¦βπΆ,π§βπ·
Mean an di dista stance is the distance of the cluster centroids
ο§ π πΆ, π· = π(ππΆ, ππ·)
V-2: 24
IRDM β15/16
A compromise between single and complete link Le Less s sus usceptible t to
noise and nd out
ο§ similar to complete link
Bia iased t towards s spherica ical clu l clusters
ο§ similar to complete link
V-2: 25
IRDM β15/16
Wardβs dis istanc nce between clusters π΅ and πΆ is the increase in sum of squared errors (SSE) when the two clusters are merged
ο§ SSE for cluster π΅ is πππΉπ΅ = β
π¦ β ππ΅ 2
π¦βπ΅
ο§ difference for merging clusters π΅ and πΆ into cluster π· is then
π(π΅, πΆ) = ΞπππΉπ· = πππΉπ· β πππΉπ΅ β πππΉπΆ
ο§ or, equivalently, weighted mean distance
π π΅, πΆ =
π΅ πΆ π΅ +|πΆ| πA β ππΆ 2
V-2: 26
IRDM β15/16
Le Less s sus usceptible t to
noise and nd out
Biase ases t s towar ards sp s spherical al clust sters Hierarchical analogue of π-means
ο§ hence many shared proβs and conβs ο§ can be used to initialise π-means
V-2: 27
IRDM β15/16
V-2: 28
Single link Group average Complete link Wardβs method
IRDM β15/16
V-2: 29
Single link Group average Complete link Wardβs method
IRDM β15/16
V-2: 30
Single link Group average Complete link Wardβs method
IRDM β15/16
V-2: 31
Single link Group average Complete link Wardβs method
IRDM β15/16
After merging clusters π΅ and πΆ into cluster π· we need to compute π·βs distance to another cluster π. The Lance- Williams formula provides a general equation for this:
π π·, π = π½π΅π π΅, π + π½πΆπ πΆ, π + πΎπ π΅, πΆ + πΏ|π π΅, π β π πΆ, π |
V-2: 32
π½π΅ π½πΆ πΎ πΏ
Single link 1/2 1/2 β 1/2 Complete link 1/2 1/2 1/2 Group average |π΅|/(|π΅| + |πΆ|) |πΆ|/(|π΅| + |πΆ|) Mean distance |π΅|/(|π΅| + |πΆ|) |πΆ|/(|π΅| + |πΆ|) β |π΅||πΆ|/(|π΅| + |πΆ|)2 Wardβs method (|π΅| + |π|)/(|π΅| + |πΆ| + |π|) (|πΆ| + |π|)/(|π΅| + |πΆ| + |π|) β |π|/(|π΅| + |πΆ| + |π|)
IRDM β15/16
Takes π(π3) time in most cases
ο§ π steps ο§ in each step, π2 distance matrix must be updated and searched
π(π2 log (π)) time for some approaches that use appropriate data structures
ο§ e.g. keep distances in a heap ο§ each step takes π(π log π) time
π(π2) space complexity
ο§ have to store the distance matrix
V-2: 33
IRDM β15/16
Aggarwal Ch. 6.6
V-2: 34
IRDM β15/16
Representation-based clustering can find only convex clusters
ο§ data may contain interesting
non-convex clusters
V-2: 35
In den ensit ity-based sed clust ster ering a cluster is a βdense area of pointsβ
ο§ how to define βdense areaβ?
IRDM β15/16
Alg lgorithm hm GENERICGRID(data π¬, num-ranges π, min-density π) :
ο§ discretise each dimension of π¬ into π ranges ο§ determine those cells with density β₯ π ο§ create a graph π» with a node per dense cell,
add an edge if the two cells are adjacent
ο§ determine the connected components
re return points in each component as a cluster
V-2: 36
IRDM β15/16
The G Good
ο§ we donβt have to specify π ο§ we can find arbitrarily shaped clusters
Th The Ba Bad
ο§ we have to specify a global minimal density π ο§ only points in dense cells are part of clusters, all points in
neighbouring sparse cells are ignored
The Ugl e Ugly
ο§ we consider only a single, global, rectangular-shaped grid ο§ number of grid cells increases exponentially with dimensionality
V-2: 37
IRDM β15/16
An π-neighbo hbour urhood of point π of data π¬ is the set of points of π¬ that are within π distance from π
ο§ ππ π = π β π¬: π π, π β€ π -- note, we count x aswell! ο§ parameter π is set by the user
Point π β π¬ is a cor
poin
β₯ ππππππ
ο§ minpts
ts (aka π) is a user supplied parameter
Point π β π¬ is a border poin
but π β ππ(π) for some core point π A point π β π¬ that is neither a core point nor a border point is called a nois
poin
(be aware: some definitions do count a point as a member of its own π-neighborhood, some do not. Here we do.) V-2: 38
IRDM β15/16
(minpts was 5, now 6 to make clear we count x as an epsilon-neighbor of itself) V-2: 39
x z y min inpt pts = 6
Core point Noise point Border point
IRDM β15/16
Point π β π¬ is direc ectly d y density r y reacha chabl ble e from point π β π¬ if
ο§ π is a core point ο§ π β ππ(π)
Point π β π¬ is densi sity r y reach chable e from point π β π¬ if there is a chain of points π0, π1, β¦ , ππ s.t. π = π0, π = ππ, and ππβπ is directly density reachable from ππ for all π = 1, β¦ , π
ο§ not a symmetric relationship (!)
Points π, π β π¬ are densi sity c y conne nected ed if there exists a core point π s.t. both π and π are density reachable from π
V-2: 40
IRDM β15/16
A de densi sity-based ed c cluster er is a maximal set of density connected points
(image from Wikipedia) V-2: 41
IRDM β15/16
ο§ fo
for ea each unvisited point π in the data
ο§ compute ππ(π) ο§ if
if ππ π β₯ π§π§π§π§π§π§
ο§ EXPANDCLUSTER(π¦, ++clusterID)
ο§ EXPANDCLUSTER(π, ID)
ο§ assign π to cluster ID and set N β ππ(π) ο§ fo
for ea each π β π
ο§ if
if π is not visited and ππ π β₯ π§π§π§π§π§π§
ο§ π β π βͺ ππ(π) ο§ if
if π does not belong to any cluster
ο§ assign π to cluster ID
V-2: 42
IRDM β15/16
DBSCAN can return either overlapping or non-overlapping clusters
ο§ ties are broken arbitrarily
The main time complexity comes from computing the neighborhoods
ο§ total π(π log π) with spatial index structures
ο§ won't work with high dimensions, worst case is π(π2)
With the neighborhoods known, DBSCAN
singl gle p pass ass over the data
V-2: 43
IRDM β15/16
DBSCAN requires two parameters, π and π§π§π§π§π§π§ π§π§π§π§π§π§ controls the minimum size of a cluster
ο§ π§π§π§π§π§π§ = 1 allows singleton clusters ο§ π§π§π§π§π§π§ = 2 makes DBSCAN essentially a single-link clustering ο§ higher values avoid the long-and-narrow clusters of single link
π controls the required density
ο§ a single π is not enough if the clusters are of very different density
V-2: 44
IRDM β15/16
Aggarwal Ch. 6.7-6.8
V-2: 45
IRDM β15/16
So far weβve seen
ο§ representative-based clustering ο§ model-based clustering ο§ hierarchical clustering ο§ density-based clustering
There are many more types of clustering, including
ο§ co-clustering ο§ graph clustering (Aggarwal Ch. 6.8) ο§ non-negative matrix factorisation (NMF) (Aggarwal Ch. 6.9)
But weβre not going to discuss these in IRDM.
ο§ phew!
V-2: 46
IRDM β15/16
Aggarwal Ch. 7.4β7.4.2
V-2: 47
IRDM β15/16
If we compute similarity over many dimensions, all points will be roughly equi-distant. The here e exist no no clus usters ov
many ny dime mensions ns.
ο§ or, are there?
Of course there are!
ο§ data can have a much lower intrinsic dimensionality (SVD)
i.e. many dimensions are noisy, irrelevant, or copies
ο§ data can have clusters embedded in subsets of it
its dim imens nsio ions
V-2: 48
IRDM β15/16
The full s ll space ce of data π¬ is its set of attributes π A su subsp space π of π¬ is a subset of π, i.e. π β π
ο§ there exist 2 π β 1 non-empty subspaces
A su subsp space c clust ster is a cluster π· over a subspace π
ο§ a group of points that is highly similar over subspace π
V-2: 49
IRDM β15/16
In full-dimensional grid-based methods, the grid cells are determined on the intersection of the discretization ranges π across all ll dimensions. What happens for high-dimensional data?
ο§ many many grid cells will be empty
CLIQUE is a generalisation of grid-based clustering to
ubset o
mens nsions
V-2: 50
IRDM β15/16
CLIQUE is the first subspace clustering algorithm.
ο§ partition each dimension into π ranges ο§ for each subspace we now have grid
cells of the same volume
ο§ subspace clusters are connected
dense cells in the grid
(Agrawal et al. 1998) V-2: 51
IRDM β15/16
CLIQUE uses anti-monotonicity to find dense grid cells in subspaces: the higher the dimensionality, the sparser the cells
Main Idea:
ο§
every subspace we consider is a βtransaction databaseβ, every cell is then a βtransactionβ. If a cell is π-dense, the subspace βitemsetβ has been βboughtβ.
ο§
we now mine frequent itemsets with minsup=1
V-2: 52
IRDM β15/16
A-priori for subspace clusters:
For every level π in the subspace lattice, we check, for all subspaces π β π π whether π contains dense cells; but only if all subspaces πβ² β π contain dense cells. If π contains dense cells, we report each group of adjacent dense cells as a cluster π· over subspace π
V-2: 53 A
Dense cluster in subspace A
IRDM β15/16
A-priori for subspace clusters:
For every level π in the subspace lattice, we check, for all subspaces π β π π whether π contains dense cells; but only if all subspaces πβ² β π contain dense cells. If π contains dense cells, we report each group of adjacent dense cells as a cluster π· over subspace π
V-2: 54 A B
Dense cluster in subspace B
IRDM β15/16
A-priori for subspace clusters:
For every level π in the subspace lattice, we check, for all subspaces π β π π whether π contains dense cells; but only if all subspaces πβ² β π contain dense cells. If π contains dense cells, we report each group of adjacent dense cells as a cluster π· over subspace π
V-2: 55 A B
AB
Dense cluster in subspace AB Dense cluster in subspace AB
IRDM β15/16
A-priori for subspace clusters:
For every level π in the subspace lattice, we check, for all subspaces π β π π whether π contains dense cells; but only if all subspaces πβ² β π contain dense cells. If π contains dense cells, we report each group of adjacent dense cells as a cluster π· over subspace π
V-2: 56 A B
To find dense clusters in a subspace, we only have to consider grid cells that are dense in all super-spaces
IRDM β15/16
CLIQUE was the first subspace clustering algorithm.
ο§ and it shows
It produces an enormous amount of clusters
ο§ just like frequent itemset mining
ο§ nothing like βa summary of your dataβ
This, however, is general problem of subspace clustering
ο§ there are exponentially many subspaces ο§ and for each subspace there are exponentially many clusters
V-2: 57
IRDM β15/16
Clustering is one of the most important and most used data analysis methods There exist many different types of clustering
ο§ weβve seen representative, hierarchical, probabilistic, and density-based
Analysis of clustering methods is often difficult Always think what youβre doing if you use clustering
ο§ in fact, just always think what youβre doing
V-2: 58
IRDM β15/16
Clustering is one of the most important and most used data analysis methods There exist many different types of clustering
ο§ weβve seen representative, hierarchical, probabilistic, and density-based
Analysis of clustering methods is often difficult Always think what youβre doing if you use clustering
ο§ in fact, just always think what youβre doing
V-2: 59