- 12. Clustering
12. Clustering Chlo-Agathe Azencot Centre for Computatjonal - - PowerPoint PPT Presentation
12. Clustering Chlo-Agathe Azencot Centre for Computatjonal - - PowerPoint PPT Presentation
Foundatjons of Machine Learning CentraleSuplec Fall 2017 12. Clustering Chlo-Agathe Azencot Centre for Computatjonal Biology, Mines ParisTech chloe-agathe.azencott@mines-paristech.fr Learning objectjves Explain what clustering
2
Learning objectjves
- Explain what clustering algorithms can be used for.
- Explain and implement three difgerent ways to
evaluate clustering algorithms.
- Implement hierarchical clustering, discuss its
various fmavors.
- Implement k-means clustering, discuss its
advantages and drawbacks.
- Sketch out a density-based clustering algorithm.
3
Goals of clustering
Group objects that are similar into clusters: classes that are unknown beforehand.
4
Goals of clustering
Group objects that are similar into clusters: classes that are unknown beforehand.
5
Goals of clustering
Group objects that are similar into clusters: classes that are unknown beforehand. E.g.
– group genes that are similarly afgected by a disease – group patjents whose genes respond similarly to a
disease
– group pixels in an image that belong to the same object
(image segmentatjon).
6
Applicatjons of clustering
- Understand general characteristjcs of the data
- Visualize the data
- Infer some propertjes of a data point based on how
it relates to other data points E.g.
– fjnd subtypes of diseases – visualize protein families – fjnd categories among images – fjnd patuerns in fjnancial transactjons – detect communitjes in social networks
7
Distances and similaritjes
8
Distances & similaritjes
- Assess how close / far
– data points are from each other – a data point is from a cluster – two clusters are from each other
- Distance metric
9
Distances & similaritjes
- Assess how close / far
– data points are from each other – a data point is from a cluster – two clusters are from each other
- Distance metric
- E.g. Lq distances
symmetry triangle inequality
10
Distance & similaritjes
- How do we get similaritjes?
11
For a given mapping from the space of objects X to some Hilbert space H, the kernel between two objects x and x' is the inner product
- f their images in the feature spaces.
Distance & similaritjes
- Transform distances into similaritjes?
- Kernels defjne similaritjes
12
Pearson's correlatjon
- Measure of the linear correlatjon
between two variables
- If the features are centered: ?
13
Pearson's correlatjon
- Measure of the linear correlatjon
between two variables
- If the features are centered:
- Normalized dot product = cosine
14
Pearson vs Euclide
- Pearson's coeffjcient
Profjles of similar shapes will be close to each other, even if they difger in magnitude.
- Euclidean distance
Magnitude is taken into account.
15
Pearson vs Euclide
16
Evaluatjng clusters
17
Evaluatjng clusters
- Clustering is unsupervised.
- There is no ground truth. How do we evaluate the
quality of a clustering algorithm?
18
Evaluatjng clusters
- Clustering is unsupervised.
- There is no ground truth. How do we evaluate the
quality of a clustering algorithm?
- 1) Based on the shape of the clusters:
Points within the same cluster should be nearby/similar and points far from each other should belong to difgerent clusters.
- Based on the stability of the clusters:
We should get the same results if we remove some data points, add noise, etc.
- Based on domain knowledge:
The clusters should “make sense”.
19
Evaluatjng clusters
- Clustering is unsupervised.
- There is no ground truth. How do we evaluate the
quality of a clustering algorithm?
- 1) Based on the shape of the clusters:
Points within the same cluster should be nearby/similar and points far from each other should belong to difgerent clusters.
- Based on the stability of the clusters:
We should get the same results if we remove some data points, add noise, etc.
- Based on domain knowledge:
The clusters should “make sense”.
20
Centroids and medoids
- Centroid: mean of the points in the cluster.
- Medoid: point in the cluster that is closest to the
centroid.
21
Cluster shape: Tightness
vs
22
Cluster shape: Tightness
Tk
23
Cluster shape: Separability
vs
24
Cluster shape: Separability
Skl
25
Clusters shape: Davies-Bouldin
- Cluster tjghtness (homogeneity)
- Cluster separatjon
- Davies-Bouldin index
Tk Skl
26
Clusters shape: Silhouete coeffjcient
- how well x fjts in its
cluster:
- how well x would fjt in
another cluster:
- if x is very close to the other
points of its cluster: s(x) = 1
- if x is very close to the points
in another cluster: s(x) = -1
27
Evaluatjng clusters
- Clustering is unsupervised.
- There is no ground truth. How do we evaluate the
quality of a clustering algorithm?
- 1) Based on the shape of the clusters:
Points within the same cluster should be nearby/similar and points far from each other should belong to difgerent clusters.
- 2) Based on the stability of the clusters:
We should get the same results if we remove some data points, add noise, etc.
- Based on domain knowledge:
The clusters should “make sense”.
28
Cluster stability
- How many clusters?
29
Cluster stability
- K=2
- K=3
30
Cluster stability
- K=2
- K=3
31
Evaluatjng clusters
- Clustering is unsupervised.
- There is no ground truth. How do we evaluate the
quality of a clustering algorithm?
- 1) Based on the shape of the clusters:
Points within the same cluster should be nearby/similar and points far from each other should belong to difgerent clusters.
- 2) Based on the stability of the clusters:
We should get the same results if we remove some data points, add noise, etc.
- 3) Based on domain knowledge:
The clusters should “make sense”.
32
Domain knowledge
- Do the cluster match natural categories?
– Check with human expertjse
33
Ontology enrichment analysis
- Ontology:
Entjtjes may be grouped, related within a hierarchy, and subdivided according to similaritjes and difgerences. Build by human experts
- E.g.: The Gene Ontology
http://geneontology.org/
– Describe genes with a common vocabulary, organized in
categories
E.g. cellular process > cell death > programmed cell death > apoptotjc process > executjon phase of apoptosis
34
Ontology enrichment analysis
- Enrichment analysis:
Are there more data points from ontology category G in cluster C than expected by chance?
- TANGO [Tanay et al., 2003]
– Assume data points sampled from a hypergeometric
distributjon
– The probability for the intersectjon of G and C to contain
more than t points is:
35
Ontology enrichment analysis
- Enrichment analysis:
Are there more data points from ontology category G in cluster C than expected by chance?
- TANGO [Tanay et al., 2003]
– Assume data points sampled from a hypergeometric
distributjon
– The probability for the intersectjon of G and C to contain
more than t points is:
Probability of gettjng i points from G when drawing |C| points from a total of n samples.
36
Hierarchical clustering
37
Hierachical clustering
Group data over a variety of possible scales, in a multj-level hierarchy.
38
Constructjon
- Agglomeratjve approach (botom-up)
Start with each element in its own cluster Iteratjvely join neighboring clusters.
- Divisive approach (top-down)
Start with all elements in the same cluster Iteratjvely separate into smaller clusters.
39
Dendogram
- The results of a hierarchical clustering algorithm are
presented in a dendogram.
- Branch length = cluster distance.
40
Dendogram
- The results of a hierarchical clustering algorithm are
presented in a dendogram.
- U height = distance.
How many clusters?
?
41
Dendogram
- The results of a hierarchical clustering algorithm are
presented in a dendogram.
- U height = distance.
2 1 3 4
42
Linkage: connectjng two clusters
- Single linkage
43
Linkage: connectjng two clusters
- Complete linkage
44
Linkage: connectjng two clusters
- Average linkage
45
Linkage: connectjng two clusters
- Centroid linkage
46
- Ward
Join clusters so as to minimize within-cluster variance
Linkage: connectjng two clusters
47
Example: Gene expression clustering
Breast cancer survival signature [Bergamashi et al. 2011]
genes patjents 1 2 2 1
48
Hierarchical clustering
- Advantages
– No need to pre-defjne the number of clusters – Interpretability
- Drawbacks
– Computatjonal complexity ?
49
Hierarchical clustering
- Advantages
– No need to pre-defjne the number of clusters – Interpretability
- Drawbacks
– Computatjonal complexity
E.g. Single/complete linkage (naive): At least O(pn²) to compute all pairwise distances.
– Must decide at which level of the hierarchy to split – Lack of robustness (unstable)
50
K-means
51
K-means clustering
- Minimize the intra-cluster variance
- What will this partjtjon of the space look like?
52
K-means clustering
- Minimize the intra-cluster variance
- For each cluster, the points in that cluster are those
that are closest to its centroid than to any other centroid
53
K-means clustering
- Minimize the intra-cluster variance
- Voronoi tesselatjon
54
Lloyd's algorithm
- K-means cannot be easily optjmized
- We adopt a greedy strategy.
– Partjtjon the data into K clusters at random – Compute the centroid of each cluster – Assign each point to the cluster whose centroid it is
closest to
– Repeat untjl cluster membership converges.
55
K-means
- Advantages
– What is the computatjonal tjme of k-means?
56
K-means
- Advantages
– What is the computatjonal tjme of k-means?
number of iteratjons compute kn distances in p dimensions Can be small if there's indeed a cluster structure in the data
57
K-means
- Advantages
– Computatjonal tjme is linear – Easily implementable
- Drawbacks
– Need to set up K ahead of tjme – What happens when there are outliers?
58
K-means
- Advantages
– Computatjonal tjme is linear – Easily implementable
- Drawbacks
– Need to set up K ahead of tjme – Sensitjve to noise and outliers – Stochastjc (difgerent solutjons with each iteratjon) – The clusters are forced to have convex shapes
59
K-means variants
- K-means++
– Seeding algorithm to initjalize clusters with centroids
“spread-out” throughout the data.
– Deterministjc
- K-medoids
- Kernel k-means
Find clusters in feature space
k-means kernel k-means
60
Density-based clustering
61
Density-based clustering
62
cluster.AgglomerativeClustering(linkage='average', n_clusters=3)
Hierarchical clustering:
63
k-means clustering
cluster.KMeans(n_clusters=3)
64
DBSCAN
- Density-based clustering: clusters are made of
dense neighborhoods of points
65
DBSCAN
- ε-neighborhood:
- core points:
- x and z are density-connected:
core points such that
66
Summary
- Clustering: unsupervised approach to group similar data
points together.
- Evaluate clustering algorithms based on
– the shape of the cluster – the stability of the results – the consistency with domain knowledge.
- Hierarchical clustering
– top-down / botuom-up – various linkage functjons.
- k-means clustering tries to minimize intra-cluster variance
- density-based clustering clusters dense neighborhoods
together.
67
References
- Introductjon to Data Mining
- P. Tang, M. Steinbach, V. Kumar
- Chap. 8: Cluster analysis
https://www-users.cs.umn.edu/~kumar001/dmbook/ch8.pdf