12. Clustering Chlo-Agathe Azencot Centre for Computatjonal - - PowerPoint PPT Presentation

12 clustering
SMART_READER_LITE
LIVE PREVIEW

12. Clustering Chlo-Agathe Azencot Centre for Computatjonal - - PowerPoint PPT Presentation

Foundatjons of Machine Learning CentraleSuplec Fall 2017 12. Clustering Chlo-Agathe Azencot Centre for Computatjonal Biology, Mines ParisTech chloe-agathe.azencott@mines-paristech.fr Learning objectjves Explain what clustering


slide-1
SLIDE 1
  • 12. Clustering

Foundatjons of Machine Learning CentraleSupélec — Fall 2017 Chloé-Agathe Azencot

Centre for Computatjonal Biology, Mines ParisTech

chloe-agathe.azencott@mines-paristech.fr

slide-2
SLIDE 2

2

Learning objectjves

  • Explain what clustering algorithms can be used for.
  • Explain and implement three difgerent ways to

evaluate clustering algorithms.

  • Implement hierarchical clustering, discuss its

various fmavors.

  • Implement k-means clustering, discuss its

advantages and drawbacks.

  • Sketch out a density-based clustering algorithm.
slide-3
SLIDE 3

3

Goals of clustering

Group objects that are similar into clusters: classes that are unknown beforehand.

slide-4
SLIDE 4

4

Goals of clustering

Group objects that are similar into clusters: classes that are unknown beforehand.

slide-5
SLIDE 5

5

Goals of clustering

Group objects that are similar into clusters: classes that are unknown beforehand. E.g.

– group genes that are similarly afgected by a disease – group patjents whose genes respond similarly to a

disease

– group pixels in an image that belong to the same object

(image segmentatjon).

slide-6
SLIDE 6

6

Applicatjons of clustering

  • Understand general characteristjcs of the data
  • Visualize the data
  • Infer some propertjes of a data point based on how

it relates to other data points E.g.

– fjnd subtypes of diseases – visualize protein families – fjnd categories among images – fjnd patuerns in fjnancial transactjons – detect communitjes in social networks

slide-7
SLIDE 7

7

Distances and similaritjes

slide-8
SLIDE 8

8

Distances & similaritjes

  • Assess how close / far

– data points are from each other – a data point is from a cluster – two clusters are from each other

  • Distance metric
slide-9
SLIDE 9

9

Distances & similaritjes

  • Assess how close / far

– data points are from each other – a data point is from a cluster – two clusters are from each other

  • Distance metric
  • E.g. Lq distances

symmetry triangle inequality

slide-10
SLIDE 10

10

Distance & similaritjes

  • How do we get similaritjes?
slide-11
SLIDE 11

11

For a given mapping from the space of objects X to some Hilbert space H, the kernel between two objects x and x' is the inner product

  • f their images in the feature spaces.

Distance & similaritjes

  • Transform distances into similaritjes?
  • Kernels defjne similaritjes
slide-12
SLIDE 12

12

Pearson's correlatjon

  • Measure of the linear correlatjon

between two variables

  • If the features are centered: ?
slide-13
SLIDE 13

13

Pearson's correlatjon

  • Measure of the linear correlatjon

between two variables

  • If the features are centered:
  • Normalized dot product = cosine
slide-14
SLIDE 14

14

Pearson vs Euclide

  • Pearson's coeffjcient

Profjles of similar shapes will be close to each other, even if they difger in magnitude.

  • Euclidean distance

Magnitude is taken into account.

slide-15
SLIDE 15

15

Pearson vs Euclide

slide-16
SLIDE 16

16

Evaluatjng clusters

slide-17
SLIDE 17

17

Evaluatjng clusters

  • Clustering is unsupervised.
  • There is no ground truth. How do we evaluate the

quality of a clustering algorithm?

slide-18
SLIDE 18

18

Evaluatjng clusters

  • Clustering is unsupervised.
  • There is no ground truth. How do we evaluate the

quality of a clustering algorithm?

  • 1) Based on the shape of the clusters:

Points within the same cluster should be nearby/similar and points far from each other should belong to difgerent clusters.

  • Based on the stability of the clusters:

We should get the same results if we remove some data points, add noise, etc.

  • Based on domain knowledge:

The clusters should “make sense”.

slide-19
SLIDE 19

19

Evaluatjng clusters

  • Clustering is unsupervised.
  • There is no ground truth. How do we evaluate the

quality of a clustering algorithm?

  • 1) Based on the shape of the clusters:

Points within the same cluster should be nearby/similar and points far from each other should belong to difgerent clusters.

  • Based on the stability of the clusters:

We should get the same results if we remove some data points, add noise, etc.

  • Based on domain knowledge:

The clusters should “make sense”.

slide-20
SLIDE 20

20

Centroids and medoids

  • Centroid: mean of the points in the cluster.
  • Medoid: point in the cluster that is closest to the

centroid.

slide-21
SLIDE 21

21

Cluster shape: Tightness

vs

slide-22
SLIDE 22

22

Cluster shape: Tightness

Tk

slide-23
SLIDE 23

23

Cluster shape: Separability

vs

slide-24
SLIDE 24

24

Cluster shape: Separability

Skl

slide-25
SLIDE 25

25

Clusters shape: Davies-Bouldin

  • Cluster tjghtness (homogeneity)
  • Cluster separatjon
  • Davies-Bouldin index

Tk Skl

slide-26
SLIDE 26

26

Clusters shape: Silhouete coeffjcient

  • how well x fjts in its

cluster:

  • how well x would fjt in

another cluster:

  • if x is very close to the other

points of its cluster: s(x) = 1

  • if x is very close to the points

in another cluster: s(x) = -1

slide-27
SLIDE 27

27

Evaluatjng clusters

  • Clustering is unsupervised.
  • There is no ground truth. How do we evaluate the

quality of a clustering algorithm?

  • 1) Based on the shape of the clusters:

Points within the same cluster should be nearby/similar and points far from each other should belong to difgerent clusters.

  • 2) Based on the stability of the clusters:

We should get the same results if we remove some data points, add noise, etc.

  • Based on domain knowledge:

The clusters should “make sense”.

slide-28
SLIDE 28

28

Cluster stability

  • How many clusters?
slide-29
SLIDE 29

29

Cluster stability

  • K=2
  • K=3
slide-30
SLIDE 30

30

Cluster stability

  • K=2
  • K=3
slide-31
SLIDE 31

31

Evaluatjng clusters

  • Clustering is unsupervised.
  • There is no ground truth. How do we evaluate the

quality of a clustering algorithm?

  • 1) Based on the shape of the clusters:

Points within the same cluster should be nearby/similar and points far from each other should belong to difgerent clusters.

  • 2) Based on the stability of the clusters:

We should get the same results if we remove some data points, add noise, etc.

  • 3) Based on domain knowledge:

The clusters should “make sense”.

slide-32
SLIDE 32

32

Domain knowledge

  • Do the cluster match natural categories?

– Check with human expertjse

slide-33
SLIDE 33

33

Ontology enrichment analysis

  • Ontology:

Entjtjes may be grouped, related within a hierarchy, and subdivided according to similaritjes and difgerences. Build by human experts

  • E.g.: The Gene Ontology

http://geneontology.org/

– Describe genes with a common vocabulary, organized in

categories

E.g. cellular process > cell death > programmed cell death > apoptotjc process > executjon phase of apoptosis

slide-34
SLIDE 34

34

Ontology enrichment analysis

  • Enrichment analysis:

Are there more data points from ontology category G in cluster C than expected by chance?

  • TANGO [Tanay et al., 2003]

– Assume data points sampled from a hypergeometric

distributjon

– The probability for the intersectjon of G and C to contain

more than t points is:

slide-35
SLIDE 35

35

Ontology enrichment analysis

  • Enrichment analysis:

Are there more data points from ontology category G in cluster C than expected by chance?

  • TANGO [Tanay et al., 2003]

– Assume data points sampled from a hypergeometric

distributjon

– The probability for the intersectjon of G and C to contain

more than t points is:

Probability of gettjng i points from G when drawing |C| points from a total of n samples.

slide-36
SLIDE 36

36

Hierarchical clustering

slide-37
SLIDE 37

37

Hierachical clustering

Group data over a variety of possible scales, in a multj-level hierarchy.

slide-38
SLIDE 38

38

Constructjon

  • Agglomeratjve approach (botom-up)

Start with each element in its own cluster Iteratjvely join neighboring clusters.

  • Divisive approach (top-down)

Start with all elements in the same cluster Iteratjvely separate into smaller clusters.

slide-39
SLIDE 39

39

Dendogram

  • The results of a hierarchical clustering algorithm are

presented in a dendogram.

  • Branch length = cluster distance.
slide-40
SLIDE 40

40

Dendogram

  • The results of a hierarchical clustering algorithm are

presented in a dendogram.

  • U height = distance.

How many clusters?

?

slide-41
SLIDE 41

41

Dendogram

  • The results of a hierarchical clustering algorithm are

presented in a dendogram.

  • U height = distance.

2 1 3 4

slide-42
SLIDE 42

42

Linkage: connectjng two clusters

  • Single linkage
slide-43
SLIDE 43

43

Linkage: connectjng two clusters

  • Complete linkage
slide-44
SLIDE 44

44

Linkage: connectjng two clusters

  • Average linkage
slide-45
SLIDE 45

45

Linkage: connectjng two clusters

  • Centroid linkage
slide-46
SLIDE 46

46

  • Ward

Join clusters so as to minimize within-cluster variance

Linkage: connectjng two clusters

slide-47
SLIDE 47

47

Example: Gene expression clustering

Breast cancer survival signature [Bergamashi et al. 2011]

genes patjents 1 2 2 1

slide-48
SLIDE 48

48

Hierarchical clustering

  • Advantages

– No need to pre-defjne the number of clusters – Interpretability

  • Drawbacks

– Computatjonal complexity ?

slide-49
SLIDE 49

49

Hierarchical clustering

  • Advantages

– No need to pre-defjne the number of clusters – Interpretability

  • Drawbacks

– Computatjonal complexity

E.g. Single/complete linkage (naive): At least O(pn²) to compute all pairwise distances.

– Must decide at which level of the hierarchy to split – Lack of robustness (unstable)

slide-50
SLIDE 50

50

K-means

slide-51
SLIDE 51

51

K-means clustering

  • Minimize the intra-cluster variance
  • What will this partjtjon of the space look like?
slide-52
SLIDE 52

52

K-means clustering

  • Minimize the intra-cluster variance
  • For each cluster, the points in that cluster are those

that are closest to its centroid than to any other centroid

slide-53
SLIDE 53

53

K-means clustering

  • Minimize the intra-cluster variance
  • Voronoi tesselatjon
slide-54
SLIDE 54

54

Lloyd's algorithm

  • K-means cannot be easily optjmized
  • We adopt a greedy strategy.

– Partjtjon the data into K clusters at random – Compute the centroid of each cluster – Assign each point to the cluster whose centroid it is

closest to

– Repeat untjl cluster membership converges.

slide-55
SLIDE 55

55

K-means

  • Advantages

– What is the computatjonal tjme of k-means?

slide-56
SLIDE 56

56

K-means

  • Advantages

– What is the computatjonal tjme of k-means?

number of iteratjons compute kn distances in p dimensions Can be small if there's indeed a cluster structure in the data

slide-57
SLIDE 57

57

K-means

  • Advantages

– Computatjonal tjme is linear – Easily implementable

  • Drawbacks

– Need to set up K ahead of tjme – What happens when there are outliers?

slide-58
SLIDE 58

58

K-means

  • Advantages

– Computatjonal tjme is linear – Easily implementable

  • Drawbacks

– Need to set up K ahead of tjme – Sensitjve to noise and outliers – Stochastjc (difgerent solutjons with each iteratjon) – The clusters are forced to have convex shapes

slide-59
SLIDE 59

59

K-means variants

  • K-means++

– Seeding algorithm to initjalize clusters with centroids

“spread-out” throughout the data.

– Deterministjc

  • K-medoids
  • Kernel k-means

Find clusters in feature space

k-means kernel k-means

slide-60
SLIDE 60

60

Density-based clustering

slide-61
SLIDE 61

61

Density-based clustering

slide-62
SLIDE 62

62

cluster.AgglomerativeClustering(linkage='average', n_clusters=3)

Hierarchical clustering:

slide-63
SLIDE 63

63

k-means clustering

cluster.KMeans(n_clusters=3)

slide-64
SLIDE 64

64

DBSCAN

  • Density-based clustering: clusters are made of

dense neighborhoods of points

slide-65
SLIDE 65

65

DBSCAN

  • ε-neighborhood:
  • core points:
  • x and z are density-connected:

core points such that

slide-66
SLIDE 66

66

Summary

  • Clustering: unsupervised approach to group similar data

points together.

  • Evaluate clustering algorithms based on

– the shape of the cluster – the stability of the results – the consistency with domain knowledge.

  • Hierarchical clustering

– top-down / botuom-up – various linkage functjons.

  • k-means clustering tries to minimize intra-cluster variance
  • density-based clustering clusters dense neighborhoods

together.

slide-67
SLIDE 67

67

References

  • Introductjon to Data Mining
  • P. Tang, M. Steinbach, V. Kumar
  • Chap. 8: Cluster analysis

https://www-users.cs.umn.edu/~kumar001/dmbook/ch8.pdf