12 clustering
play

12. Clustering Chlo-Agathe Azencot Centre for Computatjonal - PowerPoint PPT Presentation

Foundatjons of Machine Learning CentraleSuplec Fall 2017 12. Clustering Chlo-Agathe Azencot Centre for Computatjonal Biology, Mines ParisTech chloe-agathe.azencott@mines-paristech.fr Learning objectjves Explain what clustering


  1. Foundatjons of Machine Learning CentraleSupélec — Fall 2017 12. Clustering Chloé-Agathe Azencot Centre for Computatjonal Biology, Mines ParisTech chloe-agathe.azencott@mines-paristech.fr

  2. Learning objectjves ● Explain what clustering algorithms can be used for. ● Explain and implement three difgerent ways to evaluate clustering algorithms. ● Implement hierarchical clustering, discuss its various fmavors. ● Implement k-means clustering, discuss its advantages and drawbacks. ● Sketch out a density-based clustering algorithm. 2

  3. Goals of clustering Group objects that are similar into clusters: classes that are unknown beforehand. 3

  4. Goals of clustering Group objects that are similar into clusters: classes that are unknown beforehand. 4

  5. Goals of clustering Group objects that are similar into clusters: classes that are unknown beforehand. E.g. – group genes that are similarly afgected by a disease – group patjents whose genes respond similarly to a disease – group pixels in an image that belong to the same object (image segmentatjon). 5

  6. Applicatjons of clustering ● Understand general characteristjcs of the data ● Visualize the data ● Infer some propertjes of a data point based on how it relates to other data points E.g. – fjnd subtypes of diseases – visualize protein families – fjnd categories among images – fjnd patuerns in fjnancial transactjons – detect communitjes in social networks 6

  7. Distances and similaritjes 7

  8. Distances & similaritjes ● Assess how close / far – data points are from each other – a data point is from a cluster – two clusters are from each other ● Distance metric 8

  9. Distances & similaritjes ● Assess how close / far – data points are from each other – a data point is from a cluster – two clusters are from each other ● Distance metric symmetry triangle inequality ● E.g. Lq distances 9

  10. Distance & similaritjes ● How do we get similaritjes? 10

  11. Distance & similaritjes ● Transform distances into similaritjes? ● Kernels defjne similaritjes For a given mapping from the space of objects X to some Hilbert space H, the kernel between two objects x and x' is the inner product of their images in the feature spaces. 11

  12. Pearson's correlatjon ● Measure of the linear correlatjon between two variables ● If the features are centered: ? 12

  13. Pearson's correlatjon ● Measure of the linear correlatjon between two variables ● If the features are centered: ● Normalized dot product = cosine 13

  14. Pearson vs Euclide ● Pearson's coeffjcient Profjles of similar shapes will be close to each other, even if they difger in magnitude. ● Euclidean distance Magnitude is taken into account. 14

  15. Pearson vs Euclide 15

  16. Evaluatjng clusters 16

  17. Evaluatjng clusters ● Clustering is unsupervised. ● There is no ground truth. How do we evaluate the quality of a clustering algorithm? 17

  18. Evaluatjng clusters ● Clustering is unsupervised. ● There is no ground truth. How do we evaluate the quality of a clustering algorithm? ● 1) Based on the shape of the clusters: Points within the same cluster should be nearby/similar and points far from each other should belong to difgerent clusters. ● Based on the stability of the clusters: We should get the same results if we remove some data points, add noise, etc. ● Based on domain knowledge: The clusters should “make sense”. 18

  19. Evaluatjng clusters ● Clustering is unsupervised. ● There is no ground truth. How do we evaluate the quality of a clustering algorithm? ● 1) Based on the shape of the clusters: Points within the same cluster should be nearby/similar and points far from each other should belong to difgerent clusters. ● Based on the stability of the clusters: We should get the same results if we remove some data points, add noise, etc. ● Based on domain knowledge: The clusters should “make sense”. 19

  20. Centroids and medoids ● Centroid: mean of the points in the cluster. ● Medoid: point in the cluster that is closest to the centroid. 20

  21. Cluster shape: Tightness vs 21

  22. Cluster shape: Tightness T k 22

  23. Cluster shape: Separability vs 23

  24. Cluster shape: Separability S kl 24

  25. Clusters shape: Davies-Bouldin ● Cluster tjghtness ( homogeneity ) T k ● Cluster separatjon S kl ● Davies-Bouldin index 25

  26. Clusters shape: Silhouete coeffjcient ● how well x fjts in its cluster: ● how well x would fjt in another cluster: ● if x is very close to the other points of its cluster: s( x ) = 1 ● if x is very close to the points in another cluster: s( x ) = -1 26

  27. Evaluatjng clusters ● Clustering is unsupervised. ● There is no ground truth. How do we evaluate the quality of a clustering algorithm? ● 1) Based on the shape of the clusters: Points within the same cluster should be nearby/similar and points far from each other should belong to difgerent clusters. ● 2) Based on the stability of the clusters: We should get the same results if we remove some data points, add noise, etc. ● Based on domain knowledge: The clusters should “make sense”. 27

  28. Cluster stability ● How many clusters? 28

  29. Cluster stability ● K=2 ● K=3 29

  30. Cluster stability ● K=2 ● K=3 30

  31. Evaluatjng clusters ● Clustering is unsupervised. ● There is no ground truth. How do we evaluate the quality of a clustering algorithm? ● 1) Based on the shape of the clusters: Points within the same cluster should be nearby/similar and points far from each other should belong to difgerent clusters. ● 2) Based on the stability of the clusters: We should get the same results if we remove some data points, add noise, etc. ● 3) Based on domain knowledge: The clusters should “make sense”. 31

  32. Domain knowledge ● Do the cluster match natural categories? – Check with human expertjse 32

  33. Ontology enrichment analysis ● Ontology: Entjtjes may be grouped, related within a hierarchy, and subdivided according to similaritjes and difgerences. Build by human experts ● E.g.: The Gene Ontology http://geneontology.org/ – Describe genes with a common vocabulary, organized in categories E.g. cellular process > cell death > programmed cell death > apoptotjc process > executjon phase of apoptosis 33

  34. Ontology enrichment analysis ● Enrichment analysis: Are there more data points from ontology category G in cluster C than expected by chance? ● TANGO [Tanay et al., 2003] – Assume data points sampled from a hypergeometric distributjon – The probability for the intersectjon of G and C to contain more than t points is: 34

  35. Ontology enrichment analysis ● Enrichment analysis: Are there more data points from ontology category G in cluster C than expected by chance? ● TANGO [Tanay et al., 2003] – Assume data points sampled from a hypergeometric distributjon – The probability for the intersectjon of G and C to contain more than t points is: Probability of gettjng i points from G when drawing |C| points from a total of n samples. 35

  36. Hierarchical clustering 36

  37. Hierachical clustering Group data over a variety of possible scales, in a multj-level hierarchy. 37

  38. Constructjon ● Agglomeratjve approach ( botom-up ) Start with each element in its own cluster Iteratjvely join neighboring clusters. ● Divisive approach ( top-down ) Start with all elements in the same cluster Iteratjvely separate into smaller clusters. 38

  39. Dendogram ● The results of a hierarchical clustering algorithm are presented in a dendogram. ● Branch length = cluster distance. 39

  40. Dendogram ● The results of a hierarchical clustering algorithm are presented in a dendogram. ● U height = distance. How many clusters? ? 40

  41. Dendogram ● The results of a hierarchical clustering algorithm are presented in a dendogram. ● U height = distance. 1 2 3 4 41

  42. Linkage: connectjng two clusters ● Single linkage 42

  43. Linkage: connectjng two clusters ● Complete linkage 43

  44. Linkage: connectjng two clusters ● Average linkage 44

  45. Linkage: connectjng two clusters ● Centroid linkage 45

  46. Linkage: connectjng two clusters ● Ward Join clusters so as to minimize within-cluster variance 46

  47. Example: Gene expression clustering Breast cancer survival signature [Bergamashi et al. 2011] 1 genes 2 patjents 2 1 47

  48. Hierarchical clustering ● Advantages – No need to pre-defjne the number of clusters – Interpretability ● Drawbacks – Computatjonal complexity ? 48

  49. Hierarchical clustering ● Advantages – No need to pre-defjne the number of clusters – Interpretability ● Drawbacks – Computatjonal complexity E.g. Single/complete linkage (naive): At least O(pn²) to compute all pairwise distances. – Must decide at which level of the hierarchy to split – Lack of robustness (unstable) 49

  50. K-means 50

  51. K-means clustering ● Minimize the intra-cluster variance ● What will this partjtjon of the space look like? 51

  52. K-means clustering ● Minimize the intra-cluster variance ● For each cluster, the points in that cluster are those that are closest to its centroid than to any other centroid 52

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend