data mining in bioinformatics day 8 clustering in
play

Data Mining in Bioinformatics Day 8: Clustering in Bioinformatics - PowerPoint PPT Presentation

Data Mining in Bioinformatics Day 8: Clustering in Bioinformatics Clustering Gene Expression Data Chlo-Agathe Azencott & Karsten Borgwardt February 10 to February 21, 2014 Machine Learning & Computational Biology Research Group Max


  1. Data Mining in Bioinformatics Day 8: Clustering in Bioinformatics Clustering Gene Expression Data Chloé-Agathe Azencott & Karsten Borgwardt February 10 to February 21, 2014 Machine Learning & Computational Biology Research Group Max Planck Institutes Tübingen and Eberhard Karls Universität Tübingen Karsten Borgwardt: Data Mining in Bioinformatics, Page 1

  2. Gene expression data Microarray technology High density arrays Probes (or “reporters”, “oligos”) Detect probe-target hybridization Fluorescence, chemiluminescence E.g. Cyanine dyes: Cy3 (green) / Cy5 (red) Karsten Borgwardt: Data Mining in Bioinformatics, Page 2

  3. Gene expression data Data X : n × m matrix n genes m experiments: conditions time points tissues patients cell lines Karsten Borgwardt: Data Mining in Bioinformatics, Page 3

  4. Clustering gene expression data Group samples Group together tissues that are similarly affected by a disease Group together patients that are similarly affected by a disease Group genes Group together functionally related genes Group together genes that are similarly affected by a disease Group together genes that respond similarly to an ex- perimental condition Karsten Borgwardt: Data Mining in Bioinformatics, Page 4

  5. Clustering gene expression data Applications Build regulatory networks Discover subtypes of a disease Infer unknown gene function Reduce dimensionality Popularity Pubmed hits: 33 548 for “microarray AND clustering”, 79 201 for “"gene expression" AND clustering” Toolboxes: MatArray, Cluster3, GeneCluster, Bioconductor, GEO tools, . . . Karsten Borgwardt: Data Mining in Bioinformatics, Page 5

  6. Pre-processing Pre-filtering Eliminate poorly expressed genes Eliminate genes whose expression remains constant Missing values Ignore Replace with random numbers Impute Continuity of time series Values for similar genes Karsten Borgwardt: Data Mining in Bioinformatics, Page 6

  7. Pre-processing Normalization log 2 ( ratio ) particularly for time series log 2 ( Cy 5 /Cy 3) → induction and repression have opposite signs variance normalization differential expression Karsten Borgwardt: Data Mining in Bioinformatics, Page 7

  8. Distances Euclidean distance Distance between gene x and y , given n samples (or distance between samples x and y , given n genes) n � � ( x i − y i ) 2 d ( x, y ) = i =1 Emphasis: shape Pearson’s correlation Correlation between gene x and y , given n samples (or correlation between samples x and y , given n genes) � n i =1 ( x i − ¯ x )( y i − ¯ y ) ρ ( x, y ) = �� n x ) 2 � n y ) 2 i =1 ( x i − ¯ i =1 ( y i − ¯ Emphasis: magnitude Karsten Borgwardt: Data Mining in Bioinformatics, Page 8

  9. Distances d = 8 . 25 d = 13 . 27 ρ = 0 . 33 ρ = 0 . 79 Karsten Borgwardt: Data Mining in Bioinformatics, Page 9

  10. Clustering evaluation Clusters shape Cluster tightness (homogeneity) k 1 � � d ( x, µ i ) | C i | i =1 x ∈ C i � �� � T i Cluster separation k k � � d ( µ i , µ j ) � �� � i =1 j = i +1 S i,j Davies-Bouldin index k DB := 1 T i + T j � D i D i := max k S i,j j : j � = i i =1 Karsten Borgwardt: Data Mining in Bioinformatics, Page 10

  11. Clustering evaluation Clusters stability image from [von Luxburg, 2009] Does the solution change if we perturb the data? Bootstrap Add noise Karsten Borgwardt: Data Mining in Bioinformatics, Page 11

  12. Quality of clustering The Gene Ontology “The GO project has developed three structured controlled vocabularies (on- tologies) that describe gene products in terms of their associated biological pro- cesses, cellular components and molecular functions in a species-independent manner” Cellular Component : where in the cell a gene acts Molecular Function : function(s) carried out by a gene product Biological Process : biological phenomena the gene is involved in (e.g. cell cycle, DNA replication, limb forma- tion) Hierarchical organization (“is a”, “is part of”) Karsten Borgwardt: Data Mining in Bioinformatics, Page 12

  13. Quality of clustering GO enrichment analysis: TANGO [Tanay, 2003] Are there more genes from a given GO class in a given cluster than expected by chance? Assume genes sampled from the hypergeometric dis- � | G | �� n −| G | � tribution t � i | C |− i Pr ( | C ∩ G | ≥ t ) = 1 − � n � | C | i =1 Correct for multiple hypothesis testing Bonferroni too conservative (dependencies between GO groups) Empirical computation of the null distribution Karsten Borgwardt: Data Mining in Bioinformatics, Page 13

  14. Quality of clustering Gene Set enrichment analysis (GSEA) [Subramanian et al. , 2005] Use correlation to a phenotype y Rank genes according to the correlation ρ i of their ex- pression to y → L = { g 1 , g 2 , . . . , g n } P hit ( C, i ) = � | ρ j | � j : j ≤ i,g j ∈ C gj ∈ C | ρ j | P miss ( C, i ) = � 1 j : j ≤ i,g j / ∈ C n −| C | Enrichment score : ES ( C ) = max i | P hit ( C, i ) − P miss ( C, i ) | Karsten Borgwardt: Data Mining in Bioinformatics, Page 14

  15. Hierarchical clustering Linkage single linkage : d ( A, B ) = min x ∈ A,y ∈ B d ( x, y ) complete linkage : d ( A, B ) = max x ∈ A,y ∈ B d ( x, y ) average (arithmetic) linkage : d ( A, B ) = � x ∈ A,y ∈ B d ( x, y ) / | A || B | also called UPGMA (Unweighted Pair Group Method with Arithmetic Mean) average (centroid) linkage : d ( A, B ) = d ( � x ∈ A x/ | A | , � y ∈ B y/ | B | ) also called UPGMC (Unweighted Pair-Group Method using Centroids) Karsten Borgwardt: Data Mining in Bioinformatics, Page 15

  16. Hierarchical clustering Construction Agglomerative approach (bottom-up) Start with every element in its own cluster, then iteratively join nearby clusters Divisive approach (top-down) Start with a single cluster containing all elements, then recur- sively divide it into smaller clusters Karsten Borgwardt: Data Mining in Bioinformatics, Page 16

  17. Hierarchical clustering Advantages Does not require to set the number of clusters Good interpretability Drawbacks Computationally intensive O ( n 2 log n 2 ) Hard to decide at which level of the hierarchy to stop Lack of robustness Risk of locking accidental features (local decisions) Karsten Borgwardt: Data Mining in Bioinformatics, Page 17

  18. Hierarchical clustering Dendrograms abcdef In biology Phylogenetic trees bcdef Sequences analysis infer the evolutionary history def of sequences being com- pared de bc a b c d e f Karsten Borgwardt: Data Mining in Bioinformatics, Page 18

  19. Hierarchical clustering [Eisen et al. , 1998] Motivation Arrange genes according to similarity in pattern of gene expression Graphical display of output Efficient grouping of genes of similar functions Karsten Borgwardt: Data Mining in Bioinformatics, Page 19

  20. Hierarchical clustering [Eisen et al. , 1998] Data Saccharomyces cerevisiae : DNA microarrays containing all ORFs Diauxic shift; mitotic cell division cycle; sporulation; temperature and reducing shocks Human 9 800 cDNAs representing ∼ 8 600 transcripts fibroblasts stimulated with serum following serum star- vation Data pre-processing Cy5 (red) and Cy3 (green) fluorescences → log 2 ( Cy5 / Cy3 ) Karsten Borgwardt: Data Mining in Bioinformatics, Page 20

  21. Hierarchical clustering [Eisen et al. , 1998] Methods Distance: Pearson’s correlation Pairwise average-linkage cluster analysis Ordering of elements: Ideally: such that adjacent elements have maximal similarity (impractical) In practice: rank genes by average gene expression, chromosomal position Karsten Borgwardt: Data Mining in Bioinformatics, Page 21

  22. Hierarchical clustering [Bar-Joseph et al. , 2001] Fast optimal leaf ordering for hierarchical clustering n leaves → 2 n − 1 possible ordering Goal: maximize the sum of similarities of ad- jacent leaves in the ordering Recursively find, for a node v , the cost C ( v, u l , u r ) of the optimal ordering rooted at v with left-most leaf u l and right-most leaf u r Work bottom up: C ( v, u, w ) = C ( v l , u, m )+ C ( v r , k, w )+ σ ( m, k ) , where σ ( m, k ) is the similarity between m and k O ( n 4 ) time, O ( n 2 ) space Early termination → O ( n 3 ) Karsten Borgwardt: Data Mining in Bioinformatics, Page 22

  23. Hierarchical clustering [Eisen et al. , 1998] Genes “represent” more than a mere cluster together Genes of similar function cluster together cluster A: cholesterol biosyntehsis cluster B: cell cycle cluster C: immediate-early response cluster D: signaling and angiogenesis cluster E: tissue remodeling and wound healing Karsten Borgwardt: Data Mining in Bioinformatics, Page 23

  24. Hierarchical clustering [Eisen et al. , 1998] cluster E: genes encoding glycolytic enzymes share a function but are not members of large pro- tein complexes cluster J: mini-chromosomoe maintenance DNA replication complex cluster I: 126 genes strongly down-regulated in response to stress 112 of those encode ribosomal proteins Yeast responds to favorable growth conditions by increasing the pro- duction of ribosome, through transcriptional regulation of genes en- coding ribosomal proteins Karsten Borgwardt: Data Mining in Bioinformatics, Page 24

  25. Hierarchical clustering [Eisen et al. , 1998] Validation Randomized data does not cluster Karsten Borgwardt: Data Mining in Bioinformatics, Page 25

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend