data mining in bioinformatics day 7 clustering in
play

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics - PowerPoint PPT Presentation

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Clustering Gene Expression Data Chlo-Agathe Azencott & Karsten Borgwardt February 18 to March 1, 2013 Machine Learning & Computational Biology Research Group Max


  1. Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Clustering Gene Expression Data Chloé-Agathe Azencott & Karsten Borgwardt February 18 to March 1, 2013 Machine Learning & Computational Biology Research Group Max Planck Institutes Tübingen and Eberhard Karls Universität Tübingen Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 1

  2. Gene expression data Microarray technology High density arrays Probes (or “reporters”, “oligos”) Detect probe-target hybridization Fluorescence, chemiluminescence E.g. Cyanine dyes: Cy3 (green) / Cy5 (red) Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 2

  3. Gene expression data Data X : n × m matrix n genes m experiments: conditions time points tissues patients cell lines Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 3

  4. Clustering gene expression data Group samples Group together tissues that are similarly affected by a disease Group together patients that are similarly affected by a disease Group genes Group together functionally related genes Group together genes that are similarly affected by a disease Group together genes that respond similarly to an ex- perimental condition Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 4

  5. Clustering gene expression data Applications Build regulatory networks Discover subtypes of a disease Infer unknown gene function Reduce dimensionality Popularity Pubmed hits: 33 548 for “microarray AND clustering”, 79 201 for “"gene expression" AND clustering” Toolboxes: MatArray, Cluster3, GeneCluster, Bioconductor, GEO tools, . . . Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 5

  6. Pre-processing Pre-filtering Eliminate poorly expressed genes Eliminate genes whose expression remains constant Missing values Ignore Replace with random numbers Impute Continuity of time series Values for similar genes Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 6

  7. Pre-processing Normalization Background elimination Local Global: negative controls Mismatch probes source Nucl. Acids Res. (2002) 30 (4): e15 mean-variance normalization differential expression log 2 ( Cy 5 Cy 3 ) → induction and repression have opposite signs Lo(w)ess normalization to eliminate intensity bias Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 7

  8. Distances Euclidean distance Distance between gene x and y , given n samples (or distance between samples x and y , given n genes) n � � ( x i − y i ) 2 d ( x, y ) = i =1 Emphasis: magnitude Pearson’s correlation Correlation between gene x and y , given n samples (or correlation between samples x and y , given n genes) � n i =1 ( x i − ¯ x )( y i − ¯ y ) ρ ( x, y ) = �� n x ) 2 � n y ) 2 i =1 ( x i − ¯ i =1 ( y i − ¯ Emphasis: shape Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 8

  9. Distances d = 8 . 25 d = 13 . 27 1 − ρ = 0 . 67 1 − ρ = 0 . 21 Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 9

  10. Clustering evaluation Clusters shape Cluster tightness (homogeneity) k 1 � � d ( x, µ i ) | C i | i =1 x ∈ C i � �� � T i Cluster separation k k � � d ( µ i , µ j ) � �� � i =1 j = i +1 S i,j Davies-Bouldin index k DB := 1 T i + T j � D i D i := max k S i,j j : j � = i i =1 Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 10

  11. Clustering evaluation Clusters stability image from [von Luxburg, 2009] Does the solution change if we perturb the data? Bootstrap Add noise Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 11

  12. Quality of clustering The Gene Ontology “The GO project has developed three structured controlled vocabularies (on- tologies) that describe gene products in terms of their associated biological pro- cesses, cellular components and molecular functions in a species-independent manner” Cellular Component : where in the cell a gene acts Molecular Function : function(s) carried out by a gene product Biological Process : biological phenomena the gene is involved in (e.g. cell cycle, DNA replication, limb forma- tion) Hierarchical organization (“is a”, “is part of”) Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 12

  13. Quality of clustering GO enrichment analysis: TANGO [Tanay, 2003] Are there more genes from a given GO class in a given cluster than expected by chance? Assume genes sampled from the hypergeometric dis- � | G | �� n −| G | � tribution t � i | C |− 1 Pr ( | C ∩ G | ≥ t ) = 1 − � n � | C | i =1 Correct for multiple hypothesis testing Bonferonni too stringent (dependencies between GO groups) Empirical computation of the null distribution Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 13

  14. Quality of clustering Gene Set enrichment analysis (GSEA) [Subramanian et al. , 2005] Use correlation to a phenotype y Rank genes according to the correlation ρ i of their ex- pression to y → L = { g 1 , g 2 , . . . , g n } P hit ( C, i ) = � | ρ j | � j : j ≤ i,g j ∈ C gj ∈ C | ρ j | P miss ( C, i ) = � 1 j : j ≤ i,g j / ∈ C n −| C | Enrichment score : ES ( C ) = max i | P hit ( C, i ) − P miss ( C, i ) | Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 14

  15. Hierarchical clustering Linkage single linkage : d ( A, B ) = min x ∈ A,y ∈ B d ( x, y ) complete linkage : d ( A, B ) = max x ∈ A,y ∈ B d ( x, y ) average (arithmetic) linkage : d ( A, B ) = � x ∈ A,y ∈ B d ( x, y ) / | A || B | also called UPGMA (Unweighted Pair Group Method with Arithmetic Mean) average (centroid) linkage : d ( A, B ) = d ( � x ∈ A x/ | A | , � y ∈ B y/ | B | ) also called UPGMC (Unweighted Pair-Group Method using Centroids) Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 15

  16. Hierarchical clustering Construction Agglomerative approach (bottom-up) Start with every element in its own cluster, then iteratively join nearby clusters Divisive approach (top-down) Start with a single cluster containing all elements, then recur- sively divide it into smaller clusters Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 16

  17. Hierarchical clustering Advantages Does not require to set the number of clusters Good interpretability Drawbacks Computationally intensive O ( n 2 log n 2 ) Hard to decide at which level of the hierarchy to stop Lack of robustness Risk of locking accidental features (local decisions) Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 17

  18. Hierarchical clustering Dendograms abcdef In biology Phylogenic trees bcdef Sequences analysis infer the evolutionary history def of sequences being com- pared de bc a b c d e f Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 18

  19. Hierarchical clustering [Eisen et al. , 1998] Motivation Arrange genes according to similarity in pattern of gene expression Graphical display of output Efficient grouping of genes of similar functions Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 19

  20. Hierarchical clustering [Eisen et al. , 1998] Data Saccharomyces cervisiae : DNA micro-arrays containing all ORFs Diauxic shift; mitotic cell division cycle; sporulation; temperature and reducing shocks Human 9 800 cDNAs representing ∼ 8 600 transcripts fibroblasts stimulated with serum following serum star- vation Data pre-processing Cy5 (red) and Cy3 (green) fluorescences → log 2 ( Cy5 / Cy3 ) Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 20

  21. Hierarchical clustering [Eisen et al. , 1998] Methods Distance: Pearson’s correlation Pairwise average-linkage cluster analysis Ordering of elements: Ideally: such that adjacent elements have maximal similarity (inpractical*) In practice: weight genes by average gene expres- sion, chromosomal position Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 21

  22. Hierarchical clustering [Bar-Joseph et al. , 2001] Fast optimal leaf ordering for hierarchical clustering n leaves → 2 n − 1 possible ordering Goal: maximize the sum of similarities of ad- jacent leaves in the orderning Recursively find, for a node v , the cost C ( v, u l , u r ) of the optimal ordering rooted at v with left-most leaf u l and right-most leaf u r Work bottom up: C ( v, u, w ) = C ( v l , u, m ) + C ( v r , k, w ) + σ ( m, k ) O ( n 4 ) time, O ( n 2 ) space Early termination → O ( n 3 ) Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 22

  23. Hierarchical clustering [Eisen et al. , 1998] Genes “present” more than once cluster together Genes of similar function cluster together cluster A: cholesterol biosyntehsis cluster B: cell cycle cluster C: immediate-early response cluster D: signaling and angiogenesis cluster E: tissue remodeling and wound healing Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 23

  24. Hierarchical clustering [Eisen et al. , 1998] cluster E: genes encoding glycolytic enzymes share a function but are not members of large pro- tein complexes cluster J: mini-chromosome maintenance DNA replication complex cluster I: 126 genes strongly down-regulated in response to stress 112 of those encode ribosomal proteins Yeast responds to favorable growth conditions by increasing the pro- duction of ribosome, through transcriptional regulation of genes en- coding ribosomal proteins Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 24

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend