Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics - PowerPoint PPT Presentation

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Clustering Gene Expression Data Chloé-Agathe Azencott & Karsten Borgwardt February 18 to March 1, 2013 Machine Learning & Computational Biology Research Group Max Planck Institutes Tübingen and Eberhard Karls Universität Tübingen Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 1

Gene expression data Microarray technology High density arrays Probes (or “reporters”, “oligos”) Detect probe-target hybridization Fluorescence, chemiluminescence E.g. Cyanine dyes: Cy3 (green) / Cy5 (red) Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 2

Gene expression data Data X : n × m matrix n genes m experiments: conditions time points tissues patients cell lines Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 3

Clustering gene expression data Group samples Group together tissues that are similarly affected by a disease Group together patients that are similarly affected by a disease Group genes Group together functionally related genes Group together genes that are similarly affected by a disease Group together genes that respond similarly to an ex- perimental condition Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 4

Clustering gene expression data Applications Build regulatory networks Discover subtypes of a disease Infer unknown gene function Reduce dimensionality Popularity Pubmed hits: 33 548 for “microarray AND clustering”, 79 201 for “"gene expression" AND clustering” Toolboxes: MatArray, Cluster3, GeneCluster, Bioconductor, GEO tools, . . . Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 5

Pre-processing Pre-filtering Eliminate poorly expressed genes Eliminate genes whose expression remains constant Missing values Ignore Replace with random numbers Impute Continuity of time series Values for similar genes Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 6

Pre-processing Normalization Background elimination Local Global: negative controls Mismatch probes source Nucl. Acids Res. (2002) 30 (4): e15 mean-variance normalization differential expression log 2 ( Cy 5 Cy 3 ) → induction and repression have opposite signs Lo(w)ess normalization to eliminate intensity bias Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 7

Distances Euclidean distance Distance between gene x and y , given n samples (or distance between samples x and y , given n genes) n � � ( x i − y i ) 2 d ( x, y ) = i =1 Emphasis: magnitude Pearson’s correlation Correlation between gene x and y , given n samples (or correlation between samples x and y , given n genes) � n i =1 ( x i − ¯ x )( y i − ¯ y ) ρ ( x, y ) = �� n x ) 2 � n y ) 2 i =1 ( x i − ¯ i =1 ( y i − ¯ Emphasis: shape Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 8

Distances d = 8 . 25 d = 13 . 27 1 − ρ = 0 . 67 1 − ρ = 0 . 21 Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 9

Clustering evaluation Clusters shape Cluster tightness (homogeneity) k 1 � � d ( x, µ i ) | C i | i =1 x ∈ C i � �� T i Cluster separation k k � � d ( µ i , µ j ) � �� i =1 j = i +1 S i,j Davies-Bouldin index k DB := 1 T i + T j � D i D i := max k S i,j j : j � = i i =1 Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 10

Clustering evaluation Clusters stability image from [von Luxburg, 2009] Does the solution change if we perturb the data? Bootstrap Add noise Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 11

Quality of clustering The Gene Ontology “The GO project has developed three structured controlled vocabularies (on- tologies) that describe gene products in terms of their associated biological pro- cesses, cellular components and molecular functions in a species-independent manner” Cellular Component : where in the cell a gene acts Molecular Function : function(s) carried out by a gene product Biological Process : biological phenomena the gene is involved in (e.g. cell cycle, DNA replication, limb forma- tion) Hierarchical organization (“is a”, “is part of”) Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 12

Quality of clustering GO enrichment analysis: TANGO [Tanay, 2003] Are there more genes from a given GO class in a given cluster than expected by chance? Assume genes sampled from the hypergeometric dis- � | G | �� n −| G | � tribution t � i | C |− 1 Pr ( | C ∩ G | ≥ t ) = 1 − � n � | C | i =1 Correct for multiple hypothesis testing Bonferonni too stringent (dependencies between GO groups) Empirical computation of the null distribution Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 13

Quality of clustering Gene Set enrichment analysis (GSEA) [Subramanian et al. , 2005] Use correlation to a phenotype y Rank genes according to the correlation ρ i of their expression to y → L = { g 1 , g 2 , . . . , g n } P hit ( C, i ) = � | ρ j | � j : j ≤ i,g j ∈ C gj ∈ C | ρ j | P miss ( C, i ) = � 1 j : j ≤ i,g j / ∈ C n −| C | Enrichment score : ES ( C ) = max i | P hit ( C, i ) − P miss ( C, i ) | Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 14

Hierarchical clustering Linkage single linkage : d ( A, B ) = min x ∈ A,y ∈ B d ( x, y ) complete linkage : d ( A, B ) = max x ∈ A,y ∈ B d ( x, y ) average (arithmetic) linkage : d ( A, B ) = � x ∈ A,y ∈ B d ( x, y ) / | A || B | also called UPGMA (Unweighted Pair Group Method with Arithmetic Mean) average (centroid) linkage : d ( A, B ) = d ( � x ∈ A x/ | A | , � y ∈ B y/ | B | ) also called UPGMC (Unweighted Pair-Group Method using Centroids) Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 15

Hierarchical clustering Construction Agglomerative approach (bottom-up) Start with every element in its own cluster, then iteratively join nearby clusters Divisive approach (top-down) Start with a single cluster containing all elements, then recursively divide it into smaller clusters Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 16

Hierarchical clustering Advantages Does not require to set the number of clusters Good interpretability Drawbacks Computationally intensive O ( n 2 log n 2 ) Hard to decide at which level of the hierarchy to stop Lack of robustness Risk of locking accidental features (local decisions) Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 17

Hierarchical clustering Dendograms abcdef In biology Phylogenic trees bcdef Sequences analysis infer the evolutionary history def of sequences being com- pared de bc a b c d e f Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 18

Hierarchical clustering [Eisen et al. , 1998] Motivation Arrange genes according to similarity in pattern of gene expression Graphical display of output Efficient grouping of genes of similar functions Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 19

Hierarchical clustering [Eisen et al. , 1998] Data Saccharomyces cervisiae : DNA micro-arrays containing all ORFs Diauxic shift; mitotic cell division cycle; sporulation; temperature and reducing shocks Human 9 800 cDNAs representing ∼ 8 600 transcripts fibroblasts stimulated with serum following serum star- vation Data pre-processing Cy5 (red) and Cy3 (green) fluorescences → log 2 ( Cy5 / Cy3 ) Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 20

Hierarchical clustering [Eisen et al. , 1998] Methods Distance: Pearson’s correlation Pairwise average-linkage cluster analysis Ordering of elements: Ideally: such that adjacent elements have maximal similarity (inpractical*) In practice: weight genes by average gene expression, chromosomal position Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 21

Hierarchical clustering [Bar-Joseph et al. , 2001] Fast optimal leaf ordering for hierarchical clustering n leaves → 2 n − 1 possible ordering Goal: maximize the sum of similarities of adjacent leaves in the orderning Recursively find, for a node v , the cost C ( v, u l , u r ) of the optimal ordering rooted at v with left-most leaf u l and right-most leaf u r Work bottom up: C ( v, u, w ) = C ( v l , u, m ) + C ( v r , k, w ) + σ ( m, k ) O ( n 4 ) time, O ( n 2 ) space Early termination → O ( n 3 ) Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 22

Hierarchical clustering [Eisen et al. , 1998] Genes “present” more than once cluster together Genes of similar function cluster together cluster A: cholesterol biosyntehsis cluster B: cell cycle cluster C: immediate-early response cluster D: signaling and angiogenesis cluster E: tissue remodeling and wound healing Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 23

Hierarchical clustering [Eisen et al. , 1998] cluster E: genes encoding glycolytic enzymes share a function but are not members of large protein complexes cluster J: mini-chromosome maintenance DNA replication complex cluster I: 126 genes strongly down-regulated in response to stress 112 of those encode ribosomal proteins Yeast responds to favorable growth conditions by increasing the pro- duction of ribosome, through transcriptional regulation of genes encoding ribosomal proteins Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 24

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics - PowerPoint PPT Presentation

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Clustering Gene Expression Data Chlo-Agathe Azencott & Karsten Borgwardt February 18 to March 1, 2013 Machine Learning & Computational Biology Research Group Max

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 25

Data Mining in Bioinformatics Day 8: Clustering in Bioinformatics Clustering Gene Expression Data

Data Mining in Bioinformatics Day 9: String & Text Mining in Bioinformatics Karsten Borgwardt

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 21

Data Mining in Bioinformatics Day 2: Clustering Karsten Borgwardt February 21 to March 4, 2011

Data Mining in Bioinformatics Day 4: Text Mining Karsten Borgwardt February 25 to March 10

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt February

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Karsten Borgwardt

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Karsten Borgwardt March

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt March 1

Data Mining in Bioinformatics Day 4: Text Mining Karsten Borgwardt February 21 to March 4, 2011

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Data Mining in Bioinformatics Day 10: Graph Mining in Bioinformatics Karsten Borgwardt February

3D folding of chromosomal domains in relation to gene expression Marc A. Marti-Renom

1 Milestones Milestones ID Task Name Duration Start Finish % Complete 1 Project Proposal

Reconstruction Spatiotemporal Gene Expression from Partial Observations Dustin Cartwright 1 April

Computer control of gene expression: Robust setpoint tracking of protein mean and variance using

Predicting perturbation effects in large-scale systems from observational data Marloes Maathuis

Assessing Differential Gene Expression from RNA-Seq Data Yanming Di Department of Statistics

Capturing Best Practice for Microarray Gene Expression Data Analysis Gregory Piatetsky-Shapiro

On the Expressive Power of Programming Languages 1 Historical Context Control Reduction

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics - PowerPoint PPT Presentation

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Clustering Gene Expression Data Chlo-Agathe Azencott & Karsten Borgwardt February 18 to March 1, 2013 Machine Learning & Computational Biology Research Group Max

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 25

Data Mining in Bioinformatics Day 8: Clustering in Bioinformatics Clustering Gene Expression Data

Data Mining in Bioinformatics Day 9: String &amp; Text Mining in Bioinformatics Karsten Borgwardt

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 21

Data Mining in Bioinformatics Day 2: Clustering Karsten Borgwardt February 21 to March 4, 2011

Data Mining in Bioinformatics Day 4: Text Mining Karsten Borgwardt February 25 to March 10

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt February

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Karsten Borgwardt

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Karsten Borgwardt March

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt March 1

Data Mining in Bioinformatics Day 4: Text Mining Karsten Borgwardt February 21 to March 4, 2011

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Data Mining in Bioinformatics Day 10: Graph Mining in Bioinformatics Karsten Borgwardt February

3D folding of chromosomal domains in relation to gene expression Marc A. Marti-Renom

1 Milestones Milestones ID Task Name Duration Start Finish % Complete 1 Project Proposal

Reconstruction Spatiotemporal Gene Expression from Partial Observations Dustin Cartwright 1 April

Computer control of gene expression: Robust setpoint tracking of protein mean and variance using

Predicting perturbation effects in large-scale systems from observational data Marloes Maathuis

Assessing Differential Gene Expression from RNA-Seq Data Yanming Di Department of Statistics

Capturing Best Practice for Microarray Gene Expression Data Analysis Gregory Piatetsky-Shapiro

On the Expressive Power of Programming Languages 1 Historical Context Control Reduction

Data Mining in Bioinformatics Day 9: String & Text Mining in Bioinformatics Karsten Borgwardt