topic 2 func3onal genomics
play

Topic 2: Func3onal Genomics 1 3/19/09 Biology 101 Central Dogma - PDF document

3/19/09 CSCI1950Z Computa3onal Methods for Biology Lecture 13 Ben Raphael March 11, 2009 hFp://cs.brown.edu/courses/csci1950z/ Topic 2: Func3onal Genomics 1 3/19/09 Biology 101 Central Dogma What can we measure? Sequencing


  1. 3/19/09 CSCI1950‐Z Computa3onal Methods for Biology Lecture 13 Ben Raphael March 11, 2009 hFp://cs.brown.edu/courses/csci1950‐z/ Topic 2: Func3onal Genomics 1

  2. 3/19/09 Biology 101 Central Dogma What can we measure? Sequencing (expensive) Hybridiza3on (noisy) Sequencing (expensive) Hybridiza3on (noisy) Mass spectrometry (noisy) Hybridiza3on (very noisy!) 2

  3. 3/19/09 DNA Basepairing DNA/RNA Basepairing RNA is single stranded T  U 3

  4. 3/19/09 RNA Microarrays Gene Expression Data Samples/Condi3ons Each microarray experiment: expression vector u = ( u 1 , …, u n ) u i = expression value for each Gene expression gene. BMC Genomics 2006, 7:279 4

  5. 3/19/09 Topics • Methods for Clustering Samples/Condi3ons – Hierarchical, Graph based (Clique‐finding), Matrix‐based (PCA), Gene expression • Methods for Classifica3on – Nearest neighbors, support vector machines • Data Integra3on: Bayesian Networks BMC Genomics 2006, 7:279 Gene Expression Data Samples/Condi3ons Each microarray experiment: expression vector u = ( u 1 , …, u n ) u i = expression value for each Gene expression gene. Goal : Group genes with similar expression paFerns over mul3ple samples/condi3ons. BMC Genomics 2006, 7:279 5

  6. 3/19/09 Clustering Goal: Group data into groups. • Input : n data points • Output : k clusters. Points in clusters “closer” than to points in other clusters. 0 11 7 5 1 4 11 0 4 6 7 4 0 9 5 6 9 0 3 2 n x n distance matrix Clustering Proper3es of a good clustering/par33on. • Separa.on : points in different clusters are far apart. • Homogeneity : points in the same cluster are close. 0 11 7 5 1 4 11 0 4 6 7 4 0 9 5 6 9 0 3 2 n x n distance matrix 6

  7. 3/19/09 Agglomera3ve Hierarchical Clustering Itera3vely combine closest groups into larger groups. 1 4 C  { {1}, …, {n} } While |C| > 2 do [Find closest clusters.] 3 5 2 d ( C i , C j ) = min d ( C i , C j ). C k  C i ∪ C j [Replace C i and C j by C k . ] C  ( C \ C i \ C j ) ∪ C k . Agglomera3ve Hierarchical Clustering How to compute d ? 1 4 C  { {1}, …, {n} } While |C| > 2 do [Find closest clusters.] 3 2 5 d ( C i , C j ) = min d ( C i , C j ). C k  C i ∪ C j [Replace C i and C j by C k . ] C  ( C \ C i \ C j ) ∪ C k . 7

  8. 3/19/09 Agglomera3ve Hierarchical Clustering Distance between clusters defined as average pairwise distance. 1 4 Average linkage clustering . Given two disjoint clusters C i , C j 3 5 2 1 d(C i, C j ) = ––––––––– Σ {p ∈ Ci, q ∈ Cj} d pq |C i | × |C j | Agglomera3ve Hierarchical Clustering Ini.aliza.on: 1 4 Assign each x i to its own cluster C i Itera.on: 3 5 2 Find two clusters C i and C j such that d ij is min Let C k = C i ∪ C j Delete C i and C j Termina.on: When a single cluster remains 2 3 5 1 4 Dendrogram 8

  9. 3/19/09 UPGMA Algorithm Unweighted Pair Group Method with Averages Ini.aliza.on: 1 4 Assign each x i to its own cluster C i Define one leaf per sequence, each at height 0 Itera.on: 3 5 2 Find two clusters C i and C j such that d ij is min Let C k = C i ∪ C j Add a vertex connec3ng C i , C j and place it at height d ij /2 Delete C i and C j Termina.on: When a single cluster remains 5 2 3 1 4 Agglomera3ve Hierarchical Clustering Clusters C i , C j Single linkage d ( C i , C j ) = min p ∈ C i ,q ∈ C j d pq Complete linkage d ( C i , C j ) = max p ∈ C i ,q ∈ C j d pq Average linkage 1 � � d ( C i , C j ) = d pq | C i || C j | p ∈ C i q ∈ C j 9

  10. 3/19/09 Agglomera3ve Hierarchical Clustering Where are the clusters? 1 4 Cut tree at some point. 3 5 2 Can define any number of clusters. 5 2 3 1 4 Cluster Centers Each cluster defined by center/centroid. (Whiteboard) 10

  11. 3/19/09 Another Greedy k‐means Move Cost(P) = k‐means “cost” of P. P i  C : clustering w/ i moved to cluster C. Δ(i  C) = cost(P) – cost(P i  C ) How many clusters? 11

  12. 3/19/09 Distance Graph Distance graph G(Θ) = (V, E). V = data points E = {(i,j): d(i,j) < Θ Θ = 7 Cliques • A graph is complete provided all possible edges are present. • A subgraph that is a complete graph is called a clique . • Separa3on and homogeneity proper3es of good clustering imply: clusters = cliques. K 3 K 4 K 5 12

  13. 3/19/09 Cliques and Clustering Good clustering 1. One connected component for each cluster ( separa.on ) 2. Each connected component has edge b/w pair of ver3ces ( homogeneity ) K 3 K 4 K 5 Clique Graphs A graph whose connected components are all cliques. 13

  14. 3/19/09 Distance Graph  Clique Graph Distance graphs from real data have missing edges and extra edges. Corrupted Cliques Problem Input : Graph G . Output : Smallest number of edges to add or remove to transform G into a clique graph. NP‐hard (Sharan, Shamir & Tsur 2004) 14

  15. 3/19/09 Extending a subpar33on • Suppose we knew op3mal clustering for subset V’ ⊆ V. • Extend this clustering to V. Cluster Affinity N ( v , C j ) = # of edges from v to C j . Define affinity (rela3ve density) of v to C j : N ( v , C j ) / | C j | Maximum Affinity Extension Assign v to argmax j N ( v , C j ) / | C j | 15

  16. 3/19/09 Parallel Clustering with Cores (PCC) (ben‐Dor et al. 1999) Score(P) = min. # edges to add/remove to par33on P to make clique graph. Straigh{orward to compute since P is known. PCC: Algorithmic Analysis Very inefficient: Number of such par33ons is equal to φ(|S’|, k ) S.rling number of second kind k φ ( r, k ) = 1 � k � � ( − 1) i ( k − i ) r k ! i i =0 16

  17. 3/19/09 Corrupted Cliques Random Graph 1) Start with clique graph H. 2) Randomly add/remove edges with probability p. Obtain graph G H,p PCC: Algorithmic Analysis • PCC selects two random sets of ver3ces. Analysis is relies on probability. • Let PCC(G) denote output graph (clique graph). • For graphs G = (V, E) and G’ = (V’, E’) define: Δ(G,G’) = | E Δ E’| = | E \E’| + |E’ \ E| • Can show (See Shamir notes) that with high probability, output graph from PCC is as good as clique graph H. Pr[ Δ( PCC( G H,p ), G H,p ) ≤ Δ(H, G H,p )] > 1 – δ. 17

  18. 3/19/09 Cluster Affinity Search Technique (CAST) Clustering of Gene Expression Samples Each microarray experiment: expression vector x = ( x 1 , …, x n ) x i = expression value for each gene. Gene expression Group similar vectors. BMC Genomics 2006, 7:279 18

  19. 3/19/09 Distances between vectors Pearson product‐moment correla3on coefficient � m k =1 ( x ik − x i )( x jk − x j ) r ij = ( m − 1) s i s j � m m � x i = 1 1 � � � s i = ( x ik − x i ) 2 x ik � m − 1 m k =1 k =1 Sample mean Sample standard devia3on Measures linear rela3onship between vectors x i and x j . ‐1 ≤ r ij ≤ 1. 19

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend