clustering expression data
play

Clustering Expression Data www.cs.washington.edu/527 Why cluster - PowerPoint PPT Presentation

Clustering Expression Data www.cs.washington.edu/527 Why cluster gene expression data? Tissue classification Find biologically related genes First step in inferring regulatory networks Look for common promoter elements


  1. Clustering Expression Data www.cs.washington.edu/527 • Why cluster gene expression data? – Tissue classification – Find biologically related genes – First step in inferring regulatory networks – Look for common promoter elements – Hypothesis generation – One of the tools of choice for expression analysis Subscribe, if you Didn’t get msg last night Clustering Expression Data Clustering Algorithms • Partitional • What has been done? – CAST (Ben-Dor et al. 1999) – Hierarchical average-link [Eisen et al. 98] – Self Organizing Maps (SOM) [Tamayo et al. 99] – k-means , variously initialized (Hartigan 1975) – CAST [Ben-Dor et al. 99] • Hierarchical – Support Vector Machines (SVM) [Grundy et al. 00] – etc., etc., etc. – single-link • Why so many methods? – average-link – Clustering is NP-hard, even with simple objectives, data – complete-link – Hard problem: high dimensionality, noise, … – ∴ many heuristic, local search, & approximation algorithms • Random (as a control) – No clear winner – Randomly assign genes to clusters • Others 1

  2. The following slides largely from Overview http://staff.washington.edu/kayee/research.html Errors are mine. • What is clustering? • Similarity/distance metrics • Hierarchical clustering algorithms Clustering 101 – Made popular by Stanford, ie. [Eisen et al . 1998] • K-means – Made popular by many groups, eg. [Tavazoie et al. 1999] • Self-organizing map (SOM) Ka Yee Yeung – Made popular by Whitehead, ie. [Tamayo et al . 1999] Center for Expression Arrays University of Washington How to define similarity? What is clustering? Experiment X genes n 1 p 1 s X • Group similar objects together genes genes • Objects in the same cluster (group) are more Y similar to each other than objects in different Y n n clusters Raw matrix Similarity matrix • Data exploratory tool • Similarity metric: – A measure of pairwise similarity or dissimilarity – Examples: • Correlation coefficient • Euclidean distance 2

  3. Example Similarity metrics 4 X 1 0 -1 0 • Euclidean distance 3 Y 3 2 1 2 2 Z -1 0 1 0 X 1 p Y W 2 0 -2 0 2 � ( X [ j ] Y [ j ] ) Z � 0 W 1 2 3 4 j 1 -1 = -2 • Correlation coefficient -3 Correlation (X,Y) = 1 Distance (X,Y) = 4 p p ( X [ j ] X )( Y [ j ] Y ) X [ j ] � � � � j 1 j 1 = , where X = = Correlation (X,Z) = -1 Distance (X,Z) = 2.83 p p p ( X [ j ] X ) 2 ( Y [ j ] Y ) 2 � � � � Correlation (X,W) = 1 Distance (X,W) = 1.41 j = 1 j = 1 Clustering algorithms Lessons from the example • Inputs: • Correlation – direction only – Raw data matrix or similarity matrix • Euclidean distance – magnitude & direction – Number of clusters or some other parameters • Min # attributes (experiments) to compute pairwise • Many different classifications of clustering similarity algorithms: – >= 2 attributes for Euclidean distance – Hierarchical vs partitional – >= 3 attributes for correlation • Array data is noisy  need many experiments to robustly – Heuristic-based vs model-based estimate pairwise similarity – Soft vs hard 3

  4. Hierarchical Clustering [Hartigan Hierarchical: Single Link 1975] • Agglomerative ( bottom-up) • cluster similarity = similarity of two most similar • Algorithm: members – Initialize: each item a cluster - Potentially – Iterate: long and skinny • select two most similar clusters clusters dendrogram • merge them + Fast – Halt: when required number of clusters is reached Example: single link Example: single link 1 2 3 4 5 1 2 3 4 5 ( 1 , 2 ) 3 4 5 ( 1 , 2 ) 3 4 5 ( 1 , 2 , 3 ) 4 5 1 0 1 0 � � � � ( 1 , 2 ) 0 ( 1 , 2 ) 0 � � � � ( 1 , 2 , 3 ) 0 � � � � � � 2 2 0 2 2 0 � � � � 3 3 0 3 3 0 � � � � � � 4 7 0 � � � � 3 � 6 3 0 � 3 � 6 3 0 � � � 4 9 7 0 4 9 7 0 � � � � � � � � 5 5 4 0 � � 4 10 9 7 0 4 10 9 7 0 � � � � � � � � 5 8 5 4 0 � � 5 8 5 4 0 � � � � 5 � 9 8 5 4 0 � 5 � 9 8 5 4 0 � � � � � 5 5 d min{ d , d } min{ 6 , 3 } 3 = = = d min{ d , d } min{ 9 , 7 } 7 = = = ( 1 , 2 ), 3 1 , 3 2 , 3 ( 1 , 2 , 3 ), 4 ( 1 , 2 ), 4 3 , 4 4 4 d min{ d , d } min{ 10 , 9 } 9 d min{ d , d } min{ 8 , 5 } 5 = = = = = = ( 1 , 2 , 3 ), 5 ( 1 , 2 ), 5 3 , 5 ( 1 , 2 ), 4 1 , 4 2 , 4 3 3 d min{ d , d } min{ 9 , 8 } 8 = = = ( 1 , 2 ), 5 1 , 5 2 , 5 2 2 1 1 4

  5. Hierarchical: Complete Link Example: single link • cluster similarity = similarity of two least similar 1 2 3 4 5 members ( 1 , 2 ) 3 4 5 ( 1 , 2 , 3 ) 4 5 1 0 � � ( 1 , 2 ) 0 � � ( 1 , 2 , 3 ) 0 � � � � 2 2 0 � � � � 3 3 0 � � 4 7 0 � � 3 6 3 0 � � � � 4 9 7 0 � � � � 5 5 4 0 � � 4 10 9 7 0 � � � � � � + tight clusters 5 8 5 4 0 � � 5 � 9 8 5 4 0 � � � 5 - slow d min{ d , d } 5 = = 4 ( 1 , 2 , 3 ), ( 4 , 5 ) ( 1 , 2 , 3 ), 4 ( 1 , 2 , 3 ), 5 3 2 1 Sometimes drawn to a scale Example: complete link Example: complete link 1 2 3 4 5 1 2 3 4 5 ( 1 , 2 ) 3 4 5 ( 1 , 2 ) 3 4 5 ( 1 , 2 ) 3 ( 4 , 5 ) 1 0 1 0 � � � � ( 1 , 2 ) 0 ( 1 , 2 ) 0 � � � � ( 1 , 2 ) 0 � � � � � � 2 2 0 2 2 0 � � � � 3 6 0 3 6 0 � � � � � � 3 6 0 � � � � 3 � 6 3 0 � 3 � 6 3 0 � � � 4 10 7 0 4 10 7 0 � � � � � � � � ( 4 , 5 ) 10 7 0 � � 4 10 9 7 0 4 10 9 7 0 � � � � � � � � 5 9 5 4 0 � � 5 9 5 4 0 � � � � 5 � 9 8 5 4 0 � 5 � 9 8 5 4 0 � � � � � 5 5 d max{ d , d } max{ 6 , 3 } 6 = = = d max{ d , d } max{ 10 , 9 } 10 = = = ( 1 , 2 ), 3 1 , 3 2 , 3 ( 1 , 2 ), ( 4 , 5 ) ( 1 , 2 ), 4 ( 1 , 2 ), 5 4 4 d max{ d , d } max{ 10 , 9 } 10 = = = d max{ d , d } max{ 7 , 5 } 7 = = = ( 1 , 2 ), 4 1 , 4 2 , 4 3 , ( 4 , 5 ) 3 , 4 3 , 5 3 3 d max{ d , d } max{ 9 , 8 } 9 = = = ( 1 , 2 ), 5 1 , 5 2 , 5 2 2 1 1 5

  6. Hierarchical: Average Link Example: complete link • cluster similarity = average similarity of all pairs 1 2 3 4 5 ( 1 , 2 ) 3 4 5 ( 1 , 2 ) 3 ( 4 , 5 ) 1 0 � � ( 1 , 2 ) 0 � � ( 1 , 2 ) 0 � � � � 2 2 0 � � � � 3 6 0 � � 3 6 0 � � 3 6 3 0 � � � � 4 10 7 0 � � � � ( 4 , 5 ) 10 7 0 � � 4 10 9 7 0 � � � � � � + tight clusters 5 9 5 4 0 � � 5 � 9 8 5 4 0 � � � 5 - slow d max{ d , d } 10 = = 4 ( 1 , 2 , 3 ), ( 4 , 5 ) ( 1 , 2 ), ( 4 , 5 ) 3 , ( 4 , 5 ) 3 2 1 Example: average link Example: average link 1 2 3 4 5 1 2 3 4 5 ( 1 , 2 ) 3 4 5 ( 1 , 2 ) 3 4 5 ( 1 , 2 ) 3 ( 4 , 5 ) 1 0 1 0 � � � � ( 1 , 2 ) 0 ( 1 , 2 ) 0 � � � � ( 1 , 2 ) 0 � � � � � � 2 2 0 2 2 0 � � � � 3 4 . 5 0 3 4 . 5 0 � � � � � � 3 4 . 5 0 � � � � 3 � 6 3 0 � 3 � 6 3 0 � � � 4 9 . 5 7 0 4 9 . 5 7 0 � � � � � � � � ( 4 , 5 ) 9 6 0 � � 4 10 9 7 0 4 10 9 7 0 � � � � � � � � 5 8 . 5 5 4 0 � � 5 8 . 5 5 4 0 � � � � 5 � 9 8 5 4 0 � 5 � 9 8 5 4 0 � � � � � 5 1 6 3 + 5 d ( d d ) 4 . 5 = + = = ( 1 , 2 ), 3 1 , 3 2 , 3 1 2 2 d ( d d d d ) 9 4 = + + + = ( 1 , 2 ), ( 4 , 5 ) 1 , 4 1 , 5 2 , 4 2 , 5 4 4 1 10 9 + d ( d d ) 9 . 5 = + = = 1 ( 1 , 2 ), 4 1 , 4 2 , 4 3 2 2 3 d ( d d ) 6 = + = 3 , ( 4 , 5 ) 3 , 4 3 , 5 2 1 9 8 2 + 2 d ( d d ) 8 . 5 = + = = ( 1 , 2 ), 5 1 , 5 2 , 5 2 2 1 1 6

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend