Clustering megavariate data Dhammika Amaratunga Team Leader - - - PowerPoint PPT Presentation

clustering megavariate data dhammika amaratunga
SMART_READER_LITE
LIVE PREVIEW

Clustering megavariate data Dhammika Amaratunga Team Leader - - - PowerPoint PPT Presentation

Clustering megavariate data Dhammika Amaratunga Team Leader - Statistics in Drug Discovery Senior Research Fellow - Nonclinical Statistics Joint work with Javier Cabrera, Yauheniya Cherkas, Vladimir Kovtun, YungSeop Lee, and others Rutgers


slide-1
SLIDE 1

1

Clustering megavariate data Dhammika Amaratunga

Team Leader - Statistics in Drug Discovery Senior Research Fellow - Nonclinical Statistics

Rutgers Biostatistics Day, April 2010

Joint work with

Javier Cabrera, Yauheniya Cherkas, Vladimir Kovtun, YungSeop Lee, and others

slide-2
SLIDE 2

2

Cluster analysis

 Data collected for N samples.  For each sample, measurements made on G variables.  Data represented as a

GxN matrix.

 The objective is to cluster

the N samples into a few classes in such a way that samples within a class are collectively more similar to each other than to samples in any other class.

C5 C6 C3 C4 C2 C1

slide-3
SLIDE 3

3

Cluster analysis methods

 There are many standard approaches available (e.g.,

partitioning methods such as K-means, hierarchical methods such as average linkage, machine learning methods such as self organizing maps)

 For example, hierarchical clustering is one of the

more popular clustering methods.

  • - Define an inter-sample dissimilarity

(e.g., Euclidean distance, 1-Correlation)

  • - Define an inter-cluster dissimilarity

(e.g., Dissimilarity between a pair of clusters is the average dissimilarity between a sample in one cluster and a sample in the other cluster)

  • - Combine “close” samples/clusters sequentially
slide-4
SLIDE 4

4

1 2 3 4 7 6 5

SAMPLE 1 SAMPLE 2 SAMPLE 3 SAMPLE 4 SAMPLE 5 SAMPLE 6 SAMPLE 7

Hierarchical clustering: how it works

slide-5
SLIDE 5

5

The catch

 In many contemporary settings, the data are

megavariate, i.e., N<<G (e.g., in high throughput gene expression studies G is around 1,000-50,000 while N is around 10-500); in such cases, most predictors are noninformative and could overwhelm the dissimilarity estimates.

 Example: Use gene expression data to discover

unexpected novel classes among the samples (e.g., in leukemia patients, subtypes of leukemia).

slide-6
SLIDE 6

6

WT:

C1

Case study

 Experiment: Compare the gene expression profiles of 6 KO mice vs 6 WT mice using a microarray with 45101 genes.

C2 C3 C4 C5 C6

KO:

T1 T2 T3 T4 T5 T6

 Note 1: Data available for early stage and late stage development of these mice.  Note 2: This data is useful for illustration but is not representative of a cluster analysis situation as here the classes are known.

slide-7
SLIDE 7

7

Gene expression data  Gene expression levels (measured via microarrays) for

G genes in N samples:

C1 C2 C3 C4 C5 C6 … G1 83 94 82 111 130 122 G2 16 14 7 2 11 33 G3 490 879 193 604 1031 962 G4 46458 49268 74059 44849 42235 44611 G5 32 70 185 20 25 19 G6 1067 891 546 906 1038 1098 G7 118 111 95 896 536 695 G8 10 30 25 24 31 28 G9 166 132 162 27 109 213 G10 136 139 44 62 23 135 . . . . . . . . . . . . . . . . . . . . . . . .

Preprocess and analyze

slide-8
SLIDE 8

8

Biplots of data from knockout experiment

Early stage Late stage

slide-9
SLIDE 9

9

Clustering of data from knockout experiment

Early stage Late stage MR=5/12 MR=0/12

slide-10
SLIDE 10

10

Filtering

 Problem: With megavariate data, most predictors

are noninformative and will overwhelm the dissimilarity estimates.

 Usual (partial) resolution: Filter the genes based on

variance or coefficient of variation to reduce the error rates (but which genes are informative?).

 Resolution: Ensemble approach: Filter genes

repeatedly and apply an ensemble technique.

slide-11
SLIDE 11

11

Similari ty S1 S2 S3 S4 S5 S6 S1 0 1 0 0 0 0 S2 1 0 0 0 0 0 S3 0 0 0 0 0 0 S4 0 0 0 0 1 1 S5 0 0 0 1 0 1 S6 0 0 0 1 1 0 Similarity S1 S2 S3 S4 S5 S6 S1 0 1 1 1 0 0 S2 1 0 0 0 0 0 S3 1 0 0 1 0 0 S4 1 0 1 0 1 1 S5 0 0 0 1 0 2 S6 0 0 0 1 2 0 Similarity S1 S2 S3 S4 S5 S6 S1 0 1 1 1 0 0 S2 1 0 0 0 1 1 S3 1 0 0 2 0 0 S4 1 0 2 0 1 1 S5 0 1 0 1 0 3 S6 0 1 0 1 3 0 Similarity S1 S2 S3 S4 S5 S6 S1 0 1 2 2 0 0 S2 1 0 0 0 1 1 S3 2 0 0 3 0 0 S4 2 0 3 0 1 1 S5 0 1 0 1 0 4 S6 0 1 0 1 4 0 Similarity S1 S2 S3 S4 S5 S6 S1 0 2 3 3 0 0 S2 2 0 1 1 1 1 S3 3 1 0 4 0 0 S4 3 1 4 0 1 1 S5 0 1 0 1 0 5 S6 0 1 0 1 5 0 Similarity S1 S2 S3 S4 S5 S6 S1 0 6 7 7 0 0 S2 6 0 5 5 1 1 S3 7 5 0 8 0 0 S4 7 5 8 0 2 2 S5 0 2 0 2 0 10 S6 0 2 0 2 10 0

S1 S2 S4 S5 S6 G8523 680 749 669 724 643 G8524 262 311 1677 1286 1486 G8528 2571 1929 2439 1613 5074 G8530 1640 1693 1731 1861 1550 G8537 4077 2557 3394 2926 2755 G8545 1652 1799 254 383 258 G8547 2607 3394 2755 3077 2227

Select n samples and g genes Gene expression matrix

{S1,S2,S3,S4} {S5,S6}

Final Clusters Compute similarity

S1 S2 S3 S4 S5 S6 G8521 1003 1306 713 1628 1268 1629 G8522 890 705 566 975 883 1005 G8523 680 749 811 669 724 643 G8524 262 311 336 1677 1286 1486 G8525 254 383 258 1652 1799 1645 G8526 81 140 288 298 241 342 G8527 4077 2557 2600 3394 2926 2755 G8528 2571 1929 1406 2439 1613 5074 G8529 55 73 121 22 141 44 G8530 1640 1693 1517 1731 1861 1550 G8531 168 229 284 220 310 315 G8532 323 258 359 345 308 315 G8533 12131 11199 14859 11544 11352 11506 G8534 11544 11352 12131 11199 14859 12529 G8535 1929 1406 2439 254 383 258 G8536 191 140 288 298 241 342 G8537 4077 2557 2600 3394 2926 2755 G8538 2571 1613 5074 1652 1799 1645 G8539 55 73 121 22 91 24 G8540 1640 1693 1517 1731 1861 1750 G8541 168 229 284 220 312 335 G8542 323 258 359 345 298 325 G8543 2007 1878 1502 1758 2480 1731 G8544 2480 1731 2007 1878 1502 1758 G8545 1652 1799 1645 254 383 258 G8546 298 241 342 81 150 298 G8547 2607 3394 2926 2755 3077 2227 G8548 2571 1929 1406 2439 1613 5074 G8549 121 22 55 730 201 35 G8550 1640 1693 1517 1731 1861 1550

slide-12
SLIDE 12

12

Data Simple random sample of cases Random sample of genes Cluster analysis Iterate

ABC dissimilarities

ABC(i,j) = 1-relative frequency

  • f how often samples i and j

cluster together

Ref: Amaratunga, Cabrera and Kovtun (Biostatistics, 2007)

Simple or weighted based on variance HC (Ave, Ward’s), Kmeans, … Input to clustering algorithm

slide-13
SLIDE 13

13

ABC clustering of data from knockout experiment

Early stage Late stage MR=2/12 MR=0/12

slide-14
SLIDE 14

14

Early stage Late stage

ABC-MDS plot of data from knockout experiment

slide-15
SLIDE 15

15

Within-cluster and between-cluster dissimilarities

slide-16
SLIDE 16

16

More proof-of-concept examples  Try on data in which the clusters are known.

Misclassification Rates

Method Golub AMS ALL Colon Ward's with ABC 18.1 1.4 0.0 9.7 Ward’s with 1-Cor 23.6 9.7 2.3 48.4 Single Linkage 47.0 47.0 25.0 37.0 Complete Linkage 37.5 23.6 41.4 45.0 Average Linkage 47.2 27.8 26.5 38.7 K-means 20.8 5.5 42.2 48.4 PAM 23.6 8.3 2.3 16.1 Random Forest 43.0 26.4 48.0 43.5

slide-17
SLIDE 17

17

More proof-of-concept examples (ctd)  … with feature selection

Misclassification Rates

Method Golub AMS ALL Colon Ward's with ABC 18.1 1.4 0.0 9.7 Ward’s with 1-Cor 6.9 13.9 0.0 24.2 Single Linkage 45.8 58.3 26.6 35.5 Complete Linkage 29.2 13.9 0.0 27.4 Average Linkage 5.6 30.6 0.0 37.1 K-means 6.9 6.9 0.0 14.5 PAM 8.3 13.9 0.0 12.9 Random Forest 23.6 12.5 0.0 11.3

slide-18
SLIDE 18

18

Hepatotoxicity example (1)

 In this experiment N=87 compounds were tested in rats for a certain type of hepatotoxicity.

slide-19
SLIDE 19

19

Hepatotoxicity example (2)

 ABC was run on this dataset.

  • 0.4
  • 0.2

0.0 0.2 0.4

  • 0.4
  • 0.2

0.0 0.2 0.4

Eth Ery Rif Ani Met Sul ANI Gli Ami Adr Ami ChoSpi Sta Tes Per Val Pur Par Flu Tet Dis Asp Cap But Fur Pip Met Nia Vit Fam Rot Car Ral Cy p Ran Iso Ket Sim Bro Dap Dip Meb Met Cy c Bro Eto Ace Flu Hy d Tac Dic Ams Cis Dac Dox Met Iso Str Phe Bus Chl Gen Car Die Nim Phe Tan Cad Dig Dex Mif Sul Met Bus Met Ace Chl Pro Tam Ver Clo My c Nal Niz Ate Dan

slide-20
SLIDE 20

20

 In this case, it was known that there are 3 genes thought to be implicated with the toxicity of interest.

Hepatotoxicity example (3)

slide-21
SLIDE 21

Hepatotoxicity example (4)

  • 0.2

0.0 0.2 0.4

  • 0.4
  • 0.2

0.0 0.2

Ethi Ery t Rif a Anil Meth Suli ANIT Glib Amio Adre Amin Chol Spir Stan Test Perh Valp Puro Para Fluo Tetr Disu Aspi Capt Buty Furo Pipe Meth Niac Vita Famo Rote Carb Ralo Cy pr Rani Ison Keto Simv Brom Daps Dipy Mebe Meth Cy cl Brom Etop Acet Flut Hy dr Tacr Dich Amsa Cisp Daca Doxo Meth Isop Stre Phen Busu Chlo Gent Carm Diel Nime Phen Tann Cadm Digo Dexa Mif e Sulf Meto Busp Metf Acet Chlo Prog Tamo Vera Cloz My co Nalt Niza Aten Dant

 Running ABC with weights proportional to the maximum correlation to these 3 genes gave a much more interesting result.

slide-22
SLIDE 22

Data Simple random sample of subjects Simple random sample of genes Construct classifier Collate results

Extension: ensemble classifiers

Ref: Breiman (Machine Learning, 2001), Amaratunga et al (2009)

Tree ( Random Forest*), LDA, …

22

Predict using classifier Prediction: Majority Vote

slide-23
SLIDE 23

23

Case study: KO experiment  Try on data in which the classes are known.

Out-of-bag error rates

Ref: Amaratunga, Cabrera & Lee (Bioinformatics, 2008) RF RF(p) ERF E- LDA EE- LDA Slc17A5 Day 0 0.583 0.583 0.167 0.583 0.083 Slc17A5 Day 18 0.083 0.083 0.000 0.000 0.000 Slc17A5 Day 0 (scrambled) 0.750 0.750 0.833 0.833 0.833 Slc17A5 Day 18 (scrambled) 0.583 0.667 0.667 0.583 0.583

slide-24
SLIDE 24

24

 Megavariate data are becoming more and more prevalent  Megavariate data introduce special challenges

  • overparametrized and undersampled
  • overfitting and redundancy
  • computationally challenging

In this setting, ensemble methods are among the best choices for classification.

Wrap Up

slide-25
SLIDE 25

25

Wrap Up

 Scientific collaborators: Michael McMillian, Jennifer Sasaki  References: D Amaratunga and J Cabrera (2004) Exploration and Analysis of DNA Microarray and Protein Array Data. John Wiley. D Amaratunga, J Cabrera and V Kovtun (2008) Microarray learning with ABC, Biostatistics. D Amaratunga, J Cabrera and Y S Lee (2008) Enriched random forests, Bioinformatics. D Amaratunga, J Cabrera, Y Cherkas and Y S Lee (2009) Ensemble classifiers, in review.  Website (recent papers and software):

  • www. amaratunga.com

www.rci.rutgers.edu/~cabrera/DNAMR  Email: damaratu@its.jnj.com