Motivations Pool genome-wide expression measurements from many - - PDF document

motivations
SMART_READER_LITE
LIVE PREVIEW

Motivations Pool genome-wide expression measurements from many - - PDF document

Part1: SIB course 4-8 Feb 2008 Statistical analysis applied to genome Analysis tools for large datasets and proteome analyses Standard tools Sven Bergmann k-means, PCA, SVD Department of Medical Genetics University of Lausanne Rue de


slide-1
SLIDE 1

1

SIB course 4-8 Feb 2008

Statistical analysis applied to genome and proteome analyses

Sven Bergmann

Department of Medical Genetics University of Lausanne Rue de Bugnon 27 - DGM 328 CH-1005 Lausanne Switzerland work: ++41-21-692-5452 cell: ++41-78-663-4980 http://serverdgm.unil.ch/bergmann

Part1:

Analysis tools for large datasets

  • Standard tools

k-means, PCA, SVD

  • Modular analysis tools

CTWC, ISA, PPA

Why to study a large heterogeneous set of expression data?

Large: Better signals from noisy data! Heterogeneous: Global view at transcription program!

Supervised vs. unsupervised approaches

Large genome-wide data may contain answers to questions we do not ask! Need for both hypothesis- driven and exploratory analyses!

Motivations

How to get large-scale expression data? Pool genome-wide expression measurements from many experiments!

stress

2 4 6 8 1000 2000 3000 4000 5000 6000

cell- cycle

1 2 3 4 5 1000 2000 3000 4000 5000 6000

200 400 600 800 1000 1000 2000 3000 4000 5000 6000

large-scale expression data

genes diverse conditions sets of specific conditions

How to make sense of millions of numbers?

New Analysis and Visualization Tools are needed!

Hundreds of samples Thousands

  • f genes

K-means Clustering

“guess” k=3 (# of clusters)

http://en.wikipedia.org/wiki/K-means_algorithm

slide-2
SLIDE 2

2

K-means Clustering

  • 1. Start with random

positions of centroids ( ) “guess” k=3 (# of clusters)

http://en.wikipedia.org/wiki/K-means_algorithm

K-means Clustering

  • 2. Assign each data point

to closest centroid

  • 1. Start with random

positions of centroids ( ) “guess” k=3 (# of clusters)

http://en.wikipedia.org/wiki/K-means_algorithm

K-means Clustering

  • 3. Move centroids to

center of assigned points

  • 2. Assign each data point

to closest centroid

  • 1. Start with random

positions of centroids ( ) “guess” k=3 (# of clusters)

http://en.wikipedia.org/wiki/K-means_algorithm

K-means Clustering

Iterate 1-3 until minimal cost

  • 3. Move centroids to

center of assigned points

  • 2. Assign each data point

to closest centroid

  • 1. Start with random

positions of centroids ( )

with k clusters Si, i = 1,2,...,k and centroids µi (the mean point of all the points )

“guess” k=3 (# of clusters)

K-means Clustering

Plus:

  • visual
  • intuitive
  • relatively fast

Minus:

  • have to “guess” number of clusters
  • can give different results for distinct

“starting seeds”

  • distances computed over all features
  • one cluster only per element
  • no cluster hierarchy

Hierachical Clustering

Plus:

  • Shows (re-orderd) data
  • Gives hierarchy

Minus:

  • Does not work well for many genes

(usually apply cut-off on fold-change)

  • Similarity over all genes/conditions
  • Clusters do not overlap
slide-3
SLIDE 3

3

Principle Component Analysis

Principle components (PCs) are projections onto subspace with the largest variation in the data

http://csnet.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf

Example: 2PCs for 3d-data

http://ordination.okstate.edu/PCA.htm

Raw data points: {a, …, z}

Example: 2PCs for 3d-data

http://ordination.okstate.edu/PCA.htm

Normalized data points: zero mean (& unit std)!

Example: 2PCs for 3d-data

http://ordination.okstate.edu/PCA.htm

Identification of axes with the most variance

Most variance is along PCA1 The direction of most variance perpendicular to PCA1 defines PCA2

Example: 2PCs for 3d-data

Cluster?

http://ordination.okstate.edu/PCA.htm

Reminder: Matrix multiplications

Definition: Scheme: Vectorized: Example:

http://en.wikipedia.org/wiki/Matrix multiplication

slide-4
SLIDE 4

4

How do we get the PCs?

  • The PCs are the eigenvectors of the

covariance matrix C computed from the (mean-centered) data matrix E:

C = ET·E /(n-1) C·pc = λ·pc C·pc = λ·pc C

1 300 300

· =

1 300 1 300

·

λ

pc

C = ET·E /(n-1) ET E = C

1 300 300

·

1 300 300 1 6k 6k

/(n-1)

PCA: Example deletion mutants And how to project?

  • The projected data is just the product of

the original data with the PCs:

E’ = E · PC

  • Principle Component or Transformation Matrix:

PC = [pc1, pc2, …, pcn]

(where n is the number of PCs used)

E’ = E · PC E

300 1 6k

=

n

· E’

1 6k

  • The original gene expression profiles are over 300 arrays.
  • The transformed data contain projections on n “eigen-genes”

(linear combinations of the 300 arrays shown in red)

300

n 1 2 1

PCA: Example deletion mutants

  • 0.08
  • 0.06
  • 0.04
  • 0.02

0.02 0.04

  • 0.1
  • 0.05

0.05 0.1 0.15 PCA1 PCA2

The first 2 “eigen-genes” separate data into 3 clusters

PCA: Example deletion mutants

  • 0.04
  • 0.02

0.02 0.04 0.06 0.08

  • 0.15
  • 0.1
  • 0.05

0.05 0.1 PCA1 PCA3

Third “eigen-gene” (PCA3) reveals little structure!

PCA: Example deletion mutants

slide-5
SLIDE 5

5

Singular Value Decomposition

V: PC matrix of “eigen-genes” (composed of eigenvectors of C = ET·E) U: PC matrix of “eigen-arrays” (composed of eigenvectors of C’ = E·ET) D: diagonal matrix

E = U·D·VT “SVD = bi-PCA”

http://public.lanl.gov/mewall/kluwer2002.html

SVD: Matrix representation E = U·D·VT E

300 1 6k

= · …

300 1

un u1u2

λ1λ2 λn v1 v2 vn

·

6k n 1 n n 1 n

U D VT

ui: eigen-arrays vi: eigen-genes λi: eigenvalues i = 1, …, n n: rank(E) = #(independent arrays)

Alter O., Brown P.O., Botstein D. Singular value decomposition for genome-wide expression data processing and modeling. Proc Natl Acad Sci USA 2000; 97:10101-06.

E = U·D·VT = ∑i λi·ui·vi

T (full expansion)

E1 = λ1·u1·v1

T (rank-1 expansion)

∆ = |E - E1|2 (sum of residuals) minimize ∆ for free u1 and v1: E·v1= λ1·u1 & ET·u1 = λ1·v1 implying: E·ET·u1 = λ1

2·u1 & ET·E·v1 = λ1 2·v1

SVD: What is optimized?

Bergmann et al., Phys. Rev. E 67, 031902 (2003)

SVD: Example deletion mutants E1 = λ1·u1·v1

T

E1

300 1 6k

= ·

300 λ1 v1

· 1 =

u1

(1)·v1 (1) ··· u1 (1)·v1 (300)

: : : : u1

(6k)·v1 (1) ··· u1 (6k)·v1 (300)

λ1 1 u1 6k

= · · =

high low low low high high low low

SVD: Example deletion mutants

genes arrays

  • riginal data

50 100 150 200 250 300 50 100 150 200 genes eigen-arrays U (n=1) 1 50 100 150 200 eigen-genes arrays VT (n=1) 50 100 150 200 250 300 1 arrays genes SVD(data) = U D VT (n=1) 50 100 150 200 250 300 50 100 150 200

  • 1

1

SVD: Example deletion mutants

genes arrays

  • riginal data

50 100 150 200 250 300 50 100 150 200 genes eigen-arrays U (n=2) 1 2 50 100 150 200 eigen-genes arrays VT (n=2) 50 100 150 200 250 300 1 2 arrays genes SVD(data) = U D VT (n=2) 50 100 150 200 250 300 50 100 150 200

  • 1

1

slide-6
SLIDE 6

6

SVD: Example deletion mutants

genes arrays

  • riginal data

50 100 150 200 250 300 50 100 150 200 genes eigen-arrays U (n=3) 1 2 3 50 100 150 200 eigen-genes arrays VT (n=3) 50 100 150 200 250 300 1 2 3 arrays genes SVD(data) = U D VT (n=3) 50 100 150 200 250 300 50 100 150 200

  • 1

1

Part1:

Analysis tools for large datasets

  • Standard tools

k-means, PCA, SVD

  • Modular analysis tools

CTWC, ISA, PPA

How to extract biological information from large-scale expression data?

200 400 600 800 1000 1000 2000 3000 4000 5000 6000

Hierarchical clustering and other correlation-based methods may be good for small data sets, but: Problems with large data:

  • Clusters cannot overlap!
  • Clustering based on

correlations over all conditions:

  • sensitive to noise
  • computation intensive

Search for transcription modules: Set of genes co-regulated under a certain set of conditions

  • context specific
  • allow for overlaps

How to extract biological information from large-scale expression data?

Overview of “modular” analysis tools

  • Cheng Y and Church GM. Biclustering of expression data.

(Proc Int Conf Intell Syst Mol Biol. 2000;8:93-103)

  • Getz G, Levine E, Domany E. Coupled two-way clustering analysis of gene

microarray data. (Proc Natl Acad Sci U S A. 2000 Oct 24;97(22):12079-84)

  • Tanay A, Sharan R, Kupiec M, Shamir R. Revealing modularity and organization

in the yeast molecular network by integrated analysis of highly heterogeneous genomewide data. (Proc Natl Acad Sci U S A. 2004 Mar 2;101(9):2981-6)

  • Sheng Q, Moreau Y, De Moor B. Biclustering microarray data by Gibbs
  • sampling. (Bioinformatics. 2003 Oct;19 Suppl 2:ii196-205)
  • Gasch AP and Eisen MB. Exploring the conditional coregulation of yeast gene

expression through fuzzy k-means clustering. (Genome Biol. 2002 Oct 10;3(11):RESEARCH0059)

  • Hastie T, Tibshirani R, Eisen MB, Alizadeh A, Levy R, Staudt L, Chan WC, Botstein

D, Brown P. 'Gene shaving' as a method for identifying distinct sets of genes with similar expression patterns. (Genome Biol. 2000;1(2):RESEARCH0003.)

… and many more!

http://serverdgm.unil.ch/bergmann/Publications/review.pdf

Coupled two-way Clustering

slide-7
SLIDE 7

7

How to “hear” the relevant genes?

Song A Song B

Inside CTWC: Iterations

S1 G1

Init

S68… …S113 S2(G6)...S2(G21) S3(G6)…S3(G21) G161… … …G216 G2(S4)...G2(S11) … G5(S4)...G5(S11) 5 S52,... S67 S1(G6) … S1(G21) G98,..G105 … G151,..G160 G1(S4) … G1(S11) 4 S12,… …S51 S2(G1)…S2(G5) S3(G1)…S3(G5) G22… … …G97 G2(S1)…G2(S3) … G5(S1)…G5(S3) 3 S4,S5,S6 S10,S11 None S1(G2) … S1(G5) G6,G7,….G13 G14,…G21 G1(S2) G1(S3) 2 S2,S3 S1(G1) G2,G3,…G5 G1(S1) 1

Samples Genes Depth

Two-way clustering

  • No need for correlations!
  • decomposes data into “transcription modules”
  • integrates external information
  • allows for interspecies comparative analysis

One example in more detail:

The (Iterative) Signature Algorithm:

J Ihmels, G Friedlander, SB, O Sarig, Y Ziv & N Barkai Nature Genetics (2002)

Trip to the “Amazon”:

5 10 15 20 25 30 35 40 45 50 10 20 30 40 50 60 70 80 90 100

How to find related items?

items customers re- commended items your choice customers with similar choice

False Positives:

slide-8
SLIDE 8

8

5 10 15 20 25 30 35 40 45 50 10 20 30 40 50 60 70 80 90 100

How to find related genes?

genes conditions similarly expressed genes your guess relevant conditions

J Ihmels, G Friedlander, SB, O Sarig, Y Ziv & N Barkai Nature Genetics (2002)

I

G g gc G c

E s

=

} : {

C C C c c c C

t s s C c S σ > − ∈ =

c

S c gc C c g

E s s

=

} : {

G G G g g g G

t s s G g S σ > − ∈ =

I

G

Signature Algorithm: Score definitions

initial guesses (genes)

thresholding:

condition scores

How to find related genes? Scores and thresholds!

gene scores condition scores

thresholding:

How to find related genes? Scores and thresholds!

gene scores condition scores

thresholding:

How to find related genes? Scores and thresholds!

Iterative Signature Algorithm

INPUT OUTPUT OUTPUT = INPUT “Transcription Module”

SB, J Ihmels & N Barkai Physical Review E (2003)

slide-9
SLIDE 9

9

Identification of transcription modules using many random “seeds”

random “seeds”

Transcription modules Independent identification: Modules may

  • verlap!

New Tools: Module Visualization

http://serverdgm.unil.ch/bergmann/Fibroblasts/visualiser.html

Gene enrichment analysis

The hypergeometric distribution f(M,A,K,T) gives the probability that K out of A genes with a particular annotation match with a module having M genes if there are T genes in total. http://en.wikipedia.org/wiki/Hypergeometric_distribution

Decomposing expression data into annotated transcriptional modules

identified >100 transcriptional modules in yeast: high functional consistency! many functional links “waiting” to be verified experimentally

J Ihmels, SB & N Barkai Bioinformatics 2005

Module hierarchies and networks Higher-order structure

correlated anti- correlated

C

slide-10
SLIDE 10

10

Organisms Data types Conditions

D e v e l

  • p

m e n t a l P h y s i

  • l
  • g

i c a l E n v i r

  • n

m e n t a l E x p e r i m e n t a l C l i n i c a l

– Protein expression – Tissue specific expression – Interaction data – Localization data – …

?

Biological Insight

The challenge of many datasets: How to integrate all the information? BLAST signature algorithm

Mapping Transcription Modules For distant organisms correlation patterns generally are distinct

SB, J Ihmels & N Barkai PLoS Biology (2004)

What about related organisms?

J Ihmels, SB, J Berman & N Barkai Science (2005)

pairwise correlation (over all arrays)

genes

Promoter analysis: The “Rapid Growth Element” AATTTT

Data Integration: Example NCI60

slide-11
SLIDE 11

11 Our (modular) approach: The model

Co-modules Gene-modules Drug-modules

C3 F4 C4 F3 G3 G4 [AGF] [AGF] [BFC] [BFC] C5 D3 F6 C6 D4 F5 [BFC] [BFC] [CDF] [CDF] C1 D1 F2 C2 D2 F1 G1 G2 [AGF] [CDF] [BFC] [AGF] [CDF] [BFC]

G D C

ME MD G4 D4 C3 C4 Drug-modules Gene-modules C5 C6

Modules and Co-modules

D3 G3 M

E D

Co-modules G2 G1 C1 C2 D1 D2

E

C

G

R

D MED

Iteratively refine genes, cell-lines and drugs to get co-modules

The Ping-Pong algorithm! 1 2 3 4

Co-modules have predictive power for drug-gene associations Co-modules analysis provides biological focus through data integration

  • Analysis of large-scale expression data bears

great potential to understand global transcription programs and their evolution

  • Innovative analysis tools needed to extract

information from such data

  • (Iterative) Signature & Ping-Pong Algorithms:

– decomposes data into “transcription modules” – integrates external information – allows for interspecies comparative analysis

Take-home Messages: