Motivations Pool genome-wide expression measurements from many - PDF document

Part1: SIB course 4-8 Feb 2008 Statistical analysis applied to genome Analysis tools for large datasets and proteome analyses • Standard tools Sven Bergmann k-means, PCA, SVD Department of Medical Genetics University of Lausanne Rue de Bugnon 27 - DGM 328 • Modular analysis tools CH-1005 Lausanne Switzerland CTWC, ISA, PPA work: ++41-21-692-5452 cell: ++41-78-663-4980 http://serverdgm.unil.ch/bergmann How to get large-scale expression data? Motivations Pool genome-wide expression measurements from many experiments! Why to study a large heterogeneous set of expression data? cell- large-scale cycle Large: Better signals from noisy data! stress expression data Heterogeneous: Global view at transcription program! 1000 1000 1000 Supervised vs. unsupervised approaches 2000 2000 2000 genes 3000 3000 3000 Large genome-wide data may contain answers to 4000 4000 4000 questions we do not ask! Need for both hypothesis- 5000 5000 5000 driven and exploratory analyses! 6000 6000 6000 1 2 3 4 5 2 4 6 8 200 400 600 800 1000 diverse conditions sets of specific conditions How to make sense of millions of numbers? K-means Clustering “guess” k=3 (# of clusters) Hundreds of samples Thousands of genes New Analysis and Visualization Tools are needed! http://en.wikipedia.org/wiki/K-means_algorithm 1

K-means Clustering K-means Clustering “guess” k=3 (# of clusters) “guess” k=3 (# of clusters) 1. Start with random 1. Start with random positions of centroids ( ) positions of centroids ( ) 2. Assign each data point to closest centroid http://en.wikipedia.org/wiki/K-means_algorithm http://en.wikipedia.org/wiki/K-means_algorithm K-means Clustering K-means Clustering “guess” k=3 (# of clusters) “guess” k=3 (# of clusters) 1. Start with random 1. Start with random positions of centroids ( ) positions of centroids ( ) 2. Assign each data point 2. Assign each data point to closest centroid to closest centroid 3. Move centroids to 3. Move centroids to center of assigned points center of assigned points Iterate 1-3 until minimal cost with k clusters S i , i = 1,2,..., k and centroids µ i (the mean point of all the points ) http://en.wikipedia.org/wiki/K-means_algorithm Hierachical Clustering K-means Clustering Plus: Plus: • visual • Shows (re-orderd) data • intuitive • Gives hierarchy • relatively fast Minus: Minus: • Does not work well for many genes • have to “guess” number of clusters (usually apply cut-off on fold-change) • can give different results for distinct • Similarity over all genes/conditions “starting seeds” • Clusters do not overlap • distances computed over all features • one cluster only per element • no cluster hierarchy 2

Example: 2PCs for 3d-data Principle Component Analysis Principle components (PCs) are projections onto subspace with the largest variation in the data Raw data points: {a, …, z} http://csnet.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf http://ordination.okstate.edu/PCA.htm Example: 2PCs for 3d-data Example: 2PCs for 3d-data The direction of most variance Most variance is perpendicular to along PCA1 PCA1 defines PCA2 Identification of axes with the most variance Normalized data points: zero mean (& unit std)! http://ordination.okstate.edu/PCA.htm http://ordination.okstate.edu/PCA.htm Example: 2PCs for 3d-data Reminder: Matrix multiplications Definition: Cluster? Scheme: Vectorized: Example: http://ordination.okstate.edu/PCA.htm http://en.wikipedia.org/wiki/Matrix multiplication 3

PCA: Example deletion mutants How do we get the PCs? • The PCs are the eigenvectors of the 300 1 1 6k 1 300 = E T · covariance matrix C computed from the C C = E T · E / (n-1) E (mean-centered) data matrix E : 300 300 /(n-1) 6k C = E T · E / (n-1) C · pc = λ · pc 1 300 1 1 λ C · pc = λ · pc · · = C 300 300 300 pc PCA: Example deletion mutants And how to project? • The projected data is just the product of 1 n 1 300 the original data with the PCs : 1 … = · E’ = E · PC E’ E E’ = E · PC 1 2 n 300 6k 6k • Principle Component or Transformation Matrix: • The original gene expression profiles are over 300 arrays. PC = [ pc 1 , pc 2 , …, pc n ] • The transformed data contain projections on n “ eigen-genes” (where n is the number of PCs used) (linear combinations of the 300 arrays shown in red) PCA: Example deletion mutants PCA: Example deletion mutants 0.15 0.1 0.1 0.05 0 0.05 PCA2 PCA3 -0.05 0 -0.1 -0.05 -0.15 -0.1 -0.04 -0.02 0 0.02 0.04 0.06 0.08 -0.08 -0.06 -0.04 -0.02 0 0.02 0.04 PCA1 PCA1 Third “eigen-gene” (PCA3) reveals little structure! The first 2 “eigen-genes” separate data into 3 clusters 4

SVD: Matrix representation Singular Value Decomposition “SVD = bi-PCA” E = U · D · V T E = U · D · V T 1 300 1 n 1 n 1 300 … 0 v 1 V : PC matrix of “eigen-genes” λ 1 λ 2 … = v 2 · · E (composed of eigenvectors of C = E T · E ) … 0 u 1 u 2 u n v n λ n U : PC matrix of “eigen-arrays” 6k 6k n n V T U D (composed of eigenvectors of C’ = E · E T ) u i : eigen-arrays v i : eigen-genes λ i : eigenvalues D : diagonal matrix i = 1, …, n n : rank( E ) = #(independent arrays) Alter O., Brown P.O., Botstein D. Singular value decomposition for genome-wide http://public.lanl.gov/mewall/kluwer2002.html expression data processing and modeling . Proc Natl Acad Sci USA 2000; 97:10101-06. SVD: Example deletion mutants SVD: What is optimized? E 1 = λ 1 · u 1 · v 1 T E = U · D · V T = ∑ i λ i · u i · v i T (full expansion) T (rank-1 expansion) E 1 = λ 1 · u 1 · v 1 1 300 1 (1) ··· u 1 ∆ = | E - E 1 | 2 (sum of residuals) (1) ·v 1 (1) ·v 1 (300) u 1 = · · 1 = : : E 1 v 1 λ 1 : : 300 λ 1 (6k) ·v 1 (1) ··· u 1 (6k) ·v 1 (300) u 1 minimize ∆ for free u 1 and v 1 : 6k 6k u 1 E · v 1 = λ 1 · u 1 & E T · u 1 = λ 1 · v 1 implying: high high low high low = · · = E · E T · u 1 = λ 1 2 · u 1 & E T · E · v 1 = λ 1 2 · v 1 low low low Bergmann et al ., Phys. Rev. E 67, 031902 (2003) SVD: Example deletion mutants SVD: Example deletion mutants original data U (n=2) original data U (n=1) 50 50 50 50 genes genes genes genes 100 100 100 100 150 150 150 150 200 200 200 200 50 100 150 200 250 300 1 50 100 150 200 250 300 1 2 arrays eigen-arrays arrays eigen-arrays V T (n=1) SVD(data) = U D V T (n=1) V T (n=2) SVD(data) = U D V T (n=2) 50 1 1 50 1 eigen-genes eigen-genes genes genes 100 100 1 0 0 150 150 -1 2 -1 200 200 50 100 150 200 250 300 50 100 150 200 250 300 50 100 150 200 250 300 50 100 150 200 250 300 arrays arrays arrays arrays 5

SVD: Example deletion mutants Part1: Analysis tools for large datasets original data U (n=3) 50 50 genes genes 100 100 • Standard tools 150 150 k-means, PCA, SVD 200 200 50 100 150 200 250 300 1 2 3 arrays eigen-arrays V T (n=3) SVD(data) = U D V T (n=3) • Modular analysis tools 1 50 1 CTWC, ISA, PPA eigen-genes genes 100 2 0 150 -1 3 200 50 100 150 200 250 300 50 100 150 200 250 300 arrays arrays How to extract biological information from How to extract biological information from large-scale expression data? large-scale expression data? Hierarchical clustering and other correlation-based methods may be 1000 Search for transcription modules: 2000 good for small data sets, but: 3000 Set of genes co-regulated under Problems with large data: a certain set of conditions 4000 • Clusters cannot overlap! 5000 • context specific 6000 200 400 600 800 1000 • Clustering based on • allow for overlaps correlations over all conditions: - sensitive to noise - computation intensive Overview of “modular” analysis tools Coupled two-way Clustering • Cheng Y and Church GM. Biclustering of expression data . (Proc Int Conf Intell Syst Mol Biol. 2000;8:93-103) • Getz G, Levine E, Domany E. Coupled two-way clustering analysis of gene microarray data . (Proc Natl Acad Sci U S A. 2000 Oct 24;97(22):12079-84) • Tanay A, Sharan R, Kupiec M, Shamir R. Revealing modularity and organization in the yeast molecular network by integrated analysis of highly heterogeneous genomewide data . (Proc Natl Acad Sci U S A. 2004 Mar 2;101(9):2981-6) • Sheng Q, Moreau Y, De Moor B. Biclustering microarray data by Gibbs sampling . (Bioinformatics. 2003 Oct;19 Suppl 2:ii196-205) • Gasch AP and Eisen MB. Exploring the conditional coregulation of yeast gene expression through fuzzy k-means clustering . (Genome Biol. 2002 Oct 10;3(11):RESEARCH0059) • Hastie T, Tibshirani R, Eisen MB, Alizadeh A, Levy R, Staudt L, Chan WC, Botstein D, Brown P. 'Gene shaving' as a method for identifying distinct sets of genes with similar expression patterns . (Genome Biol. 2000;1(2):RESEARCH0003.) … and many more! http://serverdgm.unil.ch/bergmann/Publications/review.pdf 6

Motivations Pool genome-wide expression measurements from many - PDF document

Part1: SIB course 4-8 Feb 2008 Statistical analysis applied to genome Analysis tools for large datasets and proteome analyses Standard tools Sven Bergmann k-means, PCA, SVD Department of Medical Genetics University of Lausanne Rue de

2 3 Motivations 4 Motivations 5 Motivations 6 Motivations 7 8 System Implementation and

Motivations for migration of Motivations for migration of Dutch Somalis to the UK New migrations

When Neurons Fail El Mahdi El Mhamdi, Rachid Guerraoui BDA, Chicago July 25th, 2016 1 / 28

Integrated Super Modules + New Prototypes Motivations Motivations Key features

v4l2 stream sharing Brandon Philips brandon@ifup.org brandon@suse.com Motivations Single

Component Based Software Engineering approach on DSP Targets Agenda 2 / 2 / Motivations

TinyOS Overview of TinyOS Industrial motivations behind TinyOS What is TinyOS? TinyOS

Two Tools for the Analysis of Longitudinal Data: Motivations, Applications and Issues Vern

Artificial Intelligence Course Presentation Summary Artificial Intelligence Motivations

HVP lattice finite-volume Giusti corrections OUTLINE Motivations Second Plenary Workshop of

(BUILDING AN) AI PLATFORM ON HTCONDOR Motivations, lessons learnt and Next Steps Cedalion

Contributions for 5G Development at Brazil May 22 nd 2018 Dr. Henry Douglas Rodrigues Agenda

Cracking Passwords With Time-memory Trade-offs Gildas Avoine Universit e catholique de

Mail Server Andrea Gussoni andrea at gussoni.ovh P.O.u.L. 12 Aprile 2017 Motivations Why

Artificial Intelligence Course Presentation Summary Artificial Intelligence Motivations

LATTICE QCD AND FLAVOR PHYSICS Vittorio Lubicz OUTLINE OUTLINE 1. Motivations for flavor

CSE 158 Lecture 17 Web Mining and Recommender Systems More temporal dynamics This week

MANAGEMENT OF LIPID DISORDERS: WHERE DO WE STAND WITH THE NEW PRACTICE GUIDELINES? Robert B.

Disclosure No relevant financial Robert B. Baron, MD MS relationships Professor and Associate

Human Senses : Vision week 11 Dr. Belal Gharaibeh 1 Body senses Seeing Hearing

Trichromacy & Color Constancy Jonathan Pillow Mathematical Tools for Neuroscience (NEU 314)

Comp/Phys/Apsc 715 Lecture 5: Trichromacy, Color Spaces, Properties of Color 1/23/2014 Color

Information Visualization Crash Course (AKA Information Visualization 101) Chad Stolper Google

Your not a designer. Why should you care, at least a li5le, about design? Lets take a huge step

Motivations Pool genome-wide expression measurements from many - PDF document

Part1: SIB course 4-8 Feb 2008 Statistical analysis applied to genome Analysis tools for large datasets and proteome analyses Standard tools Sven Bergmann k-means, PCA, SVD Department of Medical Genetics University of Lausanne Rue de

2 3 Motivations 4 Motivations 5 Motivations 6 Motivations 7 8 System Implementation and

Motivations for migration of Motivations for migration of Dutch Somalis to the UK New migrations

When Neurons Fail El Mahdi El Mhamdi, Rachid Guerraoui BDA, Chicago July 25th, 2016 1 / 28

Integrated Super Modules + New Prototypes Motivations Motivations Key features

v4l2 stream sharing Brandon Philips brandon@ifup.org brandon@suse.com Motivations Single

Component Based Software Engineering approach on DSP Targets Agenda 2 / 2 / Motivations

TinyOS Overview of TinyOS Industrial motivations behind TinyOS What is TinyOS? TinyOS

Two Tools for the Analysis of Longitudinal Data: Motivations, Applications and Issues Vern

Artificial Intelligence Course Presentation Summary Artificial Intelligence Motivations

HVP lattice finite-volume Giusti corrections OUTLINE Motivations Second Plenary Workshop of

(BUILDING AN) AI PLATFORM ON HTCONDOR Motivations, lessons learnt and Next Steps Cedalion

Contributions for 5G Development at Brazil May 22 nd 2018 Dr. Henry Douglas Rodrigues Agenda

Cracking Passwords With Time-memory Trade-offs Gildas Avoine Universit e catholique de

Mail Server Andrea Gussoni andrea at gussoni.ovh P.O.u.L. 12 Aprile 2017 Motivations Why

Artificial Intelligence Course Presentation Summary Artificial Intelligence Motivations

LATTICE QCD AND FLAVOR PHYSICS Vittorio Lubicz OUTLINE OUTLINE 1. Motivations for flavor

CSE 158 Lecture 17 Web Mining and Recommender Systems More temporal dynamics This week

MANAGEMENT OF LIPID DISORDERS: WHERE DO WE STAND WITH THE NEW PRACTICE GUIDELINES? Robert B.

Disclosure No relevant financial Robert B. Baron, MD MS relationships Professor and Associate

Human Senses : Vision week 11 Dr. Belal Gharaibeh 1 Body senses Seeing Hearing

Trichromacy &amp; Color Constancy Jonathan Pillow Mathematical Tools for Neuroscience (NEU 314)

Comp/Phys/Apsc 715 Lecture 5: Trichromacy, Color Spaces, Properties of Color 1/23/2014 Color

Information Visualization Crash Course (AKA Information Visualization 101) Chad Stolper Google

Your not a designer. Why should you care, at least a li5le, about design? Lets take a huge step

Trichromacy & Color Constancy Jonathan Pillow Mathematical Tools for Neuroscience (NEU 314)