Unsupervised joint analysis of arrayCGH, gene expression data and - PowerPoint PPT Presentation

Unsupervised joint analysis of arrayCGH, gene expression data and supplementary features Christine Steinhoff 1 , Matteo Pardo 1,2 , Martin Vingron 1 1 Max Planck Institute for Molecular Genetics, Berlin, Germany 2 Sensor Lab, INFM-CNR, Brescia, Italy

Outline o The data and how they have been analyzed until now o MCASV: Multiple Correspondence Analysis (MCA) with Supplementary Variables o Results Bits09, Genova - 2 - Matteo Pardo

Array comparative genomic hybridization (aCGH) o aCGH measures the (mean) number of copies of DNA- stretches along the genome in order to detect copy number aberrations ( CNA ) log2(sample/control) Bits09, Genova - 3 - Matteo Pardo

aCGH for the study of cancer o Well established that cancer progresses through accumulation of genomic and epigenomic aberrations o Advantages of DNA over expression analysis: • Genomic DNA is more stable than mRNA • CNAs define key genetic events driving tumorigenesis o aCGH results: • Identification of regions of frequent loss and gain • Correlation of copy-number aberrations with prognosis in a variety of cancer histologies, including breast and lymphoma. • Pinpointed new cancer genes, for example PPM1D in breast cancer and MITF in melanoma. Bits09, Genova - 4 - Matteo Pardo

Expression and aCGH together in breast cancer o Few studies (~10) measured genomic and transcriptomics profiles on the same patient cohort o Causal relation CNA - gene expression is intuitive: more genes  more mRNA o Typical result: the expression level of about 60% of the genes within highly amplified regions is at least moderately elevated. o The other way around: first disease subtypes are derived from expression arrays and successively distinct patterns of CNA found Bits09, Genova - 5 - Matteo Pardo

Data Integration: what is there o Central point is level at which the fusion (integration, joining) actually happens: raw data level; 1. feature level; 2. decision level (‘decision’ means as much as ’after analysis’, when 3. a decision could be taken). o In genomics, what has been really performed is ‘decision fusion’: each data source processed separately and outputs then integrated. o For expression + aCGH: e.g. first determine regions with CNAs (possibly tissue or patients -specific) and then look for differentially expressed (onco)genes inside these regions o Natural reason for pushing integration at a later stage: strong heterogeneity does not allow sensible alignments of source data. o But: loose interaction effects Bits09, Genova - 6 - Matteo Pardo

Joint analysis of expression and aCGH: what is there o Berger et al., 2006: unsupervised analysis with generalized singular value decomposition on fused expression-aCGH matrix: Convenient feature (of any vector space decomposition • approach): visualization is intuitive Does not take into account biomedical covariates (grade, ER and • p53 status,…) Does not distinguish between gene states • o Lee et al., 2008: Calculate correlations between all pairs of genes, between cgh • and expression matrices Biclustering on correlation matrix: find modules, then study • enrichment No summary plot (“one point one gene”) • No consideration of medical covariates • Bits09, Genova - 7 - Matteo Pardo

Joint analysis of expression and aCGH: what we do o Berger et al., 2006: unsupervised analysis with generalized singular value decomposition on fused expression-aCGH matrix: Convenient feature (of any vector space decomposition • approach): visualization is intuitive Does not take into account biomedical covariates • (grade, ER and p53 status,…) Does not distinguish between gene states • • MCASV has been applied in the context of social sciences but to our knowledge not for biological high throughput data analysis. • Quite common in France (‘French school’: Analyse Géométrique des Données !) Bits09, Genova - 8 - Matteo Pardo

Correspondence Analysis o In few words: PCA for discrete data o In some more words: • Applies to contingency tables (cross tabulation of two discrete variables) • Investigates departure from independence (as chi- square test) • Investigates similarity between profile vectors (row vectors divided by the row marginals, i.e. sum of profiles components = 1). • The same applies to the columns analysis Bits09, Genova - 9 - Matteo Pardo

Correspondence Analysis ctd. o As in PCA: find low dimensional projections, which maximize a criterion of preserved “information” o In PCA: criterion is total variance (= sum of distance from the mean inside the reduced dimensional space). Metric is Euclidean. o In CA: criterion is total inertia (= weighted sum of distance from the barycentre inside the reduced dimensional space). Metric is chi-square : • The weight is given by the row marginal, i.e. profiles associated to more objects count more • The chi-square norm weights each profile component by the inverse column marginal, i.e. a category which is less represented counts more o Once projection is found, supplementary points can be projected Bits09, Genova - 10 - Matteo Pardo

Multiple CA o Extension to more than two discrete variables - not straightforward o Standard MCA: CA on indicator matrix o Amounts to: maximize mean projected inertia. Mean over all the tabulations of two variables. This includes variable’s self-tabulation (which also have maximal inertia) o Empirical corrections exist for excluding contributions of self-tabulation Bits09, Genova - 11 - Matteo Pardo

Integrate data with different distributions grade stage Died 2 1 Yes 4 3 No 2 2 yes Discrete categories After appropriate Not symmetric normalization skew approx lognormal symmetric  Discretize all Bits09, Genova - 12 - Matteo Pardo

Pipeline A A E E Data INPUT ( 1 ) ( 1 ) P P Discretization C B S C B S F C F C ( 2 ) ( 2 ) V a r V a r C o r r C o r r Filtering ( 3 ) ( 3 ) − − − − pxm pxm n xp n xp n xp n xp Cat Cat { 1,0,1 { 1,0,1 } E } E { 1,0,1 { 1,0,1 } A } A I n d i c a t o r M a t r i x I n d i c a t o r M a t r i x Indicator coding ( 4 ) ( 4 ) = = } Cat m xp } Cat m xp ( ( ) ) = = I I {0,1 {0,1 3 3 n xp n xp = = 3 3 n xp n xp I I {0,1 {0,1 } } I I {0,1 {0,1 } } A A E E P P E E A A = = MCASV t t B B [ [ I I I I ] [ ] [ I I I I ] ] ( 5 ) ( 5 ) E E A A E E A A = = * * ] t ] t B B [ [ I I I I I I E E A A P P

Discretization + Filtering N xp N xp R R A E Circular Binary Segmentation Two Fold Change (R Package DNAcopy) Genes with Genes with highest highest variance correlation between aCGH across patients and expression Bits09, Genova - 14 - Matteo Pardo

MCASV Nenadic, O. and Greenacre, M. (2006) Multiple Correspondence Analysis and Related Methods . Chapman & Hall/CRC, London Burt matrix: super-table of MCA: find plane maximizing inertia all contingency tables (between Project covariates on the plane genes couples) t I [ I I ] B p E A Bits09, Genova - 15 - Matteo Pardo

Data Show results for correlation filter, 100 genes Bits09, Genova - 16 - Matteo Pardo

MCA plot: Genes Bits09, Genova - 17 - Matteo Pardo

MCA plot: clinical covariates Bits09, Genova - 18 - Matteo Pardo

MCA plot: Supplementary Variables • The plot is centered on the genes’ mean profile. • Genes and covariate states which are near to the origin are less informative. Bits09, Genova - 19 - Matteo Pardo

MCA plot: Supplementary Variables • Each covariate status’ value is the center of the gene patterns (=patients) having that status. • E.g., the (projection of the) mean gene pattern of the patients having tumor grade 1 is represented by the point Grade.1. Bits09, Genova - 20 - Matteo Pardo

MCA plot: Supplementary Variables Tumor grade 1, 2 and 3 separate (only) along the first component  the gene pattern of a patient is determined foremostly by its tumor grade. Bits09, Genova - 21 - Matteo Pardo

MCA plot: Supplementary Variables • Also ER and p53 status display considerable variation along first component • p53 mutant and ER– on the side of higher tumor grade. • ER– has highest score on 1 st component  strongest negative indicator? Bits09, Genova - 22 - Matteo Pardo

MCA plot: Supplementary Variables • Tumor stages separate clearly from each other but show no order. • This hints to heterogeneity of gene patterns inside each state  Lack of genomic support for this classification? Bits09, Genova - 23 - Matteo Pardo

MCA plot: Supplementary Variables • Node status has no projection on the first component  independent of tumor grade progression. • Explains part of the remaining information in the data. • Node- has biggest value on 2nd MCA component Bits09, Genova - 24 - Matteo Pardo

Selected known genes MYC and ERBB2 are wellknown to be amplified and overexpressed coordinately in breast cancers having bad prognosis Bits09, Genova - 25 - Matteo Pardo

Genes related to clinical state ER- GO category enrichment Bits09, Genova - 26 - Matteo Pardo

Unsupervised joint analysis of arrayCGH, gene expression data and - PowerPoint PPT Presentation

Unsupervised joint analysis of arrayCGH, gene expression data and supplementary features Christine Steinhoff 1 , Matteo Pardo 1,2 , Martin Vingron 1 1 Max Planck Institute for Molecular Genetics, Berlin, Germany 2 Sensor Lab, INFM-CNR, Brescia,

Gene Expression Data Introduction to gene expression data Expression data storage concept An

CSci 8980: Advanced Topics in Graphical Models Application: Gene Expression Analysis Instructor:

Analysis of Gene Expression Profiles Analysis of Gene Expression Profiles and Drug Activity

A Data Warehouse-based A Data Warehouse-based Gene Expression Analysis Gene Expression Analysis

Eukaryotic Gene Eukaryotic Gene Prediction Prediction Eukaryotic gene structure Eukaryotic

1 Milestones Milestones ID Task Name Duration Start Finish % Complete 1 Project Proposal

CSEP 527 Computational Biology Gene Expression Analysis 1 Assaying Gene Expression 3

CSEP 527 Computational Biology Gene Expression Analysis 1 Assaying Gene Expression 3

CSEP 590 B Computational Biology Gene Expression Analysis 1 Assaying Gene Expression 3

UNSUPERVISED LEARNING, CLUSTERING UNSUPERVISED LEARNING UNSUPERVISED LEARNING Supervised

Examples of online analysis tools for gene expression data Tools integrated in data repositories

Gene Expression Remember the days of 10 th grade biology Learning about gene expression Which can

AP BIOLOGY Gene Expression Summer 2013 www.njctl.org Slide 3 / 199 Gene Expression Unit Topics

Boolean models of the lac operon in E. coli Matthew Macauley Clemson University Gene expression

Survival Models built from Gene Expression Data using Gene Groups as Covariates Kai Kammers,

A Parallel Approximation Hitting Set Algorithm for Gene Expression Analysis D. P. Ruchkys

Most Random Gene Expression Signatures are Significantly Associated with Breast Cancer Outcome

Ra Random matrix analysis for gene co co-ex expres ession ex exper erimen ents in in can

Introduction to read alignment pipelines and gene expression estimates Johan Reimegrd Read

Platforms July 28, 2010 Big Data for Science Workshop Judy Qiu xqiu@indiana.edu

Modelling Biochemical Reaction Networks Lecture 16: Gene expression and delay-differential

Computer control of gene expression: Robust setpoint tracking of protein mean and variance using

Reconstruction Spatiotemporal Gene Expression from Partial Observations Dustin Cartwright 1 April

3D folding of chromosomal domains in relation to gene expression Marc A. Marti-Renom

Sambuz

Useful Links

Newsletter

Mail Us