unsupervised joint analysis of arraycgh gene expression
play

Unsupervised joint analysis of arrayCGH, gene expression data and - PowerPoint PPT Presentation

Unsupervised joint analysis of arrayCGH, gene expression data and supplementary features Christine Steinhoff 1 , Matteo Pardo 1,2 , Martin Vingron 1 1 Max Planck Institute for Molecular Genetics, Berlin, Germany 2 Sensor Lab, INFM-CNR, Brescia,


  1. Unsupervised joint analysis of arrayCGH, gene expression data and supplementary features Christine Steinhoff 1 , Matteo Pardo 1,2 , Martin Vingron 1 1 Max Planck Institute for Molecular Genetics, Berlin, Germany 2 Sensor Lab, INFM-CNR, Brescia, Italy

  2. Outline o The data and how they have been analyzed until now o MCASV: Multiple Correspondence Analysis (MCA) with Supplementary Variables o Results Bits09, Genova - 2 - Matteo Pardo

  3. Array comparative genomic hybridization (aCGH) o aCGH measures the (mean) number of copies of DNA- stretches along the genome in order to detect copy number aberrations ( CNA ) log2(sample/control) Bits09, Genova - 3 - Matteo Pardo

  4. aCGH for the study of cancer o Well established that cancer progresses through accumulation of genomic and epigenomic aberrations o Advantages of DNA over expression analysis: • Genomic DNA is more stable than mRNA • CNAs define key genetic events driving tumorigenesis o aCGH results: • Identification of regions of frequent loss and gain • Correlation of copy-number aberrations with prognosis in a variety of cancer histologies, including breast and lymphoma. • Pinpointed new cancer genes, for example PPM1D in breast cancer and MITF in melanoma. Bits09, Genova - 4 - Matteo Pardo

  5. Expression and aCGH together in breast cancer o Few studies (~10) measured genomic and transcriptomics profiles on the same patient cohort o Causal relation CNA - gene expression is intuitive: more genes  more mRNA o Typical result: the expression level of about 60% of the genes within highly amplified regions is at least moderately elevated. o The other way around: first disease subtypes are derived from expression arrays and successively distinct patterns of CNA found Bits09, Genova - 5 - Matteo Pardo

  6. Data Integration: what is there o Central point is level at which the fusion (integration, joining) actually happens: raw data level; 1. feature level; 2. decision level (‘decision’ means as much as ’after analysis’, when 3. a decision could be taken). o In genomics, what has been really performed is ‘decision fusion’: each data source processed separately and outputs then integrated. o For expression + aCGH: e.g. first determine regions with CNAs (possibly tissue or patients -specific) and then look for differentially expressed (onco)genes inside these regions o Natural reason for pushing integration at a later stage: strong heterogeneity does not allow sensible alignments of source data. o But: loose interaction effects Bits09, Genova - 6 - Matteo Pardo

  7. Joint analysis of expression and aCGH: what is there o Berger et al., 2006: unsupervised analysis with generalized singular value decomposition on fused expression-aCGH matrix: Convenient feature (of any vector space decomposition • approach): visualization is intuitive Does not take into account biomedical covariates (grade, ER and • p53 status,…) Does not distinguish between gene states • o Lee et al., 2008: Calculate correlations between all pairs of genes, between cgh • and expression matrices Biclustering on correlation matrix: find modules, then study • enrichment No summary plot (“one point one gene”) • No consideration of medical covariates • Bits09, Genova - 7 - Matteo Pardo

  8. Joint analysis of expression and aCGH: what we do o Berger et al., 2006: unsupervised analysis with generalized singular value decomposition on fused expression-aCGH matrix: Convenient feature (of any vector space decomposition • approach): visualization is intuitive Does not take into account biomedical covariates • (grade, ER and p53 status,…) Does not distinguish between gene states • • MCASV has been applied in the context of social sciences but to our knowledge not for biological high throughput data analysis. • Quite common in France (‘French school’: Analyse Géométrique des Données !) Bits09, Genova - 8 - Matteo Pardo

  9. Correspondence Analysis o In few words: PCA for discrete data o In some more words: • Applies to contingency tables (cross tabulation of two discrete variables) • Investigates departure from independence (as chi- square test) • Investigates similarity between profile vectors (row vectors divided by the row marginals, i.e. sum of profiles components = 1). • The same applies to the columns analysis Bits09, Genova - 9 - Matteo Pardo

  10. Correspondence Analysis ctd. o As in PCA: find low dimensional projections, which maximize a criterion of preserved “information” o In PCA: criterion is total variance (= sum of distance from the mean inside the reduced dimensional space). Metric is Euclidean. o In CA: criterion is total inertia (= weighted sum of distance from the barycentre inside the reduced dimensional space). Metric is chi-square : • The weight is given by the row marginal, i.e. profiles associated to more objects count more • The chi-square norm weights each profile component by the inverse column marginal, i.e. a category which is less represented counts more o Once projection is found, supplementary points can be projected Bits09, Genova - 10 - Matteo Pardo

  11. Multiple CA o Extension to more than two discrete variables - not straightforward o Standard MCA: CA on indicator matrix o Amounts to: maximize mean projected inertia. Mean over all the tabulations of two variables. This includes variable’s self-tabulation (which also have maximal inertia) o Empirical corrections exist for excluding contributions of self-tabulation Bits09, Genova - 11 - Matteo Pardo

  12. Integrate data with different distributions grade stage Died 2 1 Yes 4 3 No 2 2 yes Discrete categories After appropriate Not symmetric normalization skew approx lognormal symmetric  Discretize all Bits09, Genova - 12 - Matteo Pardo

  13. Pipeline A A E E Data INPUT ( 1 ) ( 1 ) P P Discretization C B S C B S F C F C ( 2 ) ( 2 ) V a r V a r C o r r C o r r Filtering ( 3 ) ( 3 ) − − − − pxm pxm n xp n xp n xp n xp Cat Cat { 1,0,1 { 1,0,1 } E } E { 1,0,1 { 1,0,1 } A } A I n d i c a t o r M a t r i x I n d i c a t o r M a t r i x Indicator coding ( 4 ) ( 4 ) = = } Cat m xp } Cat m xp ( ( ) ) = = I I {0,1 {0,1 3 3 n xp n xp = = 3 3 n xp n xp I I {0,1 {0,1 } } I I {0,1 {0,1 } } A A E E P P E E A A = = MCASV t t B B [ [ I I I I ] [ ] [ I I I I ] ] ( 5 ) ( 5 ) E E A A E E A A = = * * ] t ] t B B [ [ I I I I I I E E A A P P

  14. Discretization + Filtering N xp N xp R R A E Circular Binary Segmentation Two Fold Change (R Package DNAcopy) Genes with Genes with highest highest variance correlation between aCGH across patients and expression Bits09, Genova - 14 - Matteo Pardo

  15. MCASV Nenadic, O. and Greenacre, M. (2006) Multiple Correspondence Analysis and Related Methods . Chapman & Hall/CRC, London Burt matrix: super-table of MCA: find plane maximizing inertia all contingency tables (between Project covariates on the plane genes couples) t I [ I I ] B p E A Bits09, Genova - 15 - Matteo Pardo

  16. Data Show results for correlation filter, 100 genes Bits09, Genova - 16 - Matteo Pardo

  17. MCA plot: Genes Bits09, Genova - 17 - Matteo Pardo

  18. MCA plot: clinical covariates Bits09, Genova - 18 - Matteo Pardo

  19. MCA plot: Supplementary Variables • The plot is centered on the genes’ mean profile. • Genes and covariate states which are near to the origin are less informative. Bits09, Genova - 19 - Matteo Pardo

  20. MCA plot: Supplementary Variables • Each covariate status’ value is the center of the gene patterns (=patients) having that status. • E.g., the (projection of the) mean gene pattern of the patients having tumor grade 1 is represented by the point Grade.1. Bits09, Genova - 20 - Matteo Pardo

  21. MCA plot: Supplementary Variables Tumor grade 1, 2 and 3 separate (only) along the first component  the gene pattern of a patient is determined foremostly by its tumor grade. Bits09, Genova - 21 - Matteo Pardo

  22. MCA plot: Supplementary Variables • Also ER and p53 status display considerable variation along first component • p53 mutant and ER– on the side of higher tumor grade. • ER– has highest score on 1 st component  strongest negative indicator? Bits09, Genova - 22 - Matteo Pardo

  23. MCA plot: Supplementary Variables • Tumor stages separate clearly from each other but show no order. • This hints to heterogeneity of gene patterns inside each state  Lack of genomic support for this classification? Bits09, Genova - 23 - Matteo Pardo

  24. MCA plot: Supplementary Variables • Node status has no projection on the first component  independent of tumor grade progression. • Explains part of the remaining information in the data. • Node- has biggest value on 2nd MCA component Bits09, Genova - 24 - Matteo Pardo

  25. Selected known genes MYC and ERBB2 are wellknown to be amplified and overexpressed coordinately in breast cancers having bad prognosis Bits09, Genova - 25 - Matteo Pardo

  26. Genes related to clinical state ER- GO category enrichment Bits09, Genova - 26 - Matteo Pardo

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend