Unsupervised joint analysis of arrayCGH, gene expression data and - - PowerPoint PPT Presentation

unsupervised joint analysis of arraycgh gene expression
SMART_READER_LITE
LIVE PREVIEW

Unsupervised joint analysis of arrayCGH, gene expression data and - - PowerPoint PPT Presentation

Unsupervised joint analysis of arrayCGH, gene expression data and supplementary features Christine Steinhoff 1 , Matteo Pardo 1,2 , Martin Vingron 1 1 Max Planck Institute for Molecular Genetics, Berlin, Germany 2 Sensor Lab, INFM-CNR, Brescia,


slide-1
SLIDE 1

Unsupervised joint analysis of arrayCGH, gene expression data and supplementary features

Christine Steinhoff1, Matteo Pardo1,2, Martin Vingron1

1Max Planck Institute for Molecular Genetics,

Berlin, Germany

2Sensor Lab, INFM-CNR, Brescia, Italy

slide-2
SLIDE 2

Bits09, Genova - 2 - Matteo Pardo

Outline

  • The data and how they have been analyzed

until now

  • MCASV:

Multiple Correspondence Analysis (MCA) with Supplementary Variables

  • Results
slide-3
SLIDE 3

Bits09, Genova - 3 - Matteo Pardo

Array comparative genomic hybridization (aCGH)

  • aCGH measures the (mean) number of copies of DNA-

stretches along the genome in order to detect copy number aberrations (CNA)

log2(sample/control)

slide-4
SLIDE 4

Bits09, Genova - 4 - Matteo Pardo

aCGH for the study of cancer

  • Well

established that cancer progresses through accumulation of genomic and epigenomic aberrations

  • Advantages of DNA over expression analysis:
  • Genomic DNA is more stable than mRNA
  • CNAs define key genetic events driving tumorigenesis
  • aCGH results:
  • Identification of regions of frequent loss and gain
  • Correlation of copy-number aberrations with prognosis in a

variety of cancer histologies, including breast and lymphoma.

  • Pinpointed new cancer genes, for example PPM1D in breast

cancer and MITF in melanoma.

slide-5
SLIDE 5

Bits09, Genova - 5 - Matteo Pardo

Expression and aCGH together in breast cancer

  • Few

studies (~10) measured genomic and transcriptomics profiles on the same patient cohort

  • Causal relation CNA - gene expression is intuitive: more

genes  more mRNA

  • Typical result: the expression level of about 60% of the

genes within highly amplified regions is at least moderately elevated.

  • The other way around: first disease subtypes are

derived from expression arrays and successively distinct patterns of CNA found

slide-6
SLIDE 6

Bits09, Genova - 6 - Matteo Pardo

Data Integration: what is there

  • Central point is level at which the fusion (integration, joining)

actually happens:

1.

raw data level;

2.

feature level;

3.

decision level (‘decision’ means as much as ’after analysis’, when a decision could be taken).

  • In genomics, what has been really performed is ‘decision fusion’:

each data source processed separately and outputs then integrated.

  • For expression + aCGH: e.g. first determine regions with CNAs

(possibly tissue or patients -specific) and then look for differentially expressed (onco)genes inside these regions

  • Natural reason for pushing integration at a later stage: strong

heterogeneity does not allow sensible alignments of source data.

  • But: loose interaction effects
slide-7
SLIDE 7

Bits09, Genova - 7 - Matteo Pardo

Joint analysis of expression and aCGH: what is there

  • Berger

et al., 2006: unsupervised analysis with generalized singular value decomposition on fused expression-aCGH matrix:

  • Convenient feature (of any vector space decomposition

approach): visualization is intuitive

  • Does not take into account biomedical covariates (grade, ER and

p53 status,…)

  • Does not distinguish between gene states
  • Lee et al., 2008:
  • Calculate correlations between all pairs of genes, between cgh

and expression matrices

  • Biclustering on correlation matrix: find modules, then study

enrichment

  • No summary plot (“one point one gene”)
  • No consideration of medical covariates
slide-8
SLIDE 8

Bits09, Genova - 8 - Matteo Pardo

Joint analysis of expression and aCGH: what we do

  • Berger

et al., 2006: unsupervised analysis with generalized singular value decomposition on fused expression-aCGH matrix:

  • Convenient feature (of any vector space decomposition

approach): visualization is intuitive

  • Does not take into account biomedical covariates

(grade, ER and p53 status,…)

  • Does not distinguish between gene states
  • MCASV has been applied in the context of social sciences

but to our knowledge not for biological high throughput data analysis.

  • Quite common in France (‘French school’: Analyse

Géométrique des Données !)

slide-9
SLIDE 9

Bits09, Genova - 9 - Matteo Pardo

Correspondence Analysis

  • In few words: PCA for discrete data
  • In some more words:
  • Applies to contingency tables (cross tabulation of two

discrete variables)

  • Investigates departure from independence (as chi-

square test)

  • Investigates similarity between profile vectors (row

vectors divided by the row marginals, i.e. sum of profiles components = 1).

  • The same applies to the columns analysis
slide-10
SLIDE 10

Bits09, Genova - 10 - Matteo Pardo

Correspondence Analysis ctd.

  • As in PCA: find low dimensional projections, which

maximize a criterion of preserved “information”

  • In PCA: criterion is total variance (= sum of distance from

the mean inside the reduced dimensional space). Metric is Euclidean.

  • In CA: criterion is total inertia (= weighted sum of

distance from the barycentre inside the reduced dimensional space). Metric is chi-square:

  • The weight is given by the row marginal, i.e. profiles

associated to more objects count more

  • The chi-square norm weights each profile component by the

inverse column marginal, i.e. a category which is less represented counts more

  • Once projection is found, supplementary points can be projected
slide-11
SLIDE 11

Bits09, Genova - 11 - Matteo Pardo

Multiple CA

  • Extension to more than two discrete variables - not

straightforward

  • Standard MCA: CA on indicator matrix
  • Amounts to: maximize mean projected inertia. Mean over

all the tabulations of two variables. This includes variable’s self-tabulation (which also have maximal inertia)

  • Empirical corrections exist for excluding contributions of

self-tabulation

slide-12
SLIDE 12

Bits09, Genova - 12 - Matteo Pardo

Integrate data with different distributions

grade stage Died 2 1 Yes 4 3 No 2 2 yes

After appropriate normalization approx lognormal symmetric Not symmetric skew Discrete categories

 Discretize all

slide-13
SLIDE 13

Data INPUT Discretization Filtering Indicator coding MCASV

Pipeline

C o r r V a r I n d i c a t o r M a t r i x E A P ( 1 ) ( 2 ) ( 3 ) ( 4 ) ( 5 )

pxm

Cat

[ ] [ ]

t E A E A

B I I I I

=

*

[ ]t

E A P

B I I I

=

C B S

{ 1,0,1 } E

n xp

{ 1,0,1 } A

n xp

3

{0,1 }

E

n xp E

I

=

3

{0,1 }

A

n xp A

I

=

( )

{0,1 }Cat m xp

P

I

=

F C C o r r V a r I n d i c a t o r M a t r i x E A P ( 1 ) ( 2 ) ( 3 ) ( 4 ) ( 5 )

pxm

Cat

[ ] [ ]

t E A E A

B I I I I

=

*

[ ]t

E A P

B I I I

=

C B S

{ 1,0,1 } E

n xp

{ 1,0,1 } A

n xp

3

{0,1 }

E

n xp E

I

=

3

{0,1 }

A

n xp A

I

=

( )

{0,1 }Cat m xp

P

I

=

F C

slide-14
SLIDE 14

Bits09, Genova - 14 - Matteo Pardo

Discretization + Filtering

Two Fold Change Circular Binary Segmentation (R Package DNAcopy) Genes with highest correlation between aCGH and expression Genes with highest variance across patients

A

N xp

R

E

N xp

R

slide-15
SLIDE 15

Bits09, Genova - 15 - Matteo Pardo

[ ]

t p E A

I I I B

Nenadic, O. and Greenacre, M. (2006) Multiple Correspondence Analysis and Related Methods. Chapman & Hall/CRC, London Burt matrix: super-table of all contingency tables (between genes couples) MCA: find plane maximizing inertia Project covariates on the plane

MCASV

slide-16
SLIDE 16

Bits09, Genova - 16 - Matteo Pardo

Data

Show results for correlation filter, 100 genes

slide-17
SLIDE 17

Bits09, Genova - 17 - Matteo Pardo

MCA plot: Genes

slide-18
SLIDE 18

Bits09, Genova - 18 - Matteo Pardo

MCA plot: clinical covariates

slide-19
SLIDE 19

Bits09, Genova - 19 - Matteo Pardo

MCA plot: Supplementary Variables

  • The plot is centered on

the genes’ mean profile.

  • Genes and covariate

states which are near to the origin are less informative.

slide-20
SLIDE 20

Bits09, Genova - 20 - Matteo Pardo

MCA plot: Supplementary Variables

  • Each covariate status’

value is the center of the gene patterns (=patients) having that status.

  • E.g., the (projection of

the) mean gene pattern of the patients having tumor grade 1 is represented by the point Grade.1.

slide-21
SLIDE 21

Bits09, Genova - 21 - Matteo Pardo

MCA plot: Supplementary Variables

Tumor grade 1, 2 and 3 separate (only) along the first component  the gene pattern of a patient is determined foremostly by its tumor grade.

slide-22
SLIDE 22

Bits09, Genova - 22 - Matteo Pardo

MCA plot: Supplementary Variables

  • Also ER and p53 status

display considerable variation along first component

  • p53 mutant and ER– on

the side of higher tumor grade.

  • ER– has highest score
  • n 1st component 

strongest negative indicator?

slide-23
SLIDE 23

Bits09, Genova - 23 - Matteo Pardo

MCA plot: Supplementary Variables

  • Tumor stages

separate clearly from each other but show no

  • rder.
  • This hints to

heterogeneity of gene patterns inside each state  Lack of genomic support for this classification?

slide-24
SLIDE 24

Bits09, Genova - 24 - Matteo Pardo

MCA plot: Supplementary Variables

  • Node status has no

projection on the first component  independent of tumor grade progression.

  • Explains part of the

remaining information in the data.

  • Node- has biggest

value on 2nd MCA component

slide-25
SLIDE 25

Bits09, Genova - 25 - Matteo Pardo

MYC and ERBB2 are wellknown to be amplified and

  • verexpressed coordinately in

breast cancers having bad prognosis

Selected known genes

slide-26
SLIDE 26

Bits09, Genova - 26 - Matteo Pardo

Genes related to clinical state ER-

GO category enrichment

slide-27
SLIDE 27

Bits09, Genova - 27 - Matteo Pardo

If there is time… more selected genes around ER-

Two parameters:

  • Input data for MCA:
  • 1. aCGH+expr, correlation filter
  • 2. aCGH+expr, variance filter
  • 3. aCGH, variance filter
  • 4. expr, variance filter
  • Angle selected around ER- (degrees): 5,10,15,(30)
  • Enrichment investigated with WebGestalt in:
  • GO terms
  • Chromosome localization
  • Biocarta pathways
slide-28
SLIDE 28

Bits09, Genova - 28 - Matteo Pardo

Joint, correlation filter

5 degrees 10 15 30 Chromosome distribution of the selected genes

slide-29
SLIDE 29

Bits09, Genova - 29 - Matteo Pardo

aCGH only, variance filter

slide-30
SLIDE 30

Joint, variance filter

10 No chromosome localization but significant GO + BioCharta enrichment:

  • Caspase Cascade in Apoptosis
  • Role of Mitochondria in Apoptotic Si
  • HIV-I Nef: negative effector of Fas

and TNF

slide-31
SLIDE 31

Bits09, Genova - 31 - Matteo Pardo

Conclusions

  • Data integration at the feature level
  • Visualization method
  • Clinical covariates permit interpretation
  • Gene states are displayed
  • Gene sets related to covariates can be selected
  • Interesting enrichments found only by integrating

aCGH with expression data

slide-32
SLIDE 32

Bits09, Genova - 32 - Matteo Pardo

Improvements…

  • Discretization options to be further tested
  • Metric for ranking genes relative to covariate state,

then use GSEA on ranked list

  • Alternatively: distinguish between genes from expr

and cgh. Do enrichment on the module, as in Lee et al.

  • More systematic investigation:
  • More datasets
  • More covariate states
  • Supervised analysis
slide-33
SLIDE 33

Bits09, Genova - 33 - Matteo Pardo Filter F1 F1acgh F1expr F2 # selected genes F1 85 103 16 179 F1acgh 18 34 193 F1expr 4 189 F2 194 10 degree F1 F1acgh F1expr F2 # selected genes F1 4 5 26 F1acgh 1 36 F1expr 28 F2 27

Supplementary figure 1 Supplementary figure 2

Genes intersections