TCGA gene expression data Nianxiang Zhang BCB, MDACC 7/14/10 - - PowerPoint PPT Presentation

tcga gene expression data
SMART_READER_LITE
LIVE PREVIEW

TCGA gene expression data Nianxiang Zhang BCB, MDACC 7/14/10 - - PowerPoint PPT Presentation

Assessment of batch effects in TCGA gene expression data Nianxiang Zhang BCB, MDACC 7/14/10 Outline TCGA data used Batch effects in TCGA data Identification of batch effects Algorithms PCA and Hierarchical


slide-1
SLIDE 1

Assessment of batch effects in TCGA gene expression data

Nianxiang Zhang BCB, MDACC 7/14/10

slide-2
SLIDE 2

Outline

  • TCGA data used
  • Batch effects in TCGA data
  • Identification of batch effects

– Algorithms

  • PCA and Hierarchical clustering
  • Correlation of correlations (CR)

– Batch effects in TCGA gene expression data on GBM and ovarian cancer

  • Adjustment for batch effects

– Methods of batch effects adjustment – Adjustment for batch effects in TCGA gene expression data on GBM and ovarian cancer – Comparison of adjustment methods

  • Implications
slide-3
SLIDE 3

Data

Level 3 gene expression data on GBM and Ovarian cancer

  • 3 platforms

– Affymetrix U133a – Agilent – Affymetrix Exon array

  • GBM

– 11 batches, 372 tumor samples

  • OV

– 13 batches, 511 samples-30 samples excluded

slide-4
SLIDE 4

Batch Effects in TCGA

  • TCGA data are collected in multiple batches
  • TCGA data come from multiple platforms,

analyses, and institutions

  • Batch effects can be very important for biological

and clinical predictions

slide-5
SLIDE 5

OV Data Distribution By Batch

slide-6
SLIDE 6

Ovarian Cancer Data

slide-7
SLIDE 7

GBM Data

slide-8
SLIDE 8

Identification of Batch Effects

  • Standard techniques

– Principal component analysis (PCA) – Clustering analysis (1-Pearson metric, Ward linkage)

  • Correlation of correlations (CR)

– A scalar index of the similarity of batches in terms of gene-gene interactions

  • CR=1 if batches are identical
  • CR=0 if batches are uncorrelated
slide-9
SLIDE 9

Calculation of Correlation of Correlations (CR)

Uij denotes the correlation of genes i and j in batch 1 Vij denotes the correlation of genes i and j in batch 2 (Scherf, …. Weinstein, Nature Genetics 2000; 24:236)

Permutation test of CR provides the statistical significance of batch effects

slide-10
SLIDE 10

Gene 1 Gene 2 Gene 3 Gene 4 R12=Corr (1,2) Gene 1 Gene 2 Gene 3 Gene 4 R’12=Corr (1,2)

Batch 1 Batch 2

Visualization of the Correlation of Correlations Calculation

(for 4 genes and batches consisting of 4 and 3 samples)

Then calculate a scalar quantity, the correlation between the vector of 6 correlation coefficients for Batch 1 and the vector of 6 correlation coefficients for Batch 2 CR=Corr [(R12 , R13 , R14 , R23 , R24 ,R34), *(R’12 , R’13 , R’14 , R’23 , R’24 ,R’34)]

slide-11
SLIDE 11

Permutation test of CRs

Actual CR Between two Batches (two-sided p) We scramble the batch labels of samples in two batches, calculate CR between two permutated batches to obtain distribution of CR under H0.

slide-12
SLIDE 12

PCA GBM data

Batch 16, 20 Batch 16

slide-13
SLIDE 13

GBM:Affy

slide-14
SLIDE 14

GBM:Agilent

slide-15
SLIDE 15

GBM:Exon

slide-16
SLIDE 16

Tests for batch effects using CR:GBM

slide-17
SLIDE 17

Q-values for testing batch effects in GBM data

slide-18
SLIDE 18

PCA-Ovarian data

Batch 9, 11

slide-19
SLIDE 19

Ovarian-Affymetrix

slide-20
SLIDE 20

Ovarian-Agilent

slide-21
SLIDE 21

Ovarian-Exon

Batch 9, 11

slide-22
SLIDE 22

Tests for batch effects using CR:OV

slide-23
SLIDE 23

Q-values for batch effects in OV data

slide-24
SLIDE 24

Batch effects in unified OV gene expression data

Unadjusted Affy U133a Data Unified Gene Expression Data

slide-25
SLIDE 25

Adjustment of Batch Effects

  • Empirical Bayes (ComBat)

– Parametric prior (EBP) – Nonparametric prior (EBNP)

  • Median Polish

– Overall (MP) – Within each batch (MPB)

  • ANOVA

– Naïve ANOVA (AN) – With variance shrinkage (WAN)

slide-26
SLIDE 26

Batch effect adjustment

slide-27
SLIDE 27

GBM:Affy

slide-28
SLIDE 28

GBM Agilent

slide-29
SLIDE 29

GBM:Exon data

slide-30
SLIDE 30

Effect of batch effects adjustment on gene expression

slide-31
SLIDE 31

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 p-values Cumulative probability U EBP EBNP MP MPB AN WAN

Assessment of batch effects with adjustments

p-value = 0.05 CDF of p-values from the permutation tests of batch effect in TCGA ovarian cancer Affymetrix gene expression data batch 9 and 11-15 Unadjusted data P-values Cumulative Distribution Function

slide-32
SLIDE 32

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 p-values Cumulative probability U EBP EBNP MP MPB AN WAN

Assessment of batch effects with adjustments

p-value = 0.05 Unadjusted data CDF of p-values from the permutation tests of batch effect in TCGA ovarian cancer Affymetrix gene expression data MPB adjusted data P-values Cumulative Distribution Function

slide-33
SLIDE 33

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 p-values Cumulative probability U EBP EBNP MP MPB AN WAN

Assessment of batch effects with adjustments:OV

p-value = 0.05 CDF of p-values from the permutation tests of batch effect in TCGA ovarian cancer Affymetrix gene expression data batch 9 and 11 to 15 MPB adjusted data Unadjusted data P-values Cumulative Distribution Function

slide-34
SLIDE 34

Association of Clinical Outcomes with Batches

Overall survival by batch (TCGA ovarian cancer data) Batch 9 Batch 9 P<0.001 P=0.018

slide-35
SLIDE 35

Implications

  • Assessments based on Correlation of Correlations parameter can

be used to identify batch effects in TCGA data. This is complemented by principal component analysis and hierarchical clustering.

  • Batch effects exist in TCGA GBM and ovarian cancer data
  • Be cautious when we do batch effects adjustment.

– The batch differences may be technical or biological

  • We do not want to correct biological difference
  • We do want to correct technical difference (bias)

– Some methods may over massage the data

  • The impact of batch effects on clinical predictions from the data

remains to be determined.

slide-36
SLIDE 36

Thank you!

slide-37
SLIDE 37