TCGA gene expression data Nianxiang Zhang BCB, MDACC 7/14/10 - - PowerPoint PPT Presentation
TCGA gene expression data Nianxiang Zhang BCB, MDACC 7/14/10 - - PowerPoint PPT Presentation
Assessment of batch effects in TCGA gene expression data Nianxiang Zhang BCB, MDACC 7/14/10 Outline TCGA data used Batch effects in TCGA data Identification of batch effects Algorithms PCA and Hierarchical
Outline
- TCGA data used
- Batch effects in TCGA data
- Identification of batch effects
– Algorithms
- PCA and Hierarchical clustering
- Correlation of correlations (CR)
– Batch effects in TCGA gene expression data on GBM and ovarian cancer
- Adjustment for batch effects
– Methods of batch effects adjustment – Adjustment for batch effects in TCGA gene expression data on GBM and ovarian cancer – Comparison of adjustment methods
- Implications
Data
Level 3 gene expression data on GBM and Ovarian cancer
- 3 platforms
– Affymetrix U133a – Agilent – Affymetrix Exon array
- GBM
– 11 batches, 372 tumor samples
- OV
– 13 batches, 511 samples-30 samples excluded
Batch Effects in TCGA
- TCGA data are collected in multiple batches
- TCGA data come from multiple platforms,
analyses, and institutions
- Batch effects can be very important for biological
and clinical predictions
OV Data Distribution By Batch
Ovarian Cancer Data
GBM Data
Identification of Batch Effects
- Standard techniques
– Principal component analysis (PCA) – Clustering analysis (1-Pearson metric, Ward linkage)
- Correlation of correlations (CR)
– A scalar index of the similarity of batches in terms of gene-gene interactions
- CR=1 if batches are identical
- CR=0 if batches are uncorrelated
Calculation of Correlation of Correlations (CR)
Uij denotes the correlation of genes i and j in batch 1 Vij denotes the correlation of genes i and j in batch 2 (Scherf, …. Weinstein, Nature Genetics 2000; 24:236)
Permutation test of CR provides the statistical significance of batch effects
Gene 1 Gene 2 Gene 3 Gene 4 R12=Corr (1,2) Gene 1 Gene 2 Gene 3 Gene 4 R’12=Corr (1,2)
Batch 1 Batch 2
Visualization of the Correlation of Correlations Calculation
(for 4 genes and batches consisting of 4 and 3 samples)
Then calculate a scalar quantity, the correlation between the vector of 6 correlation coefficients for Batch 1 and the vector of 6 correlation coefficients for Batch 2 CR=Corr [(R12 , R13 , R14 , R23 , R24 ,R34), *(R’12 , R’13 , R’14 , R’23 , R’24 ,R’34)]
Permutation test of CRs
Actual CR Between two Batches (two-sided p) We scramble the batch labels of samples in two batches, calculate CR between two permutated batches to obtain distribution of CR under H0.
PCA GBM data
Batch 16, 20 Batch 16
GBM:Affy
GBM:Agilent
GBM:Exon
Tests for batch effects using CR:GBM
Q-values for testing batch effects in GBM data
PCA-Ovarian data
Batch 9, 11
Ovarian-Affymetrix
Ovarian-Agilent
Ovarian-Exon
Batch 9, 11
Tests for batch effects using CR:OV
Q-values for batch effects in OV data
Batch effects in unified OV gene expression data
Unadjusted Affy U133a Data Unified Gene Expression Data
Adjustment of Batch Effects
- Empirical Bayes (ComBat)
– Parametric prior (EBP) – Nonparametric prior (EBNP)
- Median Polish
– Overall (MP) – Within each batch (MPB)
- ANOVA
– Naïve ANOVA (AN) – With variance shrinkage (WAN)
Batch effect adjustment
GBM:Affy
GBM Agilent
GBM:Exon data
Effect of batch effects adjustment on gene expression
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 p-values Cumulative probability U EBP EBNP MP MPB AN WAN
Assessment of batch effects with adjustments
p-value = 0.05 CDF of p-values from the permutation tests of batch effect in TCGA ovarian cancer Affymetrix gene expression data batch 9 and 11-15 Unadjusted data P-values Cumulative Distribution Function
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 p-values Cumulative probability U EBP EBNP MP MPB AN WAN
Assessment of batch effects with adjustments
p-value = 0.05 Unadjusted data CDF of p-values from the permutation tests of batch effect in TCGA ovarian cancer Affymetrix gene expression data MPB adjusted data P-values Cumulative Distribution Function
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 p-values Cumulative probability U EBP EBNP MP MPB AN WAN
Assessment of batch effects with adjustments:OV
p-value = 0.05 CDF of p-values from the permutation tests of batch effect in TCGA ovarian cancer Affymetrix gene expression data batch 9 and 11 to 15 MPB adjusted data Unadjusted data P-values Cumulative Distribution Function
Association of Clinical Outcomes with Batches
Overall survival by batch (TCGA ovarian cancer data) Batch 9 Batch 9 P<0.001 P=0.018
Implications
- Assessments based on Correlation of Correlations parameter can
be used to identify batch effects in TCGA data. This is complemented by principal component analysis and hierarchical clustering.
- Batch effects exist in TCGA GBM and ovarian cancer data
- Be cautious when we do batch effects adjustment.
– The batch differences may be technical or biological
- We do not want to correct biological difference
- We do want to correct technical difference (bias)
– Some methods may over massage the data
- The impact of batch effects on clinical predictions from the data
remains to be determined.