Application of Survival and Multivariate Methods to Gene Expression - - PowerPoint PPT Presentation
Application of Survival and Multivariate Methods to Gene Expression - - PowerPoint PPT Presentation
Application of Survival and Multivariate Methods to Gene Expression Data from Two Affymetrix datasets Linda Warnock, Statistical Sciences UK Richard Stephens, Transcriptome Analysis UK Jo Ann Coleman, Statistical Sciences US Outline
2
Outline
- Description of the datasets
- Pre-processing and exploration of the data
- Methods
- Results
- Conclusions
- Recommendations
- References
3
The data
- Harvard
U95 affymetrix chip type 124 lung adenocarcinoma samples with clinical data 71 females, 53 males 76 with stage I tumour, 48 with stage II - IV tumour 12,625 probe sets
- Michigan
HuGeneFL affymetrix chip type 86 lung adenocarcinoma samples with clinical data 51 females, 35 males 67 stage I tumour, 19 stage III tumour 7,129 probe sets
4
Pre-processing
- Normalisation
.CEL files were loaded into DChip and normalised with the available algorithms using the Perfect Match (PM) data
- Quality Control
Poor quality chips were excluded from the analysis if:
- the percentage of spot outliers > 3 (DChip)
- 3 prime to 5 prime ratio > 3 (MAS5)
- PCA of several chip metrics (eg. background, overall
brightness etc) identify technical bias in data production
- Final number of samples/ chips in analysis
Harvard: 114 (10 excluded) Michigan: 70 (16 excluded)
5
Aim of analysis and Methods Used
- Aim
Combine data across chip types and identify genes associated with prolonged survival
- Methods
Quality Control Metrics Affymetrix array comparison spreadsheets to match probe_set ids from Harvard with Michigan (resulted in 6013 genes in common) survival plots PCA plots COX PH regression modelling Meta analysis: Fisher’s Chi-squared method Volcano plots
6
Harvard chip metric data PCA (190 chips) Coloured by IVT batch
IVT Batch
- No. 3
Background, GAPDH, B-actin, RawQ, Number Present, Average Signal, St.Dev. Backgrnd
7
Harvard chip metric data PCA (190 chips) Coloured by IVT batch
Background, GAPDH, B-actin, RawQ, Number Present, Average Signal, St.Dev. Backgrnd
8
Harvard Expression Data
COID Ad NL SCLC SQ
PCA 190 chips Adenocarcinoma Samples
Coloured by IVT batch
Batch 3 PCA 132 chips
9
Survival Plots with strata: sex and tumour stage
Michigan F, III M, III F, I M, I Harvard F, I M, I M, II+ F, II+ ** Stage is a predictor for survival ** Stage is a predictor for survival
10
COX PH regression and Meta Analysis
- f the Clinical Data
Variable Harv Chi-Sq Harv P-value Mich Chi-Sq Mich P-value Meta Critical point Meta P-value Stage 15.98 <0.0001 25.15 <0.0001 48.2 <0.0001 Sex 1.63 0.2015 1.96 0.1623 6.84 0.1446 Age 0.76 0.382 2.87 0.0904 6.73 0.1508
A g e11
Exploration of the gene expression data (averaged over samples)
Mean Mich = 3.1 SD = 0.38 Mean Harv = 2.4 SD = 0.50
12
PCA on raw expression data
- 20
- 10
10 20
- 100
100 t[2] - 1% of variation t[1] - 89% of variation
PCA on all the common genes Class 1 Class 2
Harvard Michigan
- 0.080
- 0.060
- 0.040
- 0.020
0.000 0.020 0.040 0.060 0.080
- 0.0140
- 0.0120
- 0.0100
- 0.0080
- 0.0060
- 0.0040
- 0.0020
0.0000 0.0020 0.0040 0.0060 0.0080 0.0100 0.0120 0.0140 p[2] p[1]
Loadings plot for all common genes
13
Meta Analysis - combining p-values
- Inverse Chi-Square method (Fisher, 1932)
- Under the null hypothesis P-values have a uniform
distribution
- … so -2log(p) has a chi-square distribution with 2
degrees of freedom
- … and -2log(p1p2) has a chi-square distribution with
4 degrees of freedom
- A new p-value is created for every gene which is a
combination of the p-value from Harvard and from Michigan
14
Cox Proportional Hazards Model
- Model the log hazard function against the
covariates log h(t;x) = bTx * h0(t) where b is the vector of covariate parameter estimates, h0(t) is the baseline hazard and x represents the data
- The exponential of the parameter estimate for gene
expression represents the increase in hazard for every unit increase in log expression or for every 10 fold increase in expression
15
Increase in hazard for every 10-fold increase in gene expression gene expression hazard 10 100 1000 10000 1E5 1E6 1E7 200 400 600 800 1000 1200
Interpretation of the hazard parameter estimate
Parameter estimate = 2 exp(2) = 7.4 hazard increases by 7 for each 10 fold increase in gene expression Parameter estimate = 5 exp(5) = 148.4 hazard increases by 148 for each 10 fold increase in gene expression
16
Volcano plots
241 genes selected Cut-off P<0.05
17
Agreement between hazard estimates
18
Genes selected from Cox Analysis
- Interpretation
These genes show a significant association with survival after taking the factors of stage, sex and age into account These are genes which will increase or decrease the chances of survival regardless of the stage of the tumour
19
PCA Score plot of genes selected from survival analysis
Scatter Plot
M2.t[1]
- 8
- 6
- 4
- 2
2 4 6
- 5
- 4
- 3
- 2
- 1
1 2 3
survivor Non-survivor short survival long survival
20
List of genes and association with Survival
probe_set meta p-value False Dicovery Rate p- value survival association Hazard Harv Hazard Mich gene name 34777_at <0.0001 0.0015 Neg 1.00 5.67 adrenomedullin 40507_at <0.0001 0.0077 Neg 8.07 10.34 solute carrier family 2 (facilitated glucose transporter), member 1 1649_at 0.0001 0.0656 Neg 5.68 22.78 chromosome 20 open reading frame 16 32300_s_at 0.0002 0.1582 Neg 5.23 13.99 tyrosine hydroxylase 38544_at 0.0003 0.2286 Neg 2.10 4.99 inhibin, alpha 1269_at 0.0005 0.3053 Pos
- 2.71
- 7.04
phosphoinositide-3-kinase, regulatory subunit, polypeptide 1 (p85 alpha) 35693_at 0.0006 0.3195 Neg 2.48 16.87 hippocalcin-like 1 36133_at 0.0007 0.3428 Neg 0.46 8.33 desmoplakin (DPI, DPII) 32593_at 0.0009 0.3680 Pos
- 0.17
- 10.60 KIAA0084 protein
1904_at 0.0012 0.4366 Neg 5.36 12.27 c-myc binding protein 1212_at 0.0013 0.4673 Neg 7.37 14.03 glutathione transferase zeta 1 (maleylacetoacetate isomerase) 31488_s_at 0.0015 0.4690 Neg 2.08 6.13 Phosphoglycerate kinase {alternatively spliced} [human, phosphoglycerate kinase d e ficient patient with episodes of muscl, mRNA Partial Mutant, 307 nt] 1317_at 0.0016 0.4690 Neg 0.13 10.42 macrophage stimulating 1 receptor (c-met- related tyrosine kinase) 41096_at 0.0020 0.4690 Neg 0.36 1.66 S100 calcium binding protein A8 (calgranulin A) 37026_at 0.0022 0.4690 Neg 0.38 5.86 core promoter element binding protein 40657_r_at 0.0026 0.4690 Pos
- 3.83
- 8.19 a d ipose most abundant gene transcript 1
21
Papers that the genes have appeared in
Adrenomedullin Microsc Res Tech. 2002 Apr 15;57(2):110-9. Related Articles, Links Adrenomedullin functions as an important tumor survival factor in human carcinogenesis. Solute carrier family 2member 1 Cancer. 2002 Feb 15;94(4):1078-82. Related Articles, Links Immunohistochemical staining of GLUT1 in benign, borderline, and malignant ovarian epithelia. Kalir T, Wang BY, Goldfischer M, Haber RS, Reder I, Demopoulos R, Cohen CJ, Burstein DE. Tyrosine hydroxylase : Lambooy LH, Gidding CE, van den Heuvel LP, Hulsbergen-van de Kaa CA, Ligtenberg M, Bokkerink JP, De Abreu RA. Related Articles, Links Real-time analysis of tyrosine hydroxylase gene expression: a sensitive and semiquantitative marker for minimal residual disease detection of neuroblastoma. Clin Cancer Res. 2003 Feb;9(2):812-9. PMID: 12576454 [PubMed - indexed for MEDLINE] Inhibin J Clin Endocrinol Metab. 1998 Mar;83(3):969-75. Related Articles, Links Loss of the expression and localization of inhibin alpha-subunit in high grade prostate cancer. : J Biol Chem. 2003 Jun 27;278(26):23630-8. Epub 2003 Apr 24. Related Articles, Links Evidence that phosphatidylinositol 3-kinase- and mitogen-activated protein kinase kinase- 4/c-Jun NH2-terminal kinase-dependent Pathways cooperate to maintain lung cancer cell survival. Desmoplakin : Lung Cancer. 2002 May;36(2):133-41. Related Articles, Links Differential expression and biodistribution of cytokeratin 18 and desmoplakins in non-small cell lung carcinoma subtypes. Young GD, Winokur TS, Cerfolio RJ, Van Tine BA, Chow LT, Okoh V, Garver RI Jr. PIK3R1
22
Alternative analysis approach
- Stage has a large effect on survival with stage I
having better survival prospects
- If gene expression can be correlated with tumour
stage then the genes identified can be used as targets for new medications
- Perform ANOVA with gene expression as the
response and stage, age and sex as covariates
23
Volcano plots for ANOVA analysis
P < 0.05 43 genes selected 1.5 fold change
24
Genes selected from ANOVA
- Interpretation
These genes show a significant difference between Stage I and future stages and hence are indicators for survival If gene expression is higher on stage I then the gene is positively associated with survival CAVEAT: the analysis is detecting small fold- changes as being significant so it is questionable whether any genes are truly interesting to an oncologist
25
List of genes and association with Survival
probe_set meta p-value False Dicovery Rate p- value gene 31870_at <0.0001 <0.0001 CD37 antigen 1288_s_at <0.0001 0.0003 J04617 /FEATURE=cds /DEFINITION=HUMEF1A Human elongation factor EF-1-alpha gene, complete cds 31962_at <0.0001 0.0006 ribosomal protein L37a 32466_at <0.0001 0.0011 ribosomal protein L41 36792_at <0.0001 0.0013 tropomyosin 1 (alpha) 37892_at <0.0001 0.0016 collagen, type XI, alpha 1 1385_at <0.0001 0.0027 transforming growth factor, beta-induced, 68kDa 38111_at <0.0001 0.0028 chondroitin sulfate proteoglycan 2 (versican) 1237_at <0.0001 0.0028 immediate early response 3 1179_at <0.0001 0.0031 Heat Shock Protein, 70 Kda 31775_at <0.0001 0.0039 Cluster Incl. X65018:H.sapiens mRNA for lung surfactant protein D /cds=(171,1298) /gb=X65018 /gi=34766 /ug=Hs.153415 /len=1400 32305_at <0.0001 0.0041 collagen, type I, alpha 2 34760_at <0.0001 0.0043 KIAA0022 gene product 658_at <0.0001 0.0043 thrombospondin 2
26
Conclusions
- Tumour Stage has large effect on survival
- Meta analysis provides a solid approach to combining
results from different data sets. Can be applied to completely different data types e.g. affy chips results could be combined with glass slides results for the same hypothesis
- Only a small set of genes show a significant association
with survival however there are many genes with very large changes in hazard
- Stage of tumour can be used as a surrogate for survival
however magnitude of expression change is small
- Interesting genes have been selected, some of which have
appeared in literature before.
27
Recommendations
- Technology
Use quality control metrics to identify poor chips and redo these chips or correct the source of the problem
- Oncology
Define a severity score which includes other factors as well as Stage such as differentiation and size of tumour. Use this as a continuous explanatory variable in the survival model More clinical information the better. Data such as treatment, time of prognosis, data of operation, economic status, race plus many
- thers may all affect survival.
- Statistical Analysis
Do permutation tests to obtain more robust p-values (as can’t look for outliers on every gene) Identify a method for combining estimates of effects across datasets Repeat the analysis separately for each Stage or include a Stage by expression interaction in the survival model
28
References
- Beer, D.G., et al. Gene-expression profiles predict survival of patients with lung
- adenocarcinoma. Nature Medicine 8, 816-824 (2002)
- Bhattacharjee, A., et al. Classification of human lung carcinomas by mRNA expression
profiling reveals distinct adenocarcinoma subclasses. PNAS 98, 13790-13795 (2001)
- Wolfinger, R.D. et al., J. Computational Biol., 8 625-638 (2001)
- Fisher, R.A.Statistical Methods for Research Workers (4th Edition). London: Oliver and
Boyd, 1932
- Rhodes, D. R., Barrette, T.R., Rubin, M.A., Ghosh, D., Chinnaiyan, A.M. Meta-Analysis of
Microarrays: Interstudy Validation of Gene Expression Profiles Reveals Pathway Dysregulation in Prostrate Cancer. Cancer REsearch 62, 4427-4433, (2002)
- Hedges, L.V., Olkin, I. Statistical Methods for Meta-Analysis, Academic Press, 1985
- Speed, T. Statistical Analysis of Gene Expression Microarray Data, Chapman and
Hall/CRC, 2003
- Allison, P.D., Survival Analysis Using the SAS System: A Practical Guide, Cary, NC: SAS
Institute Inc., 1995
- https://www.affymetrix.com/support/technical/comparison_spreadsheets.affx
29
Acknowledgements
- Priti Hegde, GSK, Transcriptome Analysis US
- Lini Pandite, GSK, Oncology, Discovery Medicine US
- Robert Gagnon, GSK, Statistical Sciences US
- David Wille, GSK, Statistical Sciences UK