Missing data and net survival analysis Bernard Rachet General - PowerPoint PPT Presentation

Workshop on Flexible Models for Longitudinal and Survival Data with Applications in Biostatistics Warwick, 27 - 29 July 2015 Missing data and net survival analysis Bernard Rachet

General context Population-based, routine data Cancer registry data Clinical data – tumour, treatment, comorbidity Cancer survival and roles played by patient, tumour and health- care factors (very) large data sets, but incomplete information, which we have handled using multiple imputation procedure with Rubin’s rules

Preliminary results of on-going work

Multiple imputation procedure Under Missing At Random (MAR) assumption 1. Impute the missing data from 𝑔 𝐙 𝑁 𝐙 𝑃 to give K ‘complete’ data sets 2. Fit the substantive model to each of the K data sets, to obtain K estimates of the parameters and estimates of their variance 3. Combine them using Rubin’s rules

Multiple imputation steps Analysis Imputation Pooling Incomplete data Final results K completed data K analysis results sets

Pooling K estimates – Rubin’s rules Given K completed data sets, there are: ˆ   K estimates , k 1,..., K k   ˆ 2 with variance , k 1,..., K k K 1  ˆ ˆ    Pooled estimate MI k K  k 1 1 ˆ ˆ ˆ    Total variance V MI W (1 ) B K K 1  ˆ   within-imputation variance 2 W k K  k 1 K 1  ˆ ˆ ˆ     between-imputation variance 2 B ( ) k MI K - 1  k 1

Multiple imputation procedure Congeniality 1. Imputation model congenial with substantive model 2. Given the substantive model from 𝑔 𝐙 𝐘 , 𝑔 𝐙 𝐘 𝑕 𝐘 is a congenial imputation model if both 𝑔 and 𝑕 are correctly specified 3. Valid inference (under MAR) if 𝑔 𝐙 𝐘 𝑕 𝐘 (approximately) represents data structure and substantive model

Concepts and measures of interest Aims Prognosis of a cancer and impact at population level Concepts Excess hazard Excess hazard ratio Net survival Crude probabilities of death from cancer and other causes Relative survival data setting Population-based data Expected mortality hazard from life tables By single year age and sex, and calendar year, geography, deprivation

Nur et al , 2009 - Settings Population-based cohort of colorectal cancer patients Complete information on age, sex, follow-up time, vital status, deprivation, comorbidity, surgical treatment Tumour stage, morphology and grade: 45% incomplete data Relative survival data setting λ 𝑦 = λ 𝑄 𝑦 + 𝑓𝑦𝑞 𝑦𝛾 Substantive model: generalised linear model (Dickman et al , Stat Med 2005) 𝑚𝑝𝑕 𝜈 𝑘 − 𝑒 𝑄 𝑘 = 𝑚𝑝𝑕 𝑧 𝑘 + 𝑦𝛾 Link function Offset 𝑒 𝑘 ~𝑄𝑝𝑗𝑡𝑡𝑝𝑜 𝜈 𝑘 ; 𝜈 𝑘 = λ 𝑘 𝑧 𝑘 ; 𝑧 𝑘 person-time at risk 𝑒 𝑄 𝑘 expected number of deaths – life tables Excess hazard ratio (+ Ederer-2 relative survival)

Data description Missing information associated with: Variable Patients • Older ages Category No. % 29 563 100.0 • More deprived categories Stage I 2 193 12.3 • Less treatment with curative intent II 7 326 41.0 III 7 726 43.2 • Higher probability of death IV 643 3.6 Missing 11 684 (39.5) Morphology Adenocarcinoma 23 693 90.7 Mucinous and serous 2 314 8.9 Other 128 0.5 Neoplasm, NOS 1 3 428 (11.6) Grade I 3 212 14.5 II 16 047 72.4 III/IV 2 907 13.1 Missing 7 397 (25.0)

Missing information in several variables Multiple imputation using Full Conditional Specification (chained equations – van Buuren, 1999) Same basic assumptions than in multiple imputation Assumes a joint (multivariate) distribution exists without specifying its     form  , ,..., ,..., f Y Y Y f Y Y Y  i , 1 i , 2 i , p i , p i , 1 i , p 1           ,..., ... f Y Y Y f Y Y f Y   i , p 1 i , 1 i , p 2 i , 2 i , 1 i , 1   β Ω Imputation model (joint model for the data) Y ~ N , Gibbs sampler to: 1. Estimate the parameters in the joint imputation model 2. Impute the missing data Multivariate problem split into a series of univariate problems

Imputation models Outcomes Ordinal regression for stage and grade Polytomous regression for morphology Covariables Other two covariables with incomplete information Sex, age, deprivation, comorbidity, treatment, cancer site Vital status Follow-up time (years): piecewise function (0, 0.5, 1, 2, 3, 4, 5, 5+) Time-dependent effects (categorical) for deprivation and age Substantive (excess hazard) model includes all these variables (binary) time-dependent effects

Results Missing information associated with: Data after Variable Patients imputation • Older ages Category No. % % 29 563 100.0 • More deprived categories Stage 10.1 I 2 193 12.3 • Less treatment with curative intent II 7 326 41.0 36.1 47.4 III 7 726 43.2 • Higher probability of death IV 643 3.6 6.2 Missing 11 684 (39.5) Morphology Adenocarcinoma 23 693 90.7 90.5 Mucinous and serous 2 314 8.9 8.9 Other 128 0.5 0.5 Neoplasm, NOS 1 3 428 (11.6) Grade I 3 212 14.5 13.6 II 16 047 72.4 72.0 III/IV 2 907 13.1 14.4 Missing 7 397 (25.0)

Results Complete-case analysis (16 223 cases) Multiple imputation (29 563 cases) Period since diagnosis over which EHR was estimated Five years** First year Second to fifth Five years** First year Second to fifth years years EHR 95% CI EHR 95% CI EHR 95% CI EHR 95% CI EHR 95% CI EHR 95% CI I 1.0 - - 1.0 - - II 3.6 2.7 4.7 2.6 2.2 3.0 III 10.2 7.7 13.5 7.0 5.9 8.4 IV 26.4 19.6 35.5 16.5 13.8 19.8 Missing 15 to 44 1.0 - - 1.0 - - 1.0 - - 1.0 - - 45 to 54 1.1 0.8 1.5 1.3 1.0 1.6 1.3 1.0 1.6 1.3 1.1 1.5 55 to 64 1.4 1.0 1.9 1.2 1.0 1.5 1.7 1.4 2.1 1.3 1.1 1.5 65 to 74 2.0 1.5 2.7 1.2 1.0 1.5 2.4 2.0 2.9 1.3 1.1 1.6 75 to 84 2.7 2.0 3.7 1.1 0.9 1.4 3.6 2.9 4.3 1.4 1.2 1.6 85 to 99 4.0 2.9 5.5 0.9 0.7 1.3 5.4 4.4 6.6 1.5 1.2 1.9 Other results – Indicator approach Systematically underestimates variance of EHRs • • Overestimates EHRs for tumour morphology Underestimates EHRs for age and deprivation • • Does not identify time-dependent effects

Stage-specific survival Before imputation After imputation 100 100 80 80 60 60 40 40 20 20 I II III IV missing I II III IV 0 0 0 1 2 3 4 5 0 1 2 3 4 5 Years since diagnosis Years since diagnosis

Limitations Tutorial paper – no systematic evaluation Relatively simple substantive model piecewise model categorical variables Further recent methodological developments in: multiple imputation net survival, flexible modelling More systematic evaluation – simulations

Concepts and measures of interest Excess hazard λ 𝐹 𝑢 = λ 𝑃 𝑢 − λ 𝑄 𝑢 𝑋 𝑢 λ 𝑄𝑗 𝑢 𝑜 𝑒𝑂 𝑋 𝑢 𝑗=1 𝑍 𝑍 𝑋 𝑢 ; 𝑗 λ 𝑃 𝑢 𝑒𝑢 = λ 𝑄 𝑢 𝑒𝑢 = 𝑍 𝑋 𝑢 1 𝑋 𝑢 = 𝑇 𝑄 𝑗 𝑢 Expected probability Net survival of surviving up to t 𝑢 λ 𝐹 𝑣 𝑒𝑣 𝑇 𝐹 𝑢 = 𝑓 − 0 Crude mortality 𝑢 𝐺 𝐷 𝑢 = 𝑇 𝑃 𝑣 − λ 𝐹 𝑣 𝑒𝑣 0

Modelling approach Flexible multivariable excess hazard model Excess hazard Time-dependent and non-linear effects (splines) Variables affecting both mortality processes (cancer and other causes of death) included in the model Net survival is the mean of individual net survival functions predicted by the model

Multiple imputation procedure Congeniality 1. Imputation model congenial with substantive model 2. Given the substantive model from 𝑔 𝐙 𝐘 , 𝑔 𝐙 𝐘 𝑕 𝐘 is a congenial imputation model if both 𝑔 and 𝑕 are correctly specified 3. Valid inference (under MAR) if 𝑔 𝐙 𝐘 𝑕 𝐘 (approximately) represents data structure and substantive model 4. Problematic within net survival setting and with non- linear and time-dependent effects

Falcaro et al , 2015 – Study settings Data 44,461 men diagnosed with a colorectal cancer in 1998-2006, followed up to 2009 Age at diagnosis (continuous), tumour stage (4 categories), deprivation (5 categories) Missing stage: 30% 𝑚𝑝𝑕𝑗𝑢 𝑄𝑠 𝑆 𝑗 = 1 𝒂 𝑗 = 𝜀 0 MCAR 𝑚𝑝𝑕𝑗𝑢 𝑄𝑠 𝑆 𝑗 = 1 𝒂 𝑗 = 𝛽 0 + 𝛽 1 (age 𝑗 −60) MAR on X 𝑚𝑝𝑕𝑗𝑢 𝑄𝑠 𝑆 𝑗 = 1 𝒂 𝑗 = 𝛿 0 + 𝛿 1 (age 𝑗 −60) + 𝛿 2 𝑈 𝑗 + 𝛿 3 𝐸 𝑗 MAR 𝑆 = 1 if stage missing 100 simulated data sets per scenario

Distribution on fully observed data and empirical expected distribution in remaining complete records

Substantive model Flexible log cumulative excess hazard model 𝑚𝑜 Λ 𝐹 𝑢 𝑦 𝑗 = 𝑡 1 𝑚𝑜 𝑢 ; 𝜹 𝟐 , 𝒍 𝟐 + 𝜸′𝒚 𝒋 + 𝑡 2 𝑏𝑕𝑓 𝑗 ; 𝜹 𝟑 , 𝒍 𝟑 Flexible functions: restricted cubic splines Baseline excess hazard: 5 df, 4 internal knots and 2 boundary knots Age (continuous): 3 df, 2 internal knots Covariables: deprivation and stage Aims: estimate effect of stage (log EHR) and stage-specific net survival at 1, 5 and 10 years since diagnosis

Missing data and net survival analysis Bernard Rachet General - PowerPoint PPT Presentation

Workshop on Flexible Models for Longitudinal and Survival Data with Applications in Biostatistics Warwick, 27 - 29 July 2015 Missing data and net survival analysis Bernard Rachet General context Population-based, routine data Cancer registry

Survival Analysis / Time-to- Event Analysis in R Heidi Seibold Statistician at LMU Munich

Multiple Imputation for Missing Data in KLoSA Juwon Song Korea University and UCLA Contents 1.

Missing Data and Imputation NINA ORWITZ OCTOBER 30 TH , 2017 Outline Types of missing data

Survival Analysis Mark Lunt Centre for Epidemiology Versus Arthritis University of Manchester

Why use the Weibull model? Heidi Seibold Statistician at LMU Munich DataCamp Survival Analysis

Kaplan-Meier estimate Heidi Seibold Statistician at LMU Munich DataCamp Survival Analysis in R

Survival curve showing cohorts Overall Survival Survival Frequency Time (%) 1 year 53.7 2

Missing Values in SAS Magnus Mengelbier Director PhUSE 2011 1 Topics Introduction

RcmdrPlugin.survival : An R Commander Plug-in Package for Survival Analysis John Fox McMaster

Searching for and replacing missing values Nicholas Tierney Statistician DataCamp Dealing With

Bayesian Generalized linear mixed models with data missing not at random Overview: Two simple

Survival Analysis: Introduction Survival Analysis typically focuses on time to event data. In the

Missing data and data imputation with the Swiss Household Panel Andr Berchtold LIVES, LINES,

The Cox Model Heidi Seibold Statistician at LMU Munich DataCamp Survival Analysis in R Why use

Estimating survival from Grays Outline flexible model I. Introduction II. Semiparametric

The LIFETEST Procedure Stratum 1: treatment = 0 Product-Limit Survival Estimates Survival

Epidemiology, Carcinogenesis and Prevention of Cancer MAGGIE MOORE, MS, APRN MT ASCUTNEY

Pancreatic Cancer The Killer that must be discovered early 27 th June 2015 Dr Alfred Kow Wei Chieh

CAR-T cell therapy pros and cons Stephen J. Schuster, MD Professor of Medicine Perelman School

The annotation conundrum Mark Liberman University of Pennsylvania myl@cis.upenn.edu The setting

Basic Concepts of I R: Outline Basic Concepts of Information Retrieval: Task definition of

ABIM Certification Exam: Nephrology Division of Nephrology July 2015 Department of Medicine

Boost Your Visibility in Google Search: Implementing Schema in Drupal 8 P R E P A R E D B Y I

Massachusetts Healthy Aging Initiative Joining Forces to Build Healthier Communities Anita

Sambuz

Useful Links

Newsletter

Mail Us