Random Survival Forests Using Linked Data to Measure Illness Burden - - PDF document
Random Survival Forests Using Linked Data to Measure Illness Burden - - PDF document
Random Survival Forests Using Linked Data to Measure Illness Burden Among People With Cancer: Development and Internal Validation of the SEER-CAHPS Illness Burden Index Lisa M. Lines, PhD, MPH Julia Cohen, MA Michael T. Halpern, MD, PhD Erin
Disclosures
- The authors declare no conflicts of interest. Funding for this research was provided to LML,
JC, and MTH under National Cancer Institute contract #HHSN-261-2015-00132U.
2
2
Background
Linked data:
3
SEER Cancer Registry Data Medicare CAHPS Surveys Medicare FFS Claims & Enrollment Data
Questions:
- Could a simple score identify Medicare CAHPS respondents with high medical needs and
serious illness burdens?
- Are care experiences associated with illness burdens? (not presented today)
Approach:
- Supervised machine learning: random survival forests (RSF) using R
- Predict 1-year mortality with whatever data are available for each person
Surveillance, Epidemiology, and End Results Consumer Assessment of Healthcare Providers and Systems
- SEER‐CAHPS includes many kinds of information about people’s health,
with variables differing by year and group – analysis presents a challenge
- Many indices that summarize morbidity use claims data – example: NCI
Combined
- More than half of our sample (Medicare Advantage enrollees) does not
have claims
- We have self‐reported information on a huge range of measures, including
validated measures from other instruments (SF‐12, PHQ‐2) and widely used measures like ADLs
- SEER‐CAHPS provides an opportunity to merge information from different
data sources to improve our understanding of morbidity burden
- More precise assessment allows more accurate comparisons of illness
burden between similar individuals with and without (and before and after) cancer 3
Regression models vs. RSF methods
- RSF…
– handles the proportionality assumption
automatically
– non-parametric, makes no assumptions about
underlying distribution of values of the predictor variables (can handle skewed and multi-modal distributions)
– can handle hundreds of independent variables – can identify survival risk factors without prior
knowledge of interactions among variables
– is robust to outliers and does not suffer from
convergence problems
– identifies the independent variables that best
segregate subgroups as important predictors and identifies interactions among independent variables
– uses imputation techniques to account for
missing data
4
- Survival regression models…
– require assumptions (e.g. proportional
hazards)
– may fail to converge when there are too
many predictors or outliers
– may fail to converge when there are too
many interaction terms
– require laborious effort to account for
missing data
– may not be able to handle both imputation
and survey weights
4
Conceptual Model
5
Socio- demographic characteristics
Cancer- related morbidity Self-reported health status Chronic conditions Utilization Self-reported activity limitations Proxy assistance Contextual factors
SEER‐CAHPS Illness Burden Index (SCIBI)
5
Cohorts and groups
6
Surveyed before diagnosis (n=31,869) Assessed for eligibility (n=4,483,388) Excluded
- Comparison beneficiaries
- utside SEER areas
(n=3,400,754)
- Surveyed outside of
2007-2013 period (n=519,284)
- Comparison beneficiaries
w/ self-reported cancer history (n=32,430)
- Missing sample weight
(n=3,706)
- Survey date after date of
death (n=2,060)
- Missing diagnosis date or
diagnosed on/after death (n=225) Surveyed after diagnosis (n=84,866) People with cancer (n=116,735)
MA (n=216,794)
People without cancer (n=408,194) Analyzed (n=524,929)
FFS (n=191,400) MA (n=42,834) FFS (n=42,032) MA (n=16,222) FFS (n=15,647)
6
Classification: A Simplified Example
7
Population Needs Help with Personal Care YES 65% died NO 38% died These numbers are for example purposes only! 7
Steps in RSF process
1.
Split each group into annual subsamples
2.
In each of those subsamples, take 500 bootstrap samples from the original data
3.
Grow a survival tree for each bootstrapped dataset
a.
At each node of the tree, randomly select 10 variables for splitting on
b.
Split on the variable that optimizes the survival splitting criterion
4.
Grow the tree to full size with terminal nodes having at least 50 unique cases
a.
Calculate the tree predictor to generate the cumulative hazard estimate (relative risk) of mortality (SCIBI score)
5.
Calculate in-bag and out-of-bag (OOB) estimates by averaging the 500 tree predictors
6.
Use the OOB estimator to estimate out-of-sample prediction performance
7.
Use OOB estimation to calculate variable importance
8
8
Ability of SCIBI Scores to Differentiate 12-month Mortality Risk
9
People with cancer in SEER People without cancer in SEER Surveyed pre‐ diagnosis Surveyed post‐ diagnosis MA FFS MA FFS MA FFS N 16,222 15,647 42,834 42,032 216,794 191,400 Percent who died within 12 months of survey in all years 6% 5% 7% 6% 3% 3% Percent who died in bottom 25th percentile 0% 0% 0% 0% 0% 0% Percent who died in 99th percentile 95% 99% 96% 100% 98% 97% Error rate 37% 11% 25% 9% 23% 12%
9
Variable Importance (x100)
10
People with cancer People without cancer Surveyed pre‐diagnosis Surveyed post‐diagnosis MA FFS MA FFS MA FFS
Rank Factor VIMP Factor VIMP Factor VIMP Factor VIMP Factor VIMP Factor VIMP
1 Age 18 Any hospice 186 General health 21 Any hospice 163 Age 44 Any hospice 93 2 Social activity limitations 9 # inpatient stays 31 SF12 – physical 15 # inpatient stays 30 Needs help w/ personal care 13 # inpatient stays 36 3 SF12 ‐ mental 7 Any SNF 10 Cancer stage 14 Any SNF 8 Proxy 12 Age 12 4 Mental health 5 Age 3 Age 13 Any DME 4 General health 8 Any SNF 8 5 Pain 5 Wheelchair 2 Needs help w/ personal care 10 Needs help w/ personal care 3 ADL ‐ bathing 7 Lethargy 3
Gold – self-report; Blue – Medicare claims; Gray – SEER; Green – Medicare enrollment data
A numerical indicator of how important a variable is to the classification algorithm Based on the increase (or decrease) in misclassification error on the test data if the variable were not available We report the top 5 most influential variables (ranked) and their VIMP for variables included in any random survival forest (RSF) within that group Provides important new information about what factors influence mortality risk in
- ur sample
Results are shaded based on whether data were from claims, registry, or survey data For FFS beneficiaries, at least half of the variables ranked most important came from claims data and self‐reported variables were less important Among MA enrollees, self‐reported information – such as needing help with routine tasks – provided most of the information 10
Caveats and limitations
- Medicare CAHPS has relatively low response rates (<50%)
– Used weights to account for non-response
- Hard to compare results with regression-based approaches
– Literature has other examples
- Error rates much higher for MA enrollees
– FFS error rates are comparable or better than prior studies
- Care experience measures do not necessarily correlate with other quality measures
11
11
Conclusions
- Among more than 500,000 Medicare beneficiaries, SCIBI scores was relatively accurate as
measured by the overall error rate (20%)
– Individuals in the 99th percentile of the score had an average mortality rate of 97%
- SCIBI = omnibus measure summarizing functional and health status, other conditions, and
utilization associated with high medical need, serious illness, frailty, and end of life
- Future research needed:
– associations between site-specific risk indicators and SCIBI measures? – consistency between markers of cancer burden and the care experiences of people with cancer?
12
This score is available for both people with cancer and without, so that accurate comparisons can be made between populations 12
13