[PDF] - Random Survival Forests Using Linked Data to Measure Illness Burden PDF Document

SLIDE 1

Random Survival Forests Using Linked Data to Measure Illness Burden Among People With Cancer: Development and Internal Validation of the SEER-CAHPS Illness Burden Index

Lisa M. Lines, PhD, MPH

Julia Cohen, MA Michael T. Halpern, MD, PhD Erin E. Kent, PhD Michelle A. Mollica, PhD, MPH, RN AcademyHealth Annual Research Meeting June 4, 2019 – Washington, DC

1

SLIDE 2

Disclosures

The authors declare no conflicts of interest. Funding for this research was provided to LML,

JC, and MTH under National Cancer Institute contract #HHSN-261-2015-00132U.

2

SLIDE 3

Background

Linked data:

3

SEER Cancer Registry Data Medicare CAHPS Surveys Medicare FFS Claims & Enrollment Data

Questions:

Could a simple score identify Medicare CAHPS respondents with high medical needs and

serious illness burdens?

Are care experiences associated with illness burdens? (not presented today)

Approach:

Supervised machine learning: random survival forests (RSF) using R
Predict 1-year mortality with whatever data are available for each person

Surveillance, Epidemiology, and End Results Consumer Assessment of Healthcare Providers and Systems

SEER‐CAHPS includes many kinds of information about people’s health,

with variables differing by year and group – analysis presents a challenge

Many indices that summarize morbidity use claims data – example: NCI

Combined

More than half of our sample (Medicare Advantage enrollees) does not

have claims

We have self‐reported information on a huge range of measures, including

validated measures from other instruments (SF‐12, PHQ‐2) and widely used measures like ADLs

SEER‐CAHPS provides an opportunity to merge information from different

data sources to improve our understanding of morbidity burden

More precise assessment allows more accurate comparisons of illness

burden between similar individuals with and without (and before and after) cancer 3

SLIDE 4

Regression models vs. RSF methods

RSF…

– handles the proportionality assumption

automatically

– non-parametric, makes no assumptions about

underlying distribution of values of the predictor variables (can handle skewed and multi-modal distributions)

– can handle hundreds of independent variables – can identify survival risk factors without prior

knowledge of interactions among variables

– is robust to outliers and does not suffer from

convergence problems

– identifies the independent variables that best

segregate subgroups as important predictors and identifies interactions among independent variables

– uses imputation techniques to account for

missing data

4

Survival regression models…

– require assumptions (e.g. proportional

hazards)

– may fail to converge when there are too

many predictors or outliers

– may fail to converge when there are too

many interaction terms

– require laborious effort to account for

missing data

– may not be able to handle both imputation

and survey weights

4

SLIDE 5

Conceptual Model

5

Socio- demographic characteristics

Cancer- related morbidity Self-reported health status Chronic conditions Utilization Self-reported activity limitations Proxy assistance Contextual factors

SEER‐CAHPS Illness Burden Index (SCIBI)

5

SLIDE 6

Cohorts and groups

6

Surveyed before diagnosis (n=31,869) Assessed for eligibility (n=4,483,388) Excluded

Comparison beneficiaries
utside SEER areas

(n=3,400,754)

Surveyed outside of

2007-2013 period (n=519,284)

Comparison beneficiaries

w/ self-reported cancer history (n=32,430)

Missing sample weight

(n=3,706)

Survey date after date of

death (n=2,060)

Missing diagnosis date or

diagnosed on/after death (n=225) Surveyed after diagnosis (n=84,866) People with cancer (n=116,735)

MA (n=216,794)

People without cancer (n=408,194) Analyzed (n=524,929)

FFS (n=191,400) MA (n=42,834) FFS (n=42,032) MA (n=16,222) FFS (n=15,647)

6

SLIDE 7

Classification: A Simplified Example

7

Population Needs Help with Personal Care YES 65% died NO 38% died These numbers are for example purposes only! 7

SLIDE 8

Steps in RSF process

1.

Split each group into annual subsamples

2.

In each of those subsamples, take 500 bootstrap samples from the original data

3.

Grow a survival tree for each bootstrapped dataset

a.

At each node of the tree, randomly select 10 variables for splitting on

b.

Split on the variable that optimizes the survival splitting criterion

4.

Grow the tree to full size with terminal nodes having at least 50 unique cases

a.

Calculate the tree predictor to generate the cumulative hazard estimate (relative risk) of mortality (SCIBI score)

5.

Calculate in-bag and out-of-bag (OOB) estimates by averaging the 500 tree predictors

6.

Use the OOB estimator to estimate out-of-sample prediction performance

7.

Use OOB estimation to calculate variable importance

8

SLIDE 9

Ability of SCIBI Scores to Differentiate 12-month Mortality Risk

9

People with cancer in SEER People without cancer in SEER Surveyed pre‐ diagnosis Surveyed post‐ diagnosis MA FFS MA FFS MA FFS N 16,222 15,647 42,834 42,032 216,794 191,400 Percent who died within 12 months of survey in all years 6% 5% 7% 6% 3% 3% Percent who died in bottom 25th percentile 0% 0% 0% 0% 0% 0% Percent who died in 99th percentile 95% 99% 96% 100% 98% 97% Error rate 37% 11% 25% 9% 23% 12%

9

SLIDE 10

Variable Importance (x100)

10

People with cancer People without cancer Surveyed pre‐diagnosis Surveyed post‐diagnosis MA FFS MA FFS MA FFS

Rank Factor VIMP Factor VIMP Factor VIMP Factor VIMP Factor VIMP Factor VIMP

1 Age 18 Any hospice 186 General health 21 Any hospice 163 Age 44 Any hospice 93 2 Social activity limitations 9 # inpatient stays 31 SF12 – physical 15 # inpatient stays 30 Needs help w/ personal care 13 # inpatient stays 36 3 SF12 ‐ mental 7 Any SNF 10 Cancer stage 14 Any SNF 8 Proxy 12 Age 12 4 Mental health 5 Age 3 Age 13 Any DME 4 General health 8 Any SNF 8 5 Pain 5 Wheelchair 2 Needs help w/ personal care 10 Needs help w/ personal care 3 ADL ‐ bathing 7 Lethargy 3

Gold – self-report; Blue – Medicare claims; Gray – SEER; Green – Medicare enrollment data

A numerical indicator of how important a variable is to the classification algorithm Based on the increase (or decrease) in misclassification error on the test data if the variable were not available We report the top 5 most influential variables (ranked) and their VIMP for variables included in any random survival forest (RSF) within that group Provides important new information about what factors influence mortality risk in

ur sample

Results are shaded based on whether data were from claims, registry, or survey data For FFS beneficiaries, at least half of the variables ranked most important came from claims data and self‐reported variables were less important Among MA enrollees, self‐reported information – such as needing help with routine tasks – provided most of the information 10

SLIDE 11

Caveats and limitations

Medicare CAHPS has relatively low response rates (<50%)

– Used weights to account for non-response

Hard to compare results with regression-based approaches

– Literature has other examples

Error rates much higher for MA enrollees

– FFS error rates are comparable or better than prior studies

Care experience measures do not necessarily correlate with other quality measures

11

SLIDE 12

Conclusions

Among more than 500,000 Medicare beneficiaries, SCIBI scores was relatively accurate as

measured by the overall error rate (20%)

– Individuals in the 99th percentile of the score had an average mortality rate of 97%

SCIBI = omnibus measure summarizing functional and health status, other conditions, and

utilization associated with high medical need, serious illness, frailty, and end of life

Future research needed:

– associations between site-specific risk indicators and SCIBI measures? – consistency between markers of cancer burden and the care experiences of people with cancer?

12

This score is available for both people with cancer and without, so that accurate comparisons can be made between populations 12

SLIDE 13

13

Random Survival Forests Using Linked Data to Measure Illness Burden Among People With Cancer: Development and Internal Validation of the SEER-CAHPS Illness Burden Index

Lisa M. Lines, PhD, MPH

Julia Cohen, MA Michael T. Halpern, MD, PhD Erin E. Kent, PhD Michelle A. Mollica, PhD, MPH, RN AcademyHealth Annual Research Meeting June 4, 2019 – Washington, DC

1

Disclosures

JC, and MTH under National Cancer Institute contract #HHSN-261-2015-00132U.

2

Background

Linked data:

Questions:

serious illness burdens?

Approach:

Surveillance, Epidemiology, and End Results Consumer Assessment of Healthcare Providers and Systems

with variables differing by year and group – analysis presents a challenge

Combined

have claims

validated measures from other instruments (SF‐12, PHQ‐2) and widely used measures like ADLs

data sources to improve our understanding of morbidity burden

burden between similar individuals with and without (and before and after) cancer 3

Regression models vs. RSF methods

automatically

underlying distribution of values of the predictor variables (can handle skewed and multi-modal distributions)

knowledge of interactions among variables

convergence problems

segregate subgroups as important predictors and identifies interactions among independent variables

missing data

hazards)

many predictors or outliers

many interaction terms

missing data

and survey weights

4

Conceptual Model

Socio- demographic characteristics

Cancer- related morbidity Self-reported health status Chronic conditions Utilization Self-reported activity limitations Proxy assistance Contextual factors

SEER‐CAHPS Illness Burden Index (SCIBI)

5

Cohorts and groups

6

Classification: A Simplified Example

Population Needs Help with Personal Care YES 65% died NO 38% died These numbers are for example purposes only! 7

Steps in RSF process

Split each group into annual subsamples

In each of those subsamples, take 500 bootstrap samples from the original data

Grow a survival tree for each bootstrapped dataset

At each node of the tree, randomly select 10 variables for splitting on

Split on the variable that optimizes the survival splitting criterion

Grow the tree to full size with terminal nodes having at least 50 unique cases

Calculate the tree predictor to generate the cumulative hazard estimate (relative risk) of mortality (SCIBI score)

Calculate in-bag and out-of-bag (OOB) estimates by averaging the 500 tree predictors

Use the OOB estimator to estimate out-of-sample prediction performance

Use OOB estimation to calculate variable importance

8

Ability of SCIBI Scores to Differentiate 12-month Mortality Risk

9

Variable Importance (x100)

People with cancer People without cancer Surveyed pre‐diagnosis Surveyed post‐diagnosis MA FFS MA FFS MA FFS

Caveats and limitations

11

Conclusions

measured by the overall error rate (20%)

utilization associated with high medical need, serious illness, frailty, and end of life

This score is available for both people with cancer and without, so that accurate comparisons can be made between populations 12

Comments or Questions? Email LLines@RTI.org

More information about SEER-CAHPS: healthcaredelivery.cancer.gov/seer-cahps/

13