A Modeling Approach to Compensate for Nonresponse and Selection - - PowerPoint PPT Presentation

a modeling approach to compensate for nonresponse and
SMART_READER_LITE
LIVE PREVIEW

A Modeling Approach to Compensate for Nonresponse and Selection - - PowerPoint PPT Presentation

A Modeling Approach to Compensate for Nonresponse and Selection Bias in Surveys? Tien-Huan (Amy) Lin, Ismael Flores Cervantes Westat JSM 2019 Denver, Colorado July 27-August 1 The views presented in this paper are those of the


slide-1
SLIDE 1

Tien-Huan (Amy) Lin, Ismael Flores Cervantes Westat

A Modeling Approach to Compensate for Nonresponse and Selection Bias in Surveys?

1

The views presented in this paper are those of the author(s) and do not represent the views of any Government Agency/Department or Westat

JSM 2019 • Denver, Colorado • July 27-August 1

slide-2
SLIDE 2

Objectives ❯ Can we reduce nonresponse bias for different types of nonresponse …

  • by adjusting on response propensity alone (i.e., ̂)?
  • by incorporating (predicted) survey outcome(s) in response

propensity models (i.e.,

→ ̂)?

- Vartivarian & Little, 2002 - Using modeling tools from the statistical learning area (i.e., gradient boosting)

(Morral et al., 2015; Fay & Riddles, 2016; Lin & Flores Cervantes, 2018)

❯ Empirical study: simulation with realistic survey design

  • Unequal sampling probability
  • Indirect correlation between survey outcome and response propensity

2

Two-step approach

slide-3
SLIDE 3

Types of Nonresponse

3

❯ Missing Completely At Random (MCAR)

  • Nonresponse is unrelated to any variable in the data

❯ Missing At Random (MAR)

  • Probability to respond depends only on the

covariates

❯ Not Missing At Random (NMAR)

  • Probability to respond depends on the unobserved

data

MCAR MAR NMAR

Magnitude of bias

(Sikov, 2018)

slide-4
SLIDE 4

Simulation Details ❯ Population

  • 2012 National Health Interview Survey
  • Target population: ages 18 and over
  • N: 57,356

❯ Sample Design

  • Complex sample design:

- One-stage cluster sample with implicit stratification

  • Bernoulli or Poisson sample selection of ≈500 households

- Poisson sample selection with probability of selection proportional to household size (with differential error)

  • All persons in sampled households selected (n ≈ 800)

- Nonresponse (with and without selection bias) at the person level - Survey outcome (y) is artificial

4

slide-5
SLIDE 5

Model Specifications for Survey Outcome and Response Propensity

Covariates

y rmar rnmar

Education*income (* workclass for outcome model) ✓ ✓ † #kids in HH (* sex for response models) ✓ ✓ † Age ✓ ✓ ✓ Age *Sex ✓ ✓ ✓ Race/Ethnicity ✓ † Race/Ethnicity*Sex ✓ † Esophagus cancer ✓ † Lung cancer ✓ † Throat cancer ✓ † Kidney cancer ✓ † Heart condition/disease ✓ Coronary heart disease ✓ Heart attack ✓ COPD ✓ Sex ✓ ✓ Family type ✓ † Work limitation due to health problem (family member) ✓ ✓ Worried food would run out before got money to buy more ✓ ✓ Unable to work now due to health problem (individual) ✓ ✓ Any limitation – all conditions ✓ ✓ Region ✓ ✓

5 ( Significant predictor and available for modeling † Removed from dataset and not available for modeling

Synthetic y (e.g. estimate of smokers)

slide-6
SLIDE 6

Model Specifications for Survey Outcome and Response Propensity

Covariates

y rmar rnmar

Education*income (* workclass for outcome model) ✓ ✓ † #kids in HH (* sex for response models) ✓ ✓ † Age ✓ ✓ ✓ Age *Sex ✓ ✓ ✓ Race/Ethnicity ✓ † Race/Ethnicity*Sex ✓ † Esophagus cancer ✓ † Lung cancer ✓ † Throat cancer ✓ † Kidney cancer ✓ † Heart condition/disease ✓ Coronary heart disease ✓ Heart attack ✓ COPD ✓ Sex ✓ ✓ Family type ✓ † Work limitation due to health problem (family member) ✓ ✓ Worried food would run out before got money to buy more ✓ ✓ Unable to work now due to health problem (individual) ✓ ✓ Any limitation – all conditions ✓ ✓ Region ✓ ✓

6 ( Significant predictor and available for modeling † Removed from dataset and not available for modeling

Source 1 of NMAR: some covariates removed, introducing unobserved data

slide-7
SLIDE 7

Model Specifications for Survey Outcome and Response Propensity

Covariates

y rmar rnmar

Education*income (* workclass for outcome model) ✓ ✓ † #kids in HH (* sex for response models) ✓ ✓ † Age ✓ ✓ ✓ Age *Sex ✓ ✓ ✓ Race/Ethnicity ✓ † Race/Ethnicity*Sex ✓ † Esophagus cancer ✓ † Lung cancer ✓ † Throat cancer ✓ † Kidney cancer ✓ † Heart condition/disease ✓ Coronary heart disease ✓ Heart attack ✓ COPD ✓ Sex ✓ ✓ Family type ✓ † Work limitation due to health problem (family member) ✓ ✓ Worried food would run out before got money to buy more ✓ ✓ Unable to work now due to health problem (individual) ✓ ✓ Any limitation – all conditions ✓ ✓ Region ✓ ✓

7 ( Significant predictor and available for modeling † Removed from dataset and not available for modeling

Source 2 of NMAR: selection bias

slide-8
SLIDE 8

Observed Selection Bias for Expected y for NMAR

8

0% 10% 20% 30% 40% 50% 60% 70% No high school High schools and above No high school High schools and above Low Income High Income

Synthetic y Income x Education pop nmar

0% 10% 20% 30% 40% 50% 60% 70% M F M F M F M F M F M F 1 1 2 2 3 3 4 4 5 5

Synthetic y #kids in HH x Sex pop nmar

slide-9
SLIDE 9

Simulation Scenarios

9

. . a b 4 NMAR NMAR Poisson sample selection 100% household RR 43% person RR MAR MAR Poisson sample selection 100% household RR 59% person RR MCAR MCAR Bernoulli sample selection 100% household RR 70% person RR

  • Horvitz-Thompson on full sample

HT

  • Adjusted using inverse of selection probability

BWGT

  • Tree algorithm to model
  • rpms
  • Uses xgboost to model

and rpms to model (with + all x) xgboost→rpms

  • Uses xgboost to model

and xgboost to model (with + all x) xgboost→xgboost 1 2 3 4 5

true unbiased estimate ignoring nonresponse

  • 10,000 simulations for each response propensity assumption

Knottnerus, P (2003). Sample Survey Theory. New York, NY: Springer

slide-10
SLIDE 10

Bias Assessment for MCAR and MAR

10

❯ Missing Completely At Random (MCAR)

  • Nonresponse is unrelated to any variable in

the data -> unbiased estimate regardless of auxiliary variables and adjustment methods!

❯ Missing At Random (MAR)

  • Probability to respond depends only on the

covariates -> covariates are observed for all sampled units and estimates should be unbiased

CONFIRMED! Mostly true

Baseline to measure improvement in nonresponse bias

̂

  • → ̂
slide-11
SLIDE 11

Bias Assessment for NMAR

11

❯ Can we reduce nonresponse bias for different types of nonresponse … (Vartivarian & Little, 2002)

  • by adjusting on response propensity alone (i.e., ̂)?
  • by incorporating (predicted) survey outcome(s) in

response propensity models (i.e.,

→ ̂)?

- Using modeling tools from the statistical learning area - rpart, rpms, gradient boosting

(Morral et al., 2015; Fay & Riddles, 2016; Lin & Flores Cervantes, 2018) bwgt: worst case scenario, baseline to measure improvement in bias and rmse

Ø All methods have some impact on nonresponse bias, with the level of correction: rpms > xgb+rpms > xgb+xgb. Ø None of the adjustment methods yield “unbiased” estimates. Ø Comparing the → (i.e., xgb+rpms, xgb+xgb) methods to the (i.e., rpms) method, under these settings, the method yields the lowest bias and rmse.

Baseline to measure improvement in nonresponse bias

̂

  • → ̂
slide-12
SLIDE 12

Conclusion

12

MCAR MAR NMAR

MCAR: in this baseline assumption, the

→ ̂ methods yield the same unbiased results as the Horvitz-Thompson and ̂ method: does not have a negative impact on estimate

MAR: under the assumption of having all data available for

modeling should yield unbiased estimates, the → ̂ methods show no benefit over the method

NMAR: in this setting, the

  • f the

→ ̂ methods predicts the estimate for respondents and not for the population; consequently, the estimates under these methods show worse results than the method in terms of bias and mse reduction

slide-13
SLIDE 13

Results

13

Contact information: AmyLin@westat.com IsmaelFloresCervantes@westat.com