A Modeling Approach to Compensate for Nonresponse and Selection - - PowerPoint PPT Presentation

▶

Jun 26, 2023 495 likes •652 views

A Modeling Approach to Compensate for Nonresponse and Selection Bias in Surveys? Tien-Huan (Amy) Lin, Ismael Flores Cervantes Westat JSM 2019 Denver, Colorado July 27-August 1 The views presented in this paper are those of the

SLIDE 1

Tien-Huan (Amy) Lin, Ismael Flores Cervantes Westat

A Modeling Approach to Compensate for Nonresponse and Selection Bias in Surveys?

The views presented in this paper are those of the author(s) and do not represent the views of any Government Agency/Department or Westat

JSM 2019 • Denver, Colorado • July 27-August 1

SLIDE 2

Objectives ❯ Can we reduce nonresponse bias for different types of nonresponse …

by adjusting on response propensity alone (i.e., ̂)?
by incorporating (predicted) survey outcome(s) in response

propensity models (i.e.,

→ ̂)?

－ Vartivarian & Little, 2002 － Using modeling tools from the statistical learning area (i.e., gradient boosting)

(Morral et al., 2015; Fay & Riddles, 2016; Lin & Flores Cervantes, 2018)

❯ Empirical study: simulation with realistic survey design

Unequal sampling probability
Indirect correlation between survey outcome and response propensity

Two-step approach

SLIDE 3

Types of Nonresponse

❯ Missing Completely At Random (MCAR)

Nonresponse is unrelated to any variable in the data

❯ Missing At Random (MAR)

Probability to respond depends only on the

covariates

❯ Not Missing At Random (NMAR)

Probability to respond depends on the unobserved

data

MCAR MAR NMAR

Magnitude of bias

(Sikov, 2018)

SLIDE 4

Simulation Details ❯ Population

2012 National Health Interview Survey
Target population: ages 18 and over
N: 57,356

❯ Sample Design

Complex sample design:

－ One-stage cluster sample with implicit stratification

Bernoulli or Poisson sample selection of ≈500 households

－ Poisson sample selection with probability of selection proportional to household size (with differential error)

All persons in sampled households selected (n ≈ 800)

－ Nonresponse (with and without selection bias) at the person level － Survey outcome (y) is artificial

SLIDE 5

Model Specifications for Survey Outcome and Response Propensity

Covariates

y rmar rnmar

Educationincome ( workclass for outcome model) ✓ ✓ † #kids in HH (* sex for response models) ✓ ✓ † Age ✓ ✓ ✓ Age Sex ✓ ✓ ✓ Race/Ethnicity ✓ † Race/EthnicitySex ✓ † Esophagus cancer ✓ † Lung cancer ✓ † Throat cancer ✓ † Kidney cancer ✓ † Heart condition/disease ✓ Coronary heart disease ✓ Heart attack ✓ COPD ✓ Sex ✓ ✓ Family type ✓ † Work limitation due to health problem (family member) ✓ ✓ Worried food would run out before got money to buy more ✓ ✓ Unable to work now due to health problem (individual) ✓ ✓ Any limitation – all conditions ✓ ✓ Region ✓ ✓

5 ( Significant predictor and available for modeling † Removed from dataset and not available for modeling

Synthetic y (e.g. estimate of smokers)

SLIDE 6

Model Specifications for Survey Outcome and Response Propensity

Covariates

y rmar rnmar

Educationincome ( workclass for outcome model) ✓ ✓ † #kids in HH (* sex for response models) ✓ ✓ † Age ✓ ✓ ✓ Age Sex ✓ ✓ ✓ Race/Ethnicity ✓ † Race/EthnicitySex ✓ † Esophagus cancer ✓ † Lung cancer ✓ † Throat cancer ✓ † Kidney cancer ✓ † Heart condition/disease ✓ Coronary heart disease ✓ Heart attack ✓ COPD ✓ Sex ✓ ✓ Family type ✓ † Work limitation due to health problem (family member) ✓ ✓ Worried food would run out before got money to buy more ✓ ✓ Unable to work now due to health problem (individual) ✓ ✓ Any limitation – all conditions ✓ ✓ Region ✓ ✓

6 ( Significant predictor and available for modeling † Removed from dataset and not available for modeling

Source 1 of NMAR: some covariates removed, introducing unobserved data

SLIDE 7

Model Specifications for Survey Outcome and Response Propensity

Covariates

y rmar rnmar

Educationincome ( workclass for outcome model) ✓ ✓ † #kids in HH (* sex for response models) ✓ ✓ † Age ✓ ✓ ✓ Age Sex ✓ ✓ ✓ Race/Ethnicity ✓ † Race/EthnicitySex ✓ † Esophagus cancer ✓ † Lung cancer ✓ † Throat cancer ✓ † Kidney cancer ✓ † Heart condition/disease ✓ Coronary heart disease ✓ Heart attack ✓ COPD ✓ Sex ✓ ✓ Family type ✓ † Work limitation due to health problem (family member) ✓ ✓ Worried food would run out before got money to buy more ✓ ✓ Unable to work now due to health problem (individual) ✓ ✓ Any limitation – all conditions ✓ ✓ Region ✓ ✓

7 ( Significant predictor and available for modeling † Removed from dataset and not available for modeling

Source 2 of NMAR: selection bias

SLIDE 8

Observed Selection Bias for Expected y for NMAR

0% 10% 20% 30% 40% 50% 60% 70% No high school High schools and above No high school High schools and above Low Income High Income

Synthetic y Income x Education pop nmar

0% 10% 20% 30% 40% 50% 60% 70% M F M F M F M F M F M F 1 1 2 2 3 3 4 4 5 5

Synthetic y #kids in HH x Sex pop nmar

SLIDE 9

Simulation Scenarios

. . a b 4 NMAR NMAR Poisson sample selection 100% household RR 43% person RR MAR MAR Poisson sample selection 100% household RR 59% person RR MCAR MCAR Bernoulli sample selection 100% household RR 70% person RR

Horvitz-Thompson on full sample

HT

Adjusted using inverse of selection probability

BWGT

Tree algorithm to model
rpms
Uses xgboost to model

and rpms to model (with + all x) xgboost→rpms

Uses xgboost to model

and xgboost to model (with + all x) xgboost→xgboost 1 2 3 4 5

true unbiased estimate ignoring nonresponse

→
10,000 simulations for each response propensity assumption

Knottnerus, P (2003). Sample Survey Theory. New York, NY: Springer

SLIDE 10

Bias Assessment for MCAR and MAR

❯ Missing Completely At Random (MCAR)

Nonresponse is unrelated to any variable in

the data -> unbiased estimate regardless of auxiliary variables and adjustment methods!

❯ Missing At Random (MAR)

Probability to respond depends only on the

covariates -> covariates are observed for all sampled units and estimates should be unbiased

CONFIRMED! Mostly true

Baseline to measure improvement in nonresponse bias

→ ̂

SLIDE 11

Bias Assessment for NMAR

❯ Can we reduce nonresponse bias for different types of nonresponse … (Vartivarian & Little, 2002)

by adjusting on response propensity alone (i.e., ̂)?
by incorporating (predicted) survey outcome(s) in

response propensity models (i.e.,

→ ̂)?

－ Using modeling tools from the statistical learning area － rpart, rpms, gradient boosting

(Morral et al., 2015; Fay & Riddles, 2016; Lin & Flores Cervantes, 2018) bwgt: worst case scenario, baseline to measure improvement in bias and rmse

Ø All methods have some impact on nonresponse bias, with the level of correction: rpms > xgb+rpms > xgb+xgb. Ø None of the adjustment methods yield “unbiased” estimates. Ø Comparing the → (i.e., xgb+rpms, xgb+xgb) methods to the (i.e., rpms) method, under these settings, the method yields the lowest bias and rmse.

Baseline to measure improvement in nonresponse bias

→ ̂

SLIDE 12

Conclusion

MCAR MAR NMAR

MCAR: in this baseline assumption, the

→ ̂ methods yield the same unbiased results as the Horvitz-Thompson and ̂ method: does not have a negative impact on estimate

MAR: under the assumption of having all data available for

modeling should yield unbiased estimates, the → ̂ methods show no benefit over the method

NMAR: in this setting, the

f the

→ ̂ methods predicts the estimate for respondents and not for the population; consequently, the estimates under these methods show worse results than the method in terms of bias and mse reduction

SLIDE 13

Tien-Huan (Amy) Lin, Ismael Flores Cervantes Westat

A Modeling Approach to Compensate for Nonresponse and Selection Bias in Surveys?

JSM 2019 • Denver, Colorado • July 27-August 1

Objectives ❯ Can we reduce nonresponse bias for different types of nonresponse …

propensity models (i.e.,

→ ̂)?

－ Vartivarian & Little, 2002 － Using modeling tools from the statistical learning area (i.e., gradient boosting)

(Morral et al., 2015; Fay & Riddles, 2016; Lin & Flores Cervantes, 2018)

❯ Empirical study: simulation with realistic survey design

Two-step approach

Types of Nonresponse

❯ Missing Completely At Random (MCAR)

❯ Missing At Random (MAR)

covariates

❯ Not Missing At Random (NMAR)

data

MCAR MAR NMAR

(Sikov, 2018)

Simulation Details ❯ Population

❯ Sample Design

－ One-stage cluster sample with implicit stratification

－ Poisson sample selection with probability of selection proportional to household size (with differential error)

－ Nonresponse (with and without selection bias) at the person level － Survey outcome (y) is artificial

Model Specifications for Survey Outcome and Response Propensity

Covariates

y rmar rnmar

Synthetic y (e.g. estimate of smokers)

Model Specifications for Survey Outcome and Response Propensity

Covariates

y rmar rnmar

Source 1 of NMAR: some covariates removed, introducing unobserved data

Model Specifications for Survey Outcome and Response Propensity

Covariates

y rmar rnmar

Source 2 of NMAR: selection bias

Observed Selection Bias for Expected y for NMAR

0% 10% 20% 30% 40% 50% 60% 70% No high school High schools and above No high school High schools and above Low Income High Income

Synthetic y Income x Education pop nmar

0% 10% 20% 30% 40% 50% 60% 70% M F M F M F M F M F M F 1 1 2 2 3 3 4 4 5 5

Synthetic y #kids in HH x Sex pop nmar

Simulation Scenarios

. . a b 4 NMAR NMAR Poisson sample selection 100% household RR 43% person RR MAR MAR Poisson sample selection 100% household RR 59% person RR MCAR MCAR Bernoulli sample selection 100% household RR 70% person RR

HT

BWGT

and rpms to model (with + all x) xgboost→rpms

and xgboost to model (with + all x) xgboost→xgboost 1 2 3 4 5

true unbiased estimate ignoring nonresponse

Bias Assessment for MCAR and MAR

❯ Missing Completely At Random (MCAR)

the data -> unbiased estimate regardless of auxiliary variables and adjustment methods!

❯ Missing At Random (MAR)

covariates -> covariates are observed for all sampled units and estimates should be unbiased

CONFIRMED! Mostly true

Bias Assessment for NMAR

❯ Can we reduce nonresponse bias for different types of nonresponse … (Vartivarian & Little, 2002)

response propensity models (i.e.,

→ ̂)?

－ Using modeling tools from the statistical learning area － rpart, rpms, gradient boosting

(Morral et al., 2015; Fay & Riddles, 2016; Lin & Flores Cervantes, 2018) bwgt: worst case scenario, baseline to measure improvement in bias and rmse

Conclusion

MCAR MAR NMAR

MCAR: in this baseline assumption, the

→ ̂ methods yield the same unbiased results as the Horvitz-Thompson and ̂ method: does not have a negative impact on estimate

MAR: under the assumption of having all data available for

modeling should yield unbiased estimates, the → ̂ methods show no benefit over the method

NMAR: in this setting, the

→ ̂ methods predicts the estimate for respondents and not for the population; consequently, the estimates under these methods show worse results than the method in terms of bias and mse reduction

Results

Contact information: AmyLin@westat.com IsmaelFloresCervantes@westat.com