Stop or Continue Data Collection: A Nonignorable Missing Data - - PowerPoint PPT Presentation

stop or continue data collection a nonignorable missing
SMART_READER_LITE
LIVE PREVIEW

Stop or Continue Data Collection: A Nonignorable Missing Data - - PowerPoint PPT Presentation

Stop or Continue Data Collection: A Nonignorable Missing Data Approach for Continuous Variables Thas Paiva Jerry Reiter Federal University of Minas Gerais, Brazil Duke University March 14, 2018 JOS and DC-AAPOR Workshop on Responsive and


slide-1
SLIDE 1

Stop or Continue Data Collection: A Nonignorable Missing Data Approach for Continuous Variables

Thaís Paiva

Federal University of Minas Gerais, Brazil

Jerry Reiter

Duke University

March 14, 2018

JOS and DC-AAPOR Workshop on Responsive and Adaptive Survey Design

slide-2
SLIDE 2

Outline

Introduction Methodology Illustration with Census of Manufactures Data Conclusions References

This research was supported by the NSF NCRN grant (SES-11-31897) awarded to Duke

  • University. Any opinions and conclusions expressed in this article are those of the authors

and do not necessarily represent the views of the U.S. Census Bureau. All results have been reviewed to ensure that no confidential information is disclosed. 1

slide-3
SLIDE 3

Introduction

slide-4
SLIDE 4

Motivation

Census of Manufactures Data

  • Survey administrated by the U.S. Census Bureau annually,

with sample estimates of statistics for all manufacturing establishments with one or more paid employee.

  • Provides statistics on employment, payroll, cost of

materials consumed, operating expenses, value of shipments, etc.

2

slide-5
SLIDE 5

Adaptive Design

  • Methods that use auxiliary information to tailor and

update the sampling scheme throughout the survey.

  • administrative records;
  • paradata (data about the data collection process);
  • actual responses as they are collected.
  • The changes on the survey design can be applied to

individuals or to the entire survey.

  • In an ongoing survey, decide to:
  • 1. stop the data collection or
  • 2. invest on collecting more data.

3

slide-6
SLIDE 6

Decision rule How to decide to stop or not?

4

slide-7
SLIDE 7

Decision rule How to decide to stop or not?

Information measure How different is the non-respondents distribution from the respondents? Cost measure How much does it cost to collect more data and what is the budget?

4

slide-8
SLIDE 8

Stopping Rules

Rao et al. (2008): stopping rules for surveys with multi- ple waves for binary response variables.

  • Based on standardized differences of the response

proportions at each wave, where the proportions are estimated with multiple imputation of the nonresponses.

Wagner and Raghunathan (2010): stopping rules based

  • n the probability of additional data changing the esti-

mates, also for binary response variables.

  • Compared the estimates if stop data collection with the

estimates if collect follow-up sample.

5

slide-9
SLIDE 9

Methodology

slide-10
SLIDE 10

Methodology Model for the observed data

  • Continuous multivariate data
  • The variables are likely correlated and with heavily

skewed distributions

  • The model has to be flexible to capture any distributional

features from the data

  • Mixture of multivariate normal distributions
  • Dirichlet Process prior to allow for more flexibility and

better density estimation (Ishwaran and James, 2001)

6

slide-11
SLIDE 11

Dirichlet Process Mixture Model

Yn = y1, . . . , yn n complete p-dimensional observations. Assume each variable is standardized. zi ∈ 1, . . . , K component indicator of i-th observation, with probability πk = P(zi = k) Each component k follows a MVN distribution N(µk, Σk) Mixture model: yi|zi, µ, Σ ∼ N(yi|µzi, Σzi) zi|π ∼ Multinomial(π1, . . . , πK)

7

slide-12
SLIDE 12

Prior specification

Components: µk|Σk ∼ N(µ0, h−1Σk) Σk ∼ IW(f, Φ) Φ =

  • φ1

... φp

  • with φj ∼ Gamma(aφ, bφ)

aφ = bφ = 0.25 µ0 = 0 df: f = p + 1 h = 1 Alternative: Σk = σIp, for all k and σ > 0 ⇒ to control the size of the clusters Stick-breaking representation for the weights: πk = vk

  • g<k(1 − vg)

for k = 1, . . . , K vk ∼ Beta(1, α) for k = 1, . . . , K − 1; vK = 1 α ∼ Gamma(aα, bα) aα = bα = 0.25

8

slide-13
SLIDE 13

Imputation under MNAR

MAR

Generate impute data from the posterior predictive distribution

9

slide-14
SLIDE 14

Imputation under MNAR

MNAR

Generate impute data from the altered posterior predictive distribution

9

slide-15
SLIDE 15

Imputation under MNAR

MNAR

Generate impute data from the altered posterior predictive distribution Respondents DR

9

slide-16
SLIDE 16

Imputation under MNAR

MNAR

Generate impute data from the altered posterior predictive distribution Respondents DR

mixture model

µ Σ π

9

slide-17
SLIDE 17

Imputation under MNAR

MNAR

Generate impute data from the altered posterior predictive distribution Respondents DR

mixture model

µ Σ π π∗

reflect a hypothesis for the non-respondents pattern

9

slide-18
SLIDE 18

Imputation under MNAR

MNAR

Generate impute data from the altered posterior predictive distribution Respondents DR

mixture model

µ Σ π π∗ DNR Non-respondents Imputation

9

slide-19
SLIDE 19

10

slide-20
SLIDE 20

Sensitivity Analysis

  • We need to consider different plausible missingness

scenarios for sensitivity analysis.

  • For each scenario s, specify the mixture probabilities

π∗(s).

  • Consider a hypothetical population generated by

imputing the nonrespondents following each scenario.

  • Evaluate the impact on inferences if we collect follow-up

samples (FUS) with varying sizes. FUS size: nF = δ nMAX, where δ ∈ [0, 1], nMAX is the maximum sample size given budget.

11

slide-21
SLIDE 21

Imputation

For each scenario (s), generate mP hypothetical populations by imputation with π∗(s): DR ˜ D(s,1)

NR

˜ D(s,2)

NR

. . . ˜ D(s,mP)

NR

µ Σ π∗(s) Multiple Imputation

12

slide-22
SLIDE 22

Imputation

For each scenario (s), generate mP hypothetical populations by imputation with π∗(s): DR ˜ D(s,1)

NR

DR ˜ D(s,2)

NR

. . . DR ˜ D(s,mP)

NR 13

slide-23
SLIDE 23

Imputation

For each scenario (s) and for each imputation j, consider a follow-up sample of size δ: DR ˜ D(s,j)

NR 14

slide-24
SLIDE 24

Imputation

For each scenario (s) and for each imputation j, consider a follow-up sample of size δ: DR ˜ D(s,j)

NR

DR

D(s,j)

F,δ

DNF δ ∗ nNR

14

slide-25
SLIDE 25

Imputation

For each scenario (s) and for each imputation j, consider a follow-up sample of size δ: DR ˜ D(s,j)

NR

DR

D(s,j)

F,δ

DNF Option A):

new model

µ Σ π

Multiple Imputation (MAR)

14

slide-26
SLIDE 26

Imputation

For each scenario (s) and for each imputation j, consider a follow-up sample of size δ: DR ˜ D(s,j)

NR

DR

D(s,j)

F,δ

DNF Option A):

new model

µ Σ π

Multiple Imputation (MAR)

˜ D(s,j,1)

NF

˜ D(s,j,2)

NF

. . . ˜ D(s,j,mF)

NF 14

slide-27
SLIDE 27

Imputation

For each scenario (s) and for each imputation j, consider a follow-up sample of size δ: DR ˜ D(s,j)

NR

DR

D(s,j)

F,δ

DNF Option B):

new model

µ Σ π

Multiple Imputation (MAR)

14

slide-28
SLIDE 28

Imputation

For each scenario (s) and for each imputation j, consider a follow-up sample of size δ: DR ˜ D(s,j)

NR

DR

D(s,j)

F,δ

DNF Option B):

new model

µ Σ π

Multiple Imputation (MAR)

˜ D(s,j,1)

NF

˜ D(s,j,2)

NF

. . . ˜ D(s,j,mF)

NF 14

slide-29
SLIDE 29

Imputation

Compare the data sets: P(s,j) and

  • ˜

D(s,j,1)

δ

, . . . , ˜ D(s,j,mF)

δ

  • for each scenario (s), and for all imputations with j = 1, . . . , mP.

15

slide-30
SLIDE 30

Utility measures

Propensity scores:

  • Used in observational studies for matching covariate

characteristics and reduce the impact of confounding factors;

  • It is the probability of being assigned to be on the treatment

group T given the variables x: e(x) = P(T = 1|x)

  • Calculate the propensity score on the merged data set

consisting of the population P(s,j) (with T = 1) and the data set ˜ D(s,j,l)

δ

(with T=0) (Woo et al., 2009).

  • Use generalized additive models (GAM), where the linear

component of the regression is replaced by a flexible additive function, such as splines.

16

slide-31
SLIDE 31

Utility measures

Based on the predicted values of the propensity scores calcu- lated on the merged data set of size 2N. Measure ρ: Let the summary measure be ρδ(s,j,l) = 2N

i=1(ˆ

ei − 0.5)2 2N for each value of δ, scenario s, population j, and imputation l. The predicted values should be around 0.5 if the two data sets are comparable.

17

slide-32
SLIDE 32

Illustration with Census of Manufactures Data

slide-33
SLIDE 33

Illustration with Census of Manufactures Data

Variables: total value of shipments (TVS), total employment (TE), and salary/wages (SW). Industry: plastics products manufacturing. Scenarios: MAR; MNAR with higher probabilities for bottom ranked clusters; MNAR with higher probabilities for top ranked clusters.

18

slide-34
SLIDE 34

Plastic industry - MAR scenario

The values are log transformed and standardized.

19

slide-35
SLIDE 35

Plastic industry - MNAR scenario with higher probabilities for bottom ranked clusters

20

slide-36
SLIDE 36

Plastic industry - MNAR scenario with higher probabilities for top ranked clusters

21

slide-37
SLIDE 37

Plastic industry

0.000 0.001 0.002 0.003 0.004 δ ρ DR ∪DF,δ 0.00 0.25 0.50 0.75 1.00

  • MAR

Bottom Top 0e+00 1e−04 2e−04 3e−04 4e−04 5e−04 δ ρ 0.00 0.25 0.50 0.75 1.00 DF,δ

  • MAR

Bottom Top

22

slide-38
SLIDE 38

Conclusions

slide-39
SLIDE 39

Conclusions

Imputation under MNAR:

  • Flexible model that is able to capture different features
  • f the data.
  • Under MNAR, the missing data distribution is unknown.

The method works for different levels of prior information.

  • Interface to facilitate Sensitivity Analysis.

Adaptive Design:

  • Provide a framework to evaluate the impact of extreme

scenarios on the results of the imputation.

  • Through sensitivity analysis, the user can evaluate the

costs and benefits of collecting more data.

  • Future work: Formal decision rule.

23

slide-40
SLIDE 40

Thank you! Obrigada!

thaispaiva@est.ufmg.br

24

slide-41
SLIDE 41

References

Ishwaran, H. and James, L. F. (2001). Gibbs sampling methods for stick-breaking

  • priors. Journal of the American Statistical Association, 96(453).

Paiva, T. and Reiter, J. P. (2017). Stop or continue data collection: A nonignorable missing data approach for continuous variables. Journal of Official Statistics, 33(3):579–599. Rao, R. S., Glickman, M. E., and Glynn, R. J. (2008). Stopping rules for surveys with multiple waves of nonrespondent follow-up. Statistics in medicine, 27(12):2196–2213. Wagner, J. and Raghunathan, T. E. (2010). A new stopping rule for surveys. Statistics in medicine, 29(9):1014–1024. Woo, M.-J., Reiter, J. P., Oganian, A., and Karr, A. F. (2009). Global measures of data utility for microdata masked for disclosure limitation. Journal of Privacy and Confidentiality, 1(1):7. 25