Stop or Continue Data Collection: A Nonignorable Missing Data - PowerPoint PPT Presentation

Stop or Continue Data Collection: A Nonignorable Missing Data Approach for Continuous Variables Thaís Paiva Jerry Reiter Federal University of Minas Gerais, Brazil Duke University March 14, 2018 JOS and DC-AAPOR Workshop on Responsive and Adaptive Survey Design

Outline Introduction Methodology Illustration with Census of Manufactures Data Conclusions References This research was supported by the NSF NCRN grant (SES-11-31897) awarded to Duke University. Any opinions and conclusions expressed in this article are those of the authors and do not necessarily represent the views of the U.S. Census Bureau. All results have been reviewed to ensure that no confidential information is disclosed. 1

Introduction

Motivation Census of Manufactures Data • Survey administrated by the U.S. Census Bureau annually, with sample estimates of statistics for all manufacturing establishments with one or more paid employee. • Provides statistics on employment, payroll, cost of materials consumed, operating expenses, value of shipments, etc. 2

Adaptive Design • Methods that use auxiliary information to tailor and update the sampling scheme throughout the survey. • administrative records; • paradata (data about the data collection process); • actual responses as they are collected. • The changes on the survey design can be applied to individuals or to the entire survey. • In an ongoing survey, decide to: 1. stop the data collection or 2. invest on collecting more data. 3

Decision rule How to decide to stop or not? 4

Decision rule How to decide to stop or not? Information measure Cost measure How different is the How much does it cost non-respondents to collect more data distribution from the and what is the respondents? budget? 4

Stopping Rules Rao et al. (2008): stopping rules for surveys with multiple waves for binary response variables . • Based on standardized differences of the response proportions at each wave, where the proportions are estimated with multiple imputation of the nonresponses. Wagner and Raghunathan (2010): stopping rules based on the probability of additional data changing the estimates, also for binary response variables . • Compared the estimates if stop data collection with the estimates if collect follow-up sample. 5

Methodology

Methodology Model for the observed data • Continuous multivariate data • The variables are likely correlated and with heavily skewed distributions • The model has to be flexible to capture any distributional features from the data ➡ • Mixture of multivariate normal distributions • Dirichlet Process prior to allow for more flexibility and better density estimation (Ishwaran and James, 2001) 6

Dirichlet Process Mixture Model Y n = y 1 , . . . , y n n complete p -dimensional observations . Assume each variable is standardized. z i ∈ 1 , . . . , K component indicator of i -th observation, with probability π k = P ( z i = k ) Each component k follows a MVN distribution N ( µ k , Σ k ) Mixture model: y i | z i , µ, Σ N ( y i | µ z i , Σ z i ) ∼ z i | π Multinomial ( π 1 , . . . , π K ) ∼ 7

Prior specification � � 0 φ 1 Components: with φ j ∼ Gamma ( a φ , b φ ) ... Φ = 0 φ p N ( µ 0 , h − 1 Σ k ) µ k | Σ k ∼ a φ = b φ = 0 . 25 µ 0 = 0 IW ( f , Φ) Σ k ∼ df: f = p + 1 h = 1 Alternative: ⇒ to control the size of the clusters Σ k = σ I p , for all k and σ > 0 Stick-breaking representation for the weights: v k g < k ( 1 − v g ) for k = 1 , . . . , K π k = � a α = b α = 0 . 25 v k Beta ( 1 , α ) for k = 1 , . . . , K − 1 ; v K = 1 ∼ Gamma ( a α , b α ) ∼ α 8

Imputation under MNAR Generate impute data from the MAR ➠ posterior predictive distribution 9

Imputation under MNAR Generate impute data from the MNAR ➠ altered posterior predictive distribution 9

Imputation under MNAR Generate impute data from the MNAR ➠ altered posterior predictive distribution Respondents D R 9

Imputation under MNAR Generate impute data from the MNAR ➠ altered posterior predictive distribution Σ µ Respondents D R mixture model π 9

Imputation under MNAR Generate impute data from the MNAR ➠ altered posterior predictive distribution Σ µ Respondents D R mixture model π ∗ π reflect a hypothesis for the non-respondents pattern 9

Imputation under MNAR Generate impute data from the MNAR ➠ altered posterior predictive distribution Σ µ Respondents D R mixture model π ∗ π Non-respondents D NR Imputation 9

Sensitivity Analysis • We need to consider different plausible missingness scenarios for sensitivity analysis. • For each scenario s , specify the mixture probabilities π ∗ ( s ) . • Consider a hypothetical population generated by imputing the nonrespondents following each scenario. • Evaluate the impact on inferences if we collect follow-up samples (FUS) with varying sizes. FUS size: n F = δ n MAX , where δ ∈ [ 0 , 1 ] , n MAX is the maximum sample size given budget. 11

Imputation For each scenario ( s ) , generate m P hypothetical populations by imputation with π ∗ ( s ) : Σ µ D R π ∗ ( s ) Multiple Imputation D ( s , 1 ) D ( s , 2 ) D ( s , m P ) . . . ˜ ˜ ˜ NR NR NR 12

Imputation For each scenario ( s ) , generate m P hypothetical populations by imputation with π ∗ ( s ) : D R D R D R . . . D ( s , 1 ) D ( s , 2 ) D ( s , m P ) ˜ ˜ ˜ NR NR NR 13

Imputation For each scenario ( s ) and for each imputation j , consider a follow-up sample of size δ : D R D ( s , j ) ˜ NR 14

Imputation For each scenario ( s ) and for each imputation j , consider a follow-up sample of size δ : D R D R D ( s , j ) δ ∗ n NR F ,δ D ( s , j ) ˜ D NF NR 14

Imputation For each scenario ( s ) and for each imputation j , consider a follow-up sample of size δ : Option A): D R D R Σ µ new model π D ( s , j ) F ,δ D ( s , j ) ˜ D NF NR Multiple Imputation (MAR) 14

Imputation For each scenario ( s ) and for each imputation j , consider a follow-up sample of size δ : Option A): D R D R Σ µ new model π D ( s , j ) F ,δ D ( s , j ) ˜ D NF NR Multiple Imputation (MAR) D ( s , j , 1 ) D ( s , j , 2 ) D ( s , j , m F ) . . . ˜ ˜ ˜ NF NF NF 14

Imputation For each scenario ( s ) and for each imputation j , consider a follow-up sample of size δ : Option B): D R D R Σ µ D ( s , j ) new model F ,δ π D ( s , j ) ˜ D NF NR Multiple Imputation (MAR) 14

Imputation For each scenario ( s ) and for each imputation j , consider a follow-up sample of size δ : Option B): D R D R Σ µ D ( s , j ) new model F ,δ π D ( s , j ) ˜ D NF NR Multiple Imputation (MAR) D ( s , j , 1 ) D ( s , j , 2 ) D ( s , j , m F ) . . . ˜ ˜ ˜ NF NF NF 14

Imputation Compare the data sets: D ( s , j , 1 ) D ( s , j , m F ) P ( s , j ) � � and ˜ , . . . , ˜ δ δ for each scenario ( s ) , and for all imputations with j = 1 , . . . , m P . 15

Utility measures Propensity scores: • Used in observational studies for matching covariate characteristics and reduce the impact of confounding factors; • It is the probability of being assigned to be on the treatment group T given the variables x : e ( x ) = P ( T = 1 | x ) • Calculate the propensity score on the merged data set consisting of the population P ( s , j ) (with T = 1) and the data set D ( s , j , l ) (with T=0) (Woo et al., 2009). ˜ δ • Use generalized additive models (GAM), where the linear component of the regression is replaced by a flexible additive function, such as splines. 16

Utility measures Based on the predicted values of the propensity scores calcu- lated on the merged data set of size 2 N . Measure ρ : Let the summary measure be � 2 N e i − 0 . 5 ) 2 i = 1 (ˆ ρ δ ( s , j , l ) = 2 N for each value of δ , scenario s , population j , and imputation l . The predicted values should be around 0.5 if the two data sets are comparable. 17

Illustration with Census of Manufactures Data

Illustration with Census of Manufactures Data Variables: total value of shipments (TVS), total employment (TE), and salary/wages (SW). Industry: plastics products manufacturing. Scenarios: MAR; MNAR with higher probabilities for bottom ranked clusters; MNAR with higher probabilities for top ranked clusters. 18

Plastic industry - MAR scenario The values are log transformed and standardized. 19

Plastic industry - MNAR scenario with higher probabilities for bottom ranked clusters 20

Plastic industry - MNAR scenario with higher probabilities for top ranked clusters 21

Plastic industry D R ∪ D F, δ D F, δ 5e−04 0.004 MAR MAR ● ● ● Bottom Bottom ● 4e−04 ● 0.003 Top Top ● ● 3e−04 0.002 ● ρ ρ 2e−04 ● ● ● ● 0.001 1e−04 ● ● ● ● ● ● ● 0e+00 0.000 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 δ δ 22

Conclusions

Stop or Continue Data Collection: A Nonignorable Missing Data - PowerPoint PPT Presentation

Stop or Continue Data Collection: A Nonignorable Missing Data Approach for Continuous Variables Thas Paiva Jerry Reiter Federal University of Minas Gerais, Brazil Duke University March 14, 2018 JOS and DC-AAPOR Workshop on Responsive and

Sunglasses SM001 Collection SM005 Collection YPC001 Collection(swimming goggles) SR001

Multiple Imputation for Missing Data in KLoSA Juwon Song Korea University and UCLA Contents 1.

PRESS ROOT TO PRESS ROOT TO CONTINUE: PRESS ROOT TO PRESS ROOT TO CONTINUE: PRESS ROOT TO

Missing Data and Imputation NINA ORWITZ OCTOBER 30 TH , 2017 Outline Types of missing data

Modeling nonignorable missingness in multidimensional latent class IRT models Silvia Bacci 1 ,

Parametric and Semiprametric Prediction of Finite Population Total Under Informative Sampling and

Missing Values in SAS Magnus Mengelbier Director PhUSE 2011 1 Topics Introduction

2019-2020 What is a consolidated bus stop? A consolidated bus stop is a centralized stop that

OREGON STOP PROGRAM Ken Sanchagrin Tiffany Quintero Oregon STOP Program Co-Directors 11

Searching for and replacing missing values Nicholas Tierney Statistician DataCamp Dealing With

Bayesian Generalized linear mixed models with data missing not at random Overview: Two simple

Stop w ords SE N TIME N T AN ALYSIS IN P YTH ON Violeta Mishe v a Data Scientist What are stop w

JENKINS MODERNE ET JENKINS MODERNE ET JENKINS MODERNE ET JENKINS MODERNE ET JENKINS MODERNE ET

Missing data and data imputation with the Swiss Household Panel Andr Berchtold LIVES, LINES,

Conference + Meeting Spaces Salt + Pepper TONON COLLECTION Macs Table TONON COLLECTION Pit

Conference + Meeting Spaces Salt + Pepper TONON COLLECTION Macs Table TONON COLLECTION Pit

Bayesian Updating: Continuous Priors 18.05 Spring 2014 Jeremy Orloff and Jonathan Bloom Problem 6a

Interpreting Models for Categorical and Count Outcomes Rose Medeiros StataCorp LLC Stata

Performance Issues for Parallel Implementations of Bootstrap Simulation Algorithm 22 nd

Langevin Dynamics Loucas Pillaud-Vivien November 7, 2019 Loucas Pillaud-Vivien Langevin

Compiling Deep Nets Scott Sanner Goal of this talk Will not evangelize deep networks /

Burglary Earthquake .001 .002

Reeb Graphs and Piecewise Linear Functions Koen Klaren Eindhoven University of Technology

Static scoping Scoping in Hofl Theory of Programming Languages Computer Science Department

Sambuz

Useful Links

Newsletter

Mail Us