Testing and Imputing Item Nonresponse as Missing Data, with Big and - - PowerPoint PPT Presentation

testing and imputing item nonresponse as missing data
SMART_READER_LITE
LIVE PREVIEW

Testing and Imputing Item Nonresponse as Missing Data, with Big and - - PowerPoint PPT Presentation

Testing and Imputing Item Nonresponse as Missing Data, with Big and Normal Survey Data NATALIE JACKSON Director of Research PRRI JEFF GILL Distinguished Professor, Government, Mathematics & Statistics American University Overview


slide-1
SLIDE 1

Testing and Imputing Item Nonresponse as Missing Data, with Big and Normal Survey Data

NATALIE JACKSON Director of Research PRRI JEFF GILL Distinguished Professor, Government, Mathematics & Statistics American University

slide-2
SLIDE 2

Overview

◮ Types of item nonresponse: There are two types of survey item nonresponse that matter: whether the respondent actually has an opinion, or not ◮ Implications for missing data treatment: The above distinction matters for missing data classifica- tion and treatment - and how we think about data structures ◮ Data: ANES sets up a partial test for this theory ◮ Distinguishing between types of item nonresponse: Some models demonstrate a difference between types of nonresponse, including after imputation ◮ Limitations and what’s next: This is still a work in progress

slide-3
SLIDE 3

Types of Item Nonresponse: Sources in the Survey

◮ Don’t Know In theory, this response comes from respondents who truly do not know the answer to a question

  • In practice, it is not clear whether respondents choose “don’t know” - because they really don’t

have an attitude, none of the answer options fit, or they don’t want to answer ◮ Refuse In theory, this response originates from respondents who do not want to provide their an- swer to the question

  • Incidence of outright refusals is generally very low. In practice, reasons for refusing are as

murky as with “don’t know” ◮ Skip This response is completely ambiguous and frequently occurs on web surveys. We have no insight into the motivation for skipping an item

slide-4
SLIDE 4

Types of Item Nonresponse: Two Types that Matter

◮ The ambiguity in “don’t know”, refuse, and skip provide little help for analysis. ◮ A more useful typology of item nonresponse is:

  • True Nonattitudes The respondent truly does not have an opinion or factual knowledge re-

quired to answer the question, often for knowable reasons - a lack of education, experience, or interest in the topic

  • Hidden Attitudes The respondent has an opinion or the factual knowledge required to answer

the question, but for some (often unknowable) reason, they choose not to disclose it

slide-5
SLIDE 5

Types of Item Nonresponse: Two Types that Matter

◮ Implications of two types:

  • The reason for item nonresponse cannot be assumed consistent across - or even within - respon-

dents: – The type of missingness can change within a single variable due to different respondents – It can also change within respondents, as they will have distinct reasons for not answering different questions

  • Result: We’re no longer thinking about rows of respondents and columns of variables - we’re

thinking about every cell independently

slide-6
SLIDE 6

Implications for Missing Data Treatment: Reminder of Missing Data Types

◮ Goal: L(β|X, Y) unbiased, efficient, and with the correct standard error. ◮ Define: Zmis = (Xmis, Ymis) Zobs = (Xobs, Yobs) ◮ We stipulate a n × k matrix, R , corresponding to X that contains 0 when the X matrix data value is not missing, and 1 when it is missing. ◮ Stipulate a probability model for R where φ is a parameter in this distribution of R . ◮ Standard Terms from Rubin (1979) for the missing data: MCAR p(R |Zobs, Zmis) = p(R |φ) missingness not related to observed or unobserved MAR p(R |Zobs, Zmis) = p(R |Zobs, φ) missingness depends only on observed data Non-Ignorable p(R |Zobs, Zmis) = p(R |Zobs, Zmis, φ) missingness depends on unobserved data

slide-7
SLIDE 7

Implications for Missing Data Treatment: Two Types of Item Nonresponse

True Nonattitudes ◮ When people genuinely don’t know the answer to a factual question or don’t have an attitude to report, typically it’s because they lack education, experience, or interest in the topic ◮ Those factors are generally measured in a survey. Thus, missingness caused by a true nonattitude could be MAR Hidden Attitudes ◮ Social desirability, the interviewer interaction, views of the topic as private, and many other possi- ble justifications exist for why a respondent would decline to state an attitude, in addition to edu- cation, experience, and interest. ◮ Given that we do not/cannot measure these factors in the survey, hidden attitudes are likely Non- Ignorable/NMAR

slide-8
SLIDE 8

Implications for Missing Data Treatment: Two Types of Item Nonresponse

Analysis questions: ◮ Are these two causes of missing data discernible in survey data: Can we model the difference be- tween true nonattitudes and hidden attitudes? ◮ Can we design a model that identifies whether a missing cell is likely to be a true nonattitude or a hidden attitude?

slide-9
SLIDE 9

Data: ANES 2012, Face-to-Face Sample

◮ Dataset chosen because of a question setup that allows a more direct test of this typology than usual ◮ Working with the face-to-face sample only

  • 2012 was the first year a web sample was done simultaneously, but given mode differences the

web portion is not used here

  • It’s unclear how “don’t know” and refuse was handled in the published questionnaires, but it

appears explicit options were not given

  • Item nonresponse is generally much lower in web sample, not a clear comparison to face-to-face

sample

slide-10
SLIDE 10

Data: The Test Questions

◮ A two-question setup on the respondent ideology question: ◮ libcpre self asks all respondents to place themselves on the standard 7-point ideology scale (1=extremely liberal), provides “haven’t thought much about it” as an out

  • 590/2054 respondents did not give a substantive response

◮ libcpre choose asks those who said “moderate,” “haven’t thought much about it” or any other non-response if they had to choose, whether they would select liberal or conservative

  • 128 respondents still did not give a substantive response when libcpre choose is combined

with libcpre self

  • 462 respondents gave a substantive answer to libcpre choose but did not to libcpre self
slide-11
SLIDE 11

Data: The Test Questions

◮ The 128 respondents who didn’t answer the ideology question twice can be thought of as the “true nonattitudes”

  • It’s possible they are just really stubborn hidden attitudes, but the text of the

libcpre choose question - “if you had to choose” - is designed to ferret those out. We’re mak- ing the assumption that it worked ◮ The 462 respondents who answered on the second round can be thought of as the “hidden atti- tudes”

  • It’s possible a few true nonattitudes provided answers in order to mollify interviewers. We’re

making the assumption that those cases are few ◮ These are important assumptions that we aren’t able to definitively prove - yet there is some evi- dence in the data to support them

slide-12
SLIDE 12

Distinguishing Between Types of Item Nonresponse: Hidden Attitudes vs. True Nonattitudes

Figure 1: Modeling Nonresponse to Ideology Question

slide-13
SLIDE 13

Distinguishing Between Types of Item Nonresponse: Hidden Attitudes vs. True Nonattitudes

◮ Overall: Nonresponders are more likely to be female, nonwhite, less interested, Demo- crat/Independent, lower education ◮ True nonattitudes compared to hidden attitudes: Less interested, more likely Independent, maybe less likely nonwhite ◮ Model fit is much better for explaining true nonattitudes than it is for explaining all missingness or the hidden attitudes

slide-14
SLIDE 14

Distinguishing Between Types of Item Nonresponse: Hidden Attitudes vs. True Nonattitudes

◮ Tentative conclusion: We have a much better handle on what causes true nonattitudes, and it hinges on attitudinal variables ◮ Back to our analysis questions:

  • Are these two causes of missing data discernible in survey data: Can we model the difference

between true nonattitudes and hidden attitudes? Yes

  • Now we turn to imputation questions: Can we treat all missingness equally given the theoret-

ical distinction between true nonattitudes and hidden attitudes? Is it appropriate to impute a response for analytical purposes when the respondent truly does not have an opinion?

slide-15
SLIDE 15

Effects on Imputed Data: Hot Deck Imputation

◮ R package hot.deck used for imputation given missingness on categorical variable (libcpre self is 29% missing)

  • Multiple hot deck imputation for categorical variables (Cranmer & Gill 2013)
  • Uses mice for continuous variables
  • 10 imputations per dataset

◮ Two datasets imputed:

  • All 2054 cases
  • 1926 cases imputed, 128 true nonattitudes dropped
slide-16
SLIDE 16

Effects on Imputed Data: Imputed Model Comparison

Figure 2: Coefficients Across 4 Models

slide-17
SLIDE 17

Effects on Imputed Data: Imputed Model Comparison

Figure 3: Ideology Coefficients Across 4 Models

slide-18
SLIDE 18

Takeaways

◮ There is some important difference between the true nonattitudes and revealed hidden attitudes ◮ There is a clear hierarchy of model fit when those values are imputed:

  • Imputed all 590 missing cases, Imputed 128 true nonattitudes, Imputed 462 missing cases with

128 deleted, No imputation needed ◮ And this is only for one variable in the model - scrutinizing all variables would presumably result in more differences ◮ That indicates we should consider not imputing true nonattitudes ◮ At minimum, we need to be aware of the differences in type of item nonresponse that we’re imput- ing

slide-19
SLIDE 19

Limitations and What’s Next:

◮ Primary contribution thus far: Restructuring how we think of missing data using realistic typol-

  • gy of what we know about item nonresponse (true nonattitudes vs hidden attitudes). Cell-based

thinking rather than typical respondent/question framing ◮ We’ve shown that the traditional methods of imputing can be problematic when type of missing- ness likely varies ◮ We need to refine the simulations to a reliable way to distinguish true nonattitudes and hidden nonattitudes ◮ We want to generate practical advice for modeling and imputation decisions

slide-20
SLIDE 20

THANK YOU!

slide-21
SLIDE 21

Appendix: Variable Coding

◮ Coding for nonresponse models

  • For libcpre self and the combined variable that incorporates both libcpre self and

libcpre choose, and all subsets of these questions used: 0 if a substantive response, 1 if not ◮ Other variables used in the analysis: Gender dichotomized, female Interest in politics/elections 5-pt scale Employed dichotomized, employed Race dichotomized, nonwhite Age numeric Party ID dichotomized, Dem, Ind Education 5-pt scale ◮ Handful of missing values on explanatory variables imputed using mice to ensure full use of data

slide-22
SLIDE 22

Distinguishing Between Types of Item Nonresponse: Correlates of Overall Nonresponse

Table 1: Modeling Missing Values at Stage One, Full Sample Estimate

  • Std. Error

t value Pr(>|t|) 95% LCI 95% UCI (Intercept)

  • 1.55

0.34

  • 4.55

0.00 0.12 0.37 Female 0.28 0.12 2.36 0.01 1.09 1.60 Interest 0.40 0.05 7.43 0.00 1.37 1.63 Employed 0.10 0.13 0.80 0.21 0.90 1.36 Nonwhite 0.50 0.13 3.87 0.00 1.33 2.04 Age

  • 0.00

0.00

  • 0.57

0.28 0.99 1.00 Democrat 0.36 0.14 2.59 0.00 1.14 1.80 Independent 1.21 0.18 6.55 0.00 2.48 4.55 Education

  • 0.53

0.06

  • 8.94

0.00 0.53 0.65 Null deviance: 2228.22 on 2053 degrees of freedom Residual deviance: 1886.57 on 2045 degrees of freedom AIC: 1743.101, Adjusted Degrees of Freedom from 10 Imputations: 556.9445

slide-23
SLIDE 23

Distinguishing Between Types of Item Nonresponse: True Nonattitudes

Table 2: Modeling Missing Values at Stage One, True Nonattitudes Estimate

  • Std. Error

t value Pr(>|t|) 95% LCI 95% UCI (Intercept)

  • 3.83

0.68

  • 5.62

0.00 0.01 0.07 Female 0.47 0.22 2.17 0.02 1.12 2.30 Interest 0.47 0.10 4.50 0.00 1.35 1.89 Employment 0.03 0.23 0.12 0.45 0.70 1.50 Nonwhite 0.15 0.23 0.66 0.25 0.80 1.71 Age

  • 0.00

0.01

  • 0.64

0.26 0.99 1.01 Democrat 1.23 0.34 3.59 0.00 1.95 6.04 Independent 2.49 0.37 6.76 0.00 6.57 22.06 Education

  • 0.56

0.11

  • 5.12

0.00 0.47 0.68 Null deviance: 849.4 on 1591 degrees of freedom Residual deviance: 667.8 on 1583 degrees of freedom AIC: 625.0275, Adjusted Degrees of Freedom from 10 Imputations: 86.06016

slide-24
SLIDE 24

Distinguishing Between Types of Item Nonresponse: Hidden Attitudes vs. True Nonattitudes

Table 3: Modeling Missing Values at Stage One, True Nonattitudes vs. Hidden Attitudes Estimate

  • Std. Error

t value Pr(>|t|) 95% LCI 95% UCI (Intercept)

  • 2.36

0.73

  • 3.22

0.00 0.03 0.32 Female 0.15 0.23 0.65 0.26 0.80 1.70 Interest 0.18 0.10 1.75 0.04 1.01 1.42 Employed 0.10 0.24 0.41 0.34 0.75 1.63 Nonwhite

  • 0.36

0.25

  • 1.42

0.08 0.46 1.06 Age

  • 0.00

0.01

  • 0.43

0.34 0.99 1.01 Democrat 1.17 0.37 3.15 0.00 1.75 5.92 Independent 1.78 0.38 4.69 0.00 3.18 11.10 Education

  • 0.15

0.14

  • 1.06

0.15 0.69 1.08 Null deviance: 531.22 on 589 degrees of freedom Residual deviance: 494.71 on 581 degrees of freedom AIC: 465.1297, Adjusted Degrees of Freedom from 10 Imputations: 72.40175

slide-25
SLIDE 25

Effects on Imputed Data: Data Setup

◮ Return to original un-edited dataset ◮ Coding for nonresponse models

  • libcpre self and libcpre combined - the combined variable that incorporates both

libcpre self and libcpre choose - scale is 1-7, extremely liberal to extremely conservative

  • Outcome variable used is 0-1 vote choice; 1 indicates vote for Obama

◮ Other variables used in the analysis: Gender dichotomized, female Interest in politics/elections 5-pt scale Employed dichotomized, employed Race dichotomized, nonwhite Age numeric Party ID 7-pt scale, 1=Strong Dem Education 5-pt scale

slide-26
SLIDE 26

Multiple Hot-Deck Imputation, Affinity Score Definition

◮ For each respondent yi indicates the outcome variable and xi is a k-length vector of only discrete explanatory variables. ◮ If the i th case under consideration has qi missing values in xi, then a potential donor vector, xj, j = i, will have between 0 and k − qi exact matches with i. ◮ Define zi,j as the number of variables for which the potential donor j and the recipient i have differ- ent values. ◮ Thus k − qi − zi,j is the number of variables on which j and i are perfectly matched. ◮ This value, scaled by the highest number of possible matches (k − qi) is then the affinity score: αi,j = k − qi − zi,j k − qi . (1) ◮ The affinity score has the desirable properties that αi,j = 1 for i ∈ DR (data with responses) and αi,j = 0 for i ∈ DNR (data missing responses). ◮ Cases where the recipient and the donor are both missing values in the same covariate are deducted from k and qi prior to the calculation of αi,j.

slide-27
SLIDE 27

Effects on Imputed Data: Imputed Models

◮ The thing to focus on in the following tables is model fit

Table 4: Modeling Vote Choice, All Missing Ideology Imputed Estimate

  • Std. Error

t value Pr(>|t|) 95% LCI 95% UCI (Intercept) 5.27 0.53 9.91 0.00 81.19 467.08 Ideology

  • 0.29

0.07

  • 3.95

0.00 0.66 0.84 Female

  • 0.14

0.21

  • 0.67

0.26 0.61 1.23 Interest 0.04 0.08 0.51 0.31 0.91 1.19 Presidential job approval

  • 0.89

0.06

  • 14.71

0.00 0.37 0.45 Employed

  • 0.55

0.23

  • 2.37

0.03 0.40 0.85 Nonwhite 1.19 0.21 5.77 0.00 2.33 4.59 Age 0.00 0.01 0.71 0.25 1.00 1.01 Partisanship

  • 0.53

0.06

  • 8.48

0.00 0.53 0.65 Education 0.13 0.08 1.60 0.06 1.00 1.29 Null deviance: 2837.75 on 2053 degrees of freedom Residual deviance: 1140.29 on 2044 degrees of freedom AIC: 1110.454, Adjusted Degrees of Freedom from 10 Imputations: 38.60938

slide-28
SLIDE 28

Effects on Imputed Data: Imputed Models

Table 5: Modeling Vote Choice, Ideology True Nonattitudes Imputed Estimate

  • Std. Error

t value Pr(>|t|) 95% LCI 95% UCI (Intercept) 5.92 0.58 10.28 0.00 144.58 962.81 Ideology

  • 0.42

0.08

  • 5.39

0.00 0.58 0.75 Female

  • 0.20

0.21

  • 0.95

0.19 0.58 1.16 Interest 0.02 0.08 0.30 0.39 0.90 1.17 Presidential job approval

  • 0.91

0.06

  • 15.88

0.00 0.37 0.44 Employed

  • 0.56

0.23

  • 2.49

0.02 0.39 0.83 Nonwhite 1.23 0.21 5.98 0.00 2.44 4.80 Age 0.00 0.01 0.84 0.20 1.00 1.01 Partisanship

  • 0.50

0.06

  • 8.10

0.00 0.55 0.67 Education 0.10 0.08 1.24 0.12 0.97 1.27 Null deviance: 2837.75 on 2053 degrees of freedom Residual deviance: 1099.65 on 2044 degrees of freedom AIC: 1070.551, Adjusted Degrees of Freedom from 10 Imputations: 21.28132

slide-29
SLIDE 29

Effects on Imputed Data: Imputed Models

Table 6: Modeling Vote Choice, No True Nonattitudes, Others Imputed Estimate

  • Std. Error

t value Pr(>|t|) 95% LCI 95% UCI (Intercept) 5.75 0.62 9.29 0.00 113.06 865.29 Ideology

  • 0.34

0.07

  • 4.86

0.00 0.63 0.80 Female

  • 0.22

0.19

  • 1.21

0.12 0.59 1.08 Interest 0.01 0.08 0.10 0.46 0.88 1.16 Presidential job approval

  • 0.92

0.07

  • 13.77

0.00 0.36 0.45 Employed

  • 0.27

0.20

  • 1.36

0.10 0.54 1.06 Nonwhite 1.14 0.21 5.32 0.00 2.19 4.43 Age 0.01 0.01 0.87 0.21 0.99 1.02 Partisanship

  • 0.58

0.06

  • 9.56

0.00 0.50 0.62 Education 0.14 0.08 1.69 0.05 1.00 1.31 Null deviance: 2679.6 on 1925 degrees of freedom Residual deviance: 1000.09 on 1916 degrees of freedom AIC: 984.8838, Adjusted Degrees of Freedom from 10 Imputations: 15.09675

slide-30
SLIDE 30

Effects on Imputed Data: Imputed Models

Table 7: Modeling Vote Choice, No True Nonattitudes Estimate

  • Std. Error

t value Pr(>|t|) 95% LCI 95% UCI (Intercept) 6.43 0.67 9.58 0.00 205.98 1877.44 Ideology

  • 0.48

0.09

  • 5.55

0.00 0.53 0.71 Female

  • 0.27

0.19

  • 1.47

0.08 0.56 1.03 Interest

  • 0.00

0.08

  • 0.01

0.50 0.87 1.15 Presidential job approval

  • 0.94

0.07

  • 12.74

0.00 0.35 0.44 Employed

  • 0.27

0.20

  • 1.33

0.10 0.55 1.07 Nonwhite 1.25 0.22 5.60 0.00 2.42 5.03 Age 0.01 0.01 1.10 0.16 1.00 1.02 Partisanship

  • 0.55

0.07

  • 8.42

0.00 0.52 0.64 Education 0.09 0.08 1.15 0.13 0.96 1.26 Null deviance: 2679.6 on 1925 degrees of freedom Residual deviance: 968.57 on 1916 degrees of freedom AIC: 952.8098 , Adjusted Degrees of Freedom from 10 Imputations: 11.67363

slide-31
SLIDE 31

Multiple Hot-Deck Imputation

◮ Focuses strictly on missing discrete values, keeping measurement discrete: for example a {−2, −1, 0, 1, 2} variable will not get imputations like 1.2834. ◮ First create several replicate copies of the dataset, and then perform the steps:

  • 1. Search down columns of the data sequentially looking for missing observations in one replicate.

(a) When a missing value is found, compute a vector of affinity Scores relative to all other rows (cases) for that case with the missing value. (b) Create an empirical distribution of potential donors using affinity scores and draw randomly from it to produce a vector of imputations. (c) Impute one of these values into the appropriate cell of each duplicate dataset.

  • 2. Repeat Step 2 until no missing observations remain.
  • 3. Estimate the statistic of interest for each dataset.
  • 4. Combine the estimates of the statistic into a single estimate as in regular multiple imputation.