Testing and Imputing Item Nonresponse as Missing Data, with Big and - - PowerPoint PPT Presentation
Testing and Imputing Item Nonresponse as Missing Data, with Big and - - PowerPoint PPT Presentation
Testing and Imputing Item Nonresponse as Missing Data, with Big and Normal Survey Data NATALIE JACKSON Director of Research PRRI JEFF GILL Distinguished Professor, Government, Mathematics & Statistics American University Overview
Overview
◮ Types of item nonresponse: There are two types of survey item nonresponse that matter: whether the respondent actually has an opinion, or not ◮ Implications for missing data treatment: The above distinction matters for missing data classifica- tion and treatment - and how we think about data structures ◮ Data: ANES sets up a partial test for this theory ◮ Distinguishing between types of item nonresponse: Some models demonstrate a difference between types of nonresponse, including after imputation ◮ Limitations and what’s next: This is still a work in progress
Types of Item Nonresponse: Sources in the Survey
◮ Don’t Know In theory, this response comes from respondents who truly do not know the answer to a question
- In practice, it is not clear whether respondents choose “don’t know” - because they really don’t
have an attitude, none of the answer options fit, or they don’t want to answer ◮ Refuse In theory, this response originates from respondents who do not want to provide their an- swer to the question
- Incidence of outright refusals is generally very low. In practice, reasons for refusing are as
murky as with “don’t know” ◮ Skip This response is completely ambiguous and frequently occurs on web surveys. We have no insight into the motivation for skipping an item
Types of Item Nonresponse: Two Types that Matter
◮ The ambiguity in “don’t know”, refuse, and skip provide little help for analysis. ◮ A more useful typology of item nonresponse is:
- True Nonattitudes The respondent truly does not have an opinion or factual knowledge re-
quired to answer the question, often for knowable reasons - a lack of education, experience, or interest in the topic
- Hidden Attitudes The respondent has an opinion or the factual knowledge required to answer
the question, but for some (often unknowable) reason, they choose not to disclose it
Types of Item Nonresponse: Two Types that Matter
◮ Implications of two types:
- The reason for item nonresponse cannot be assumed consistent across - or even within - respon-
dents: – The type of missingness can change within a single variable due to different respondents – It can also change within respondents, as they will have distinct reasons for not answering different questions
- Result: We’re no longer thinking about rows of respondents and columns of variables - we’re
thinking about every cell independently
Implications for Missing Data Treatment: Reminder of Missing Data Types
◮ Goal: L(β|X, Y) unbiased, efficient, and with the correct standard error. ◮ Define: Zmis = (Xmis, Ymis) Zobs = (Xobs, Yobs) ◮ We stipulate a n × k matrix, R , corresponding to X that contains 0 when the X matrix data value is not missing, and 1 when it is missing. ◮ Stipulate a probability model for R where φ is a parameter in this distribution of R . ◮ Standard Terms from Rubin (1979) for the missing data: MCAR p(R |Zobs, Zmis) = p(R |φ) missingness not related to observed or unobserved MAR p(R |Zobs, Zmis) = p(R |Zobs, φ) missingness depends only on observed data Non-Ignorable p(R |Zobs, Zmis) = p(R |Zobs, Zmis, φ) missingness depends on unobserved data
Implications for Missing Data Treatment: Two Types of Item Nonresponse
True Nonattitudes ◮ When people genuinely don’t know the answer to a factual question or don’t have an attitude to report, typically it’s because they lack education, experience, or interest in the topic ◮ Those factors are generally measured in a survey. Thus, missingness caused by a true nonattitude could be MAR Hidden Attitudes ◮ Social desirability, the interviewer interaction, views of the topic as private, and many other possi- ble justifications exist for why a respondent would decline to state an attitude, in addition to edu- cation, experience, and interest. ◮ Given that we do not/cannot measure these factors in the survey, hidden attitudes are likely Non- Ignorable/NMAR
Implications for Missing Data Treatment: Two Types of Item Nonresponse
Analysis questions: ◮ Are these two causes of missing data discernible in survey data: Can we model the difference be- tween true nonattitudes and hidden attitudes? ◮ Can we design a model that identifies whether a missing cell is likely to be a true nonattitude or a hidden attitude?
Data: ANES 2012, Face-to-Face Sample
◮ Dataset chosen because of a question setup that allows a more direct test of this typology than usual ◮ Working with the face-to-face sample only
- 2012 was the first year a web sample was done simultaneously, but given mode differences the
web portion is not used here
- It’s unclear how “don’t know” and refuse was handled in the published questionnaires, but it
appears explicit options were not given
- Item nonresponse is generally much lower in web sample, not a clear comparison to face-to-face
sample
Data: The Test Questions
◮ A two-question setup on the respondent ideology question: ◮ libcpre self asks all respondents to place themselves on the standard 7-point ideology scale (1=extremely liberal), provides “haven’t thought much about it” as an out
- 590/2054 respondents did not give a substantive response
◮ libcpre choose asks those who said “moderate,” “haven’t thought much about it” or any other non-response if they had to choose, whether they would select liberal or conservative
- 128 respondents still did not give a substantive response when libcpre choose is combined
with libcpre self
- 462 respondents gave a substantive answer to libcpre choose but did not to libcpre self
Data: The Test Questions
◮ The 128 respondents who didn’t answer the ideology question twice can be thought of as the “true nonattitudes”
- It’s possible they are just really stubborn hidden attitudes, but the text of the
libcpre choose question - “if you had to choose” - is designed to ferret those out. We’re mak- ing the assumption that it worked ◮ The 462 respondents who answered on the second round can be thought of as the “hidden atti- tudes”
- It’s possible a few true nonattitudes provided answers in order to mollify interviewers. We’re
making the assumption that those cases are few ◮ These are important assumptions that we aren’t able to definitively prove - yet there is some evi- dence in the data to support them
Distinguishing Between Types of Item Nonresponse: Hidden Attitudes vs. True Nonattitudes
Figure 1: Modeling Nonresponse to Ideology Question
Distinguishing Between Types of Item Nonresponse: Hidden Attitudes vs. True Nonattitudes
◮ Overall: Nonresponders are more likely to be female, nonwhite, less interested, Demo- crat/Independent, lower education ◮ True nonattitudes compared to hidden attitudes: Less interested, more likely Independent, maybe less likely nonwhite ◮ Model fit is much better for explaining true nonattitudes than it is for explaining all missingness or the hidden attitudes
Distinguishing Between Types of Item Nonresponse: Hidden Attitudes vs. True Nonattitudes
◮ Tentative conclusion: We have a much better handle on what causes true nonattitudes, and it hinges on attitudinal variables ◮ Back to our analysis questions:
- Are these two causes of missing data discernible in survey data: Can we model the difference
between true nonattitudes and hidden attitudes? Yes
- Now we turn to imputation questions: Can we treat all missingness equally given the theoret-
ical distinction between true nonattitudes and hidden attitudes? Is it appropriate to impute a response for analytical purposes when the respondent truly does not have an opinion?
Effects on Imputed Data: Hot Deck Imputation
◮ R package hot.deck used for imputation given missingness on categorical variable (libcpre self is 29% missing)
- Multiple hot deck imputation for categorical variables (Cranmer & Gill 2013)
- Uses mice for continuous variables
- 10 imputations per dataset
◮ Two datasets imputed:
- All 2054 cases
- 1926 cases imputed, 128 true nonattitudes dropped
Effects on Imputed Data: Imputed Model Comparison
Figure 2: Coefficients Across 4 Models
Effects on Imputed Data: Imputed Model Comparison
Figure 3: Ideology Coefficients Across 4 Models
Takeaways
◮ There is some important difference between the true nonattitudes and revealed hidden attitudes ◮ There is a clear hierarchy of model fit when those values are imputed:
- Imputed all 590 missing cases, Imputed 128 true nonattitudes, Imputed 462 missing cases with
128 deleted, No imputation needed ◮ And this is only for one variable in the model - scrutinizing all variables would presumably result in more differences ◮ That indicates we should consider not imputing true nonattitudes ◮ At minimum, we need to be aware of the differences in type of item nonresponse that we’re imput- ing
Limitations and What’s Next:
◮ Primary contribution thus far: Restructuring how we think of missing data using realistic typol-
- gy of what we know about item nonresponse (true nonattitudes vs hidden attitudes). Cell-based
thinking rather than typical respondent/question framing ◮ We’ve shown that the traditional methods of imputing can be problematic when type of missing- ness likely varies ◮ We need to refine the simulations to a reliable way to distinguish true nonattitudes and hidden nonattitudes ◮ We want to generate practical advice for modeling and imputation decisions
THANK YOU!
Appendix: Variable Coding
◮ Coding for nonresponse models
- For libcpre self and the combined variable that incorporates both libcpre self and
libcpre choose, and all subsets of these questions used: 0 if a substantive response, 1 if not ◮ Other variables used in the analysis: Gender dichotomized, female Interest in politics/elections 5-pt scale Employed dichotomized, employed Race dichotomized, nonwhite Age numeric Party ID dichotomized, Dem, Ind Education 5-pt scale ◮ Handful of missing values on explanatory variables imputed using mice to ensure full use of data
Distinguishing Between Types of Item Nonresponse: Correlates of Overall Nonresponse
Table 1: Modeling Missing Values at Stage One, Full Sample Estimate
- Std. Error
t value Pr(>|t|) 95% LCI 95% UCI (Intercept)
- 1.55
0.34
- 4.55
0.00 0.12 0.37 Female 0.28 0.12 2.36 0.01 1.09 1.60 Interest 0.40 0.05 7.43 0.00 1.37 1.63 Employed 0.10 0.13 0.80 0.21 0.90 1.36 Nonwhite 0.50 0.13 3.87 0.00 1.33 2.04 Age
- 0.00
0.00
- 0.57
0.28 0.99 1.00 Democrat 0.36 0.14 2.59 0.00 1.14 1.80 Independent 1.21 0.18 6.55 0.00 2.48 4.55 Education
- 0.53
0.06
- 8.94
0.00 0.53 0.65 Null deviance: 2228.22 on 2053 degrees of freedom Residual deviance: 1886.57 on 2045 degrees of freedom AIC: 1743.101, Adjusted Degrees of Freedom from 10 Imputations: 556.9445
Distinguishing Between Types of Item Nonresponse: True Nonattitudes
Table 2: Modeling Missing Values at Stage One, True Nonattitudes Estimate
- Std. Error
t value Pr(>|t|) 95% LCI 95% UCI (Intercept)
- 3.83
0.68
- 5.62
0.00 0.01 0.07 Female 0.47 0.22 2.17 0.02 1.12 2.30 Interest 0.47 0.10 4.50 0.00 1.35 1.89 Employment 0.03 0.23 0.12 0.45 0.70 1.50 Nonwhite 0.15 0.23 0.66 0.25 0.80 1.71 Age
- 0.00
0.01
- 0.64
0.26 0.99 1.01 Democrat 1.23 0.34 3.59 0.00 1.95 6.04 Independent 2.49 0.37 6.76 0.00 6.57 22.06 Education
- 0.56
0.11
- 5.12
0.00 0.47 0.68 Null deviance: 849.4 on 1591 degrees of freedom Residual deviance: 667.8 on 1583 degrees of freedom AIC: 625.0275, Adjusted Degrees of Freedom from 10 Imputations: 86.06016
Distinguishing Between Types of Item Nonresponse: Hidden Attitudes vs. True Nonattitudes
Table 3: Modeling Missing Values at Stage One, True Nonattitudes vs. Hidden Attitudes Estimate
- Std. Error
t value Pr(>|t|) 95% LCI 95% UCI (Intercept)
- 2.36
0.73
- 3.22
0.00 0.03 0.32 Female 0.15 0.23 0.65 0.26 0.80 1.70 Interest 0.18 0.10 1.75 0.04 1.01 1.42 Employed 0.10 0.24 0.41 0.34 0.75 1.63 Nonwhite
- 0.36
0.25
- 1.42
0.08 0.46 1.06 Age
- 0.00
0.01
- 0.43
0.34 0.99 1.01 Democrat 1.17 0.37 3.15 0.00 1.75 5.92 Independent 1.78 0.38 4.69 0.00 3.18 11.10 Education
- 0.15
0.14
- 1.06
0.15 0.69 1.08 Null deviance: 531.22 on 589 degrees of freedom Residual deviance: 494.71 on 581 degrees of freedom AIC: 465.1297, Adjusted Degrees of Freedom from 10 Imputations: 72.40175
Effects on Imputed Data: Data Setup
◮ Return to original un-edited dataset ◮ Coding for nonresponse models
- libcpre self and libcpre combined - the combined variable that incorporates both
libcpre self and libcpre choose - scale is 1-7, extremely liberal to extremely conservative
- Outcome variable used is 0-1 vote choice; 1 indicates vote for Obama
◮ Other variables used in the analysis: Gender dichotomized, female Interest in politics/elections 5-pt scale Employed dichotomized, employed Race dichotomized, nonwhite Age numeric Party ID 7-pt scale, 1=Strong Dem Education 5-pt scale
Multiple Hot-Deck Imputation, Affinity Score Definition
◮ For each respondent yi indicates the outcome variable and xi is a k-length vector of only discrete explanatory variables. ◮ If the i th case under consideration has qi missing values in xi, then a potential donor vector, xj, j = i, will have between 0 and k − qi exact matches with i. ◮ Define zi,j as the number of variables for which the potential donor j and the recipient i have differ- ent values. ◮ Thus k − qi − zi,j is the number of variables on which j and i are perfectly matched. ◮ This value, scaled by the highest number of possible matches (k − qi) is then the affinity score: αi,j = k − qi − zi,j k − qi . (1) ◮ The affinity score has the desirable properties that αi,j = 1 for i ∈ DR (data with responses) and αi,j = 0 for i ∈ DNR (data missing responses). ◮ Cases where the recipient and the donor are both missing values in the same covariate are deducted from k and qi prior to the calculation of αi,j.
Effects on Imputed Data: Imputed Models
◮ The thing to focus on in the following tables is model fit
Table 4: Modeling Vote Choice, All Missing Ideology Imputed Estimate
- Std. Error
t value Pr(>|t|) 95% LCI 95% UCI (Intercept) 5.27 0.53 9.91 0.00 81.19 467.08 Ideology
- 0.29
0.07
- 3.95
0.00 0.66 0.84 Female
- 0.14
0.21
- 0.67
0.26 0.61 1.23 Interest 0.04 0.08 0.51 0.31 0.91 1.19 Presidential job approval
- 0.89
0.06
- 14.71
0.00 0.37 0.45 Employed
- 0.55
0.23
- 2.37
0.03 0.40 0.85 Nonwhite 1.19 0.21 5.77 0.00 2.33 4.59 Age 0.00 0.01 0.71 0.25 1.00 1.01 Partisanship
- 0.53
0.06
- 8.48
0.00 0.53 0.65 Education 0.13 0.08 1.60 0.06 1.00 1.29 Null deviance: 2837.75 on 2053 degrees of freedom Residual deviance: 1140.29 on 2044 degrees of freedom AIC: 1110.454, Adjusted Degrees of Freedom from 10 Imputations: 38.60938
Effects on Imputed Data: Imputed Models
Table 5: Modeling Vote Choice, Ideology True Nonattitudes Imputed Estimate
- Std. Error
t value Pr(>|t|) 95% LCI 95% UCI (Intercept) 5.92 0.58 10.28 0.00 144.58 962.81 Ideology
- 0.42
0.08
- 5.39
0.00 0.58 0.75 Female
- 0.20
0.21
- 0.95
0.19 0.58 1.16 Interest 0.02 0.08 0.30 0.39 0.90 1.17 Presidential job approval
- 0.91
0.06
- 15.88
0.00 0.37 0.44 Employed
- 0.56
0.23
- 2.49
0.02 0.39 0.83 Nonwhite 1.23 0.21 5.98 0.00 2.44 4.80 Age 0.00 0.01 0.84 0.20 1.00 1.01 Partisanship
- 0.50
0.06
- 8.10
0.00 0.55 0.67 Education 0.10 0.08 1.24 0.12 0.97 1.27 Null deviance: 2837.75 on 2053 degrees of freedom Residual deviance: 1099.65 on 2044 degrees of freedom AIC: 1070.551, Adjusted Degrees of Freedom from 10 Imputations: 21.28132
Effects on Imputed Data: Imputed Models
Table 6: Modeling Vote Choice, No True Nonattitudes, Others Imputed Estimate
- Std. Error
t value Pr(>|t|) 95% LCI 95% UCI (Intercept) 5.75 0.62 9.29 0.00 113.06 865.29 Ideology
- 0.34
0.07
- 4.86
0.00 0.63 0.80 Female
- 0.22
0.19
- 1.21
0.12 0.59 1.08 Interest 0.01 0.08 0.10 0.46 0.88 1.16 Presidential job approval
- 0.92
0.07
- 13.77
0.00 0.36 0.45 Employed
- 0.27
0.20
- 1.36
0.10 0.54 1.06 Nonwhite 1.14 0.21 5.32 0.00 2.19 4.43 Age 0.01 0.01 0.87 0.21 0.99 1.02 Partisanship
- 0.58
0.06
- 9.56
0.00 0.50 0.62 Education 0.14 0.08 1.69 0.05 1.00 1.31 Null deviance: 2679.6 on 1925 degrees of freedom Residual deviance: 1000.09 on 1916 degrees of freedom AIC: 984.8838, Adjusted Degrees of Freedom from 10 Imputations: 15.09675
Effects on Imputed Data: Imputed Models
Table 7: Modeling Vote Choice, No True Nonattitudes Estimate
- Std. Error
t value Pr(>|t|) 95% LCI 95% UCI (Intercept) 6.43 0.67 9.58 0.00 205.98 1877.44 Ideology
- 0.48
0.09
- 5.55
0.00 0.53 0.71 Female
- 0.27
0.19
- 1.47
0.08 0.56 1.03 Interest
- 0.00
0.08
- 0.01
0.50 0.87 1.15 Presidential job approval
- 0.94
0.07
- 12.74
0.00 0.35 0.44 Employed
- 0.27
0.20
- 1.33
0.10 0.55 1.07 Nonwhite 1.25 0.22 5.60 0.00 2.42 5.03 Age 0.01 0.01 1.10 0.16 1.00 1.02 Partisanship
- 0.55
0.07
- 8.42
0.00 0.52 0.64 Education 0.09 0.08 1.15 0.13 0.96 1.26 Null deviance: 2679.6 on 1925 degrees of freedom Residual deviance: 968.57 on 1916 degrees of freedom AIC: 952.8098 , Adjusted Degrees of Freedom from 10 Imputations: 11.67363
Multiple Hot-Deck Imputation
◮ Focuses strictly on missing discrete values, keeping measurement discrete: for example a {−2, −1, 0, 1, 2} variable will not get imputations like 1.2834. ◮ First create several replicate copies of the dataset, and then perform the steps:
- 1. Search down columns of the data sequentially looking for missing observations in one replicate.
(a) When a missing value is found, compute a vector of affinity Scores relative to all other rows (cases) for that case with the missing value. (b) Create an empirical distribution of potential donors using affinity scores and draw randomly from it to produce a vector of imputations. (c) Impute one of these values into the appropriate cell of each duplicate dataset.
- 2. Repeat Step 2 until no missing observations remain.
- 3. Estimate the statistic of interest for each dataset.
- 4. Combine the estimates of the statistic into a single estimate as in regular multiple imputation.