SLIDE 1
Data Screening and Missing Value Analysis James H. Steiger Theorem - - PowerPoint PPT Presentation
Data Screening and Missing Value Analysis James H. Steiger Theorem - - PowerPoint PPT Presentation
Data Screening and Missing Value Analysis James H. Steiger Theorem (The Fundamental Theorem of Modern Data Analysis). Storage media are cheap. Data are expensive. Corollary 1. In some cases, the data are priceless. Corollary 2. You will suffer
SLIDE 2
SLIDE 3
Corollary 1. In some cases, the data are priceless. Corollary 2. You will suffer a major data loss
- event. There are many ways this can happen. The
- nly questions are (a) when, and (b) whether the
data will be restorable from a backup.
SLIDE 4
Don’t Lose Data
- Own a CD-DVD burner.
- Have a backup plan. Back up all your research
related manuscripts and data files every week.
- Every time you work on a data file, start by
saving it under a new name, according to a precise code, like this: PROJECT5_2006_01_30_A.SAV
SLIDE 5
- After several weeks, you will have a succession
- f data files that will allow you to restore to any
earlier time. This can be useful in case one of your “data analysis” sessions is discovered, after the fact, to be a “data destruction” session!
SLIDE 6
Don’t Throw Away Data
- Never dichotomize or categorize continuous
- data. You are simply throwing away
information and increasing error variance.
- Use missing data imputation to replace missing
data, rather than deleting data casewise.
- As data become available, screen them
immediately for univariate and multivariate
- utliers and missing values. You may be able to
recover some lost information and correct recording errors.
SLIDE 7
Plan Your Data Files Carefully
- Add descriptive information to the variable
names
- Use long variable names
- Remember that multivariate data are indeed
multivariate! Never destroy the multivariate structure by splitting the file unless you make very careful use of unique ID values so that you can reconstruct the complete file.
SLIDE 8
Document All Analysis Activities
- You will be amazed how quickly you will
forget what you did during a complex data analysis.
- Yet, what you did determines the appropriate
probability model for analyzing your data! (Not all textbook authors recognize this, and we will comment extensively on this later in the course.)
- Analysis documentation can be done in SPSS
by logging commands and by placing text fields
SLIDE 9
with notes about the analysis in the SPSS
- utput files you create.
- There is an overwhelming tendency for most
people to put insufficient detail in such
- comments. Resist this tendency. You will
almost always save time in the long run by putting extremely careful (“overly detailed”) notes in your data analyses.
SLIDE 10
- Advanced, modern systems (like R combined
with SWeave) allow you to produce one file that contains a narrative about your data analysis, along with all the commands that produce your analysis! If you discover an error, you can redo months of work in seconds by simply rerunning the file.
SLIDE 11
Dealing with Missing Data Many basic multivariate procedures assume complete data. But many data sets have some missing values. What to do? One approach is to delete cases with missing data, and perform the analysis on the reduced data set. Another approach is to try to replace missing data to “fill out” the data set.
SLIDE 12
Casewise Deletion Often the default procedure is “casewise deletion.” Casewise deletion simply discards any case that has missing data on any of the variables in the
- analysis. Casewise deletion can, in many studies
reduce the operational sample size to only 70-75%
- f its initial value.
SLIDE 13
Missing Data Imputation An alternative is to “impute” or reconstruct missing data values, using procedures that range from extraordinarily simple to very complex.
SLIDE 14
Dangers of Casewise Deletion Casewise deletion seems, at first glance, to be “unbiased” and “fair” since it does not involve “making up data” to replace the missing values. However, it can actually be quite dangerous, resulting in biased estimates that are far worse than any problem resulting from “making up data.”
SLIDE 15
Missing Data Theory When missing data are a problem in an analysis, we try to develop an understanding of an underlying mechanism that led to the data being missing. The classic reference is Little & Rubin (1987), Statistical Analysis with Missing Data. For a superb introductory chapter, see Chapter 3 of Frank Harrell’s Regression Modeling Strategies.
SLIDE 16
Missing Data Theory Several potential mechanisms for missing data can be discussed, and several technical terms are commonly employed to discuss them:
SLIDE 17
MCAR (Missing Completely At Random)
- Data are missing for reasons completely
unrelated to the values of any variable in the study or characteristic of the subject, including the value of the missing data.
- Examples: A sensor failed at random during a
- trial. A survey response was lost in the mail.
- This is the easiest type of missing data to deal
with.
SLIDE 18
MAR (Missing at Random)
- The probability that a value is missing depends
- n values of the variables that were actually
measured.
- Given the values of other observed variables,
subjects having missing values are only randomly different from other subjects.
SLIDE 19
MAR (Missing at Random)
- Example (Harrell, 2001, p.41). Consider a
survey in which females are less likely to provide their personal income than males, but the likelihood of responding is independent of a woman’s actual income. If we have the sexes of all subjects, and we have income data for some females, then we still can construct unbiased income estimates.
SLIDE 20
IM (Informative Missing)
- Data are more likely to be missing if their true
values are higher or lower.
- Example: People with very high incomes in a
survey are more likely to refuse to provide income data.
- This is the most difficult kind of missing data to
deal with.
SLIDE 21
Modeling and Dealing with Missing Data Some potential approaches and issues (Harrell, 2001, p. 45):
- Imputation of missing values for one of the
variables can ignore all other information. For example, mvs can be replaced with a constant such as the mean or the median of nonmissing values on that variable
- Imputation can be based on information not
- therwise used.
SLIDE 22
- Imputations can be based on information
- btained only by analyzing interrelationships
among the X’s.
- Imputations can be based on relationships
between X’s, and between X and Y.
- Imputations can take into account the reason for
non-response, if known.
- Ignoring known relationship between X and Y
for non-missing variables during imputation can bias the regression coefficient toward zero.
SLIDE 23
Imputation Algorithms Single Imputation of Conditional Means For a single X that is unrelated to other X’s, the mean or median may be substituted with little loss
- f efficiency.
Multiple Imputation Uses random draws from the estimated conditional distribution of the X value given the other X values
SLIDE 24
and (possibly) Y. Usually these draws are repeated (and the analysis repeated) several times.
SLIDE 25
General Guidelines Proportion of missing .05 ≤ Doesn’t matter much how you impute missings or whether you adjust variance of regression coefficient estimates for missingness. Casewise deletion analysis is a reasonable option
SLIDE 26
Proportion of missing .05 to .15 If predictor is unrelated to other predictors, you can use a reasonable constant value, otherwise impute using a customized model to predict the predictor from all other predictors. Variance estimates of coefficients may need to be adjusted Proportion of missing greater than .15 Same as above, but the need to adjust variance estimates is even greater.
SLIDE 27
Generating a Correlation Matrix with Pairwise Deletion One common procedure is to compute each correlation on the complete cases for those two variables only. This procedure generates correlations that are based generally on larger sample sizes than would be obtained with casewise
- deletion. One problem is that a matrix of such