Data Screening and Missing Value Analysis James H. Steiger Theorem - - PowerPoint PPT Presentation

data screening and missing value analysis
SMART_READER_LITE
LIVE PREVIEW

Data Screening and Missing Value Analysis James H. Steiger Theorem - - PowerPoint PPT Presentation

Data Screening and Missing Value Analysis James H. Steiger Theorem (The Fundamental Theorem of Modern Data Analysis). Storage media are cheap. Data are expensive. Corollary 1. In some cases, the data are priceless. Corollary 2. You will suffer


slide-1
SLIDE 1

Data Screening and Missing Value Analysis

James H. Steiger

slide-2
SLIDE 2

Theorem (The Fundamental Theorem of Modern Data Analysis). Storage media are cheap. Data are expensive.

slide-3
SLIDE 3

Corollary 1. In some cases, the data are priceless. Corollary 2. You will suffer a major data loss

  • event. There are many ways this can happen. The
  • nly questions are (a) when, and (b) whether the

data will be restorable from a backup.

slide-4
SLIDE 4

Don’t Lose Data

  • Own a CD-DVD burner.
  • Have a backup plan. Back up all your research

related manuscripts and data files every week.

  • Every time you work on a data file, start by

saving it under a new name, according to a precise code, like this: PROJECT5_2006_01_30_A.SAV

slide-5
SLIDE 5
  • After several weeks, you will have a succession
  • f data files that will allow you to restore to any

earlier time. This can be useful in case one of your “data analysis” sessions is discovered, after the fact, to be a “data destruction” session!

slide-6
SLIDE 6

Don’t Throw Away Data

  • Never dichotomize or categorize continuous
  • data. You are simply throwing away

information and increasing error variance.

  • Use missing data imputation to replace missing

data, rather than deleting data casewise.

  • As data become available, screen them

immediately for univariate and multivariate

  • utliers and missing values. You may be able to

recover some lost information and correct recording errors.

slide-7
SLIDE 7

Plan Your Data Files Carefully

  • Add descriptive information to the variable

names

  • Use long variable names
  • Remember that multivariate data are indeed

multivariate! Never destroy the multivariate structure by splitting the file unless you make very careful use of unique ID values so that you can reconstruct the complete file.

slide-8
SLIDE 8

Document All Analysis Activities

  • You will be amazed how quickly you will

forget what you did during a complex data analysis.

  • Yet, what you did determines the appropriate

probability model for analyzing your data! (Not all textbook authors recognize this, and we will comment extensively on this later in the course.)

  • Analysis documentation can be done in SPSS

by logging commands and by placing text fields

slide-9
SLIDE 9

with notes about the analysis in the SPSS

  • utput files you create.
  • There is an overwhelming tendency for most

people to put insufficient detail in such

  • comments. Resist this tendency. You will

almost always save time in the long run by putting extremely careful (“overly detailed”) notes in your data analyses.

slide-10
SLIDE 10
  • Advanced, modern systems (like R combined

with SWeave) allow you to produce one file that contains a narrative about your data analysis, along with all the commands that produce your analysis! If you discover an error, you can redo months of work in seconds by simply rerunning the file.

slide-11
SLIDE 11

Dealing with Missing Data Many basic multivariate procedures assume complete data. But many data sets have some missing values. What to do? One approach is to delete cases with missing data, and perform the analysis on the reduced data set. Another approach is to try to replace missing data to “fill out” the data set.

slide-12
SLIDE 12

Casewise Deletion Often the default procedure is “casewise deletion.” Casewise deletion simply discards any case that has missing data on any of the variables in the

  • analysis. Casewise deletion can, in many studies

reduce the operational sample size to only 70-75%

  • f its initial value.
slide-13
SLIDE 13

Missing Data Imputation An alternative is to “impute” or reconstruct missing data values, using procedures that range from extraordinarily simple to very complex.

slide-14
SLIDE 14

Dangers of Casewise Deletion Casewise deletion seems, at first glance, to be “unbiased” and “fair” since it does not involve “making up data” to replace the missing values. However, it can actually be quite dangerous, resulting in biased estimates that are far worse than any problem resulting from “making up data.”

slide-15
SLIDE 15

Missing Data Theory When missing data are a problem in an analysis, we try to develop an understanding of an underlying mechanism that led to the data being missing. The classic reference is Little & Rubin (1987), Statistical Analysis with Missing Data. For a superb introductory chapter, see Chapter 3 of Frank Harrell’s Regression Modeling Strategies.

slide-16
SLIDE 16

Missing Data Theory Several potential mechanisms for missing data can be discussed, and several technical terms are commonly employed to discuss them:

slide-17
SLIDE 17

MCAR (Missing Completely At Random)

  • Data are missing for reasons completely

unrelated to the values of any variable in the study or characteristic of the subject, including the value of the missing data.

  • Examples: A sensor failed at random during a
  • trial. A survey response was lost in the mail.
  • This is the easiest type of missing data to deal

with.

slide-18
SLIDE 18

MAR (Missing at Random)

  • The probability that a value is missing depends
  • n values of the variables that were actually

measured.

  • Given the values of other observed variables,

subjects having missing values are only randomly different from other subjects.

slide-19
SLIDE 19

MAR (Missing at Random)

  • Example (Harrell, 2001, p.41). Consider a

survey in which females are less likely to provide their personal income than males, but the likelihood of responding is independent of a woman’s actual income. If we have the sexes of all subjects, and we have income data for some females, then we still can construct unbiased income estimates.

slide-20
SLIDE 20

IM (Informative Missing)

  • Data are more likely to be missing if their true

values are higher or lower.

  • Example: People with very high incomes in a

survey are more likely to refuse to provide income data.

  • This is the most difficult kind of missing data to

deal with.

slide-21
SLIDE 21

Modeling and Dealing with Missing Data Some potential approaches and issues (Harrell, 2001, p. 45):

  • Imputation of missing values for one of the

variables can ignore all other information. For example, mvs can be replaced with a constant such as the mean or the median of nonmissing values on that variable

  • Imputation can be based on information not
  • therwise used.
slide-22
SLIDE 22
  • Imputations can be based on information
  • btained only by analyzing interrelationships

among the X’s.

  • Imputations can be based on relationships

between X’s, and between X and Y.

  • Imputations can take into account the reason for

non-response, if known.

  • Ignoring known relationship between X and Y

for non-missing variables during imputation can bias the regression coefficient toward zero.

slide-23
SLIDE 23

Imputation Algorithms Single Imputation of Conditional Means For a single X that is unrelated to other X’s, the mean or median may be substituted with little loss

  • f efficiency.

Multiple Imputation Uses random draws from the estimated conditional distribution of the X value given the other X values

slide-24
SLIDE 24

and (possibly) Y. Usually these draws are repeated (and the analysis repeated) several times.

slide-25
SLIDE 25

General Guidelines Proportion of missing .05 ≤ Doesn’t matter much how you impute missings or whether you adjust variance of regression coefficient estimates for missingness. Casewise deletion analysis is a reasonable option

slide-26
SLIDE 26

Proportion of missing .05 to .15 If predictor is unrelated to other predictors, you can use a reasonable constant value, otherwise impute using a customized model to predict the predictor from all other predictors. Variance estimates of coefficients may need to be adjusted Proportion of missing greater than .15 Same as above, but the need to adjust variance estimates is even greater.

slide-27
SLIDE 27

Generating a Correlation Matrix with Pairwise Deletion One common procedure is to compute each correlation on the complete cases for those two variables only. This procedure generates correlations that are based generally on larger sample sizes than would be obtained with casewise

  • deletion. One problem is that a matrix of such

correlations may not be “positive definite.”