Data Screening and Missing Value Analysis James H. Steiger Theorem - PowerPoint PPT Presentation

Data Screening and Missing Value Analysis James H. Steiger

Theorem (The Fundamental Theorem of Modern Data Analysis). Storage media are cheap. Data are expensive.

Corollary 1. In some cases, the data are priceless. Corollary 2. You will suffer a major data loss event. There are many ways this can happen. The only questions are (a) when, and (b) whether the data will be restorable from a backup.

Don’t Lose Data • Own a CD-DVD burner. • Have a backup plan. Back up all your research related manuscripts and data files every week. • Every time you work on a data file, start by saving it under a new name, according to a precise code, like this: PROJECT5_2006_01_30_A.SAV

• After several weeks, you will have a succession of data files that will allow you to restore to any earlier time. This can be useful in case one of your “data analysis” sessions is discovered, after the fact, to be a “data destruction” session!

Don’t Throw Away Data • Never dichotomize or categorize continuous data. You are simply throwing away information and increasing error variance. • Use missing data imputation to replace missing data, rather than deleting data casewise. • As data become available, screen them immediately for univariate and multivariate outliers and missing values. You may be able to recover some lost information and correct recording errors.

Plan Your Data Files Carefully • Add descriptive information to the variable names • Use long variable names • Remember that multivariate data are indeed multivariate! Never destroy the multivariate structure by splitting the file unless you make very careful use of unique ID values so that you can reconstruct the complete file.

Document All Analysis Activities • You will be amazed how quickly you will forget what you did during a complex data analysis. • Yet, what you did determines the appropriate probability model for analyzing your data! (Not all textbook authors recognize this, and we will comment extensively on this later in the course.) • Analysis documentation can be done in SPSS by logging commands and by placing text fields

with notes about the analysis in the SPSS output files you create. • There is an overwhelming tendency for most people to put insufficient detail in such comments. Resist this tendency. You will almost always save time in the long run by putting extremely careful (“overly detailed”) notes in your data analyses.

• Advanced, modern systems (like R combined with SWeave) allow you to produce one file that contains a narrative about your data analysis, along with all the commands that produce your analysis! If you discover an error, you can redo months of work in seconds by simply rerunning the file.

Dealing with Missing Data Many basic multivariate procedures assume complete data. But many data sets have some missing values. What to do? One approach is to delete cases with missing data, and perform the analysis on the reduced data set. Another approach is to try to replace missing data to “fill out” the data set.

Casewise Deletion Often the default procedure is “casewise deletion.” Casewise deletion simply discards any case that has missing data on any of the variables in the analysis. Casewise deletion can, in many studies reduce the operational sample size to only 70-75% of its initial value.

Missing Data Imputation An alternative is to “impute” or reconstruct missing data values, using procedures that range from extraordinarily simple to very complex.

Dangers of Casewise Deletion Casewise deletion seems, at first glance, to be “unbiased” and “fair” since it does not involve “making up data” to replace the missing values. However, it can actually be quite dangerous, resulting in biased estimates that are far worse than any problem resulting from “making up data.”

Missing Data Theory When missing data are a problem in an analysis, we try to develop an understanding of an underlying mechanism that led to the data being missing. The classic reference is Little & Rubin (1987), Statistical Analysis with Missing Data . For a superb introductory chapter, see Chapter 3 of Frank Harrell’s Regression Modeling Strategies.

Missing Data Theory Several potential mechanisms for missing data can be discussed, and several technical terms are commonly employed to discuss them:

MCAR ( Missing Completely At Random) • Data are missing for reasons completely unrelated to the values of any variable in the study or characteristic of the subject, including the value of the missing data . • Examples: A sensor failed at random during a trial. A survey response was lost in the mail. • This is the easiest type of missing data to deal with.

MAR (Missing at Random) • The probability that a value is missing depends on values of the variables that were actually measured. • Given the values of other observed variables, subjects having missing values are only randomly different from other subjects.

MAR (Missing at Random) • Example (Harrell, 2001, p.41). Consider a survey in which females are less likely to provide their personal income than males, but the likelihood of responding is independent of a woman’s actual income. If we have the sexes of all subjects, and we have income data for some females, then we still can construct unbiased income estimates.

IM (Informative Missing) • Data are more likely to be missing if their true values are higher or lower. • Example: People with very high incomes in a survey are more likely to refuse to provide income data. • This is the most difficult kind of missing data to deal with.

Modeling and Dealing with Missing Data Some potential approaches and issues (Harrell, 2001, p. 45): • Imputation of missing values for one of the variables can ignore all other information. For example, mvs can be replaced with a constant such as the mean or the median of nonmissing values on that variable • Imputation can be based on information not otherwise used.

• Imputations can be based on information obtained only by analyzing interrelationships among the X ’s. • Imputations can be based on relationships between X’ s, and between X and Y . • Imputations can take into account the reason for non-response, if known. • Ignoring known relationship between X and Y for non-missing variables during imputation can bias the regression coefficient toward zero.

Imputation Algorithms Single Imputation of Conditional Means For a single X that is unrelated to other X ’s, the mean or median may be substituted with little loss of efficiency. Multiple Imputation Uses random draws from the estimated conditional distribution of the X value given the other X values

and (possibly) Y . Usually these draws are repeated (and the analysis repeated) several times.

General Guidelines ≤ Proportion of missing .05 Doesn’t matter much how you impute missings or whether you adjust variance of regression coefficient estimates for missingness. Casewise deletion analysis is a reasonable option

Proportion of missing .05 to .15 If predictor is unrelated to other predictors, you can use a reasonable constant value, otherwise impute using a customized model to predict the predictor from all other predictors. Variance estimates of coefficients may need to be adjusted Proportion of missing greater than .15 Same as above, but the need to adjust variance estimates is even greater.

Generating a Correlation Matrix with Pairwise Deletion One common procedure is to compute each correlation on the complete cases for those two variables only . This procedure generates correlations that are based generally on larger sample sizes than would be obtained with casewise deletion. One problem is that a matrix of such correlations may not be “positive definite.”

Data Screening and Missing Value Analysis James H. Steiger Theorem - PowerPoint PPT Presentation

Data Screening and Missing Value Analysis James H. Steiger Theorem (The Fundamental Theorem of Modern Data Analysis). Storage media are cheap. Data are expensive. Corollary 1. In some cases, the data are priceless. Corollary 2. You will suffer

Multiple Imputation for Missing Data in KLoSA Juwon Song Korea University and UCLA Contents 1.

Missing Data and Imputation NINA ORWITZ OCTOBER 30 TH , 2017 Outline Types of missing data

Metal - - screening screening Metal Thomas-Fermi (static) screening potential of point charge

Missing Values in SAS Magnus Mengelbier Director PhUSE 2011 1 Topics Introduction

Colorectal Cancer Screening Fall 2018 Agenda CRC Screening Landscape Colonoscopy: The

DIABETIC EYE SCREENING What is Diabetic Eye Screening? Diabetic eye screening means taking a

Screening Controlled Substance Screening Controlled Substance Screening Controlled Substance

Diabetic Eye Screening Extended Screening Intervals Public Health England leads the NHS Screening

Searching for and replacing missing values Nicholas Tierney Statistician DataCamp Dealing With

Bayesian Generalized linear mixed models with data missing not at random Overview: Two simple

Missing data and data imputation with the Swiss Household Panel Andr Berchtold LIVES, LINES,

Whats Missing? SOCI 101 November 29, 2011 SOCI 101 () Whats Missing? November 29, 2011

Gaussian ensemble screening (GES): A new Gaussian ensemble screening (GES): A new approach to

1 2 3 4 5 F I N A N C I A L P R O T E C T I O N S S E R V I C E SCREENING The Screening

Screening Screening By: Michael OReilly Technical Advisor FETP Thailand Session Objectives

Goals CRC/Screening Facts Available CRC Screening Tests Tools To Help you Talk About

Boundary Value Testing Chapter 5 BVT1 Introduction Input domain testing is the most

Elicitability and Identifjability of Measures of Systemic Risk Tobias Fissler Imperial College

Lessons Learned A Value Added Product of the Project Life Cycle R Gilman April 19, 2006 Agenda

Control Expressions and Statements Control Control is the study of the semantics of

SMT-Style Program Analysis with Value-based Refinements Vijay DSilva Leopold Haller Daniel

CS 451 Software Engineering Winter 2009 Yuanfang Cai Room 104, University Crossings

Reconstructing Control Flow from Predicated Assembly Code Bjrn Decker, Saarland University

SQL Workshop Data Types Doug Shook Data Types Four categories String Numeric