Dealing with missing values part 2 Applied Multivariate Statistics - - PowerPoint PPT Presentation
Dealing with missing values part 2 Applied Multivariate Statistics - - PowerPoint PPT Presentation
Dealing with missing values part 2 Applied Multivariate Statistics Spring 2012 Overview More on Single Imputation: Shortcomings Multiple Imputation: Accounting for uncertainty Appl. Multivariate Statistics - Spring 2012 2 Single
Overview
- More on Single Imputation: Shortcomings
- Multiple Imputation: Accounting for uncertainty
2
- Appl. Multivariate Statistics - Spring 2012
Single Imputation
- Unconditional Mean
- Unconditional Distribution
- Conditional Mean
- Conditional Distribution
3
- Appl. Multivariate Statistics - Spring 2012
Easy / Inaccurate Hard / Accurate
Example: Blood Pressure - Revisited
- 30 participants in January (X)
and February (Y)
- MCAR: Delete 23 Y values
randomly
- MAR: Keep Y only where
X > 140 (follow-up)
- MNAR: Record Y only where
Y > 140 (test everybody again but only keep values of critical participants)
4
- Appl. Multivariate Statistics - Spring 2012
Example: Blood Pressure
5
- Appl. Multivariate Statistics - Spring 2012
Black points are missing (MAR)
Unconditional Mean
6
- Appl. Multivariate Statistics - Spring 2012
+ Mean of Y ok
- Variance of Y wrong
Unconditional Distribution
7
- Appl. Multivariate Statistics - Spring 2012
+ Mean of Y ok, Variance better
- Correlation btw X and Y wrong
Conditional Mean
8
- Appl. Multivariate Statistics - Spring 2012
+ Conditional Mean of Y ok + Correlation ok
- (Conditional) Variance wrong
Y = 84 + 0.3*X
Conditional Distribution
9
- Appl. Multivariate Statistics - Spring 2012
+ Conditional Mean of Y ok + Correlation ok + Conditional Variance of Y ok Y = 84 + 0.3*X + e e ~ N(0, 232)
Conditional Distribution
10
- Appl. Multivariate Statistics - Spring 2012
Y = 84 + 0.3*X + e e ~ N(0, 232)
95%-CI: [-234; 402] 95%-CI: [-1.7; 2.4]
Problem: We ignore uncertainty
Problem of Single Imputation
- Too optimistic: Imputation model (e.g. in Y = a + bX) is
just estimated, but not the true model
- Thus, imputed values have some uncertainty
- Single Imputation ignores this uncertainty
- Coverage probability of confidence intervals is wrong
- Solution: Multiple Imputation
Incorporates both
- residual error
- model uncertainty (excluding model mis-specification)
11
- Appl. Multivariate Statistics - Spring 2012
?
Multiple Imputation: Idea
12
- Appl. Multivariate Statistics - Spring 2012
? Impute several times Do standard analysis for each imputed data set; get estimate and std.error Aggregate results
Multiple Imputation: Idea
- Need special imputation schemes that include both
- uncertainty of residuals
- uncertainty of model
(e.g. values of intercept a and slope b)
- Rough idea:
- Fill in random values
- Iteratively predict values for each variable until some
convergence is reached (as in missForest)
- Sample values for residuals AND for (a,b)
- Gibbs sampler is used
- Excellent for intuition (by one of the big guys in the field):
http://sites.stat.psu.edu/~jls/mifaq.html
13
- Appl. Multivariate Statistics - Spring 2012
Multiple Imputation: Intuition
14
- Appl. Multivariate Statistics - Spring 2012
Predict missing values accounting for
- Uncertainty of residuals
- Uncertainty of parameter estimates
Multiple Imputation: Intuition
15
- Appl. Multivariate Statistics - Spring 2012
Predict missing values accounting for
- Uncertainty of residuals
- Uncertainty of parameter estimates
Multiple Imputation: Intuition
16
- Appl. Multivariate Statistics - Spring 2012
Predict missing values accounting for
- Uncertainty of residuals
- Uncertainty of parameter estimates
Multiple Imputation: Intuition
17
- Appl. Multivariate Statistics - Spring 2012
Predict missing values accounting for
- Uncertainty of residuals
- Uncertainty of parameter estimates
Multiple Imputation: Intuition
18
- Appl. Multivariate Statistics - Spring 2012
Predict missing values accounting for
- Uncertainty of residuals
- Uncertainty of parameter estimates
Multiple Imputation: Intuition
19
- Appl. Multivariate Statistics - Spring 2012
Predict missing values accounting for
- Uncertainty of residuals
- Uncertainty of parameter estimates
Multiple Imputation: Gibbs sampler (Not for exam)
- Iteration t; repeat until convergence:
For each variable i: where
20
- Appl. Multivariate Statistics - Spring 2012
µ¤(t)
i
» P(µijY obs
i
; Y (t)
¡i )
Y ¤(t)
i
» P(YijY obs
i
; Y (t)
¡i ;µ¤(t) i
)
Y (t)
i
= (Y obs
i
;Y ¤(t)
j
)
Sample (a,b) Predict missings using y = a + bx + e
Intuition
R package: MICE Multiple Imputation with Chained Equations
- MICE has good default settings; don’t worry about the data
type
- Defaults for data types of columns:
- numeric: Predictive Mean Matching (pmm)
(like fancy linear regression; faster alternative: linear regression)
- factor, 2 lev: Logistic Regression (logreg)
- factor, >2 lev: Multinomial logit model (polyreg)
- ordered, >2 lev: Ordered logit model (polr)
21
- Appl. Multivariate Statistics - Spring 2012
Aggregation of estimates
- : Estimate of imputation i
: Variance of estimate (= square of std. error)
- Assume:
- Average estimate:
- Within-imputation variance:
- Between-imputation variance:
- Total variance:
- Approximately: with
- 95%-CI:
22
- Appl. Multivariate Statistics - Spring 2012
¹ Q = 1
m
Pm
j=1 ^
Qj ¹ U = 1
m
Pm
j=1 ^
Uj B =
1 m¡1
Pm
j=1( ^
Qj ¡ ¹ Q)2 T = ¹ U +
1 m¡1B ^ Q¡Q p U
¼ N(0; 1) ^ Qi
Ui
¹ Q¡Q p T » tº
º = (m ¡ 1) ³ 1 +
m ¹ U (1+m)B
´2 ¹ Q § tº;0:975 p T
Multiple Imputation with MICE
23
- Appl. Multivariate Statistics - Spring 2012
Do manually, if you have non standard analysis
How much uncertainty due to missings?
- Relative increase in variance due to nonrespose:
- Fraction (or rate) of missing information fmi:
(!! Not the same as fraction of missing OBSERVATIONS)
- Proportion of the total variance that is attributed to the
missing data:
24
- Appl. Multivariate Statistics - Spring 2012
fmi =
r+
2 º+3
r+1
r = (1+ 1
m)B
¹ U
¸ = B(1+ 1
m)
T
Returned by mice
How many imputations?
- Surprisingly few!
- Efficiency compared to depends on fmi:
- Examples (eff in %):
25
- Appl. Multivariate Statistics - Spring 2012
m = 1
eff = ³ 1 + fmi
m
´¡1
M fmi=0.1 fmi=0.3 fmi=0.5 fmi=0.7 fmi=0.9 3 97 91 86 81 77 5 98 94 91 88 85 10 99 97 95 93 92 20 100 99 98 97 96
Oftentimes OK Perfect ! Rule of thumb:
- Preliminary analysis: m = 5
- Paper: m = 20 or even m = 50
Concepts to know
- Idea of mice
- How to aggregate results from imputed data sets?
- How many imputations?
26
- Appl. Multivariate Statistics - Spring 2012
R functions to know
- mice, with, pool
27
- Appl. Multivariate Statistics - Spring 2012
Next time
- Multidimensional Scaling
- Distance metrics
28
- Appl. Multivariate Statistics - Spring 2012