Dealing with missing values part 2 Applied Multivariate Statistics - - PowerPoint PPT Presentation

dealing with missing values part 2
SMART_READER_LITE
LIVE PREVIEW

Dealing with missing values part 2 Applied Multivariate Statistics - - PowerPoint PPT Presentation

Dealing with missing values part 2 Applied Multivariate Statistics Spring 2012 Overview More on Single Imputation: Shortcomings Multiple Imputation: Accounting for uncertainty Appl. Multivariate Statistics - Spring 2012 2 Single


slide-1
SLIDE 1

Dealing with missing values – part 2

Applied Multivariate Statistics – Spring 2012

slide-2
SLIDE 2

Overview

  • More on Single Imputation: Shortcomings
  • Multiple Imputation: Accounting for uncertainty

2

  • Appl. Multivariate Statistics - Spring 2012
slide-3
SLIDE 3

Single Imputation

  • Unconditional Mean
  • Unconditional Distribution
  • Conditional Mean
  • Conditional Distribution

3

  • Appl. Multivariate Statistics - Spring 2012

Easy / Inaccurate Hard / Accurate

slide-4
SLIDE 4

Example: Blood Pressure - Revisited

  • 30 participants in January (X)

and February (Y)

  • MCAR: Delete 23 Y values

randomly

  • MAR: Keep Y only where

X > 140 (follow-up)

  • MNAR: Record Y only where

Y > 140 (test everybody again but only keep values of critical participants)

4

  • Appl. Multivariate Statistics - Spring 2012
slide-5
SLIDE 5

Example: Blood Pressure

5

  • Appl. Multivariate Statistics - Spring 2012

Black points are missing (MAR)

slide-6
SLIDE 6

Unconditional Mean

6

  • Appl. Multivariate Statistics - Spring 2012

+ Mean of Y ok

  • Variance of Y wrong
slide-7
SLIDE 7

Unconditional Distribution

7

  • Appl. Multivariate Statistics - Spring 2012

+ Mean of Y ok, Variance better

  • Correlation btw X and Y wrong
slide-8
SLIDE 8

Conditional Mean

8

  • Appl. Multivariate Statistics - Spring 2012

+ Conditional Mean of Y ok + Correlation ok

  • (Conditional) Variance wrong

Y = 84 + 0.3*X

slide-9
SLIDE 9

Conditional Distribution

9

  • Appl. Multivariate Statistics - Spring 2012

+ Conditional Mean of Y ok + Correlation ok + Conditional Variance of Y ok Y = 84 + 0.3*X + e e ~ N(0, 232)

slide-10
SLIDE 10

Conditional Distribution

10

  • Appl. Multivariate Statistics - Spring 2012

Y = 84 + 0.3*X + e e ~ N(0, 232)

95%-CI: [-234; 402] 95%-CI: [-1.7; 2.4]

Problem: We ignore uncertainty

slide-11
SLIDE 11

Problem of Single Imputation

  • Too optimistic: Imputation model (e.g. in Y = a + bX) is

just estimated, but not the true model

  • Thus, imputed values have some uncertainty
  • Single Imputation ignores this uncertainty
  • Coverage probability of confidence intervals is wrong
  • Solution: Multiple Imputation

Incorporates both

  • residual error
  • model uncertainty (excluding model mis-specification)

11

  • Appl. Multivariate Statistics - Spring 2012
slide-12
SLIDE 12

?

Multiple Imputation: Idea

12

  • Appl. Multivariate Statistics - Spring 2012

? Impute several times Do standard analysis for each imputed data set; get estimate and std.error Aggregate results

slide-13
SLIDE 13

Multiple Imputation: Idea

  • Need special imputation schemes that include both
  • uncertainty of residuals
  • uncertainty of model

(e.g. values of intercept a and slope b)

  • Rough idea:
  • Fill in random values
  • Iteratively predict values for each variable until some

convergence is reached (as in missForest)

  • Sample values for residuals AND for (a,b)
  • Gibbs sampler is used
  • Excellent for intuition (by one of the big guys in the field):

http://sites.stat.psu.edu/~jls/mifaq.html

13

  • Appl. Multivariate Statistics - Spring 2012
slide-14
SLIDE 14

Multiple Imputation: Intuition

14

  • Appl. Multivariate Statistics - Spring 2012

Predict missing values accounting for

  • Uncertainty of residuals
  • Uncertainty of parameter estimates
slide-15
SLIDE 15

Multiple Imputation: Intuition

15

  • Appl. Multivariate Statistics - Spring 2012

Predict missing values accounting for

  • Uncertainty of residuals
  • Uncertainty of parameter estimates
slide-16
SLIDE 16

Multiple Imputation: Intuition

16

  • Appl. Multivariate Statistics - Spring 2012

Predict missing values accounting for

  • Uncertainty of residuals
  • Uncertainty of parameter estimates
slide-17
SLIDE 17

Multiple Imputation: Intuition

17

  • Appl. Multivariate Statistics - Spring 2012

Predict missing values accounting for

  • Uncertainty of residuals
  • Uncertainty of parameter estimates
slide-18
SLIDE 18

Multiple Imputation: Intuition

18

  • Appl. Multivariate Statistics - Spring 2012

Predict missing values accounting for

  • Uncertainty of residuals
  • Uncertainty of parameter estimates
slide-19
SLIDE 19

Multiple Imputation: Intuition

19

  • Appl. Multivariate Statistics - Spring 2012

Predict missing values accounting for

  • Uncertainty of residuals
  • Uncertainty of parameter estimates
slide-20
SLIDE 20

Multiple Imputation: Gibbs sampler (Not for exam)

  • Iteration t; repeat until convergence:

For each variable i: where

20

  • Appl. Multivariate Statistics - Spring 2012

µ¤(t)

i

» P(µijY obs

i

; Y (t)

¡i )

Y ¤(t)

i

» P(YijY obs

i

; Y (t)

¡i ;µ¤(t) i

)

Y (t)

i

= (Y obs

i

;Y ¤(t)

j

)

Sample (a,b) Predict missings using y = a + bx + e

Intuition

slide-21
SLIDE 21

R package: MICE Multiple Imputation with Chained Equations

  • MICE has good default settings; don’t worry about the data

type

  • Defaults for data types of columns:
  • numeric: Predictive Mean Matching (pmm)

(like fancy linear regression; faster alternative: linear regression)

  • factor, 2 lev: Logistic Regression (logreg)
  • factor, >2 lev: Multinomial logit model (polyreg)
  • ordered, >2 lev: Ordered logit model (polr)

21

  • Appl. Multivariate Statistics - Spring 2012
slide-22
SLIDE 22

Aggregation of estimates

  • : Estimate of imputation i

: Variance of estimate (= square of std. error)

  • Assume:
  • Average estimate:
  • Within-imputation variance:
  • Between-imputation variance:
  • Total variance:
  • Approximately: with
  • 95%-CI:

22

  • Appl. Multivariate Statistics - Spring 2012

¹ Q = 1

m

Pm

j=1 ^

Qj ¹ U = 1

m

Pm

j=1 ^

Uj B =

1 m¡1

Pm

j=1( ^

Qj ¡ ¹ Q)2 T = ¹ U +

1 m¡1B ^ Q¡Q p U

¼ N(0; 1) ^ Qi

Ui

¹ Q¡Q p T » tº

º = (m ¡ 1) ³ 1 +

m ¹ U (1+m)B

´2 ¹ Q § tº;0:975 p T

slide-23
SLIDE 23

Multiple Imputation with MICE

23

  • Appl. Multivariate Statistics - Spring 2012

Do manually, if you have non standard analysis

slide-24
SLIDE 24

How much uncertainty due to missings?

  • Relative increase in variance due to nonrespose:
  • Fraction (or rate) of missing information fmi:

(!! Not the same as fraction of missing OBSERVATIONS)

  • Proportion of the total variance that is attributed to the

missing data:

24

  • Appl. Multivariate Statistics - Spring 2012

fmi =

r+

2 º+3

r+1

r = (1+ 1

m)B

¹ U

¸ = B(1+ 1

m)

T

Returned by mice

slide-25
SLIDE 25

How many imputations?

  • Surprisingly few!
  • Efficiency compared to depends on fmi:
  • Examples (eff in %):

25

  • Appl. Multivariate Statistics - Spring 2012

m = 1

eff = ³ 1 + fmi

m

´¡1

M fmi=0.1 fmi=0.3 fmi=0.5 fmi=0.7 fmi=0.9 3 97 91 86 81 77 5 98 94 91 88 85 10 99 97 95 93 92 20 100 99 98 97 96

Oftentimes OK Perfect ! Rule of thumb:

  • Preliminary analysis: m = 5
  • Paper: m = 20 or even m = 50
slide-26
SLIDE 26

Concepts to know

  • Idea of mice
  • How to aggregate results from imputed data sets?
  • How many imputations?

26

  • Appl. Multivariate Statistics - Spring 2012
slide-27
SLIDE 27

R functions to know

  • mice, with, pool

27

  • Appl. Multivariate Statistics - Spring 2012
slide-28
SLIDE 28

Next time

  • Multidimensional Scaling
  • Distance metrics

28

  • Appl. Multivariate Statistics - Spring 2012