Exploring the multivariate structure of missing values using the R - - PowerPoint PPT Presentation

exploring the multivariate structure of missing values
SMART_READER_LITE
LIVE PREVIEW

Exploring the multivariate structure of missing values using the R - - PowerPoint PPT Presentation

Exploring the multivariate structure of missing values using the R package VIM Matthias Templ 1 , 2 , Andreas Alfons 1 , Peter Filzmoser 1 1 Department of Statistics and Probability Theory, Vienna University of Technology 2 Department of


slide-1
SLIDE 1

Exploring the multivariate structure of missing values using the R package VIM

Matthias Templ1,2, Andreas Alfons1, Peter Filzmoser1

1 Department of Statistics and Probability Theory, Vienna University of Technology 2 Department of Methodology, Statistics Austria

Rennes, July 8, 2009

Templ, Alfons, Filzmoser (TUW) Exploring the structure of missing values Rennes, July 8, 2009 1 / 18

slide-2
SLIDE 2

Content

1

Motivation

2

Visualization of missing values

3

Conclusions

Templ, Alfons, Filzmoser (TUW) Exploring the structure of missing values Rennes, July 8, 2009 2 / 18

slide-3
SLIDE 3

Motivation

Missing values

Real data sets often contain missing values: X =         x11 . . . . . . x1p . . . NA . . . NA . . . NA . . . xn1 . . . . . . xnp         , with n observations, p variables, and some missing values. (NA) Examples: nonresponse in surveys, element concentration below detection limit in chemical analyses.

Templ, Alfons, Filzmoser (TUW) Exploring the structure of missing values Rennes, July 8, 2009 3 / 18

slide-4
SLIDE 4

Motivation

Comments on missing values

Most statistical methods can only be applied to complete data. In order to select an appropriate imputation method (especially for model-based imputation), it is necessary to know the multivariate structure of the missing values beforehand. Visualizing missing values may not only help to detect the missing value mechanisms, but also to gain insight into the quality and various other aspects of the data.

Templ, Alfons, Filzmoser (TUW) Exploring the structure of missing values Rennes, July 8, 2009 4 / 18

slide-5
SLIDE 5

Motivation

Missing value mechanisms

Three important cases (e.g., Little and Rubin 2002): MCAR (Missing Completely At Random): P(Xmiss|X) = P(Xmiss) MAR (Missing At Random): P(Xmiss|X) = P(Xmiss|Xobs) MNAR (Missing Not At Random): P(Xmiss|X) = P(Xmiss|Xobs, Xmiss) where X = (Xobs, Xmiss) denotes the complete data, and Xobs and Xmiss are the observed and missing parts, respectively.

Templ, Alfons, Filzmoser (TUW) Exploring the structure of missing values Rennes, July 8, 2009 5 / 18

slide-6
SLIDE 6

Visualization of missing values

Visualization of missing values

Famous books and almost all articles about missing values do not address vizualization. Visualization tools for missing values are rarely or not at all implemented in SAS, SPSS, STATA or even R. Through linking, missing values can be highlighted in GGobi (Cook and Swayne 2007) and Mondrian (Theus 2002). MANET (Unwin et al. 1996, Theus et al. 1997) is quite powerful, but

  • nly available for older Apple systems with PowerPC architecture and

Mac OS. Visualization tools for missing values need to be available for the R community so that visualization of missing valuess, imputation and analysis can all be done from within R, without the need of additional software.

Templ, Alfons, Filzmoser (TUW) Exploring the structure of missing values Rennes, July 8, 2009 6 / 18

slide-7
SLIDE 7

Visualization of missing values

Histogram and spinogram

20 30 40 50 60 70 80 100 200 300 400 500 600 700

age missing/observed in py010n

0.0 0.2 0.4 0.6 0.8 1.0

age missing/observed in py010n

15 30 40 50 65

Figure: Austrian EU-SILC data from 2004 with missings generated in variable age.

Templ, Alfons, Filzmoser (TUW) Exploring the structure of missing values Rennes, July 8, 2009 7 / 18

slide-8
SLIDE 8

Visualization of missing values

Marginplot

  • 15

3.0 3.5 4.0 4.5 5.0 3.0 3.5 4.0 4.5

pek_n py130n

Figure: Austrian EU-SILC data from 2004.

Templ, Alfons, Filzmoser (TUW) Exploring the structure of missing values Rennes, July 8, 2009 8 / 18

slide-9
SLIDE 9

Visualization of missing values

Scatterplot matrix

pek_n

20 30 40 50 60 70 80 2 4 6 8 10 12 20 30 40 50 60 70 80

age

2 4 6 8 10 12

py010n

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Figure: Austrian EU-SILC data from 2004.

Templ, Alfons, Filzmoser (TUW) Exploring the structure of missing values Rennes, July 8, 2009 9 / 18

slide-10
SLIDE 10

Visualization of missing values

Matrixplot

P001000 r007000 py010n py035n py050n py090n py100n pek_n bundesld age 1000 2000 3000 4000

Index

Figure: Austrian EU-SILC data from 2004.

Templ, Alfons, Filzmoser (TUW) Exploring the structure of missing values Rennes, July 8, 2009 10 / 18

slide-11
SLIDE 11

Visualization of missing values

Parallel coordinate plot

sex pek_g P033000 P029000 P014000 P001000 age bundesld

Figure: Austrian EU-SILC data from 2004

Templ, Alfons, Filzmoser (TUW) Exploring the structure of missing values Rennes, July 8, 2009 11 / 18

slide-12
SLIDE 12

Visualization of missing values

Parallel boxplots

  • 4626

4132 4530 96 4479 4612 14 4622 4 4592 34 4560 66 4618 8 4611 15 4625 1

  • bs. in py010n
  • miss. in py010n
  • bs. in py035n
  • miss. in py035n
  • bs. in py050n
  • miss. in py050n
  • bs. in py070n
  • miss. in py070n
  • bs. in py080n
  • miss. in py080n
  • bs. in py090n
  • miss. in py090n
  • bs. in py100n
  • miss. in py100n
  • bs. in py110n
  • miss. in py110n
  • bs. in py130n
  • miss. in py130n
  • bs. in py140n
  • miss. in py140n

1 2 3 4 5

pek_n

Figure: Austrian EU-SILC data from 2004.

Templ, Alfons, Filzmoser (TUW) Exploring the structure of missing values Rennes, July 8, 2009 12 / 18

slide-13
SLIDE 13

Conclusions

General Statements

The detection of missing value mechanisms is quite complex when using models or tests. Statistical methods frequently lead to only vague statements about the missing value mechanisms. Non-robust methods lead to erroneous statements about missing value mechanisms for data containing outliers. Visualization tools are easier to handle and more powerful, but flexible, easy-to-use visualization software is required.

Templ, Alfons, Filzmoser (TUW) Exploring the structure of missing values Rennes, July 8, 2009 13 / 18

slide-14
SLIDE 14

Conclusions

The R package VIM

The R package VIM (Templ and Filzmoser 2008, Templ and Alfons 2009) . . . has all previously shown plots implemented, along with some more. is a tool for explorative data analysis of data with missing values. makes it possible to analyze the multivariate structure of missing values. comes with a graphical user interface (GUI). contains interactive features. allows producing high-quality graphics for publications. is available on CRAN (http://cran.r-project.org/package=VIM).

Templ, Alfons, Filzmoser (TUW) Exploring the structure of missing values Rennes, July 8, 2009 14 / 18

slide-15
SLIDE 15

Conclusions

Graphical user interface of the R Package VIM

Figure: VIM GUI

Templ, Alfons, Filzmoser (TUW) Exploring the structure of missing values Rennes, July 8, 2009 15 / 18

slide-16
SLIDE 16

Conclusions

Acknowledgments

This work was partly funded by the European Union (represented by the European Commission) within the 7th framework programme for research (Theme 8, Socio-Economic Sciences and Humanities, Project AMELI (Advanced Methodology for European Laeken Indicators), Grant Agreement No. 217322). Visit http://ameli.surveystatistics.net for more information.

Templ, Alfons, Filzmoser (TUW) Exploring the structure of missing values Rennes, July 8, 2009 16 / 18

slide-17
SLIDE 17

Conclusions

References I

  • D. Cook and D.F. Swayne. Interactive and Dynamic Graphics for Data

Analysis: With R and GGobi. Springer, New York, 2007. ISBN 978-0-387-71761-6. R.J.A. Little and D.B. Rubin. Statistical Analysis with Missing Data. Wiley, New York, 2nd edition, 2002. ISBN 0-471-18386-5.

  • M. Templ and A. Alfons. VIM: Visualization and Imputation of Missing

Values, 2009. URL http://cran.r-project.org/package=VIM. R package version 1.3.

  • M. Templ and P. Filzmoser. Visualization of missing values using the

R-package VIM. Research Report CS-2008-1, Department of Statistics and Probability Theory, Vienna University of Technology, 2008. URL http://www.statistik.tuwien.ac.at/forschung/CS/ CS-2008-1complete.pdf.

Templ, Alfons, Filzmoser (TUW) Exploring the structure of missing values Rennes, July 8, 2009 17 / 18

slide-18
SLIDE 18

Conclusions

References II

  • M. Theus. Interactive data visualization using mondrian. Journal of

Statistical Software, 7(11): 1–9, 2002. URL http://www.jstatsoft.org/v07/i11.

  • M. Theus, H. Hofmann, B. Siegl, and A. Unwin. MANET - Extensions to

interactive statistical graphics for missing values. In In New Techniques and Technologies for Statistics II, pages 247–259. IOS Press, 1997.

  • A. Unwin, G. Hawkins, H. Hofmann, and B. Siegl. Interactive graphics for

data sets with missing values: MANET. Journal of Computational and Graphical Statistics, 5(2): 113–122, 1996.

Templ, Alfons, Filzmoser (TUW) Exploring the structure of missing values Rennes, July 8, 2009 18 / 18