to choose the matching variables
play

to Choose the Matching Variables in Statistical Matching Marcello - PowerPoint PPT Presentation

The Use of Uncertainty to Choose the Matching Variables in Statistical Matching Marcello DOrazio* ( madorazi@istat.it) Marco Di Zio* (dizio@istat.it) Mauro Scanu* (scanu@istat.it) *Italian National Institute of Statistics (Istat) NTTS 2015


  1. The Use of Uncertainty to Choose the Matching Variables in Statistical Matching Marcello D’Orazio* ( madorazi@istat.it) Marco Di Zio* (dizio@istat.it) Mauro Scanu* (scanu@istat.it) *Italian National Institute of Statistics (Istat) NTTS 2015 conference, Brussels, 10-12 March 2015

  2. Statistical Matching (data fusion or synthetic matching) Series of statistical methods for integrating two data sources (usually samples) referred to the same target population. Objective : study the relationship between variables not jointly observed in a single data source Y X X variables in common source A Y and Z are NOT jointly observed X Z source B 1 Uncertainty to choose the matching variables , M. D’Orazio, M. Di Zio, M. Scanu – NTTS 2015, Brussels

  3. Objectives of Statistical Matching  micro : derive a “synthetic” data -set with X , Y and Z ; for instance: • A filled-in with Z • with Z filled in A and Y filled in B (file concatenation)  macro : estimation of parameters; for instance: • correlation coef. ( ) • regression coefficient ( ) • a contingency table ( ) Various methods available, depending on the objective (micro or macro) and on the framework (parametric, nonparametric or mixed). Uncertainty to choose the matching variables , M. D’Orazio, M. Di Zio, M. Scanu – NTTS 2015, Brussels

  4. Matching Variables A and B may share many common variables X NOT all the X variables will be used. It is necessary to select just the most relevant X s called matching variables i.e. the subset of the X s connected, at the same time, with Y and Z : Many methods can be applied to identify (best predictors of Y ) and (best predictors of Z ). They imply separate analyses on A and B . Proposal : perform a unique analysis for choosing by searching the set of common variables more effective in reducing the uncertainty on the relationship between Y and Z Uncertainty is due to lack of information: Y and Z are NOT jointly observed. Uncertainty to choose the matching variables , M. D’Orazio, M. Di Zio, M. Scanu – NTTS 2015, Brussels

  5. Uncertainty bounds Focus on categorical X , Y and Z variables are categorical. Objective of SM: estimation of the probabilities        , , p Pr Y j Z , k j 1, , J k 1, , K  jk In this case the uncertainty set can be computed by resorting to the Fréchet bounds By conditioning on the X , it is possible to conclude that the probability will lie in the interval:              p , p p max 0, p p 1 , p min p , p         jk jk h h j h k h j h k h h h Uncertainty to choose the matching variables , M. D’Orazio, M. Di Zio, M. Scanu – NTTS 2015, Brussels

  6. Proposed method for choosing the matching variables Step 0) ordering of the X s according to their ability in minimizing:     1 ˆ   ˆ d p p   jk jk j k , J K Step 1) evaluate d for all the possible combinations of the starting variable(s) with each of the remaining ones ordered as in step (0) and evaluate the uncertainty associated in terms of d Step 2) Select the combination of the variables which determine the higher decrease in d and go back to step (1). Method tested with artificial data Uncertainty to choose the matching variables , M. D’Orazio, M. Di Zio, M. Scanu – NTTS 2015, Brussels

  7. The data (1) Bayesian networks are used to generate two artificial samples sharing 3 binary X s with the following association structure : True association structure Association str. in A Association str. n B Output of the procedure X variables No. of Xs d X1 1 0.1703 X1*X3 2 0.1703 X1*X3*X2 3 0.1699 Uncertainty to choose the matching variables , M. D’Orazio, M. Di Zio, M. Scanu – NTTS 2015, Brussels

  8. The data (2) Artificial data resembling EU-SILC Two artificial samples, and 7 common variables Output of the procedure: Best Combination of X variables d No. of X s 1 0.0878 Yes c.age 2 0.0781 Yes c.age*sex 3 0.0714 Yes c.age*sex*edu7 4 0.0608 No c.age*sex*edu7*area5 No c.age*sex*edu7*area5*hsize5 5 0.0411 6 0.0225 Yes c.age*sex*edu7*area5*hsize5*urb Yes 7 0.0162 c.age*edu7*marital*sex*hsize5*area5*urb The found combinations with 4 and 5 X s are very close to optimality Uncertainty to choose the matching variables , M. D’Orazio, M. Di Zio, M. Scanu – NTTS 2015, Brussels

  9. Conclusions Pros: - avoids separate analyses - is able to find best solutions or solutions close to them - is fully authomatic, code written in R and related to the package StatMatch ( D’Orazio , 2015) Cons: - dependence on the initial ordering of the variables - absence of a stopping rule: by increasing the no. of X s the uncertainty always decreases but the tables become very sparse Uncertainty to choose the matching variables , M. D’Orazio, M. Di Zio, M. Scanu – NTTS 2015, Brussels

  10. Essential References D’Orazio M., Di Zio M., and Scanu M. (2006) Statistical Matching, Theory and Practice . Wiley, New York. D’Orazio, M. (2015) “ StatMatch : Statistical Matching”, R package version 1.2.3 http://CRAN.R-project.org/package=StatMatch Uncertainty to choose the matching variables , M. D’Orazio, M. Di Zio, M. Scanu – NTTS 2015, Brussels

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend