to Choose the Matching Variables in Statistical Matching Marcello - - PowerPoint PPT Presentation
to Choose the Matching Variables in Statistical Matching Marcello - - PowerPoint PPT Presentation
The Use of Uncertainty to Choose the Matching Variables in Statistical Matching Marcello DOrazio* ( madorazi@istat.it) Marco Di Zio* (dizio@istat.it) Mauro Scanu* (scanu@istat.it) *Italian National Institute of Statistics (Istat) NTTS 2015
Series of statistical methods for integrating two data sources (usually samples) referred to the same target population. Objective: study the relationship between variables not jointly
- bserved in a single data source
Uncertainty to choose the matching variables, M. D’Orazio, M. Di Zio, M. Scanu – NTTS 2015, Brussels
Statistical Matching (data fusion or synthetic matching)
1
Y X source A X Z source B
X variables in common Y and Z are NOT jointly observed
micro: derive a “synthetic” data-set with X, Y and Z; for instance:
- A filled-in with Z
- with Z filled in A and Y filled in B (file concatenation)
macro: estimation of parameters; for instance:
- correlation coef. ( )
- regression coefficient ( )
- a contingency table ( )
Various methods available, depending on the objective (micro or macro) and on the framework (parametric, nonparametric or mixed).
Objectives of Statistical Matching
Uncertainty to choose the matching variables, M. D’Orazio, M. Di Zio, M. Scanu – NTTS 2015, Brussels
A and B may share many common variables X
NOT all the X variables will be used. It is necessary to select just the most relevant Xs called matching variables i.e. the subset of the Xs connected, at the same time, with Y and Z: Many methods can be applied to identify (best predictors of Y) and (best predictors of Z). They imply separate analyses on A and B. Proposal: perform a unique analysis for choosing by searching the set of common variables more effective in reducing the uncertainty on the relationship between Y and Z Uncertainty is due to lack of information: Y and Z are NOT jointly
- bserved.
Matching Variables
Uncertainty to choose the matching variables, M. D’Orazio, M. Di Zio, M. Scanu – NTTS 2015, Brussels
Uncertainty bounds
Focus on categorical X, Y and Z variables are categorical. Objective of SM: estimation of the probabilities In this case the uncertainty set can be computed by resorting to the Fréchet bounds By conditioning on the X, it is possible to conclude that the probability will lie in the interval:
Pr ,
jk
p Y j Z k
,
1, , j J
,
1, , k K
, max 0, 1 , min ,
jk jk h h j h k h j h k h h h
p p p p p p p p
Uncertainty to choose the matching variables, M. D’Orazio, M. Di Zio, M. Scanu – NTTS 2015, Brussels
Proposed method for choosing the matching variables
Step 0) ordering of the Xs according to their ability in minimizing: Step 1) evaluate d for all the possible combinations of the starting variable(s) with each of the remaining ones
- rdered as in step (0) and evaluate the uncertainty
associated in terms of d Step 2) Select the combination of the variables which determine the higher decrease in d and go back to step (1). Method tested with artificial data
,
1 ˆ ˆ
jk jk j k
d p p J K
Uncertainty to choose the matching variables, M. D’Orazio, M. Di Zio, M. Scanu – NTTS 2015, Brussels
The data (1)
Bayesian networks are used to generate two artificial samples sharing 3 binary Xs with the following association structure:
True association structure Association str. in A Association str. n B
X variables
- No. of Xs
d X1 1 0.1703 X1*X3 2 0.1703 X1*X3*X2 3 0.1699 Output of the procedure
Uncertainty to choose the matching variables, M. D’Orazio, M. Di Zio, M. Scanu – NTTS 2015, Brussels
The data (2)
Artificial data resembling EU-SILC Two artificial samples, and 7 common variables Output of the procedure:
Combination of X variables
- No. of Xs
d
Best
c.age
1 0.0878 Yes
c.age*sex
2 0.0781 Yes
c.age*sex*edu7
3 0.0714 Yes
c.age*sex*edu7*area5
4 0.0608 No
c.age*sex*edu7*area5*hsize5
5 0.0411 No
c.age*sex*edu7*area5*hsize5*urb
6 0.0225 Yes
c.age*edu7*marital*sex*hsize5*area5*urb
7 0.0162 Yes
The found combinations with 4 and 5 Xs are very close to optimality
Uncertainty to choose the matching variables, M. D’Orazio, M. Di Zio, M. Scanu – NTTS 2015, Brussels
Conclusions
Pros:
- avoids separate analyses
- is able to find best solutions or solutions close to them
- is fully authomatic, code written in R and related to the package
StatMatch (D’Orazio, 2015) Cons:
- dependence on the initial ordering of the variables
- absence of a stopping rule: by increasing the no. of Xs the
uncertainty always decreases but the tables become very sparse
Uncertainty to choose the matching variables, M. D’Orazio, M. Di Zio, M. Scanu – NTTS 2015, Brussels
Essential References
D’Orazio M., Di Zio M., and Scanu M. (2006) Statistical Matching, Theory and Practice. Wiley, New York. D’Orazio, M. (2015) “StatMatch: Statistical Matching”, R package version 1.2.3 http://CRAN.R-project.org/package=StatMatch
Uncertainty to choose the matching variables, M. D’Orazio, M. Di Zio, M. Scanu – NTTS 2015, Brussels