to Choose the Matching Variables in Statistical Matching Marcello - - PowerPoint PPT Presentation

to choose the matching variables
SMART_READER_LITE
LIVE PREVIEW

to Choose the Matching Variables in Statistical Matching Marcello - - PowerPoint PPT Presentation

The Use of Uncertainty to Choose the Matching Variables in Statistical Matching Marcello DOrazio* ( madorazi@istat.it) Marco Di Zio* (dizio@istat.it) Mauro Scanu* (scanu@istat.it) *Italian National Institute of Statistics (Istat) NTTS 2015


slide-1
SLIDE 1

The Use of Uncertainty to Choose the Matching Variables in Statistical Matching

Marcello D’Orazio* (madorazi@istat.it) Marco Di Zio* (dizio@istat.it) Mauro Scanu* (scanu@istat.it) *Italian National Institute of Statistics (Istat) NTTS 2015 conference, Brussels, 10-12 March 2015

slide-2
SLIDE 2

Series of statistical methods for integrating two data sources (usually samples) referred to the same target population. Objective: study the relationship between variables not jointly

  • bserved in a single data source

Uncertainty to choose the matching variables, M. D’Orazio, M. Di Zio, M. Scanu – NTTS 2015, Brussels

Statistical Matching (data fusion or synthetic matching)

1

Y X source A X Z source B

X variables in common Y and Z are NOT jointly observed

slide-3
SLIDE 3

 micro: derive a “synthetic” data-set with X, Y and Z; for instance:

  • A filled-in with Z
  • with Z filled in A and Y filled in B (file concatenation)

 macro: estimation of parameters; for instance:

  • correlation coef. ( )
  • regression coefficient ( )
  • a contingency table ( )

Various methods available, depending on the objective (micro or macro) and on the framework (parametric, nonparametric or mixed).

Objectives of Statistical Matching

Uncertainty to choose the matching variables, M. D’Orazio, M. Di Zio, M. Scanu – NTTS 2015, Brussels

slide-4
SLIDE 4

A and B may share many common variables X

NOT all the X variables will be used. It is necessary to select just the most relevant Xs called matching variables i.e. the subset of the Xs connected, at the same time, with Y and Z: Many methods can be applied to identify (best predictors of Y) and (best predictors of Z). They imply separate analyses on A and B. Proposal: perform a unique analysis for choosing by searching the set of common variables more effective in reducing the uncertainty on the relationship between Y and Z Uncertainty is due to lack of information: Y and Z are NOT jointly

  • bserved.

Matching Variables

Uncertainty to choose the matching variables, M. D’Orazio, M. Di Zio, M. Scanu – NTTS 2015, Brussels

slide-5
SLIDE 5

Uncertainty bounds

Focus on categorical X, Y and Z variables are categorical. Objective of SM: estimation of the probabilities In this case the uncertainty set can be computed by resorting to the Fréchet bounds By conditioning on the X, it is possible to conclude that the probability will lie in the interval:

 

Pr ,

jk

p Y j Z k

  

,

1, , j J 

,

1, , k K 

   

, max 0, 1 , min ,

jk jk h h j h k h j h k h h h

p p p p p p p p

   

          

 

Uncertainty to choose the matching variables, M. D’Orazio, M. Di Zio, M. Scanu – NTTS 2015, Brussels

slide-6
SLIDE 6

Proposed method for choosing the matching variables

Step 0) ordering of the Xs according to their ability in minimizing: Step 1) evaluate d for all the possible combinations of the starting variable(s) with each of the remaining ones

  • rdered as in step (0) and evaluate the uncertainty

associated in terms of d Step 2) Select the combination of the variables which determine the higher decrease in d and go back to step (1). Method tested with artificial data

 

,

1 ˆ ˆ

jk jk j k

d p p J K

 

   

Uncertainty to choose the matching variables, M. D’Orazio, M. Di Zio, M. Scanu – NTTS 2015, Brussels

slide-7
SLIDE 7

The data (1)

Bayesian networks are used to generate two artificial samples sharing 3 binary Xs with the following association structure:

True association structure Association str. in A Association str. n B

X variables

  • No. of Xs

d X1 1 0.1703 X1*X3 2 0.1703 X1*X3*X2 3 0.1699 Output of the procedure

Uncertainty to choose the matching variables, M. D’Orazio, M. Di Zio, M. Scanu – NTTS 2015, Brussels

slide-8
SLIDE 8

The data (2)

Artificial data resembling EU-SILC Two artificial samples, and 7 common variables Output of the procedure:

Combination of X variables

  • No. of Xs

d

Best

c.age

1 0.0878 Yes

c.age*sex

2 0.0781 Yes

c.age*sex*edu7

3 0.0714 Yes

c.age*sex*edu7*area5

4 0.0608 No

c.age*sex*edu7*area5*hsize5

5 0.0411 No

c.age*sex*edu7*area5*hsize5*urb

6 0.0225 Yes

c.age*edu7*marital*sex*hsize5*area5*urb

7 0.0162 Yes

The found combinations with 4 and 5 Xs are very close to optimality

Uncertainty to choose the matching variables, M. D’Orazio, M. Di Zio, M. Scanu – NTTS 2015, Brussels

slide-9
SLIDE 9

Conclusions

Pros:

  • avoids separate analyses
  • is able to find best solutions or solutions close to them
  • is fully authomatic, code written in R and related to the package

StatMatch (D’Orazio, 2015) Cons:

  • dependence on the initial ordering of the variables
  • absence of a stopping rule: by increasing the no. of Xs the

uncertainty always decreases but the tables become very sparse

Uncertainty to choose the matching variables, M. D’Orazio, M. Di Zio, M. Scanu – NTTS 2015, Brussels

slide-10
SLIDE 10

Essential References

D’Orazio M., Di Zio M., and Scanu M. (2006) Statistical Matching, Theory and Practice. Wiley, New York. D’Orazio, M. (2015) “StatMatch: Statistical Matching”, R package version 1.2.3 http://CRAN.R-project.org/package=StatMatch

Uncertainty to choose the matching variables, M. D’Orazio, M. Di Zio, M. Scanu – NTTS 2015, Brussels