Variance Estimation in Presence of Imputation: an Application to - - PowerPoint PPT Presentation

variance estimation in presence of imputation an
SMART_READER_LITE
LIVE PREVIEW

Variance Estimation in Presence of Imputation: an Application to - - PowerPoint PPT Presentation

Variance Estimation in Presence of Imputation: an Application to ISTAT Survey Data Marco Di Zio, Stefano Falorsi, Ugo Guarnera, Orietta Luzi, Paolo Righi paolo.righi@istat.it European Conference on Quality in Official Statistics - 2008, Rome 8-11


slide-1
SLIDE 1

Variance Estimation in Presence of Imputation: an Application to ISTAT Survey Data

Marco Di Zio, Stefano Falorsi, Ugo Guarnera, Orietta Luzi, Paolo Righi paolo.righi@istat.it

European Conference on Quality in Official Statistics - 2008, Rome 8-11 July 2008

1

slide-2
SLIDE 2

OUTLINE

  • Imputation and Variance Estimation in Official Statistics
  • Adjusted Jackknife (AJ) variance estimation under Hot Deck (HD)

imputations

  • Delete-A-Group Jackknife (DAGJK) and Extended DAGJK (EDAGJK)
  • DAGJK and EDAGJK with Rao and Shao adjustment for imputation
  • Description of a Monte Carlo simulation study on real survey data
  • Results

2

slide-3
SLIDE 3

Imputation and Variance Estimation in Official Statistics

  • In Official Statistics item nonresponses in survey data are generally

dealt with imputation

  • Usually in variance estimation imputed data are treated as they

were true values without taking into account the additional source of variability due to the adjustment process. Standard formulas could lead to serious underestimation of the variance

  • National Statistical Institutes usually do not apply methods of variance

estimation taking into account the imputed data because of both theoretical and computational problems

  • Our goal is to study methods belonging to the jackknife family, focusing
  • n their feasibility with respect to official statistical data

3

slide-4
SLIDE 4

Adjusted Jackknife (AJ) variance estimator under Hot Deck (HD) imputation

  • Rao and Shao (1992) proposed an AJ variance estimator under HD

imputation that is consistent under some assumptions on the response model

  • For a stratified multistage sampling design with ignorabile finite

population correction factor the AJ is: ˆ V ar(ˆ YI) =

  • h
  • k∈h

[(nh − 1)/nh] (ˆ Y (k)

Ih

− ˆ YI)2 (1) being ˆ YI =

  • h

  

  • k∈ARh

wkyk +

  • k∈AMh

wky∗

k

  

(2)

4

slide-5
SLIDE 5

Adjusted Jackknife (AJ) variance estimator under HD imputation

  • The term ˆ

Y (k)

Ih

in (1) is the estimate of ˆ Y when unit k ∈ h is omitted ˆ Y (k)

Ih

=

  • c

  

  • i∈ARc

w(k)

i

yi +

  • i∈AMc

w(k)

i

  • y∗

i + ˆ

y(k)

Rc − ¯

yRc

  

(3) with: ˆ y(k)

Rc = i∈ARc w(k) i

yi/

i∈ARc w(k) i

¯ y(k)

Rc = i∈ARc wiyi/ i∈ARc wi

  • HD imputation consists in replacing the missing values yk of an

incomplete unit (recipient) with the observed values y∗

k from another

record (donor) chosen among the complete units of the same survey

  • In Random HD, the donor is randomly selected among a pool of units

belonging to a subset of records (imputation cell c) having the same level of some categorical variables

5

slide-6
SLIDE 6

Adjusted Jackknife (AJ) variance estimator under HD imputation

  • Advantages of using jackknife in Official Statistics:

– No model assumptions are needed; – Unit and item nonresponse is easily dealt with this method; – Variance of nonlinear statistics and estimation for domains can be easily calculated by external users.

  • Drawbacks:

– Jackknife becomes computer intensive for large scale surveys; – Sometimes not suitable with typical sampling designs adopted in Official Statistics (strata with small sample sizes - upward bias estimates).

6

slide-7
SLIDE 7

Delete-A-Group Jackknife (DAGJK) and Extended DAGJK (EDAGJK)

  • To overcome these problems we consider the DAGJK method
  • DAGJK is based on the following jackknife procedure:

– Primary Sample Units (PSUs) in the same stratum are randomly

  • rdered;

– From this ordering, the PSUs are systematically allocated into G groups; – Considering the g-th group the replicate-g weights for the elementary k-unit are computed:

7

slide-8
SLIDE 8

Delete-A-Group Jackknife (DAGJK) and Extended DAGJK (EDAGJK) w(g)

k

=

        

wk, when k ∈ h and no PSU ∈ h belongs to the group g 0, when k ∈ PSU in group g

  • nh/nh − n(g)

h

  • wk,
  • therwise.

(4)

  • The precision of the variance estimates improves when the number of

random groups increases

  • DAGJK produces biased estimates when the number of sample PSUs

in the strata is small (less than 5)

  • Kott (2001) proposed the EDAGJK to handle the latter case

8

slide-9
SLIDE 9

Delete-A-Group Jackknife (DAGJK) and Extended DAGJK (EDAGJK)

  • EDAGJK is based on the following replicate weights

w(g)

k

=

    

wk when k ∈ h and no PSU ∈ h belongs to the group g wk [1 − (nh − 1) Z] , when k ∈ PSU in group g wk (1 + Z)

  • therwise

(5) where Z2 = G/ [(G − 1) nh (nh − 1)]

  • The DAGJK or EDAGJK variance estimator using the weights in

formulas (4) or (5) is V ar(ˆ Y ) = (G/G − 1)

  • g

(ˆ Y (g) − ˆ Y )2 (6) with ˆ Y (g) =

s w(g) k

yk.

9

slide-10
SLIDE 10

DAGJK and EDAGJK with Rao and Shao adjustment for imputation

  • We propose a DAGJK or EDAGJK version with the Rao and Shao

adjustment for HD imputation

  • The method obtains ˆ

Y (g)

I

by replacing the DAGJK or EDAGJK replicate weights in (3) ˆ Y (g)

I

=

  • c

  

  • i∈ARc

w(g)

i

yi +

  • i∈AMc

w(g)

i

  • y∗

i + ˆ

y(g)

Rc − ¯

yRc

  

(7) with ˆ y(g)

Rc = i∈ARc w(g) i

yi/

i∈ARc w(g) i

10

slide-11
SLIDE 11

Description of a Monte Carlo simulation study on real survey data

  • The population of an Italian geographical region - Lazio - (except for

the province of Rome) with 1,372,572 units has been considered

  • 250 samples according to the Italian Labour Force sampling design

have been selected: – The municipalities of each province are ordered by population size and strata of municipalities with population size equal to a given threshold are formed. Strata with only one municipality are referred to as self-representing (S-R) strata (7); – In each S-R stratum a sample of households (PSUs) is selected (Stratified cluster design);

11

slide-12
SLIDE 12

Description of a Monte Carlo simulation study on real survey data

  • In non S-R stratum (NS-R) a pps sample of municipalities (PSUs) of

size 2 is drawn, and a sample of households is selected (two stage stratified design);

  • There are many NS-R strata with non negligible PSU sampling fraction

Frequency of NS-R strata by PSU sampling fraction < 20% 20% − 40% 40% − 60% > 60% Total Frequency 8 5 2 3 18

  • The total of the variable employment (employed/not employed) has

been considered

12

slide-13
SLIDE 13

Description of a Monte Carlo simulation study on real survey data

  • A missing at Random (MAR) mechanism has been simulated by using

8 different missing rates depending on the values of 2 covariates: X1 (levels: 1,2,3,4) referred to the household’s type; the domain indicator variable depending on whether the unit belongs to either S-R or NS-R stratum. Missing rate for the simulated nonresponse mechanism X1 = 1 X1 = 2 X1 = 3 X1 = 4 NS-R 10% 20% 30% 40% S-R 40% 30% 20% 10%

  • The number of PSUs (municipalities+households) is 552
  • HD method is applied with imputation cells defined as above

13

slide-14
SLIDE 14

Results: Relative Bias and Relative Root Mean Square Error of EDAGJK by different number of random groups Number RG Without missing data With imputed data RB RRMSE RB RRMSE 5 0.07 0.83 0.07 0.83 15 0.09 0.50 0.11 0.57 30 0.09 0.44 0.11 0.48 50 0.08 0.38 0.10 0.43

14

slide-15
SLIDE 15

Results: Boxplots of the variance estimates of the methods EDAGJK - DAGJK - STANDARD FORMULA - JACKKNIFE

15

slide-16
SLIDE 16

Results: Confidence Interval of the methods - 95% CI Coverage and CI Relative Lenght Without missing data With imputed data METHODS 95% CI CI RL 95% CI CI RL COVERAGE COVERAGE EDAGJK - 30 RG 90.5 18.7 92.5 23.1 DAGJK -30 RG 97.5 24.5 98.0 29.8 STANDARD FORMULA 91.5 18.8 88.0 19.5 JACKKNIFE 97.5 24.9 98.5 30.8

16

slide-17
SLIDE 17

Conclusion

  • Variance estimation taking into account imputed data is a pressing

target in Official Statistics

  • The proposed approach based on EDAGJK with Rao and Shao

adjustment seems to obtain good performances in terms of precision

  • f the variance estimates being, at the same time, computational

feasible

  • The empirical results show that the approach seems to be suitable for

the complex designs usually used in National Statistical Institutes

  • Further analysis are needed to take into account a finite population

correction factor in the variance estimator

  • Finally an empirical study with the calibration estimators is needed

17

slide-18
SLIDE 18

References

  • Brick, J.M.,Jones, M.E.,Kalton, G., Valliant, R. (2005). Variance estimation with hot

deck imputation: a simulation study of three methods. Survey Methodology, 31,151- 159.

  • Kott, P

. S. (2001). The delete-a-group jackknife. Journal of Official Statistics, 17, 521-526.

  • Kott, P

. S. (2006). Delete-a-group variance estimation for the general regression estimator under poissoing sampling. Journal of Official Statistics, 22 , 759-767.

  • Lee, H., Rancourt E., Sarndal, C.-E. (1995). Jackknife variance for data with imputed
  • values. Proceedings of the Statistical Society of Canada Survey Methods Section,

111-115.

  • Rao, J.N.K., Shao, J. (1992). Jackknife variance estimation with survey data under

hot deck imputation. Biometrika, 79, 811-822.

  • Rust, K. (1985). Variance estimation for complex estimators in sample. Journal of

Official Statistics, 1, 381-397.

  • Shao, J., Steel, P

. (1999). Variance estimation for survey data with composite estimation and nonnegligible sampling fractions. journal of American Statistical Association, 94, 254-265.

  • Wolter, K.M. (1985). Introduction to variance estimation. New York, Springer Verlag.