Using the Superpopulation Model for Imputations and Variance - - PowerPoint PPT Presentation

using the superpopulation model for imputations and
SMART_READER_LITE
LIVE PREVIEW

Using the Superpopulation Model for Imputations and Variance - - PowerPoint PPT Presentation

Using the Superpopulation Model for Imputations and Variance Computation in Survey Sampling Petr Novk, Vclav Kosina Czech Statistical Office Petr Novk, Vclav Kosina Using the Superpopulation Model for Imputations and Variance


slide-1
SLIDE 1

Using the Superpopulation Model for Imputations and Variance Computation in Survey Sampling

Petr Novák, Václav Kosina Czech Statistical Office

Petr Novák, Václav Kosina Using the Superpopulation Model for Imputations and Variance

slide-2
SLIDE 2

Introduction

Situation Let us have a population of N units: n sampled (sam) and N-n unknown (imp). We want to estimate the population total Y = N

i=1 yi.

Petr Novák, Václav Kosina Using the Superpopulation Model for Imputations and Variance

slide-3
SLIDE 3

Introduction

Situation Let us have a population of N units: n sampled (sam) and N-n unknown (imp). We want to estimate the population total Y = N

i=1 yi.

Model assumptions yi = βxi + ǫi, ǫi are independent random variables, Eǫi = 0 and varǫi = ciσ2, xi and ci known constants for all i = 1, ..., N, β and σ2 unknown parameters.

Petr Novák, Václav Kosina Using the Superpopulation Model for Imputations and Variance

slide-4
SLIDE 4

Imputation

Estimation Estimate β from the sampled part using the least squares method: ˆ β =

  • sam wixiyi/ci
  • sam wix2

i /ci

. wi are some appropriate weights. Note: constant weights and ci = xi gives ˆ β =

  • sam yi
  • sam xi .

Petr Novák, Václav Kosina Using the Superpopulation Model for Imputations and Variance

slide-5
SLIDE 5

Imputation

Estimation Estimate β from the sampled part using the least squares method: ˆ β =

  • sam wixiyi/ci
  • sam wix2

i /ci

. wi are some appropriate weights. Note: constant weights and ci = xi gives ˆ β =

  • sam yi
  • sam xi .

Data imputation For each unit from the unknown part we impute ˆ yi = xi ˆ β. The estimate of the population total is then ˆ Y =

  • sam

yi +

  • imp

ˆ yi.

Petr Novák, Václav Kosina Using the Superpopulation Model for Imputations and Variance

slide-6
SLIDE 6

Differences from classic techniques

Classic reweighting approach: yi treated as constants. Randomness through sample inclusion indicators. Error computed through var ˆ Y. Superpopulation model approach: yi treated as random variables. Real yi from the imputed part predicted with ˆ yi = xi ˆ β. Error computed through mse ˆ Y = E( ˆ Y − Y)2.

Petr Novák, Václav Kosina Using the Superpopulation Model for Imputations and Variance

slide-7
SLIDE 7

Error computation

The least squares estimator is unbiased (E ˆ β = β). Therefore Eˆ yi = Exi ˆ β = xiβ = Eyi. The mean square error of the prediction is then mse ˆ Y = E( ˆ Y − Y)2 = E( ˆ Yimp − Yimp)2 = E( ˆ Yimp − E ˆ Yimp − Yimp + EYimp)2 = E( ˆ Yimp − E ˆ Yimp)2 + E(Yimp − EYimp)2 − 2E( ˆ Yimp − E ˆ Yimp)(Yimp + EYimp) = var ˆ Yimp + varYimp.

Petr Novák, Václav Kosina Using the Superpopulation Model for Imputations and Variance

slide-8
SLIDE 8

Variance computation

The variance of estimated values is var ˆ Yimp = varXimp ˆ β = X 2

impvar ˆ

β = X 2

imp

  • sam w2

i x2 i /ci

(

sam wix2 i /ci)2 σ2.

We denote var ˆ β as σ2

β.

The variance of the predicted real values is varYimp =

  • imp

ciσ2. Denote Cimp :=

imp ci. We get

mse ˆ Y = X 2

impσ2 β + Cimpσ2.

Petr Novák, Václav Kosina Using the Superpopulation Model for Imputations and Variance

slide-9
SLIDE 9

Variance computation

The variance of estimated values is var ˆ Yimp = varXimp ˆ β = X 2

impvar ˆ

β = X 2

imp

  • sam w2

i x2 i /ci

(

sam wix2 i /ci)2 σ2.

We denote var ˆ β as σ2

β.

The variance of the predicted real values is varYimp =

  • imp

ciσ2. Denote Cimp :=

imp ci. We get

mse ˆ Y = X 2

impσ2 β + Cimpσ2.

Possible estimators for σ2: 1 n − 1

  • sam

(yi − ˆ βxi)2 ci , 1 wi − ¯ wi

  • sam

wi(yi − ˆ βxi)2 ci .

Petr Novák, Václav Kosina Using the Superpopulation Model for Imputations and Variance

slide-10
SLIDE 10

Special cases

If wi ≡ const. and ci = xi, we get σ2

β =

1 Xsam σ2 and therefore mse ˆ Y = X 2

imp

σ2 Xsam + Ximpσ2 = XimpXall Xsam σ2. If we have no auxiliary information available and set xi ≡ 1, we impute the sample mean for each unit. We get then the commonly used formula mse ˆ Y = (N − n)N n σ2 = N2 n

  • 1 − n

N

  • σ2.

Petr Novák, Václav Kosina Using the Superpopulation Model for Imputations and Variance

slide-11
SLIDE 11

Chain imputation

Situation: xi not known, but estimated from zi Model: yi|xi ∼ (xiβyx, ciσ2

yx),

xi ∼ (ziβxz, diσ2

xz)

With help of conditional variance decomposition we get mse( ˆ Y) = var ˆ Yimp + varYimp = Evar[ ˆ Yimp|X] + varE[ ˆ Yimp|X] + Evar[Yimp|X] + varE[Yimp|X] ... = Emse( ˆ Y|X) + β2

yxmse(ˆ

X).

Petr Novák, Václav Kosina Using the Superpopulation Model for Imputations and Variance

slide-12
SLIDE 12

Chain imputation

Estimated error:

  • mse ˆ

Y = mse(Y|ˆ X) + ˆ β2

yx

mse ˆ X. The chain structure can be followed up and stacked until we get to an auxiliary variable which is known for all units, i.e. administrative data.

Petr Novák, Václav Kosina Using the Superpopulation Model for Imputations and Variance

slide-13
SLIDE 13

Stratification level shifts

Situation: The population is divided into strata (size class, NACE, region). There are several stratification levels, going from relatively small groups to larger ones. When there are not enough responding units to estimate β in

  • ne stratum, we use the estimates from corresponding higher

level stratum.

−1.0 −0.5 0.0 0.5 1.0 −0.2 0.2 0.6 S2 S1 S0

Petr Novák, Václav Kosina Using the Superpopulation Model for Imputations and Variance

slide-14
SLIDE 14

Stratification level shifts

If the estimated total of the whole population divided into strata m1, ..., mK is ˆ Y =

  • j

ˆ Ymj, the mean square error is mse ˆ Y = var ˆ Yimp + varYimp = var

  • j

ˆ Y imp

mj

+ var

  • j

Y imp

mj

=

  • j

var ˆ Y imp

mj

+

  • j=k

cov( ˆ Y imp

mj , ˆ

Y imp

mk ) +

  • j

varY imp

mj .

Both variances of estimated and real values can be computed with methods from above.

Petr Novák, Václav Kosina Using the Superpopulation Model for Imputations and Variance

slide-15
SLIDE 15

Stratification level shifts - covariance computation

Covariance computation Let m1 and m2 be two basic strata. ˆ β estimated from superstrata S1 and S2 respectively. Denote Sd = S1 ∩ S2, which is the smaller of S1 and S2, if the stratification levels are well ordered. Denote S = S1 ∪ S2, which is then the larger of both. Then cov( ˆ Ym1, ˆ Ym2) = cov(X imp

m1 ˆ

βS1, X imp

m2 ˆ

βS2) = X imp

m1 X imp m2 cov(ˆ

βS1, ˆ βS2) = X imp

m1 X imp m2 cov

  • Ssam

1

wixiyi/ci

  • Ssam

1

wix2

i /ci

,

  • Ssam

2

wixiyi/ci

  • Ssam

2

wix2

i /ci

  • .

Petr Novák, Václav Kosina Using the Superpopulation Model for Imputations and Variance

slide-16
SLIDE 16

Stratification level shifts - covariance computation

The variables yi belonging to either S1 or S2 but not to Sd are mutually independent. Denote as BS1 and BS1 the sums in the denominator: cov( ˆ Ym1, ˆ Ym2) = X imp

m1 X imp m2

BS1BS2 var  

Ssam

d

wixiyi/ci   = X imp

m1 X imp m2

BS1BS2

  • Ssam

d

w2

i x2 i /c2 i varyi

= X imp

m1 X imp m2

BS1BS2

  • Ssam

d

w2

i x2 i /ciσ2 Sd = X imp m1 X imp m2

BSd BS σ2

βSd .

This way we can compute all the covariances between base strata and the mean square error of the whole sum.

Petr Novák, Václav Kosina Using the Superpopulation Model for Imputations and Variance

slide-17
SLIDE 17

Stratification level shifts - chained imputations

If we have a sophisticated stratification structure and chained imputations, we need to compute the chained covariance also. The covariances are computed with help of conditional covariance decomposition: cov( ˆ Ym1, ˆ Ym2) = Ecov[ ˆ Ym1, ˆ Ym2|X] + cov(E[ ˆ Ym1|X], E[ ˆ Ym2|X]) = Ecov[ ˆ Ym1, ˆ Ym2|X] + βS1βS2cov(ˆ Xm1, ˆ Xm2). The computation of the mean of the first term with respect to X would be rather difficult, we substitute it with the estimate with the help of ˆ X:

  • cov( ˆ

Ym1, ˆ Ym2) = cov[ ˆ Ym1, ˆ Ym2|X] + ˆ βS1 ˆ βS2cov(ˆ Xm1, ˆ Xm2).

Petr Novák, Václav Kosina Using the Superpopulation Model for Imputations and Variance

slide-18
SLIDE 18

Choosing the weights

If no stratification shifts are involved and no outliers are present, we can use wi ≡ 1. If we compute ˆ β from a superstratum S consisting of basic strata k = 1, .., K, we can use wi ≡ Nk/nk for units from stratum k. Data from the greater strata then influence the estimates more than the data from the smaller strata. If we apply some outlier-detection methods, we can use wi = 0 for data which may not fit the model, so that they will not influence the estimates.

Petr Novák, Václav Kosina Using the Superpopulation Model for Imputations and Variance

slide-19
SLIDE 19

Conclusions

The superpopulation model allows us to: Estimate the target variable for each unit separately. Report the estimated population totals with respect to any groupings of choice, regardless of the sampling plan. Easily compute the mean square error of the estimated sums. Develop methods of error computation in complex stratification and chaining structure. Drawbacks: The approach is model-based, the results may be inacurrate if the assumptions are not met, especially the linear dependence

  • f yi on xi and the choice of ci.

Auxiliary variables xi and ci must be available for all units.

Petr Novák, Václav Kosina Using the Superpopulation Model for Imputations and Variance