 
              Using the Superpopulation Model for Imputations and Variance Computation in Survey Sampling Petr Novák, Václav Kosina Czech Statistical Office Petr Novák, Václav Kosina Using the Superpopulation Model for Imputations and Variance
Introduction Situation Let us have a population of N units: n sampled ( sam ) and N-n unknown ( imp ). We want to estimate the population total Y = � N i = 1 y i . Petr Novák, Václav Kosina Using the Superpopulation Model for Imputations and Variance
Introduction Situation Let us have a population of N units: n sampled ( sam ) and N-n unknown ( imp ). We want to estimate the population total Y = � N i = 1 y i . Model assumptions y i = β x i + ǫ i , ǫ i are independent random variables, E ǫ i = 0 and var ǫ i = c i σ 2 , x i and c i known constants for all i = 1 , ..., N , β and σ 2 unknown parameters. Petr Novák, Václav Kosina Using the Superpopulation Model for Imputations and Variance
Imputation Estimation Estimate β from the sampled part using the least squares method: � sam w i x i y i / c i ˆ β = � . sam w i x 2 i / c i w i are some appropriate weights. sam y i Note: constant weights and c i = x i gives ˆ � β = sam x i . � Petr Novák, Václav Kosina Using the Superpopulation Model for Imputations and Variance
Imputation Estimation Estimate β from the sampled part using the least squares method: � sam w i x i y i / c i ˆ β = � . sam w i x 2 i / c i w i are some appropriate weights. sam y i Note: constant weights and c i = x i gives ˆ � β = sam x i . � Data imputation For each unit from the unknown part we impute y i = x i ˆ ˆ β. The estimate of the population total is then � � Y = ˆ y i + y i . ˆ sam imp Petr Novák, Václav Kosina Using the Superpopulation Model for Imputations and Variance
Differences from classic techniques Classic reweighting approach: y i treated as constants. Randomness through sample inclusion indicators. Error computed through var ˆ Y . Superpopulation model approach: y i treated as random variables. Real y i from the imputed part predicted with ˆ y i = x i ˆ β . Error computed through mse ˆ Y = E ( ˆ Y − Y ) 2 . Petr Novák, Václav Kosina Using the Superpopulation Model for Imputations and Variance
Error computation The least squares estimator is unbiased ( E ˆ β = β ) . Therefore E ˆ y i = Ex i ˆ β = x i β = Ey i . The mean square error of the prediction is then Y − Y ) 2 = E ( ˆ mse ˆ Y = E ( ˆ Y imp − Y imp ) 2 = E ( ˆ Y imp − E ˆ Y imp − Y imp + EY imp ) 2 Y imp ) 2 + E ( Y imp − EY imp ) 2 = E ( ˆ Y imp − E ˆ − 2 E ( ˆ Y imp − E ˆ Y imp )( Y imp + EY imp ) = var ˆ Y imp + varY imp . Petr Novák, Václav Kosina Using the Superpopulation Model for Imputations and Variance
Variance computation The variance of estimated values is � sam w 2 i x 2 i / c i var ˆ Y imp = varX imp ˆ β = X 2 imp var ˆ β = X 2 i / c i ) 2 σ 2 . ( � imp sam w i x 2 We denote var ˆ β as σ 2 β . The variance of the predicted real values is � varY imp = c i σ 2 . imp Denote C imp := � imp c i . We get mse ˆ Y = X 2 β + C imp σ 2 . imp σ 2 Petr Novák, Václav Kosina Using the Superpopulation Model for Imputations and Variance
Variance computation The variance of estimated values is � sam w 2 i x 2 i / c i var ˆ Y imp = varX imp ˆ β = X 2 imp var ˆ β = X 2 i / c i ) 2 σ 2 . ( � imp sam w i x 2 We denote var ˆ β as σ 2 β . The variance of the predicted real values is � varY imp = c i σ 2 . imp Denote C imp := � imp c i . We get mse ˆ Y = X 2 β + C imp σ 2 . imp σ 2 Possible estimators for σ 2 : � ( y i − ˆ β x i ) 2 � w i ( y i − ˆ β x i ) 2 1 1 � w i − ¯ , . n − 1 c i w i c i sam sam Petr Novák, Václav Kosina Using the Superpopulation Model for Imputations and Variance
Special cases If w i ≡ const . and c i = x i , we get 1 σ 2 σ 2 β = X sam and therefore + X imp σ 2 = X imp X all σ 2 mse ˆ Y = X 2 σ 2 . imp X sam X sam If we have no auxiliary information available and set x i ≡ 1, we impute the sample mean for each unit. We get then the commonly used formula � � Y = ( N − n ) N σ 2 = N 2 1 − n mse ˆ σ 2 . n n N Petr Novák, Václav Kosina Using the Superpopulation Model for Imputations and Variance
Chain imputation Situation: x i not known, but estimated from z i Model: y i | x i ∼ ( x i β yx , c i σ 2 x i ∼ ( z i β xz , d i σ 2 yx ) , xz ) With help of conditional variance decomposition we get mse ( ˆ Y ) = var ˆ Y imp + varY imp = Evar [ ˆ Y imp | X ] + varE [ ˆ Y imp | X ] + Evar [ Y imp | X ] + varE [ Y imp | X ] ... = Emse ( ˆ Y | X ) + β 2 yx mse (ˆ X ) . Petr Novák, Václav Kosina Using the Superpopulation Model for Imputations and Variance
Chain imputation Estimated error: mse ˆ Y = � mse ( Y | ˆ X ) + ˆ mse ˆ X . � β 2 yx � The chain structure can be followed up and stacked until we get to an auxiliary variable which is known for all units, i.e. administrative data. Petr Novák, Václav Kosina Using the Superpopulation Model for Imputations and Variance
Stratification level shifts Situation: The population is divided into strata (size class, NACE, region). There are several stratification levels, going from relatively small groups to larger ones. When there are not enough responding units to estimate β in one stratum, we use the estimates from corresponding higher level stratum. 0.6 S2 0.2 S1 −0.2 S0 −1.0 −0.5 0.0 0.5 1.0 Petr Novák, Václav Kosina Using the Superpopulation Model for Imputations and Variance
Stratification level shifts If the estimated total of the whole population divided into strata m 1 , ..., m K is � Y = ˆ Y m j , ˆ j the mean square error is mse ˆ Y = var ˆ Y imp + varY imp � � Y imp Y imp = var ˆ + var m j m j j j � � � Y imp Y imp Y imp varY imp var ˆ cov ( ˆ m j , ˆ = + m k ) + m j . m j j j � = k j Both variances of estimated and real values can be computed with methods from above. Petr Novák, Václav Kosina Using the Superpopulation Model for Imputations and Variance
Stratification level shifts - covariance computation Covariance computation Let m 1 and m 2 be two basic strata. β estimated from superstrata S 1 and S 2 respectively. ˆ Denote S d = S 1 ∩ S 2 , which is the smaller of S 1 and S 2 , if the stratification levels are well ordered. Denote S = S 1 ∪ S 2 , which is then the larger of both. Then Y m 2 ) = cov ( X imp β S 1 , X imp β S 2 ) = X imp m 1 X imp cov ( ˆ Y m 1 , ˆ m 2 cov (ˆ m 1 ˆ m 2 ˆ β S 1 , ˆ β S 2 ) �� � � w i x i y i / c i w i x i y i / c i S sam S sam = X imp m 1 X imp m 2 cov � 1 , � 2 . w i x 2 i / c i w i x 2 i / c i S sam S sam 1 2 Petr Novák, Václav Kosina Using the Superpopulation Model for Imputations and Variance
Stratification level shifts - covariance computation The variables y i belonging to either S 1 or S 2 but not to S d are mutually independent. Denote as B S 1 and B S 1 the sums in the denominator:   Y m 2 ) = X imp m 1 X imp � m 2 cov ( ˆ Y m 1 , ˆ var w i x i y i / c i  B S 1 B S 2 S sam d = X imp m 1 X imp � m 2 w 2 i x 2 i / c 2 i vary i B S 1 B S 2 S sam d = X imp m 1 X imp � B S d m 2 S d = X imp m 1 X imp w 2 i x 2 i / c i σ 2 σ 2 β Sd . m 2 B S 1 B S 2 B S S sam d This way we can compute all the covariances between base strata and the mean square error of the whole sum. Petr Novák, Václav Kosina Using the Superpopulation Model for Imputations and Variance
Stratification level shifts - chained imputations If we have a sophisticated stratification structure and chained imputations, we need to compute the chained covariance also. The covariances are computed with help of conditional covariance decomposition: cov ( ˆ Y m 1 , ˆ Y m 2 ) = Ecov [ ˆ Y m 1 , ˆ Y m 2 | X ] + cov ( E [ ˆ Y m 1 | X ] , E [ ˆ Y m 2 | X ]) = Ecov [ ˆ Y m 1 , ˆ Y m 2 | X ] + β S 1 β S 2 cov (ˆ X m 1 , ˆ X m 2 ) . The computation of the mean of the first term with respect to X would X : be rather difficult, we substitute it with the estimate with the help of ˆ cov ( ˆ Y m 1 , ˆ Y m 2 ) = � cov [ ˆ Y m 1 , ˆ Y m 2 | X ] + ˆ β S 2 cov (ˆ X m 1 , ˆ X m 2 ) . β S 1 ˆ � Petr Novák, Václav Kosina Using the Superpopulation Model for Imputations and Variance
Choosing the weights If no stratification shifts are involved and no outliers are present, we can use w i ≡ 1. If we compute ˆ β from a superstratum S consisting of basic strata k = 1 , .., K , we can use w i ≡ N k / n k for units from stratum k . Data from the greater strata then influence the estimates more than the data from the smaller strata. If we apply some outlier-detection methods, we can use w i = 0 for data which may not fit the model, so that they will not influence the estimates. Petr Novák, Václav Kosina Using the Superpopulation Model for Imputations and Variance
Recommend
More recommend