Current Trends in Small Area Estimation Research Partha Lahiri - - PowerPoint PPT Presentation
Current Trends in Small Area Estimation Research Partha Lahiri - - PowerPoint PPT Presentation
Current Trends in Small Area Estimation Research Partha Lahiri JPSM, University of Maryland, College Park, USA Paper to be presented at Q2008, Rome, Italy, July 10, 2008 What is a Small Area? A subpopulation of interest, for which the
What is a Small Area?
- A subpopulation of interest, for which the sample size
is not adequate to produce reliable direct estimates.
- Example:
Geographic Region Small Area Nation State State County, school district Demographic Group Small Domain Broad group Narrow groups by sex/race/ethnicity
2
Examples
- Survey of drug use in Nebraska, N=4300. Boone
County has n =14 and only 1 white, female age 25-44 was sampled.
- In SAIPE, about one-third of the counties are in the
sample.
- In NHANSE III, a majority of US states do not have
sample.
3
A Historical Note
- 11th century England and 17th century Canada
– Based on census or administrative records.
- Recent 3 decades
– Increasing demand for small area statistics, due to growing use in formulating policies and programs in the allocation of government funds and in regional planning.
4
Design Issues
Ref: Singh et al. (1994), Marker (2001), Rao (2003)
- Stratification – Use a large number of smaller strata
- Degree of Clustering – Minimize clustering
- Sample Allocation – Reallocate sample from large
planned domains to smaller planned domains
- Rolling samples (ACS), multiple frames
- In the Canadian LFS, max(CV) for UI regions was
reduced by about half using compromise allocation.
5
Planned Domains:
- Minimize a weighted sum of sampling variances of
direct small area estimators subject to fixed overall sample size. Ref: Longford (2006)
- Minimize total sample size (or cost) subject to desired
tolerances on the area sampling variances and on the aggregate sampling variance. Ref: Rao (2007)
- Achieve (approx.) equal RRMSE of GREG for the
planned domains subject to a fixed cost. Ref: Gabler, Ganninger, Münnich, and others
6
- Achieve equal RRMSE of EBP (or, the estimator to be
used) for the planned domains subject to a fixed cost.
However, “the client will always require more than specified at the design stage” (Fuller, 1999).
7
Issues in Small Area Estimation
1. Definition of small-areas 2. Identification of relevant sources of information 3. Method of combining information 4. Small area estimates 5. Accuracy of the SAE method 6. Robust validation 7. Computer programming 8. Presentation of SAE statistics
8
Borrowing Strength:
- Relevant Source of Information
– Census data – Administrative information – Related surveys
- Method of Combining Information
– Choices of good small area models – Use of a good statistical methodology
9
Synthetic Estimators
1944 Radio Listening Survey, Hansen, Hurwitz and Madow (1953, p. 483-486): To estimate the median number of radio stations heard during the day for over 500 counties (small areas). The following explicit regression equation based on data for 85 counties was used: ˆ i
i
y = 0.52 + 0.74x where for county i
i
y : estimate obtained from the personal interview survey
i
x : estimate obtained from the personal interview survey
10
County Crop Production (Stasny et al., 1991) To estimate wheat production for each county of Kansas ˆ ˆ ˆ
ij 1 1ij p pij
y = β + β x + + β x , where
- ij
y : wheat production of the jth farm in the ith county
ij 1ij pij
x = (1,x , ,x )': a vector of auxiliary variables
- Regression-synthetic estimator:
ˆ ˆ ˆ ˆ ˆ
∑
i ij i i1 1 ip p j
Y = y = N β + X β + + X β
- The total no. of farms
and the totals of the auxiliary variables )
- are known.
i
N
il
X (l = 1, ,p
11
Ratio Adjustment: ˆ ˆ ˆ ˆ
∑
i i,adj i i
Y Y = Y, where is the direct design-based estimate for the state from a large probability sample. Y ˆ Y NCHS synthetic State estimates for health variables: assume homogeneity within carefully constructed post-
- strata. More refined synthetic estimation: SPREE.
World Bank Method: Elbers et al. (2003), Haslett-Jones (2005) Off-the-Shelf Methods: Schirm and Zaslavsky (1997)
12
Basic Area Level Model
To estimate small area means
i
Y using direct design-based estimates
i
y and area level auxiliary variables ’s.
i
x A Basic Area-Level Model: ˆ
i i i i T 2 i i i
Level 1: θ = g(y ) ~ ind. N(θ ,ψ ) Level 2 : θ = g(Y ) ~ ind. N(x β,τ ) Fay and Herriot (1979):
i i
g(Y ) = log(Y )
13
Carter and Rolph (1974), Efron and Morris (1975):
i i
g(Y ) = arcsine( Y ) SAIPE:
i i for state level estimation of proportion
- f poor school-age children and
i
g(Y ) = log(Y ) for county level poverty counts of school-age children g(Y ) = Y θ + e = x β + v + e
i
The model can be written as a simple linear mixed normal model: ˆ = θ
T i i i i i i, where i
e : sampling error;
i i
e ~ ind. N(0,ψ )
i
v : area specific random effects;
2 i
v ~ iid N(0,τ )
14
Supplementary Information Used
- Per-Capita Income for the county
- Value of housing for the place
- Value of housing for the county
- IRS-adjusted gross income per exemption for the place
- IRS-adjusted gross income per exemption for the
county
15
The BP: , ˆ ˆ ˆ ˆ
BP 2 i i i T T i i i i T i i i i
θ = E(θ |θ ;β,τ ) = x β + γ (θ - x β) = γ θ + (1- γ )x β where
2 i 2 i
τ γ = τ + ψ EBP (or EBLUP): ˆ ˆ ˆ ˆ
BP 2 i i i
θ = E(θ | θ ;β,τ )
16
Different MSE of EBP:
ˆ ( ) ˆ ˆ ˆ ˆ ˆ ⎡ ⎤ ⎣ ⎦ ⎡ ⎤ ⎣ ⎦ ⎡ ⎤ ⎣ ⎦
EBP 2 i i EBP 2 i i i EBP 2 i i i EBP 2 i i i
i E(θ
- θ )
(ii)E (θ
- θ ) |θ
(iii)E (θ
- θ ) |θ
(iv)E (θ
- θ ) |θ , i = 1,
,m
- Majority of research focused on the unconditional MSE
(i) estimation.
17
ˆ ˆ ≈
EBP 2 2 2 i 1i 2i 3i 2 BP 1i i 2 2i 2 2 3i
MSE(θ ) g (τ ) + g (τ ) + g (τ ) g (τ ) = MSE(θ ) g (τ ): the extra variability due to the estimation of β g (τ ): the extra variability due to the estimation of τ Ref: Prasad and Rao (1990) and Datta and Lahiri (2000) The terms
2 and 2 are of the same order and is
lower than that of the leading term
2 PR and DL
- btained a second-order (or nearly unbiased) estimator of
unconditional MSE using the above approximation and correcting the bias of
2 2i
g (τ )
3i
g (τ )
1i
g (τ ).
1i
g (τ )
18
Longford (2007): The PR MSE estimator did not perform well in estimating design-based MSE for the EURAREA project. Zhang (2007): The PR MSE estimator, averaged over areas, tracks average of design-based MSE for large m, if the model holds. Different resampling methods [jackknife and parametric boostrap] have been proposed by Butar and Lahiri (2003), Jiang and Lahiri (2002), and Wan (2002), Hall and Maiti (2006), Pfeffermann and Glickmann (2004) and Chatterjee and Lahiri (2007). Compared to the Taylor seriesmethod, they performed well in simulations; see Fabrizi et al. (2007) and Pereira and Pedro (2008)
19
Issues:
The method uses a simple model and results in an EBP which is design-consistent Normality: EBP method is extendable to specified non- normal distributions for the sampling and random effects. For unspecified non-normality of the sampling and random effects, one can use EBLUP [Lahiri and Rao, 1995] or certain adaptive [Lahiri, 2002; Fabrizi and Trivisano, 2007] or linear EB [Ghosh and Lahiri, 1987; Cocchi and Mouchart]
20
Known sampling variances : GVF type methods are generally used. The method usually does not consider small area effect and the uncertainty in estimating the sampling variances are not included in the EBP.
i
ψ In some situation, standard estimates [REML, ML, ANOVA, etc.] of the model variance
2
τ can be zero. When ˆ 2 τ is zero, EBLUP reduces to the regression synthetic estimate. One way to avoid the problem is to use the ADM or AML estimates [Morris, 1987; Li and Lahiri, 2007]
21
A simple back transformation is often used to obtain the estimate of
i
Y . The optimum property of the BP is lost by such a back transformation. The BP of
- 1
i i
Y = g (θ ): (
)
ˆ
- 1
2 i i
E g (θ )|θ ;β,τ An EBP Y:
( )
ˆ ˆ ˆ
- 1
2 i i
E g (θ )|θ ;β,τ The rationale behind the transformation rests on the Taylor series argument and is used primarily to stabilize the variance. A direct modeling of the direct estimates is possible, but this is likely to lead to non-linear non-normal mixed model. g(.)
22
Confidence Interval:
The intuitive interval [Cox, 1976] ˆ ˆ
EBP 2 i 1i
θ ±1.96 g (τ ) has an undercoverage problem. The correction ˆ EBP
PR i i
θ ±1.96 mse does not solve the problem – it has either undercoverage
- r overcoverage problem.
23
Parametric bootstrap interval: (ˆ ˆ ,
EBP 2 EBP 2 i 1i i 1i
θ
- L g (τ ) θ
- U g (τ )),
where L and U are obtained from the parametric bootstrap histogram:
ˆ *EBP
* i i *2 1i
θ
- θ
g (τ ) [Ref: Chatterjee, Lahiri
and Li, 2008] Hall and Maiti (2006) has an alternative parametric bootstrap method, but the method is synthetic (Rao, 2005)
24
Estimation of Small Area Proportions: Two Basic Area Models Ref: Liu, Lahiri and Kalton (2007) Model 1:
i i iw i i i i T 2 i i
P (1- P ) Level 1: p | P ~ ind N(P , deff ) n Level 2: logit(P ) ~ ind N(x β,τ ) Model 2:
i i iw i i i i T 2 i i
P (1- P ) Level 1: p | P ~ ind Beta(P , deff ) n Level 2 : logit(P ) ~ ind N(x β,τ )
25
Hierarchical Bayes method [using MCMC] is more straightforward than the corresponding EBP method. However, the method requires careful specification of prior [usually non-informative] for the hyperparameters
2
β and τ . Datta et al. (2005) and Ganesh and Lahiri (2005) proposed priors that have good frequentist’s properties.
26
Other Issues:
Unit Level Models: Battesse, Harter and Fuller (2002), Kott (1989), Prasad and Rao (1999), You and Rao (2002), Hall and Maiti (2006), Jiang and Lahiri (2006), Lahiri and Mukherjee (2007) Robust Estimation: Sinha and Rao (2007) and a number
- f papers by Chambers, Tzavidis, Pratesi, Ranalli, Salvati,
Informative Sampling: Pfeffermann and Sverchkov (2007)
27
Model Selection: Meza and Lahiri (2005), and a number
- f papers by Jiang, J.S. Rao and others.
Spatial Modeling: Pratesi, Salvati and Molina (2007), Zhou and You (2007), Ganesh (2007) Benchmarking: Rao (2003, sec 7.2), You, Rao and Dick 2004
28
- Fig. 2
Player Prediction 5 10 15 0.15 0.20 0.25 0.30 0.35 0.40 True Direct Reg EM EMReg
Fig 3: Plot of proposed estimates and the corresponding preliminary and final weighted link relatives
13