Current Trends in Small Area Estimation Research Partha Lahiri - - PowerPoint PPT Presentation

current trends in small area estimation research
SMART_READER_LITE
LIVE PREVIEW

Current Trends in Small Area Estimation Research Partha Lahiri - - PowerPoint PPT Presentation

Current Trends in Small Area Estimation Research Partha Lahiri JPSM, University of Maryland, College Park, USA Paper to be presented at Q2008, Rome, Italy, July 10, 2008 What is a Small Area? A subpopulation of interest, for which the


slide-1
SLIDE 1

Current Trends in Small Area Estimation Research

Partha Lahiri JPSM, University of Maryland, College Park, USA Paper to be presented at Q2008, Rome, Italy, July 10, 2008

slide-2
SLIDE 2

What is a Small Area?

  • A subpopulation of interest, for which the sample size

is not adequate to produce reliable direct estimates.

  • Example:

Geographic Region Small Area Nation State State County, school district Demographic Group Small Domain Broad group Narrow groups by sex/race/ethnicity

2

slide-3
SLIDE 3

Examples

  • Survey of drug use in Nebraska, N=4300. Boone

County has n =14 and only 1 white, female age 25-44 was sampled.

  • In SAIPE, about one-third of the counties are in the

sample.

  • In NHANSE III, a majority of US states do not have

sample.

3

slide-4
SLIDE 4

A Historical Note

  • 11th century England and 17th century Canada

– Based on census or administrative records.

  • Recent 3 decades

– Increasing demand for small area statistics, due to growing use in formulating policies and programs in the allocation of government funds and in regional planning.

4

slide-5
SLIDE 5

Design Issues

Ref: Singh et al. (1994), Marker (2001), Rao (2003)

  • Stratification – Use a large number of smaller strata
  • Degree of Clustering – Minimize clustering
  • Sample Allocation – Reallocate sample from large

planned domains to smaller planned domains

  • Rolling samples (ACS), multiple frames
  • In the Canadian LFS, max(CV) for UI regions was

reduced by about half using compromise allocation.

5

slide-6
SLIDE 6

Planned Domains:

  • Minimize a weighted sum of sampling variances of

direct small area estimators subject to fixed overall sample size. Ref: Longford (2006)

  • Minimize total sample size (or cost) subject to desired

tolerances on the area sampling variances and on the aggregate sampling variance. Ref: Rao (2007)

  • Achieve (approx.) equal RRMSE of GREG for the

planned domains subject to a fixed cost. Ref: Gabler, Ganninger, Münnich, and others

6

slide-7
SLIDE 7
  • Achieve equal RRMSE of EBP (or, the estimator to be

used) for the planned domains subject to a fixed cost.

However, “the client will always require more than specified at the design stage” (Fuller, 1999).

7

slide-8
SLIDE 8

Issues in Small Area Estimation

1. Definition of small-areas 2. Identification of relevant sources of information 3. Method of combining information 4. Small area estimates 5. Accuracy of the SAE method 6. Robust validation 7. Computer programming 8. Presentation of SAE statistics

8

slide-9
SLIDE 9

Borrowing Strength:

  • Relevant Source of Information

– Census data – Administrative information – Related surveys

  • Method of Combining Information

– Choices of good small area models – Use of a good statistical methodology

9

slide-10
SLIDE 10

Synthetic Estimators

1944 Radio Listening Survey, Hansen, Hurwitz and Madow (1953, p. 483-486): To estimate the median number of radio stations heard during the day for over 500 counties (small areas). The following explicit regression equation based on data for 85 counties was used: ˆ i

i

y = 0.52 + 0.74x where for county i

i

y : estimate obtained from the personal interview survey

i

x : estimate obtained from the personal interview survey

10

slide-11
SLIDE 11

County Crop Production (Stasny et al., 1991) To estimate wheat production for each county of Kansas ˆ ˆ ˆ

ij 1 1ij p pij

y = β + β x + + β x , where

  • ij

y : wheat production of the jth farm in the ith county

ij 1ij pij

x = (1,x , ,x )': a vector of auxiliary variables

  • Regression-synthetic estimator:

ˆ ˆ ˆ ˆ ˆ

i ij i i1 1 ip p j

Y = y = N β + X β + + X β

  • The total no. of farms

and the totals of the auxiliary variables )

  • are known.

i

N

il

X (l = 1, ,p

11

slide-12
SLIDE 12

Ratio Adjustment: ˆ ˆ ˆ ˆ

i i,adj i i

Y Y = Y, where is the direct design-based estimate for the state from a large probability sample. Y ˆ Y NCHS synthetic State estimates for health variables: assume homogeneity within carefully constructed post-

  • strata. More refined synthetic estimation: SPREE.

World Bank Method: Elbers et al. (2003), Haslett-Jones (2005) Off-the-Shelf Methods: Schirm and Zaslavsky (1997)

12

slide-13
SLIDE 13

Basic Area Level Model

To estimate small area means

i

Y using direct design-based estimates

i

y and area level auxiliary variables ’s.

i

x A Basic Area-Level Model: ˆ

i i i i T 2 i i i

Level 1: θ = g(y ) ~ ind. N(θ ,ψ ) Level 2 : θ = g(Y ) ~ ind. N(x β,τ ) Fay and Herriot (1979):

i i

g(Y ) = log(Y )

13

slide-14
SLIDE 14

Carter and Rolph (1974), Efron and Morris (1975):

i i

g(Y ) = arcsine( Y ) SAIPE:

i i for state level estimation of proportion

  • f poor school-age children and

i

g(Y ) = log(Y ) for county level poverty counts of school-age children g(Y ) = Y θ + e = x β + v + e

i

The model can be written as a simple linear mixed normal model: ˆ = θ

T i i i i i i, where i

e : sampling error;

i i

e ~ ind. N(0,ψ )

i

v : area specific random effects;

2 i

v ~ iid N(0,τ )

14

slide-15
SLIDE 15

Supplementary Information Used

  • Per-Capita Income for the county
  • Value of housing for the place
  • Value of housing for the county
  • IRS-adjusted gross income per exemption for the place
  • IRS-adjusted gross income per exemption for the

county

15

slide-16
SLIDE 16

The BP: , ˆ ˆ ˆ ˆ

BP 2 i i i T T i i i i T i i i i

θ = E(θ |θ ;β,τ ) = x β + γ (θ - x β) = γ θ + (1- γ )x β where

2 i 2 i

τ γ = τ + ψ EBP (or EBLUP): ˆ ˆ ˆ ˆ

BP 2 i i i

θ = E(θ | θ ;β,τ )

16

slide-17
SLIDE 17

Different MSE of EBP:

ˆ ( ) ˆ ˆ ˆ ˆ ˆ ⎡ ⎤ ⎣ ⎦ ⎡ ⎤ ⎣ ⎦ ⎡ ⎤ ⎣ ⎦

EBP 2 i i EBP 2 i i i EBP 2 i i i EBP 2 i i i

i E(θ

  • θ )

(ii)E (θ

  • θ ) |θ

(iii)E (θ

  • θ ) |θ

(iv)E (θ

  • θ ) |θ , i = 1,

,m

  • Majority of research focused on the unconditional MSE

(i) estimation.

17

slide-18
SLIDE 18

ˆ ˆ ≈

EBP 2 2 2 i 1i 2i 3i 2 BP 1i i 2 2i 2 2 3i

MSE(θ ) g (τ ) + g (τ ) + g (τ ) g (τ ) = MSE(θ ) g (τ ): the extra variability due to the estimation of β g (τ ): the extra variability due to the estimation of τ Ref: Prasad and Rao (1990) and Datta and Lahiri (2000) The terms

2 and 2 are of the same order and is

lower than that of the leading term

2 PR and DL

  • btained a second-order (or nearly unbiased) estimator of

unconditional MSE using the above approximation and correcting the bias of

2 2i

g (τ )

3i

g (τ )

1i

g (τ ).

1i

g (τ )

18

slide-19
SLIDE 19

Longford (2007): The PR MSE estimator did not perform well in estimating design-based MSE for the EURAREA project. Zhang (2007): The PR MSE estimator, averaged over areas, tracks average of design-based MSE for large m, if the model holds. Different resampling methods [jackknife and parametric boostrap] have been proposed by Butar and Lahiri (2003), Jiang and Lahiri (2002), and Wan (2002), Hall and Maiti (2006), Pfeffermann and Glickmann (2004) and Chatterjee and Lahiri (2007). Compared to the Taylor seriesmethod, they performed well in simulations; see Fabrizi et al. (2007) and Pereira and Pedro (2008)

19

slide-20
SLIDE 20

Issues:

The method uses a simple model and results in an EBP which is design-consistent Normality: EBP method is extendable to specified non- normal distributions for the sampling and random effects. For unspecified non-normality of the sampling and random effects, one can use EBLUP [Lahiri and Rao, 1995] or certain adaptive [Lahiri, 2002; Fabrizi and Trivisano, 2007] or linear EB [Ghosh and Lahiri, 1987; Cocchi and Mouchart]

20

slide-21
SLIDE 21

Known sampling variances : GVF type methods are generally used. The method usually does not consider small area effect and the uncertainty in estimating the sampling variances are not included in the EBP.

i

ψ In some situation, standard estimates [REML, ML, ANOVA, etc.] of the model variance

2

τ can be zero. When ˆ 2 τ is zero, EBLUP reduces to the regression synthetic estimate. One way to avoid the problem is to use the ADM or AML estimates [Morris, 1987; Li and Lahiri, 2007]

21

slide-22
SLIDE 22

A simple back transformation is often used to obtain the estimate of

i

Y . The optimum property of the BP is lost by such a back transformation. The BP of

  • 1

i i

Y = g (θ ): (

)

ˆ

  • 1

2 i i

E g (θ )|θ ;β,τ An EBP Y:

( )

ˆ ˆ ˆ

  • 1

2 i i

E g (θ )|θ ;β,τ The rationale behind the transformation rests on the Taylor series argument and is used primarily to stabilize the variance. A direct modeling of the direct estimates is possible, but this is likely to lead to non-linear non-normal mixed model. g(.)

22

slide-23
SLIDE 23

Confidence Interval:

The intuitive interval [Cox, 1976] ˆ ˆ

EBP 2 i 1i

θ ±1.96 g (τ ) has an undercoverage problem. The correction ˆ EBP

PR i i

θ ±1.96 mse does not solve the problem – it has either undercoverage

  • r overcoverage problem.

23

slide-24
SLIDE 24

Parametric bootstrap interval: (ˆ ˆ ,

EBP 2 EBP 2 i 1i i 1i

θ

  • L g (τ ) θ
  • U g (τ )),

where L and U are obtained from the parametric bootstrap histogram:

ˆ *EBP

* i i *2 1i

θ

  • θ

g (τ ) [Ref: Chatterjee, Lahiri

and Li, 2008] Hall and Maiti (2006) has an alternative parametric bootstrap method, but the method is synthetic (Rao, 2005)

24

slide-25
SLIDE 25

Estimation of Small Area Proportions: Two Basic Area Models Ref: Liu, Lahiri and Kalton (2007) Model 1:

i i iw i i i i T 2 i i

P (1- P ) Level 1: p | P ~ ind N(P , deff ) n Level 2: logit(P ) ~ ind N(x β,τ ) Model 2:

i i iw i i i i T 2 i i

P (1- P ) Level 1: p | P ~ ind Beta(P , deff ) n Level 2 : logit(P ) ~ ind N(x β,τ )

25

slide-26
SLIDE 26

Hierarchical Bayes method [using MCMC] is more straightforward than the corresponding EBP method. However, the method requires careful specification of prior [usually non-informative] for the hyperparameters

2

β and τ . Datta et al. (2005) and Ganesh and Lahiri (2005) proposed priors that have good frequentist’s properties.

26

slide-27
SLIDE 27

Other Issues:

Unit Level Models: Battesse, Harter and Fuller (2002), Kott (1989), Prasad and Rao (1999), You and Rao (2002), Hall and Maiti (2006), Jiang and Lahiri (2006), Lahiri and Mukherjee (2007) Robust Estimation: Sinha and Rao (2007) and a number

  • f papers by Chambers, Tzavidis, Pratesi, Ranalli, Salvati,

Informative Sampling: Pfeffermann and Sverchkov (2007)

27

slide-28
SLIDE 28

Model Selection: Meza and Lahiri (2005), and a number

  • f papers by Jiang, J.S. Rao and others.

Spatial Modeling: Pratesi, Salvati and Molina (2007), Zhou and You (2007), Ganesh (2007) Benchmarking: Rao (2003, sec 7.2), You, Rao and Dick 2004

28

slide-29
SLIDE 29
  • Fig. 2

Player Prediction 5 10 15 0.15 0.20 0.25 0.30 0.35 0.40 True Direct Reg EM EMReg

slide-30
SLIDE 30
slide-31
SLIDE 31
slide-32
SLIDE 32

Fig 3: Plot of proposed estimates and the corresponding preliminary and final weighted link relatives

13