Size Estimation - Statistical Models for Underreporting Gerhard - - PowerPoint PPT Presentation

size estimation statistical models for underreporting
SMART_READER_LITE
LIVE PREVIEW

Size Estimation - Statistical Models for Underreporting Gerhard - - PowerPoint PPT Presentation

Size Estimation - Statistical Models for Underreporting Gerhard Neubauer, Gordana Djura & Herwig Friedl JOANNEUM RESEARCH and Technical University, Graz useR! 2009, Rennes 8.-10. July 2009 p. 1 1 Introduction useR! 2009, Rennes


slide-1
SLIDE 1

useR! 2009, Rennes 8.-10. July 2009 – p. 1

Size Estimation - Statistical Models for Underreporting

Gerhard Neubauer, Gordana Djuraš & Herwig Friedl

JOANNEUM RESEARCH and Technical University, Graz

slide-2
SLIDE 2

useR! 2009, Rennes 8.-10. July 2009 – p. 2

1 Introduction

slide-3
SLIDE 3

useR! 2009, Rennes 8.-10. July 2009 – p. 3

Underreporting

Any sample of count data may be incomplete

■ criminology: crimes with an aspect of shame

(sexuality, domestic violence) or theft of low values goods

■ public health: infectious (HIV) or chronic

(diabetes) disease data

■ production: error counts in a production process ■ traffic accidents with minor damage

Estimation of total number of cases

slide-4
SLIDE 4

useR! 2009, Rennes 8.-10. July 2009 – p. 4

Binomial Model

Event reported Yes No R 1 − R R ∼ Bernoulli(π)

slide-5
SLIDE 5

useR! 2009, Rennes 8.-10. July 2009 – p. 4

Binomial Model

Event reported Yes No R 1 − R R ∼ Bernoulli(π) iid sample of size λ ⇒ Y =

i Ri ∼ Binomial(λ, π)

E(Y ) = µ = λπ, var(Y ) = σ2 = λπ(1 − π)

slide-6
SLIDE 6

useR! 2009, Rennes 8.-10. July 2009 – p. 4

Binomial Model

Event reported Yes No R 1 − R R ∼ Bernoulli(π) iid sample of size λ ⇒ Y =

i Ri ∼ Binomial(λ, π)

E(Y ) = µ = λπ, var(Y ) = σ2 = λπ(1 − π) Y the number of reported events π the reporting probability λ the total number of events - size parameter U = λ − Y the number of unreported events

slide-7
SLIDE 7

useR! 2009, Rennes 8.-10. July 2009 – p. 5

Binomial Model

Event reported Yes No R 1 − R R ∼ Bernoulli(π) iid sample of size λ ⇒ Y =

i Ri ∼ Binomial(λ, π)

E(Y ) = µ = λπ, var(Y ) = σ2 = λπ(1 − π)

Both λ and π have to be estimated No longer member of Exponential Family

slide-8
SLIDE 8

useR! 2009, Rennes 8.-10. July 2009 – p. 6

Estimation

For T iid samples Yt Method of Moments For the binomial we have var(Y ) = µ − µ2/λ ≤ µ Limitation to data with s2 < ¯ y For s2 > ¯ y

  • 1. Regression approach using ML

Yt

ind

∼ Binomial(λt, π) with λt = f(xt, β) Neubauer & Friedl (2006)

  • 2. Mixed model approaches
slide-9
SLIDE 9

useR! 2009, Rennes 8.-10. July 2009 – p. 7

2 Alternative Models

slide-10
SLIDE 10

useR! 2009, Rennes 8.-10. July 2009 – p. 8

Beta-Binomial

Random reporting probability P Yt|P ∼ Binomial(λ, p) P ∼ Beta(γ, δ) Yt ∼ Beta-Binomial(λ, γ, δ) Mean-variance relation var(Yt) =

  • µ − µ2

λ

  • φ

φ = λ + γ + δ 1 + γ + δ ≥ 1

slide-11
SLIDE 11

useR! 2009, Rennes 8.-10. July 2009 – p. 9

Poisson

Random total number of cases L Yt|L ∼ Binomial(l, π) L ∼ Poisson(λ) Yt ∼ Poisson(λπ) Parameters not identified

slide-12
SLIDE 12

useR! 2009, Rennes 8.-10. July 2009 – p. 10

Negative Binomial

Additional randomness in E(L) L|K ∼ Poisson(kλ) Yt|K ∼ Poisson(kλπ) K ∼ Gamma(ω, ω) Yt ∼ Negative Binomial(ω, 1 − π) ω the number of unreported cases π the reporting probability Mean-variance relation var(Yt) = µ + µ2 ω ≥ µ

slide-13
SLIDE 13

useR! 2009, Rennes 8.-10. July 2009 – p. 11

Beta-Poisson

Consider both binomial parameters as random Y |L, P ∼ Binomial(L, P) L ∼ Poisson(λ) P ∼ Beta(γ, δ) Y ∼ Beta-Poisson(λ, γ, δ) E(Y ) = λπ = µ where π = γ γ + δ var(Y ) = µφ with φ = 1 + λ(1 − π) 1 + γ + δ ≥ 1

slide-14
SLIDE 14

useR! 2009, Rennes 8.-10. July 2009 – p. 12

Generalized Poisson Distribution

Consul(1989) Moments E(Y ) = θ (1 − τ) var(Y ) = θ (1 − τ)3

■ τ = 0:

E(Y ) = var(Y ) ⇒ Poisson (θ)

■ 0 < τ < 1:

E(Y ) < var(Y ) ⇒

  • Neg. Binomial

■ τ < 0:

E(Y ) > var(Y ) ⇒ Binomial

slide-15
SLIDE 15

useR! 2009, Rennes 8.-10. July 2009 – p. 13

Conditional Poisson Models

Motivation: π → 1 leads to Y → λ in the binomial approach Assume Y |L ∼ Poisson(L) Choose p(L) such that E(Y ) = λπ and var(Y ) = λπφ For example: L ∼ Binomial(λ, π) 1 < φ = 2 − π < 2 L ∼ Negative Binomial(λ, π) 2 < φ = 2 − π 1 − π < ∞

slide-16
SLIDE 16

useR! 2009, Rennes 8.-10. July 2009 – p. 14

Regression Modelling

For all models - except the GP - we use λt,β = exp(x′

tβ)

and πα = exp(α) 1 + exp(α) xt, d-vector of known regressors β, d-vector of unknown parameters For the GP model we use θt,β = exp(x′

tβ)

and τα = 1 − exp(−α) Test α = 0 ⇒ Identify Poisson misdispersion

slide-17
SLIDE 17

useR! 2009, Rennes 8.-10. July 2009 – p. 15

3 Implementation

slide-18
SLIDE 18

useR! 2009, Rennes 8.-10. July 2009 – p. 16

R package sizEst

Full ML estimation of all models: done Conditional Poisson in work Testing competing models: in development Testing parameters within models: done Model diagnostics: done Main functions: arrayEst, sizEst Implemented methods: sizEst: plot, predict, residuals, summary summary.arrayEst

slide-19
SLIDE 19

useR! 2009, Rennes 8.-10. July 2009 – p. 17

4 Real Data Application

slide-20
SLIDE 20

useR! 2009, Rennes 8.-10. July 2009 – p. 18

Stroke Data

Hypothesis: Slight strokes are not seen in hospitals Data: Hospital discharges Output from function arrayEst()

iterations loglik chi.sq gradient p1 GP 0.5 3 -405.256 94.602 0.000 0.4594 NegBin 0.5 3 -405.195 94.411 0.000 0.4604 BetaBin 0.5 19 -402.135 80.779 0.000 0.8230 BetaPois 0.5 30 -408.170 80.811 0.000 0.9097

slide-21
SLIDE 21

useR! 2009, Rennes 8.-10. July 2009 – p. 19

Stroke Data

Output from function sizEst()

Distribution: BetaPois Formula: y ˜ beta01 + T.cos1 + T.sin1 - 1 Estimate Std. Error t value Pr(>|t|) alpha1 2.309 0.289 8.002 beta01 5.071 0.025 204.435 T.cos1 0.060 0.016 3.673 T.sin1

  • 0.083

0.016

  • 5.249

Theta 11.231

  • Performance measures:

measures loglik

  • 408.170

chi.sq 80.811 df.residual 92.000 aic 824.341 bic 834.599 Reporting Probabilities: lower estimate upper alpha1 0.8622 0.9097 0.9571

slide-22
SLIDE 22

useR! 2009, Rennes 8.-10. July 2009 – p. 20

Stroke Data

20 40 60 80 50 100 150 200

Distribution=BetaPois, p0=0.3

Mean function from Quasi−Poisson−Model (blue) and BetaPois model (red) Frequency 20 40 60 80 50 100 150

Distribution=BetaPois, p0=0.3

Lambda function from BetaPois model (red) Frequency

Pearson residuals

Histogram of residuals and N(0,1) density Density −2 −1 1 2 0.0 0.1 0.2 0.3 0.4 5 10 15 −0.2 0.2 0.4 0.6 0.8 1.0 Lag ACF

Pearson residuals

ˆ π = 0.91

slide-23
SLIDE 23

useR! 2009, Rennes 8.-10. July 2009 – p. 21

Crime Register Data

SIMO: Austrian online crime register

■ Time range: 2004 - 2007 ■ weekly counts ■ 132 regions ■ different crime categories

In most cases Poisson overdispersion

slide-24
SLIDE 24

useR! 2009, Rennes 8.-10. July 2009 – p. 22

GP model estimates

50 100 150 200 20 40 60 80 100 120

Shop Lifting

Week Frequency

50 100 150 200 20 40 60 80 100

Bicycle Theft

Week Frequency

ˆ π = 0.69 ˆ π = 0.67

slide-25
SLIDE 25

useR! 2009, Rennes 8.-10. July 2009 – p. 23

Beta-Poisson model estimates

50 100 150 200 20 40 60 80 100 120

Shop Lifting

Week Frequency

50 100 150 200 50 100 150

Bicycle Theft

Week Frequency

ˆ π = 0.65 ˆ π = 0.52

slide-26
SLIDE 26

useR! 2009, Rennes 8.-10. July 2009 – p. 24

Summary

■ Great variety of models ■ MLE based implementation in R ■ Good performance for simulated data ■ Reasonable estimates for real data

slide-27
SLIDE 27

useR! 2009, Rennes 8.-10. July 2009 – p. 24

Summary

■ Great variety of models ■ MLE based implementation in R ■ Good performance for simulated data ■ Reasonable estimates for real data

Future work

■ Implement Conditional Poisson models ■ Non-nested Testing for more than two models