Generalized sample selection model Magorzata Wojty 1 , Giampiero - - PowerPoint PPT Presentation

generalized sample selection model
SMART_READER_LITE
LIVE PREVIEW

Generalized sample selection model Magorzata Wojty 1 , Giampiero - - PowerPoint PPT Presentation

Generalized sample selection model Magorzata Wojty 1 , Giampiero Marra 2 1 Plymouth University, 2 University College London XLII Konferencja "Statystyka Matematyczna", Bdlewo, November 29, 2016 Magorzata Wojty 1 , Giampiero


slide-1
SLIDE 1

Generalized sample selection model

Małgorzata Wojtyś1, Giampiero Marra2

1Plymouth University, 2University College London

XLII Konferencja "Statystyka Matematyczna", Będlewo, November 29, 2016

Małgorzata Wojtyś1, Giampiero Marra2 Generalized sample selection model

slide-2
SLIDE 2

Plan

Sample selection problem: Classical Heckman model Generalized model using GAM and copulae Estimation approach Real life application example

Małgorzata Wojtyś1, Giampiero Marra2 Generalized sample selection model

slide-3
SLIDE 3

Motivating example

Example: HIV prevalence P(HIV positive) ∼ socio-economic and health characteristics Some individuals in the sample refused to say whether they are HIV

  • positive. They may differ in important characteristics from individ-

uals who did answer the question. If the link between decision to provide an answer and being HIV pos- itive exists and is not only through observables then sample selection bias arises and univariate equation model is not appropriate.

Małgorzata Wojtyś1, Giampiero Marra2 Generalized sample selection model

slide-4
SLIDE 4

Sample selection

Regression of primary interest: Y ∗

i ∼ x(1) i

, i = 1, . . . , n, where x(1)

i

  • row vector of predictors.

But: observations on some Y ∗

i

are missing, based on a combination of

  • bserved and unobserved characteristics.

Małgorzata Wojtyś1, Giampiero Marra2 Generalized sample selection model

slide-5
SLIDE 5

Sample selection

Regression of primary interest: Y ∗

i ∼ x(1) i

, i = 1, . . . , n, where x(1)

i

  • row vector of predictors.

But: observations on some Y ∗

i

are missing, based on a combination of

  • bserved and unobserved characteristics.

Observables: Yi = Y ∗

i Ui,

where Ui - binary selection variable, Ui ∈ {0, 1}.

Małgorzata Wojtyś1, Giampiero Marra2 Generalized sample selection model

slide-6
SLIDE 6

Sample selection

Regression of primary interest: Y ∗

i ∼ x(1) i

, i = 1, . . . , n, where x(1)

i

  • row vector of predictors.

But: observations on some Y ∗

i

are missing, based on a combination of

  • bserved and unobserved characteristics.

Observables: Yi = Y ∗

i Ui,

where Ui - binary selection variable, Ui ∈ {0, 1}. Selection mechanism: P(Ui = 1) ∼ x(2)

i

, where x(2)

i

  • vector of covariates.

Małgorzata Wojtyś1, Giampiero Marra2 Generalized sample selection model

slide-7
SLIDE 7

Classical Heckmann’s (1979) model

For i = 1, . . . , n Y ∗

i = x(1) i

β(1) + ε1i U∗

i = x(2) i

β(2) + ε2i

Małgorzata Wojtyś1, Giampiero Marra2 Generalized sample selection model

slide-8
SLIDE 8

Classical Heckmann’s (1979) model

For i = 1, . . . , n Y ∗

i = x(1) i

β(1) + ε1i U∗

i = x(2) i

β(2) + ε2i where ε1i ε2i

  • ∼ N
  • ,

σ2 ρσ ρσ 1

  • Małgorzata Wojtyś1, Giampiero Marra2

Generalized sample selection model

slide-9
SLIDE 9

Classical Heckmann’s (1979) model

For i = 1, . . . , n Y ∗

i = x(1) i

β(1) + ε1i U∗

i = x(2) i

β(2) + ε2i where ε1i ε2i

  • ∼ N
  • ,

σ2 ρσ ρσ 1

  • Latent variables: Y ∗

i , U∗ i .

Observables: Ui = I(U∗

i > 0)

(⇒ probit regression) Yi = Y ∗

i Ui

Małgorzata Wojtyś1, Giampiero Marra2 Generalized sample selection model

slide-10
SLIDE 10

Classical Heckmann’s (1979) model

For i = 1, . . . , n Y ∗

i = x(1) i

β(1) + ε1i U∗

i = x(2) i

β(2) + ε2i where ε1i ε2i

  • ∼ N
  • ,

σ2 ρσ ρσ 1

  • Latent variables: Y ∗

i , U∗ i .

Observables: Ui = I(U∗

i > 0)

(⇒ probit regression) Yi = Y ∗

i Ui

Modifications: eg. bivariate t-distribution (Marchenko & Genton, 2012), Archimedean copulas (Smith, 2003).

Małgorzata Wojtyś1, Giampiero Marra2 Generalized sample selection model

slide-11
SLIDE 11

Generalized sample selection model

Random component Y ∗ ∼ f1 belongs to an exponential family of distributions: f1(y|η1, φ) = exp yη1 − b(η1) φ + c(y, φ)

  • for some b(·) and c(·). It holds E(Y ∗) = b′(η1) and

Var(Y ∗) = b′′(η1). Selection variable U = I(U∗ > 0) and U∗ ∼ f2, where f2(u|η2) = exp

  • −(u − η2)2

. implying the probit regression model for U. F(y, u) – joint cdf of (Y ∗, U∗), F1(y), F2(u) - marginal cdf’s. Cθ – the copula such that F(y, u) = Cθ (F1(y), F2(u)) , where θ - dependence parameter of copula.

Małgorzata Wojtyś1, Giampiero Marra2 Generalized sample selection model

slide-12
SLIDE 12

Generalized sample selection model

Random component Y ∗ ∼ f1 belongs to an exponential family of distributions: f1(y|η1, φ) = exp yη1 − b(η1) φ + c(y, φ)

  • for some b(·) and c(·). It holds E(Y ∗) = b′(η1) and

Var(Y ∗) = b′′(η1). Selection variable U = I(U∗ > 0) and U∗ ∼ f2, where f2(u|η2) = exp

  • −(u − η2)2

. implying the probit regression model for U. F(y, u) – joint cdf of (Y ∗, U∗), F1(y), F2(u) - marginal cdf’s. Cθ – the copula such that F(y, u) = Cθ (F1(y), F2(u)) , where θ - dependence parameter of copula.

Małgorzata Wojtyś1, Giampiero Marra2 Generalized sample selection model

slide-13
SLIDE 13

Generalized sample selection model

Random component Y ∗ ∼ f1 belongs to an exponential family of distributions: f1(y|η1, φ) = exp yη1 − b(η1) φ + c(y, φ)

  • for some b(·) and c(·). It holds E(Y ∗) = b′(η1) and

Var(Y ∗) = b′′(η1). Selection variable U = I(U∗ > 0) and U∗ ∼ f2, where f2(u|η2) = exp

  • −(u − η2)2

. implying the probit regression model for U. F(y, u) – joint cdf of (Y ∗, U∗), F1(y), F2(u) - marginal cdf’s. Cθ – the copula such that F(y, u) = Cθ (F1(y), F2(u)) , where θ - dependence parameter of copula.

Małgorzata Wojtyś1, Giampiero Marra2 Generalized sample selection model

slide-14
SLIDE 14

Likelihood

Likelihood of an observed outcome (Y , U): L =

  • P(U = 0)

if U = 0, fY |U(Y |U = 1)P(U = 1) if U = 1,

Małgorzata Wojtyś1, Giampiero Marra2 Generalized sample selection model

slide-15
SLIDE 15

Likelihood

Likelihood of an observed outcome (Y , U): L =

  • P(U = 0)

if U = 0, fY |U(Y |U = 1)P(U = 1) if U = 1, It holds fY |U(y|U = 1) = ∂ ∂y P(Y ≤ y|U = 1) =

Małgorzata Wojtyś1, Giampiero Marra2 Generalized sample selection model

slide-16
SLIDE 16

Likelihood

Likelihood of an observed outcome (Y , U): L =

  • P(U = 0)

if U = 0, fY |U(Y |U = 1)P(U = 1) if U = 1, It holds fY |U(y|U = 1) = ∂ ∂y P(Y ≤ y|U = 1) = = ∂ ∂y P(Y ∗ ≤ y, U∗ > 0) P(U = 1) = ∂ ∂y F1(y) − F(y, 0) P(U = 1) =

Małgorzata Wojtyś1, Giampiero Marra2 Generalized sample selection model

slide-17
SLIDE 17

Likelihood

Likelihood of an observed outcome (Y , U): L =

  • P(U = 0)

if U = 0, fY |U(Y |U = 1)P(U = 1) if U = 1, It holds fY |U(y|U = 1) = ∂ ∂y P(Y ≤ y|U = 1) = = ∂ ∂y P(Y ∗ ≤ y, U∗ > 0) P(U = 1) = ∂ ∂y F1(y) − F(y, 0) P(U = 1) = = 1 P(U = 1)

  • f1(y) − ∂

∂y F(y, 0)

  • Małgorzata Wojtyś1, Giampiero Marra2

Generalized sample selection model

slide-18
SLIDE 18

Likelihood

Likelihood of an observed outcome (Y , U): L =

  • P(U = 0)

if U = 0, fY |U(Y |U = 1)P(U = 1) if U = 1, It holds fY |U(y|U = 1) = ∂ ∂y P(Y ≤ y|U = 1) = = ∂ ∂y P(Y ∗ ≤ y, U∗ > 0) P(U = 1) = ∂ ∂y F1(y) − F(y, 0) P(U = 1) = = 1 P(U = 1)

  • f1(y) − ∂

∂y F(y, 0)

  • So

L = P(U = 0) = F2(0) if U = 0, f1(y) − ∂

∂y F(y, 0)|y=Y

if U = 1,

Małgorzata Wojtyś1, Giampiero Marra2 Generalized sample selection model

slide-19
SLIDE 19

Log-likelihood

So: L(Y , U) = F2(0)1−U ×

  • f1(y) − ∂

∂y F(y, 0)|y=Y U

Małgorzata Wojtyś1, Giampiero Marra2 Generalized sample selection model

slide-20
SLIDE 20

Log-likelihood

So: L(Y , U) = F2(0)1−U ×

  • f1(y) − ∂

∂y F(y, 0)|y=Y U Using copula representation, we obtain log-likelihood: ℓ = (1 − U) log F2(0) + U log (f1(Y ) (1 − z(Y , η1, η2))) , where z(y, η1, η2) = ∂ ∂v Cθ(v, F2(0))

  • v→F1(y)

The function z can be also expressed as z(y, η1, η2) = P(U = 0)fY ∗|U(y|U = 0)(f1(y|η1))−1.

Małgorzata Wojtyś1, Giampiero Marra2 Generalized sample selection model

slide-21
SLIDE 21

The fact that E(Y ) = b′(η1) implies ∂ ∂η1 ℓ = U(Y − µ1) + U

∂ ∂η1 z(Y , η1, η2)

1 − z(Y , η1, η2) where µ1 = E(Y ). As E( ∂

∂η1 ℓ) = 0,

Cov(U, Y ) = −E

  • U

∂ ∂η1 z(Y , η1, η2)

1 − z(Y , η1, η2)

  • which provides another interpretation for the function z(Y , η1, η2).

Małgorzata Wojtyś1, Giampiero Marra2 Generalized sample selection model

slide-22
SLIDE 22

The fact that E(Y ) = b′(η1) implies ∂ ∂η1 ℓ = U(Y − µ1) + U

∂ ∂η1 z(Y , η1, η2)

1 − z(Y , η1, η2) where µ1 = E(Y ). As E( ∂

∂η1 ℓ) = 0,

Cov(U, Y ) = −E

  • U

∂ ∂η1 z(Y , η1, η2)

1 − z(Y , η1, η2)

  • which provides another interpretation for the function z(Y , η1, η2).

Małgorzata Wojtyś1, Giampiero Marra2 Generalized sample selection model

slide-23
SLIDE 23

Systematic component

We assume that for j = 1, 2 ηj(x(j)) = η(j)

1 (x(j) 1 ) + η(j) 2 (x(j) 2 ) + . . . + η(j) Dj (x(j) Dj )

Functions ηj(x(j)) are unknown.

Małgorzata Wojtyś1, Giampiero Marra2 Generalized sample selection model

slide-24
SLIDE 24

Systematic component

We assume that for j = 1, 2 ηj(x(j)) = η(j)

1 (x(j) 1 ) + η(j) 2 (x(j) 2 ) + . . . + η(j) Dj (x(j) Dj )

Functions ηj(x(j)) are unknown. Spline basis functions to approximate every η(j)

l (x) by K

  • k=−p+1

β(j)

K,lBk,p(x)

for l = 1, . . . , Dj,

Małgorzata Wojtyś1, Giampiero Marra2 Generalized sample selection model

slide-25
SLIDE 25

Systematic component

We assume that for j = 1, 2 ηj(x(j)) = η(j)

1 (x(j) 1 ) + η(j) 2 (x(j) 2 ) + . . . + η(j) Dj (x(j) Dj )

Functions ηj(x(j)) are unknown. Spline basis functions to approximate every η(j)

l (x) by K

  • k=−p+1

β(j)

K,lBk,p(x)

for l = 1, . . . , Dj, Vector of parameters: δ = (β(1), β(2), φ, θ)

Małgorzata Wojtyś1, Giampiero Marra2 Generalized sample selection model

slide-26
SLIDE 26

Systematic component

We assume that for j = 1, 2 ηj(x(j)) = η(j)

1 (x(j) 1 ) + η(j) 2 (x(j) 2 ) + . . . + η(j) Dj (x(j) Dj )

Functions ηj(x(j)) are unknown. Spline basis functions to approximate every η(j)

l (x) by K

  • k=−p+1

β(j)

K,lBk,p(x)

for l = 1, . . . , Dj, Vector of parameters: δ = (β(1), β(2), φ, θ) Maximise penalized likelihood: ℓp(δ) = ℓ(δ) − 1 2δT ˜ Sλδ where: ˜ Sλ - penalty matrix, with zeros corresponding to φ and θ; λ - vector of smoothing parameters.

Małgorzata Wojtyś1, Giampiero Marra2 Generalized sample selection model

slide-27
SLIDE 27

R package

SemiParSampleSel allows distributions: normal, Gamma, Bernoulli and copulas: normal, Clayton, Joe, FGM, AMH, Frank, Gumbel and their 900 rotations.

Clayton (0 ≤ τ < 1) Joe (0 ≤ τ < 1 ) FGM (−0.22 < τ < 0.22)

0.02 0.04 . 6 0.08 0.1 0.12 0.14 0.16 0.18 . 2

−3 −2 −1 1 2 3 −3 −2 −1 1 2 3

0.02 0.04 0.06 . 8 . 1 0.12 0.14 . 1 6 . 1 8 0.2

−3 −2 −1 1 2 3 −3 −2 −1 1 2 3

0.02 0.04 0.06 0.08 0.1 0.12 . 1 4

−3 −2 −1 1 2 3 −3 −2 −1 1 2 3

AMH (−0.1817 ≤ τ < 1/3) Frank (−1 < τ < 1) Gumbel (0 ≤ τ < 1)

0.02 . 4 . 6 . 8 0.1 0.12 0.14 0.16

−3 −2 −1 1 2 3 −3 −2 −1 1 2 3

. 2 0.04 . 6 . 8 0.1 0.12 0.14 . 1 6 0.18 0.2

−3 −2 −1 1 2 3 −3 −2 −1 1 2 3

0.02 . 4 0.06 . 8 0.1 0.12 0.14 0.16 0.18

−3 −2 −1 1 2 3 −3 −2 −1 1 2 3

Małgorzata Wojtyś1, Giampiero Marra2 Generalized sample selection model

slide-28
SLIDE 28

RAND Health Insurance example

Dependent variable: annual medical expenses of an individual Predictors: socio-economic characteristics (income, health, age, family size,

etc.)

n = 5574 out of which 4281 used health services.

Małgorzata Wojtyś1, Giampiero Marra2 Generalized sample selection model

slide-29
SLIDE 29

RAND Health Insurance example

Dependent variable: annual medical expenses of an individual Predictors: socio-economic characteristics (income, health, age, family size,

etc.)

n = 5574 out of which 4281 used health services. We fit 10 models using all available copulas and thin plate regression splines with basis dimension 10.

Małgorzata Wojtyś1, Giampiero Marra2 Generalized sample selection model

slide-30
SLIDE 30

RAND Health Insurance example

Dependent variable: annual medical expenses of an individual Predictors: socio-economic characteristics (income, health, age, family size,

etc.)

n = 5574 out of which 4281 used health services. We fit 10 models using all available copulas and thin plate regression splines with basis dimension 10. ⇒ AIC minimal for 900 rotated Clayton copula.

Małgorzata Wojtyś1, Giampiero Marra2 Generalized sample selection model

slide-31
SLIDE 31

RAND Health Insurance example

Dependent variable: annual medical expenses of an individual Predictors: socio-economic characteristics (income, health, age, family size,

etc.)

n = 5574 out of which 4281 used health services. We fit 10 models using all available copulas and thin plate regression splines with basis dimension 10. ⇒ AIC minimal for 900 rotated Clayton copula. In this model: ˆ θ = −2 (−2.5, −1.6) ⇒ Kendall’s τ = −0.5 negative and significantly different from zero.

Małgorzata Wojtyś1, Giampiero Marra2 Generalized sample selection model

slide-32
SLIDE 32

Fitted smooth curves

400 800 1200 −0.4 0.0 0.4 pi s(pi,4.35) 10000 20000 30000 −0.5 0.0 0.5 1.0 inc s(inc,3.4) 2 4 6 8 10 −0.6 −0.2 0.2 fam s(fam,1) 5 10 15 20 25 −0.6 −0.2 0.2 educdec s(educdec,2.18) 10 30 50 −0.5 0.0 0.5 xage s(xage,7.09)

pi - participation incentive payment; inc - family income; fam - family size; educdec

  • education of household head in years;

xage - age of the individual in years.

Małgorzata Wojtyś1, Giampiero Marra2 Generalized sample selection model

slide-33
SLIDE 33

Fitted smooth curves for "naive" GAM

400 800 1200 −0.4 0.0 0.2 pi s(pi,2.25) 10000 20000 30000 −0.5 0.0 0.5 1.0 inc s(inc,1.65) 2 4 6 8 10 −1.0 −0.6 −0.2 0.2 fam s(fam,1.64) 5 10 15 20 25 −0.6 −0.2 0.2 0.6 educdec s(educdec,1) 10 30 50 −1.0 0.0 1.0 xage s(xage,8.11)

pi - participation incentive payment; inc - family income; fam - family size; educdec

  • education of household head in years;

xage - age of the individual in years.

Małgorzata Wojtyś1, Giampiero Marra2 Generalized sample selection model

slide-34
SLIDE 34

References:

Wojtyś M, Marra G, Radice R. (2016) Copula Regression Spline Sample Selection Models: The R Package SemiParSampleSel. Journal of Statistical Software. Heckman J (1979) Sample selection bias as a specification error. Econometrica 47:153-1. Marra G, Radice R (2013) Estimation of a regresssion spline sample selection model. Comp Stat Data Anal. Nelsen R (2006) An Introduction to Copulas. New York: Springer. Smith M (2003) Modelling sample selection using Archimedean

  • copulas. Econometrics Journal, 6:99–123.

Małgorzata Wojtyś1, Giampiero Marra2 Generalized sample selection model

slide-35
SLIDE 35

Theoretical foundations of the model

Let δ0 be such that EG(δ0) = 0, where G(δ) =

∂ ∂δℓ(δ).

The aim: to prove that ˆ δ → δ0 and ˆ η := Xˆ δ → η0 := Xδ0 in probability. Denote: Gn(δ) =

∂ ∂δℓ(δ) =

  • ∂ℓp

∂α (δ), ∂ℓ ∂β(δ), ∂ℓp ∂θ (δ)

  • Hn(δ) =

∂2 ∂δ∂δt ℓ(δ)

Typical assumptions: Gn(δ0) = OP(n1/2), EHn(δ0) = O(n), Hn(δ0) − EHn(δ0) = OP(n1/2), ˜ Sλ = o

  • n1/2

.

Małgorzata Wojtyś1, Giampiero Marra2 Generalized sample selection model

slide-36
SLIDE 36

Assumptions

Assumptions:

  • A1. All partial derivatives up to the order 3 of copula function

Cθ(u, v) w.r.t u, v and θ exist and are bounded.

  • A2. The function z(·) is bounded away from 1.
  • A3. maxl=1,2; j=1,...,Dl
  • λ(l)

j

  • = o (nγ) where γ ≤

2 2p+3.

  • A4. The explanatory variables x(1) and x(1) are distributed on

unit cubes [0, 1]D1 and [0, 1]D2, respectively.

  • A5. The knots of the B-spline basis are equidistantly located

so that κk − κk−1 = K −1

n

for k = 1, . . . , Kn and the dimension of the spline basis satisfies Kn = O(n1/(2p+3)).

  • A6. Kn is such that (D1 + D2)(Kn + p) < n.

Małgorzata Wojtyś1, Giampiero Marra2 Generalized sample selection model

slide-37
SLIDE 37

Theoretical foundations of the model

  • Theorem. Under assumptions (A1)-(A6),

(a) the estimate ˆ δ has asymptotic expansion ˆ δ − δ0 = −Fp(δ0)−1Gp(δ0)(1 + o(1)) + oP(Kn/n). (b) the estimate ˆ η(x) has asymptotic expansion ˆ η(x) − η0(x) ≈ X(x)F −1

p Gp(δ0)

which implies MSE(ˆ η(x)) = E(ˆ η(x) − η0(x))2 = O(n−(2p+2)/(2p+3)).

Małgorzata Wojtyś1, Giampiero Marra2 Generalized sample selection model

slide-38
SLIDE 38

References:

Wojtyś M, Marra G, Radice R. (2016) Copula Regression Spline Sample Selection Models: The R Package SemiParSampleSel. Journal of Statistical Software. Heckman J (1979) Sample selection bias as a specification error. Econometrica 47:153-1. Marra G, Radice R (2013) Estimation of a regresssion spline sample selection model. Comp Stat Data Anal. Nelsen R (2006) An Introduction to Copulas. New York: Springer. Smith M (2003) Modelling sample selection using Archimedean

  • copulas. Econometrics Journal, 6:99–123.

Małgorzata Wojtyś1, Giampiero Marra2 Generalized sample selection model