Computational treatment of the error distribution in nonparametric - - PowerPoint PPT Presentation

computational treatment of the error distribution in
SMART_READER_LITE
LIVE PREVIEW

Computational treatment of the error distribution in nonparametric - - PowerPoint PPT Presentation

Estimation Asymptotic results Bandwidth selection Simulations Data Analysis Computational treatment of the error distribution in nonparametric regression with right-censored and selection-biased data Graldine LAURENT Jointly with Cdric


slide-1
SLIDE 1

Estimation Asymptotic results Bandwidth selection Simulations Data Analysis

Computational treatment of the error distribution in nonparametric regression with right-censored and selection-biased data

Géraldine LAURENT Jointly with Cédric HEUCHENNE

QuantOM, HEC-ULg Management School-University of Liege

Tuesday, 24 August 2010

slide-2
SLIDE 2

Estimation Asymptotic results Bandwidth selection Simulations Data Analysis

The Spanish Institute for Statistics studied between 1987 and 1997 the unemployment of active people, and more especially the married women. For these data, we note that

  • the time of unemployment will not be completely observed,
  • the age of the woman acts on the future job.
slide-3
SLIDE 3

Estimation Asymptotic results Bandwidth selection Simulations Data Analysis 100 200 300 400 500 600 700 800 900 1000 20 40 60 80 100 120 140 160 180 200 Woman age (in months) Unemployment duration (in months) Censored Observed

slide-4
SLIDE 4

Estimation Asymptotic results Bandwidth selection Simulations Data Analysis

Estimation Asymptotic results Bandwidth selection Simulations Data Analysis

slide-5
SLIDE 5

Estimation Asymptotic results Bandwidth selection Simulations Data Analysis

Estimation

slide-6
SLIDE 6

Estimation Asymptotic results Bandwidth selection Simulations Data Analysis

We consider the nonparametric regression model Y = m(X) + σ(X)ε where

  • Y is the response variable
  • X is the covariate
  • m(·) = E[Y |·] and σ2(·) = Var[Y |·] are unknown smooth

functions

  • ε is independent of X, with E[ε] = 0 and Var[ε] = 1
slide-7
SLIDE 7

Estimation Asymptotic results Bandwidth selection Simulations Data Analysis

Particularity of (X, Y )

  • (X, Y ) is obtained from cross-sectional sampling
  • Y is subject to right censoring.

We study the variable Y delimited by T ≤ Y ≤ C where

  • T is the truncation variable
  • C is the censoring variable.
slide-8
SLIDE 8

Estimation Asymptotic results Bandwidth selection Simulations Data Analysis

Real World

Time

We use as notation F for cdf

slide-9
SLIDE 9

Estimation Asymptotic results Bandwidth selection Simulations Data Analysis

Real World

Time Truncation Time

We use as notation F for cdf

slide-10
SLIDE 10

Estimation Asymptotic results Bandwidth selection Simulations Data Analysis

Real World

Time Truncation Time

We use as notation F for cdf

slide-11
SLIDE 11

Estimation Asymptotic results Bandwidth selection Simulations Data Analysis

Intermediate Observed World

Time Truncation Time Y1 C2 C6 Y4 C5 Y3

We use as notation H for cdf, n the sample size

slide-12
SLIDE 12

Estimation Asymptotic results Bandwidth selection Simulations Data Analysis

Observed World

Time Truncation Time Y1 Y4 Y3

We use as notation H for cdf

slide-13
SLIDE 13

Estimation Asymptotic results Bandwidth selection Simulations Data Analysis

Aim : Estimation of the error distribution Fε(e) = I P(ε ≤ e) with (X, Y ) where T ≤ Y ≤ C where

  • the distribution FT|X is a parametric distribution
  • the distribution FC−T|X is completely unknown
slide-14
SLIDE 14

Estimation Asymptotic results Bandwidth selection Simulations Data Analysis

Assumptions:

  • the variables Y and T are independent, conditionally on X
  • for each value x, the support of FY |X(·|x) is included into the

support of FT|X(·|x)

  • the lower bound of the T support is zero
  • the variables (T, Y ) and C − T are independent, conditionally
  • n T ≤ Y , X
slide-15
SLIDE 15

Estimation Asymptotic results Bandwidth selection Simulations Data Analysis

We have HX,Y (x, y) = I P(X ≤ x, Y ≤ y|T ≤ Y ≤ C) = (E[w(X, Y )])−1

r≤x

s≤y w(r, s)dFX,Y (r, s),

the weight function w(x, y) is defined by w(x, y) =

t≤y {1 − G(y − t|x)} dFT|X(t|x)

where G(z|x) = I P(C − T ≤ z|X = x, T ≤ Y ).

slide-16
SLIDE 16

Estimation Asymptotic results Bandwidth selection Simulations Data Analysis

In particular, if C = T + τ where τ is a positive constant, the weight function is w(x, y) =

❩ y

0∨y−τ dFT|X(t|x)

by applying the same procedure.

slide-17
SLIDE 17

Estimation Asymptotic results Bandwidth selection Simulations Data Analysis

We obtain FX,Y (x, y) =

r≤x

s≤y

E[w(X, Y )] w(r, s) dHX,Y (r, s) Therefore, Fε(e) = I P

✒Y − m(X)

σ(X) ≤ e

=

❩❩ ➛

(x,y): y−m(x)

σ(x)

≤e

➞ dFX,Y (x, y)

=

❩❩ ➛

(x,y): y−m(x)

σ(x)

≤e

➞ E[w(X, Y )]

w(x, y) dHX,Y (x, y)

slide-18
SLIDE 18

Estimation Asymptotic results Bandwidth selection Simulations Data Analysis

Thus, the estimator is ˆ Fε(e) = 1 M

n

i=1

ˆ E[w(X, Y )] ˆ w(Xi, Yi) I{ˆ εi ≤ e, ∆i = 1} with ˆ εi = Yi − ˆ m(Xi) ˆ σ(Xi) , M =

n

i=1

∆i, ˆ E[w(X, Y )] =

1 M

n

i=1

∆i ˆ w(Xi, Yi)

✦−1

where the functions ˆ m(·), ˆ σ(·) and ˆ w(·, ·) are nonparametric estimators.

slide-19
SLIDE 19

Estimation Asymptotic results Bandwidth selection Simulations Data Analysis

For G(t|x), we use the Beran (1981) estimator defined by ˆ G(t|x) = 1 −

Zi≤t,∆i=0

1 − Wi(x, hn)

Pn

j=1 Wj(x, hn)I {Zj ≥ Zi}

where

  • Zi = min(Ci − Ti, Yi − Ti) and ∆i = I{Yi ≤ Ci}
  • Wi(x, hn) =

K

⑨ x−Xi

hn

❾ Pn

j=1 K

⑨ x−Xj

hn

❾ are the Nadaraya-Watson weights

  • K is a kernel function
  • hn is a bandwidth sequence tending to 0 when n → ∞

=> ˆ w(x, y) =

t≤y

1 − ˆ G(y − t|x)

dFT|X(t|x)

slide-20
SLIDE 20

Estimation Asymptotic results Bandwidth selection Simulations Data Analysis

The estimators of m(·) and σ(·) are given by ˆ m(x) =

Pn

i=1 Wi(x,hn)Yi∆i ˆ w(x,Yi)

Pn

i=1 Wi(x,hn)∆i ˆ w(x,Yi)

, ˆ σ2(x) =

Pn

i=1 Wi(x,hn)∆i(Yi− ˆ m(x))2 ˆ w(x,Yi)

Pn

i=1 Wi(x,hn)∆i ˆ w(x,Yi)

, extension of the estimators in de Uña-Alvarez and Iglesias-Pérez (2008).

slide-21
SLIDE 21

Estimation Asymptotic results Bandwidth selection Simulations Data Analysis

Asymptotic results

slide-22
SLIDE 22

Estimation Asymptotic results Bandwidth selection Simulations Data Analysis

Under some assumptions, ˆ Fε(e) − Fε(e) =

n

i=1

V (Xi, Yi, Zi, ∆i, e) + op(n

−1 2 )

uniformly in e. => Weak convergence of the process √n(ˆ Fε(e) − Fε(e)) → Ω(e) where Ω is a Gaussian process with zero mean and complex covariance.

slide-23
SLIDE 23

Estimation Asymptotic results Bandwidth selection Simulations Data Analysis

Bandwidth selection

slide-24
SLIDE 24

Estimation Asymptotic results Bandwidth selection Simulations Data Analysis

We want to determine the smoothing parameter hn which minimizes MISE = E

➉❩ ➛ˆ

Fε,hn(e) − Fε(e)

➞2 de ➌

We consider bootstrap procedure which is an extension of Li and Datta (2001).

slide-25
SLIDE 25

Estimation Asymptotic results Bandwidth selection Simulations Data Analysis

For b = 1, . . . , B, For i = 1, . . . , n Step 1 Generate X ∗

i,b from

ˆ FX(·) =

n

j=1

ˆ E[w(X, Y )] ˆ E[w(X, Y )|X = ·] I{Xj ≤ ·, ∆j = 1}, where ˆ E[w(X, Y )|X = ·] =

n

j=1

Wj(·, gn)∆j/

n

j=1

Wj(·, gn)∆j ˆ w(·, Yj) and gn is a pilot bandwidth

slide-26
SLIDE 26

Estimation Asymptotic results Bandwidth selection Simulations Data Analysis

Step 2 Generate Y ∗

i,b from

ˆ FY |X(·|X ∗

i,b) = n

j=1

ˆ E[w(X, Y )|X = X ∗

i,b]Wj(X ∗ i,b, gn)

ˆ w(X ∗

i,b, Yj)(Pn k=1 Wk(X ∗ i,b, gn)∆k)

I{Yj ≤ ·, ∆j = 1} Step 3 Draw T ∗

i,b from the distribution FT|X(·|X ∗ i,b).

  • If T ∗

i,b > Y ∗ i,b, then reject (X ∗ i,b, Y ∗ i,b, T ∗ i,b) and go to Step 1.

  • Otherwise, go to Step 4.

Step 4 Select at random V ∗

i,b from ˆ

G(·|X ∗

i,b) calculated with gn

Step 5 Define

  • Z ∗

i,b = min(Y ∗ i,b − T ∗ i,b, V ∗ i,b)

  • ∆∗

i,b = I{Y ∗ i,b − T ∗ i,b ≤ V ∗ i,b}.

slide-27
SLIDE 27

Estimation Asymptotic results Bandwidth selection Simulations Data Analysis

Compute ˆ F ∗

ε,hn,b, the error distribution based on

  • bandwidth hn
  • resample {(X ∗

i,b, T ∗ i,b, Z ∗ i,b, ∆∗ i,b) : i = 1, . . . , n}.

The expression of the MISE can be approximated by argminhn B−1

B

b=1

{ˆ F ∗

ε,hn,b(e) − ˆ

Fε,gn(e)}2de.

slide-28
SLIDE 28

Estimation Asymptotic results Bandwidth selection Simulations Data Analysis

Simulations

slide-29
SLIDE 29

Estimation Asymptotic results Bandwidth selection Simulations Data Analysis

We consider

  • model Y = X + ε where
  • X ∼ U ([1.7321; 2])
  • ε ∼ U

⑨➈ − √ 3; √ 3 ➋❾

  • model log Y = X + ε where
  • X ∼ U ([0; 1])
  • ε ∼ N(0; 1)
  • model Y = X 2 + X ∗ ε where
  • X ∼ U

⑨➈ 2; 2 ∗ √ 3 ➋❾

  • ε ∼ U

⑨➈ − √ 3; √ 3 ➋❾

  • model log Y = X 2 + X ∗ ε where
  • X ∼ U ([0; 1])
  • ε ∼ N(0; 1)

where X and ε are independent in each model

slide-30
SLIDE 30

Estimation Asymptotic results Bandwidth selection Simulations Data Analysis

Homoscedastic model : Y = X + ε

  • Dist. of T
  • Dist. of C − T

% Censor.

Ö

MISE (∗10−3) Unif([0; 4]) Exp(2/5) 0.37 5.5 Unif([0; 4]) Exp(2/7) 0.28 4.9 Unif([0; X + 2]) Exp(2/5) 0.36 5.2 Unif([0; X + 2]) Exp(2/7) 0.29 5.0 4 ∗ Beta(0.5; 1) Exp(2/7) 0.34 4.2 4 ∗ Beta(0.5; 1) Exp(2/9) 0.29 4.0 Unif([0; 4]) Exp(1/(X + 1.5)) 0.28 4.6 4 ∗ Beta(0.5; 1) Exp(1/(X 2 − 1)) 0.34 4.5

slide-31
SLIDE 31

Estimation Asymptotic results Bandwidth selection Simulations Data Analysis

Heteroscedastic model : Y = X 2 + X ∗ ε

  • Dist. of T
  • Dist. of C − T

% Censor.

Ö

MISE (∗10−3) Unif([0; 18]) Exp(0.1) 0.34 6.9 Unif([0; 18]) Exp(0.05) 0.19 6.2 18 ∗ Beta(0.5; 1) Exp(1/12) 0.35 6.3 18 ∗ Beta(0.5; 1) Exp(1/15) 0.29 6.2 Unif([0; X + 16]) Exp(1/12) 0.29 6.2 18 ∗ Beta(0.5; 1) Exp(1/(2X 2 − 1)) 0.30 6.6

slide-32
SLIDE 32

Estimation Asymptotic results Bandwidth selection Simulations Data Analysis

Interval containing 90% of value of ˆ Fε

−3 −2 −1 1 2 3 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

epsilon cumulative distribution function Y=X²+X*ε where T~18*Beta(.5;1)& C−T~Exp(1/12)

real cdf of error

  • quant. 5% of ε cdf est.
  • quant. 50% of ε cdf est.
  • quant. 95% of ε cdf est.

−3 −2 −1 1 2 3 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

epsilon cumulative distribution function log(Y)=X²+X*ε where T~Exp(2)& C−T~Exp(.375)

real cdf of error

  • quant. 5% of ε cdf est.
  • quant. 50% of ε cdf est.
  • quant. 95% of ε cdf est.
slide-33
SLIDE 33

Estimation Asymptotic results Bandwidth selection Simulations Data Analysis

Data analysis

slide-34
SLIDE 34

Estimation Asymptotic results Bandwidth selection Simulations Data Analysis

For the real data, we suppose that

  • the number of time periods is equal to 1009 but only 446

aren’t censored.

  • the distribution of T is a uniform one (Wang, 1991);
  • the variable C is defined by C = T + τ where τ is a constant

equal to 18 months; The Bootstrap approximation gives the value of 70 months as the

  • ptimal bandwidth.
slide-35
SLIDE 35

Estimation Asymptotic results Bandwidth selection Simulations Data Analysis

Representation of ˆ FY |X for various ages.

50 100 150 200 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Unemployment Time (months) Cumulative distribution function cdf of Y|X on 25 years cdf of Y|X on 30 years cdf of Y|X on 50 years cdf of Y|X on 55 years

slide-36
SLIDE 36

Estimation Asymptotic results Bandwidth selection Simulations Data Analysis

Thank you for your attention

slide-37
SLIDE 37

Estimation Asymptotic results Bandwidth selection Simulations Data Analysis

  • ASGHARIAN, M., M´LAN, C. E., WOLFSON, D. B. (2002):

Length-biased sampling with right-censoring: an unconditional

  • approach. Journal of the American Statististical Association

97, 201-209.

  • BERAN, R. (1981): Nonparametric regression with randomly

censored survival data. Technical Report, University of California, Berkeley.

  • de UNA-ALVAREZ, J., IGLESIAS-PEREZ, M.C. (2008):

Nonparametric estimation of a conditional distribution from length-biased data. Annals of the Institute of Statistical Mathematics, in press. doi: 10.1007/s10463-008-0178-0.

  • LI, G., DATTA, S. (2001): A bootstrap approach to

non-parametric regression for right censored data. Annals of the Institute of Statistical Mathematics 53, 708-729.

slide-38
SLIDE 38

Estimation Asymptotic results Bandwidth selection Simulations Data Analysis

  • OJEDA-CABRERA, J.L., VAN KEILEGOM, I. (2008):

Goodness-of-fit tests for parametric regression with selection biased data. Journal of Statistical Planning and Inference 139 (8), 2836-2850.

  • VAN KEILEGOM, I., AKRITAS, M.G. (1999): Transfer of tail

information in censored regression models. The annals of Statistics 27 (5), 1745-1784.

  • WANG, M.-C. (1991): Nonparametric estimation from

cross-sectional survival data. Journal of the American Statistical Association 86, 130-143.