 
              Estimation Asymptotic results Bandwidth selection Simulations Data Analysis Computational treatment of the error distribution in nonparametric regression with right-censored and selection-biased data Géraldine LAURENT Jointly with Cédric HEUCHENNE QuantOM, HEC-ULg Management School-University of Liege Tuesday, 24 August 2010
Estimation Asymptotic results Bandwidth selection Simulations Data Analysis The Spanish Institute for Statistics studied between 1987 and 1997 the unemployment of active people, and more especially the married women. For these data, we note that • the time of unemployment will not be completely observed, • the age of the woman acts on the future job.
Estimation Asymptotic results Bandwidth selection Simulations Data Analysis 200 Censored Observed 180 160 Unemployment duration (in months) 140 120 100 80 60 40 20 0 0 100 200 300 400 500 600 700 800 900 1000 Woman age (in months)
Estimation Asymptotic results Bandwidth selection Simulations Data Analysis Estimation Asymptotic results Bandwidth selection Simulations Data Analysis
Estimation Asymptotic results Bandwidth selection Simulations Data Analysis Estimation
Estimation Asymptotic results Bandwidth selection Simulations Data Analysis We consider the nonparametric regression model Y = m ( X ) + σ ( X ) ε where • Y is the response variable • X is the covariate • m ( · ) = E [ Y |· ] and σ 2 ( · ) = Var [ Y |· ] are unknown smooth functions • ε is independent of X , with E [ ε ] = 0 and Var [ ε ] = 1
Estimation Asymptotic results Bandwidth selection Simulations Data Analysis Particularity of ( X , Y ) • ( X , Y ) is obtained from cross-sectional sampling • Y is subject to right censoring. We study the variable Y delimited by T ≤ Y ≤ C where • T is the truncation variable • C is the censoring variable.
Estimation Asymptotic results Bandwidth selection Simulations Data Analysis Real World Time We use as notation F for cdf
Estimation Asymptotic results Bandwidth selection Simulations Data Analysis Real World Truncation Time Time We use as notation F for cdf
Estimation Asymptotic results Bandwidth selection Simulations Data Analysis Real World Truncation Time Time We use as notation F for cdf
Estimation Asymptotic results Bandwidth selection Simulations Data Analysis Intermediate Observed World Y1 C2 Y3 Y4 C5 C6 Truncation Time Time We use as notation H for cdf, n the sample size
Estimation Asymptotic results Bandwidth selection Simulations Data Analysis Observed World Y1 Y3 Y4 Truncation Time Time We use as notation H for cdf
Estimation Asymptotic results Bandwidth selection Simulations Data Analysis Aim : Estimation of the error distribution F ε ( e ) = I P ( ε ≤ e ) with ( X , Y ) where T ≤ Y ≤ C where • the distribution F T | X is a parametric distribution • the distribution F C − T | X is completely unknown
Estimation Asymptotic results Bandwidth selection Simulations Data Analysis Assumptions: • the variables Y and T are independent, conditionally on X • for each value x , the support of F Y | X ( ·| x ) is included into the support of F T | X ( ·| x ) • the lower bound of the T support is zero • the variables ( T , Y ) and C − T are independent, conditionally on T ≤ Y , X
Estimation Asymptotic results Bandwidth selection Simulations Data Analysis We have H X , Y ( x , y ) = I P ( X ≤ x , Y ≤ y | T ≤ Y ≤ C ) ❩ ❩ ( E [ w ( X , Y )]) − 1 = s ≤ y w ( r , s ) dF X , Y ( r , s ) , r ≤ x the weight function w ( x , y ) is defined by ❩ w ( x , y ) = t ≤ y { 1 − G ( y − t | x ) } dF T | X ( t | x ) where G ( z | x ) = I P ( C − T ≤ z | X = x , T ≤ Y ) .
Estimation Asymptotic results Bandwidth selection Simulations Data Analysis In particular, if C = T + τ where τ is a positive constant, the weight function is ❩ y w ( x , y ) = 0 ∨ y − τ dF T | X ( t | x ) by applying the same procedure.
Estimation Asymptotic results Bandwidth selection Simulations Data Analysis We obtain ❩ ❩ E [ w ( X , Y )] F X , Y ( x , y ) = dH X , Y ( r , s ) w ( r , s ) r ≤ x s ≤ y Therefore, ✒ Y − m ( X ) ✓ F ε ( e ) = P I ≤ e σ ( X ) ❩❩ ➛ ➞ dF X , Y ( x , y ) = ( x , y ): y − m ( x ) ≤ e σ ( x ) ❩❩ ➞ E [ w ( X , Y )] ➛ = dH X , Y ( x , y ) w ( x , y ) ( x , y ): y − m ( x ) ≤ e σ ( x )
Estimation Asymptotic results Bandwidth selection Simulations Data Analysis Thus, the estimator is ❳ n ˆ F ε ( e ) = 1 E [ w ( X , Y )] ˆ w ( X i , Y i ) I { ˆ ε i ≤ e , ∆ i = 1 } M ˆ i = 1 with n ❳ ε i = Y i − ˆ m ( X i ) ˆ , M = ∆ i , σ ( X i ) ˆ i = 1 ✥ ✦ − 1 ❳ n 1 ∆ i ˆ E [ w ( X , Y )] = M w ( X i , Y i ) ˆ i = 1 where the functions ˆ m ( · ) , ˆ σ ( · ) and ˆ w ( · , · ) are nonparametric estimators.
Estimation Asymptotic results Bandwidth selection Simulations Data Analysis For G ( t | x ) , we use the Beran (1981) estimator defined by ❶ ➀ ❨ W i ( x , h n ) ˆ P n G ( t | x ) = 1 − 1 − j = 1 W j ( x , h n ) I { Z j ≥ Z i } Z i ≤ t , ∆ i = 0 where • Z i = min ( C i − T i , Y i − T i ) and ∆ i = I { Y i ≤ C i } ⑨ x − Xi ❾ K hn ⑨ x − Xj ❾ are the Nadaraya-Watson weights • W i ( x , h n ) = P n j = 1 K hn • K is a kernel function • h n is a bandwidth sequence tending to 0 when n → ∞ ❩ ➛ ➞ 1 − ˆ = > ˆ w ( x , y ) = G ( y − t | x ) dF T | X ( t | x ) t ≤ y
Estimation Asymptotic results Bandwidth selection Simulations Data Analysis The estimators of m ( · ) and σ ( · ) are given by P n W i ( x , h n ) Y i ∆ i i = 1 w ( x , Y i ) ˆ m ( x ) = ˆ , P n W i ( x , h n )∆ i i = 1 w ( x , Y i ) ˆ P n m ( x )) 2 W i ( x , h n )∆ i ( Y i − ˆ i = 1 w ( x , Y i ) ˆ σ 2 ( x ) = ˆ , P n W i ( x , h n )∆ i i = 1 w ( x , Y i ) ˆ extension of the estimators in de Uña-Alvarez and Iglesias-Pérez (2008).
Estimation Asymptotic results Bandwidth selection Simulations Data Analysis Asymptotic results
Estimation Asymptotic results Bandwidth selection Simulations Data Analysis Under some assumptions, ❳ n − 1 ˆ 2 ) F ε ( e ) − F ε ( e ) = V ( X i , Y i , Z i , ∆ i , e ) + o p ( n i = 1 uniformly in e . = > Weak convergence of the process √ n (ˆ F ε ( e ) − F ε ( e )) → Ω( e ) where Ω is a Gaussian process with zero mean and complex covariance.
Estimation Asymptotic results Bandwidth selection Simulations Data Analysis Bandwidth selection
Estimation Asymptotic results Bandwidth selection Simulations Data Analysis We want to determine the smoothing parameter h n which minimizes ➉❩ ➛ ˆ ➌ ➞ 2 de MISE = E F ε, h n ( e ) − F ε ( e ) We consider bootstrap procedure which is an extension of Li and Datta (2001).
Estimation Asymptotic results Bandwidth selection Simulations Data Analysis For b = 1 , . . . , B , For i = 1 , . . . , n Step 1 Generate X ∗ i , b from n ˆ ❳ E [ w ( X , Y )] ˆ F X ( · ) = I { X j ≤ · , ∆ j = 1 } , ˆ E [ w ( X , Y ) | X = · ] j = 1 ❳ n ❳ n W j ( · , g n )∆ j where ˆ E [ w ( X , Y ) | X = · ] = W j ( · , g n )∆ j / w ( · , Y j ) ˆ j = 1 j = 1 and g n is a pilot bandwidth
Estimation Asymptotic results Bandwidth selection Simulations Data Analysis Step 2 Generate Y ∗ i , b from ˆ E [ w ( X , Y ) | X = X ∗ i , b ] W j ( X ∗ ❳ n i , b , g n ) F Y | X ( ·| X ∗ ˆ i , b , Y j )( P n i , b ) = w ( X ∗ k = 1 W k ( X ∗ ˆ i , b , g n )∆ k ) j = 1 I { Y j ≤ · , ∆ j = 1 } Step 3 Draw T ∗ i , b from the distribution F T | X ( ·| X ∗ i , b ) . • If T ∗ i , b > Y ∗ i , b , then reject ( X ∗ i , b , Y ∗ i , b , T ∗ i , b ) and go to Step 1 . • Otherwise, go to Step 4 . i , b from ˆ Step 4 Select at random V ∗ G ( ·| X ∗ i , b ) calculated with g n Step 5 Define • Z ∗ i , b = min ( Y ∗ i , b − T ∗ i , b , V ∗ i , b ) • ∆ ∗ i , b = I { Y ∗ i , b − T ∗ i , b ≤ V ∗ i , b } .
Estimation Asymptotic results Bandwidth selection Simulations Data Analysis Compute ˆ F ∗ ε, h n , b , the error distribution based on • bandwidth h n • resample { ( X ∗ i , b , T ∗ i , b , Z ∗ i , b , ∆ ∗ i , b ) : i = 1 , . . . , n } . The expression of the MISE can be approximated by ❩ ❳ B argmin h n B − 1 { ˆ F ∗ ε, h n , b ( e ) − ˆ F ε, g n ( e ) } 2 de . b = 1
Estimation Asymptotic results Bandwidth selection Simulations Data Analysis Simulations
Estimation Asymptotic results Bandwidth selection Simulations Data Analysis We consider • model Y = X + ε where • X ∼ U ([ 1 . 7321 ; 2 ]) ⑨➈ ➋❾ √ √ • ε ∼ U − 3 ; 3 • model log Y = X + ε where • X ∼ U ([ 0 ; 1 ]) • ε ∼ N ( 0 ; 1 ) • model Y = X 2 + X ∗ ε where ⑨➈ ➋❾ √ • X ∼ U 2 ; 2 ∗ 3 ⑨➈ ➋❾ √ √ • ε ∼ U − 3 ; 3 • model log Y = X 2 + X ∗ ε where • X ∼ U ([ 0 ; 1 ]) • ε ∼ N ( 0 ; 1 ) where X and ε are independent in each model
Recommend
More recommend