adaptive estimation of the distribution function and its
play

Adaptive Estimation of the Distribution Function and its Density in - PowerPoint PPT Presentation

Adaptive Estimation of the Distribution Function and its Density in Sup-Norm Loss Evarist Gin e and Richard Nickl Department of Mathematics University of Connecticut Let X 1 , ..., X n be i.i.d. with completely un- known law P on R .


  1. Adaptive Estimation of the Distribution Function and its Density in Sup-Norm Loss Evarist Gin´ e and Richard Nickl Department of Mathematics University of Connecticut

  2. → Let X 1 , ..., X n be i.i.d. with completely un- known law P on R . → Define also P n = n − 1 � n i =1 δ X i , the mea- sure consisting of point masses at the obser- vations (’empirical measure’).

  3. → We want to find ’data-driven’ functions T ( y, X 1 , ..., X n ), y ∈ R , that optimally esti- mate � y (A) the distribution function F ( y ) = −∞ dP ( x ); (B) its density function f ( y ) = d dy F ( y ); in sup-norm loss on the real line.

  4. Case (A) : A classical minimax result is √ nE sup lim inf inf sup | T n ( y ) − F ( y ) | ≥ c > 0 . n T n F y ∈ R → The natural candidate for T n is the sample � y cdf F n ( y ) = −∞ dP n ( t ), which is an efficient estimator of F in ℓ ∞ ( R ) . Case (B) : If f is contained in some H¨ older space C t ( R ) with norm � · � t , then one has t n (2 t +1) E � T n − f � ∞ ≥ c ( D ) > 0 � � lim n inf sup log n T n � f � t ≤ D

  5. → Clearly, the step function F n cannot be used to estimate the density f of F . → Can one outperform F n as an estimator for F in the sense that differentiable F can be estimated without knowing a priori that F is smooth? → Somewhat suprisingly maybe, the answer is yes .

  6. Theorem 1 (Gin´ e, Nickl (2008, PTRF)) Let X 1 , ..., X n be i.i.d. on R with unknown law P . Then there exists a purely-data driven estimator ˆ F n ( s ) that satisfies √ n � � ˆ F n − F � ℓ ∞ ( R ) G P . Furthermore, if P has a density f ∈ C t ( R ) for some 0 < t ≤ T < ∞ (where T is arbitrary but fixed), then ˆ F n has a density ˆ f n with pr. approaching one, and   � t/ (2 t +1) � log n | ˆ  . sup E sup f n ( y ) − f ( y ) | = O  n f : � f � t ≤ D y ∈ R

  7. → This estimator can be explicitly written down (it is a nonlinear estimator based on kernel estimators with adaptive bandwidth choice), and we refer to the paper for de- tails. Questions: A) Can (and should) the estimator ˆ F n be implemented in practice? B) Can one obtain reasonable asymptotic or even nonasymptotic risk bounds for the adaptive convergence rates? To which ex- tent is this phenomenon purely asymptotic?

  8. → To (partially) answer these questions, wavelets turned out to be more versatile than kernels. If φ , ψ are father and mother wavelet and if n n α k = 1 β ℓk = 1 2 ℓ/ 2 ψ (2 ℓ X i − k ) , φ ( X i − k ) , ˆ � � ˆ n n i =1 i =1 then, for j ∈ N , the (linear) wavelet density estimator is, with ψ ℓk = 2 ℓ/ 2 ψ (2 ℓ x − k ), j − 1 f W � � � ˆ n ( y, j ) = α k φ ( y − k ) + ˆ β ℓk ψ ℓk ( y ) . k ℓ =0 k

  9. → This estimator is a projection of the em- pirical measure P n onto the space V j spanned by the associated wavelet basis functions at resolution level j . If φ, ψ are the Battle- Lemari´ e wavelets, this corresponds to a pro- jection onto the classical Schoenberg spaces spanned by (dyadic) B -splines. → It was shown in Gin´ e and Nickl (2007): If 2 j n ≃ ( n/ log n ) 1 / (2 t +1) and if f ∈ C t ( R ), then | f W � ( n/ log n ) t/ (2 t +1) � E sup n ( y ) − f ( y ) | = O y ∈ R

  10. � s and, if F W −∞ f W n ( s ) := n ( y ) dy , that √ n ( F W − F ) � ℓ ∞ ( R ) G P . n → However, this is of limited practical impor- tance, since f ∈ C t ( R ) is rarely known, and hence the choice 2 j n ≃ ( n/ log n ) 1 / (2 t +1) is not feasible. → A natural way to choose the resolution level j n is to perform some model selection procedure on the sequence of nested spaces (or ’candidate models’) V j .

  11. HARD THRESHOLDING The hard thresholding wavelet density es- timator introduced by Donoho, Johnstone, Kerkyacharian and Picard (1996) is f T � n ( y ) = α k φ ( y − k )+ ˆ k j 0 − 1 j 1 − 1 ˆ ˆ � � � � β ℓk ψ ℓk ( y )+ β ℓk 1 [ | β ℓk | > lτ √ n ] ψ ℓk ( y ) , ℓ =0 k ℓ = j 0 k where j 1 ≃ n/ log n and j 0 → ∞ depending on the maximal smoothness up to which one wants to adapt.

  12. Theorem 2 (Gin´ e-Nickl (2007),Thm 8) For a (reasonable) choice of τ , and under a moment assumption of arbitrary order on f ∈ C t ( R ), one can prove Theorem 1 with ˆ F n the hard thresholding estimator. → This already gives an answer to the first question, since the hard thresholding estima- tor can be implemented without too much difficulties.

  13. LEPSKI’s METHOD → In the model selection context, Lepski’s (1991) method can be briefly described as follows: a) Start with the smallest model V j min ; com- pare it to a nested sequence of larger models { V j } , j min ≤ j ≤ j max b) choose the smallest j for which all rele- vant blocks of wavelet coefficients between j and j max are insignificant as compared to a certain threshold.

  14. Formally, if J is the set of candidate resolu- tion levels between j min and j max , we define ˆ j n as � � j ∈ J : � f W n ( j ) − f W min n ( l ) � ∞ ≤ T n,j,l ∀ l > j, l ∈ J , where T n,j,l is a threshold discussed later. → Note that, unlike hard thresholding pro- cedures, Lepski’s method does not discard irrelevant blocks at resolution levels that are smaller than ˆ j n .

  15. → The crucial point is of course the choice of the threshold T n,j,l . The general principle behind Lepski’s proof is that one needs a sharp estimate for the ’variance-term’ of the linear estimator underlying the procedure. → In the i.i.d. density model on R with sup- norm loss, this means that one needs ex- act exponential inequalities (involving con- stants!) for | f W n ( y, j ) − Ef W sup n ( y, j ) | . y ∈ R

  16. → In the Gaussian white noise model of- ten assumed in the literature, exponential in- equalities are immediate. Tsybakov (1998) for example works with a trigonometric ba- sis and ends up with a stationary Gaussian process, and then one has the Rice formula at hand. → Otherwise, one needs empirical processes: Talagrand’s (1996) inequality, with sharp con- stants (Massart (2000), Bousquet (2003), Klein and Rio (2005)) can be used here.

  17. → To apply Talagrand’s inequality, one needs sharp moment bounds for suprema of em- pirical processes. The constants in these in- equalities (Talagrand (1994), Einmahl and Mason (2000), Gin´ e and Guillou (2001), Gin´ e and Nickl (2007)) are not useful in adaptive estimation. → To tackle this problem, we adapt an idea from machine learning due to Koltchinskii (2001, 2006), Bartlett, Boucheron and Lu- gosi (2002)), and use Rademacher processes.

  18. → The following symmetrization inequality is well known: If ε i ’s are i.i.d. Rademacher variables independent of the sample, then � � � � n n � � � � � � � � � � E ( f ( X i ) − Pf ) ≤ 2 E ε i f ( X i ) , � � � � � � � � i =1 i =1 � � F � � F and the r.h.s. can be estimated by the (supre- mum of the) ”Rademacher-process” � � n � � � � � ε i f ( X i ) , � � � � i =1 � � F which is ’purely data-driven’ and concentrates (again by Talagrand) in a ”Bernstein - way” nicely around its expectation.

  19. → In our setup, if 2 l φ (2 l x − k ) φ (2 l y − k ) � K l ( x, y ) = k is a wavelet projection kernel, and if ε i are i.i.d. Rademachers, we set � � n 1 � � � � � R ( n, l ) = 2 sup ε i K l ( X i , y ) � . � � n � � y ∈ R i =1 � → We choose the threshold ( � Φ � 2 is a con- stant that depends only on φ ): � 2 l l T ( n, j, l ) = R ( n, l )+7 � Φ � 2 � p n ( j max ) � 1 / 2 n . ∞

  20. Theorem 3 (GN 2008) Let X 1 , ..., X n be i.i.d. on R with common law P and uniformly continuous density f . Let � s f W ˆ ˆ n ( y, ˆ F n ( s ) = j n ) dy. −∞ Then √ n � � ˆ F n − F � ℓ ∞ ( R ) G P . If, in addition, f ∈ C t ( R ) for some 0 < t ≤ r then also   � t/ (2 t +1) � log n f W | ˆ n ( y, ˆ sup E sup j n ) − f ( y ) | = O   n f : � f � t ≤ D y ∈ R

  21. → The following theorem uses the previous proof, as well as the exact almost sure law of the logarithm for wavelet density estimators (GN (2007)). Theorem 1 Let the conditions of Theorem 3 hold. Then, if f ∈ C t ( R ) for some 0 < t ≤ 1 , and if φ is the Haar wavelet, we have � t/ (2 t +1) n � E � f W n (ˆ lim sup j n ) − f � ∞ ≤ A ( p 0 ) log n n where 1 � � 1 2 t +1 √ 2 log 2(1 + t ) � f � t A ( p 0 ) = 26 . 6 ∞ � f � t

  22. → For example if t = 1, A ( p 0 ) ≤ 20 � f � 1 / 3 ∞ � Df � 1 / 3 ∞ . → The best possible constant in the minimax risk is derived in Korostelev and Nussbaum (1999) for densities supported in [0 , 1], and our bound misses the one there by ≃ 20. → Some loss of efficiency in the asymptotic constant of any adaptive estimator is to be expected in our estimation problem, cf. Lep- ski (1992) and also Tsybakov (1998).

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend