estimating risk
play

Estimating risk Maximilian Kasy Department of Economics, Harvard - PowerPoint PPT Presentation

Estimating risk Estimating risk Maximilian Kasy Department of Economics, Harvard University May 4, 2018 1 / 17 Estimating risk Introduction Introduction Some of the topics about which I learned from Gary: The normal means model.


  1. Estimating risk Estimating risk Maximilian Kasy Department of Economics, Harvard University May 4, 2018 1 / 17

  2. Estimating risk Introduction Introduction ◮ Some of the topics about which I learned from Gary: ◮ The normal means model. ◮ Finite sample risk and point estimation. ◮ Shrinkage and tuning. ◮ Random coefficients and empirical Bayes. ◮ This talk: ◮ Brief review of these topics. ◮ Building on that, some new results from my own work. 2 / 17

  3. Estimating risk Introduction The normal means model ◮ θ , X ∈ R k ◮ X ∼ N ( θ , Σ) ◮ Estimator � θ ( X ) of θ (“almost differentiable”) ◮ Mean squared error: � θ − θ � 2 � MSE ( � � � θ , θ ) = 1 k E θ � θ j − θ j ) 2 � ( � = 1 k ∑ . E θ j ◮ Would like to estimate MSE ( � θ , θ ) , to 1. choose tuning parameters to minimize estimated MSE, 2. choose between estimators to minimize estimated MSE, 3. as a theoretical tool for proving dominance results. ◮ Key ingredient for machine learning! 3 / 17

  4. Estimating risk Introduction Roadmap ◮ Review: ◮ Covariance penalties, ◮ Stein’s Unbiased Risk Estimate (SURE), ◮ Cross-Validation (CV). ◮ Panel version of (normal) means model: ◮ X ∈ R k as sample mean of n i.i.d. draws Y i . ◮ ⇒ n -fold Cross-Validation. ◮ Two results that are new (I think): ◮ Large n ⇒ CV approximates SURE. ◮ Large k ⇒ CV and SURE converge to MSE, yield oracle optimal tuning (“uniform loss consistency”). 4 / 17

  5. Estimating risk Introduction References ◮ Stein, C. M. (1981). Estimation of the mean of a multivariate normal distribution. The Annals of Statistics , 9(6):1135–1151 ◮ Efron, B. (2004). The estimation of prediction error: covariance penalties and cross-validation. Journal of the American Statistical Association , 99(467):619–632 ◮ Abadie, A. and Kasy, M. (2018). Choosing among regularized estimators in empirical economics. Working Paper. ◮ Fessler, P . and Kasy, M. (2018). How to use economic theory to improve estimators: Shrinking toward theoretical restrictions. Working Paper ◮ Kasy, M. and Mackey, L. (2018). Approximate cross-validation. Work in progress 5 / 17

  6. Estimating risk SURE and CV Covariance penalty ◮ Efron (2004): Adding and subtracting θ j gives θ j − X j ) 2 = ( � θ j − θ j ) 2 + 2 · ( � ( � θ j − θ j )( θ j − X j )+( θ j − X j ) 2 . ◮ Thus MSE ( � θ , θ ) = 1 k ∑ j MSE j , where � θ j − θ j ) 2 � ( � MSE j = E θ � ( X j − θ j ) 2 � = E θ [( � θ j − X j ) 2 ]+ 2 E θ [( � θ j − θ j ) · ( X j − θ j )] − E θ = E θ [( � θ j − X j ) 2 ]+ 2Cov θ ( � θ j , X j ) − Var θ ( X j ) . ◮ First term: In-sample prediction error (observed). ◮ Second term: Covariance penalty (depends on unobserved θ ). ◮ Third term: Irreducible prediction error, doesn’t depend on � θ . 6 / 17

  7. Estimating risk SURE and CV Stein’s Unbiased Risk Estimate ◮ Stein (1981): For normal pdf with variance σ 2 , ϕ ′ σ ( x − θ ) = − x − θ σ · ϕ σ ( x − θ ) . ◮ Suppose for a moment that Σ = σ 2 I . ◮ Then, by partial integration, � Cov θ ( � E θ [ � θ j , X j ) = θ j | X j = x j ]( x j − θ j ) ϕ σ ( x j − θ j ) dx j � − E θ [ � θ j | X j = x j ] ϕ ′ = σ · σ ( x j − θ j ) dx j � ∂ x j E θ [ � = σ · θ j | X j = x j ] ϕ σ ( x j − θ j ) dx j = σ · E θ [ ∂ X j � θ j ] . 7 / 17

  8. Estimating risk SURE and CV ◮ Thus � θ j − σ 2 � θ j − X j ) 2 + 2 σ 2 · ∂ X j � ( � MSE = 1 k ∑ MSE j = 1 k ∑ . E θ j j ◮ For non-diagonal Σ , by change of coordinates we get more generally � � � � θ − X � 2 + 2trace θ ′ · Σ � � � MSE = 1 − trace (Σ) . k E θ ◮ All terms on the right hand side are observed! Sample version: � � � � θ − X � 2 + 2trace θ ′ · Σ � � � SURE = 1 − trace (Σ) . k ◮ Key assumptions that we used: ◮ X is normally distributed. ◮ Σ is known. ◮ � θ is almost differentiable. 8 / 17

  9. Estimating risk SURE and CV Panel setting and cross-validation ◮ Assume panel structure: X is a sample average, i = 1 ,..., n and j = 1 ,..., k , Y i ∼ i . i . d . ( θ , n · Σ) . X = 1 n ∑ Y i , i ◮ Leave-one-out mean and estimator: θ − i = � � 1 n − 1 ∑ X − i = Y i ′ , θ ( X − i ) . i ′ � = i ◮ n -fold cross-validation: CV i = � Y i − � CV = 1 n ∑ θ − i � 2 . CV i , i 9 / 17

  10. Estimating risk Large n Large n : SURE ≈ CV Proposition Suppose � θ ( · ) is continuously differentiable in a neighborhood of θ , i − θ ) / √ and suppose X n = 1 n ∑ i Y n i with ( Y n n i.i.d. with expectation 0 and variance Σ . Let � i − X n ) ′ . Then Σ = 1 n 2 ∑ i ( Y n i − X n )( Y n � Σ n � CV n = � X n − � θ n � 2 + 2trace θ ′ · � � +( n − 1 ) trace ( � Σ n )+ o p ( 1 ) as n → ∞ . ◮ New result, I believe. ◮ “For large n , CV is the same as SURE, plus the irreducible forecasting error” n · trace (Σ) = E θ [ � Y i − θ � 2 ] . ◮ Does not require normality, known Σ ! 10 / 17

  11. Estimating risk Large n Sketch of proof ◮ Let s = √ n − 1, omit superscript n , U i = 1 s ( Y i − X ) U i ∼ ( 0 , Σ) , X − i = X − 1 Y i = X + sU i s U i θ ( X − i ) = � � s � θ ′ ( X ) · U i +∆ i θ ( X ) − 1 ∆ i = o ( 1 s U i ) � U i U ′ Σ = 1 n ∑ i . i ◮ Then θ − i � 2 = � X + sU i − ( � CV i = � Y i − � s � θ ′ ( X ) · U i +∆ i ) � 2 θ − 1 � � θ � 2 + 2 = � X − � U i , � θ ′ ( X ) · U i + s 2 � U i � 2 � � � 1 � θ , ( s + 1 X − � s � s 2 � � θ ′ ( X ) · U i � 2 + 2 � ∆ i , Y i − � θ ′ ) U i θ − i � + 2 + . � � θ � 2 + 2trace θ ′ · � CV i = � X − � � +( n − 1 ) trace ( � CV = 1 n ∑ Σ Σ) i + 0 + o p ( 1 n ) . 11 / 17

  12. Estimating risk Large k Large k : SURE , CV ≈ MSE ◮ Abadie and Kasy (2018): Random effects (empirical Bayes) perspective: ( X j , θ j ) ∼ i . i . d . π , E π [ X j | θ j ] = θ j . ◮ Unbiasedness of SURE, CV: E θ [ CV ] = E θ [ CV i ] = MSE n − 1 . E θ [ SURE ] = MSE , ◮ Law of large numbers: For fixed π , n , plim k → ∞ CV − MSE n − 1 = 0 . plim k → ∞ SURE − MSE = 0 ◮ Questions: ◮ Does this hold uniformly over π ? ◮ If so, does this yield oracle-optimal tuning parameters? 12 / 17

  13. Estimating risk Large k Componentwise estimators ◮ Answer requires more structure on estimators. Assume � θ j = m ( X j , λ ) . Examples: ◮ Ridge: m R ( x , λ ) = 1 1 + λ x . ◮ Lasso: m L ( x , λ ) = 1 ( x < − λ )( x + λ )+ 1 ( x > λ )( x − λ ) . ◮ Denote k SE ( λ ) = 1 ( m ( X j , λ ) − θ j ) 2 , ∑ (squared error loss) k j = 1 MSE ( λ ) = E θ [ SE ( λ )] , (compound risk) MSE ( λ ) = E π [ MSE ( λ )] = E π [ SE ( λ )] , (empirical Bayes risk) ◮ and � MSE ( λ ) an estimator of MSE , e.g. SURE or CV. 13 / 17

  14. Estimating risk Large k Theorem (Uniform loss consistency) Assume that, as k → ∞ , � � � � � � � SE ( λ ) − MSE ( λ ) � > ε → 0 , ∀ ε > 0 , sup P π sup π ∈ Q λ ∈ [ 0 , ∞ ] � � � � � � � > ε MSE ( λ ) − MSE ( λ ) − v π → 0 , ∀ ε > 0 . sup P π sup π ∈ Q λ ∈ [ 0 , ∞ ] Then �� � � � � � SE ( � � � λ ) − λ ∈ [ 0 , ∞ ] SE ( λ ) � > ε → 0 , ∀ ε > 0 , sup P π inf π ∈ Q where � λ ∈ argmin λ ∈ [ 0 , ∞ ] � MSE ( λ ) . 14 / 17

  15. Estimating risk Large k Theorem (Uniform convergence) Suppose that sup π ∈ Q E π [ X 4 ] < ∞ . Under some conditions on m (satisfied for Ridge and Lasso), the assumptions of the previous theorem are satisfied. Remarks: ◮ Extension of Glivenko-Cantelli theorem. ◮ Need conditions on m to get uniformity over λ . ◮ Only need (and get) uniform convergence of � MSE − MSE − v π to 0 for some constant v π . ◮ For CV , get uniform loss consistency to the estimator using λ optimal for SE n − 1 (thus shrinking a bit too much for small n ). n ≈ sample size / # of parameters 15 / 17

  16. Estimating risk Large k Outlook and work in progress 1. Approximate CV using first-order approx to leave-1-out estimator, in penalized M-estimator settings: � � − 1 β − i ( λ ) − � � m bb ( X j , � β ( λ ))+ π bb ( � · m b ( X i , � ∑ β ( λ ) ≈ β ( λ ) , λ ) β ( λ )) . j ◮ Fast alternative to CV for tuning of neural nets, etc. ◮ Additional acceleration by only calculating this for subset of i , j . 2. Risk reductions for shrinkage toward inequality restrictions. ◮ Relevant for many restrictions implied by economic theory. ◮ Proving uniform dominance using SURE, extending James-Stein. ◮ Open question: Smooth choice of “degrees of freedom” that is not too conservative. 16 / 17

  17. Estimating risk Large k Thank you! 17 / 17

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend