Estimating risk Maximilian Kasy Department of Economics, Harvard - - PowerPoint PPT Presentation

estimating risk
SMART_READER_LITE
LIVE PREVIEW

Estimating risk Maximilian Kasy Department of Economics, Harvard - - PowerPoint PPT Presentation

Estimating risk Estimating risk Maximilian Kasy Department of Economics, Harvard University May 4, 2018 1 / 17 Estimating risk Introduction Introduction Some of the topics about which I learned from Gary: The normal means model.


slide-1
SLIDE 1

Estimating risk

Estimating risk

Maximilian Kasy

Department of Economics, Harvard University

May 4, 2018

1 / 17

slide-2
SLIDE 2

Estimating risk Introduction

Introduction

◮ Some of the topics about which I learned from Gary:

◮ The normal means model. ◮ Finite sample risk and point estimation. ◮ Shrinkage and tuning. ◮ Random coefficients and empirical Bayes.

◮ This talk:

◮ Brief review of these topics. ◮ Building on that, some new results from my own work.

2 / 17

slide-3
SLIDE 3

Estimating risk Introduction

The normal means model

◮ θ,X ∈ Rk ◮ X ∼ N(θ,Σ) ◮ Estimator

θ(X) of θ (“almost differentiable”)

◮ Mean squared error:

MSE(

θ,θ) = 1

k Eθ

  • θ −θ2

= 1

k ∑ j

  • (

θj −θj)2 .

◮ Would like to estimate MSE(

θ,θ), to

  • 1. choose tuning parameters to minimize estimated MSE,
  • 2. choose between estimators to minimize estimated MSE,
  • 3. as a theoretical tool for proving dominance results.

◮ Key ingredient for machine learning!

3 / 17

slide-4
SLIDE 4

Estimating risk Introduction

Roadmap

◮ Review:

◮ Covariance penalties, ◮ Stein’s Unbiased Risk Estimate (SURE), ◮ Cross-Validation (CV).

◮ Panel version of (normal) means model:

◮ X ∈ Rk as sample mean of n i.i.d. draws Yi. ◮ ⇒ n-fold Cross-Validation.

◮ Two results that are new (I think):

◮ Large n ⇒ CV approximates SURE. ◮ Large k ⇒ CV and SURE converge to MSE,

yield oracle optimal tuning (“uniform loss consistency”).

4 / 17

slide-5
SLIDE 5

Estimating risk Introduction

References

◮ Stein, C. M. (1981). Estimation of the mean of a multivariate normal

  • distribution. The Annals of Statistics, 9(6):1135–1151

◮ Efron, B. (2004). The estimation of prediction error: covariance

penalties and cross-validation. Journal of the American Statistical Association, 99(467):619–632

◮ Abadie, A. and Kasy, M. (2018). Choosing among regularized

estimators in empirical economics. Working Paper.

◮ Fessler, P

. and Kasy, M. (2018). How to use economic theory to improve estimators: Shrinking toward theoretical restrictions. Working Paper

◮ Kasy, M. and Mackey, L. (2018). Approximate cross-validation. Work in

progress

5 / 17

slide-6
SLIDE 6

Estimating risk SURE and CV

Covariance penalty

◮ Efron (2004): Adding and subtracting θj gives

( θj − Xj)2 = ( θj −θj)2 + 2·( θj −θj)(θj − Xj)+(θj − Xj)2.

◮ Thus MSE(

θ,θ) = 1

k ∑j MSEj, where

MSEj = Eθ

  • (

θj −θj)2 = Eθ[( θj − Xj)2]+ 2Eθ[( θj −θj)·(Xj −θj)]− Eθ

  • (Xj −θj)2

= Eθ[( θj − Xj)2]+ 2Covθ( θj,Xj)− Varθ(Xj).

◮ First term: In-sample prediction error (observed). ◮ Second term: Covariance penalty (depends on unobserved θ). ◮ Third term: Irreducible prediction error, doesn’t depend on

θ.

6 / 17

slide-7
SLIDE 7

Estimating risk SURE and CV

Stein’s Unbiased Risk Estimate

◮ Stein (1981): For normal pdf with variance σ 2,

ϕ′

σ(x −θ) = − x−θ σ ·ϕσ(x −θ). ◮ Suppose for a moment that Σ = σ 2I. ◮ Then, by partial integration,

Covθ(

θj,Xj) =

  • Eθ[

θj|Xj = xj](xj −θj)ϕσ(xj −θj)dxj = σ ·

  • −Eθ[

θj|Xj = xj]ϕ′

σ(xj −θj)dxj

= σ ·

  • ∂xjEθ[

θj|Xj = xj]ϕσ(xj −θj)dxj = σ · Eθ[∂Xj θj].

7 / 17

slide-8
SLIDE 8

Estimating risk SURE and CV

◮ Thus

MSE = 1

k ∑ j

MSEj = 1

k ∑ j

  • (

θj − Xj)2 + 2σ 2 ·∂Xj θj −σ 2 .

◮ For non-diagonal Σ, by change of coordinates we get more

generally MSE = 1

k Eθ

  • θ − X2 + 2trace
  • θ ′ ·Σ
  • − trace(Σ)
  • .

◮ All terms on the right hand side are observed! Sample version:

SURE = 1

k

  • θ − X2 + 2trace
  • θ ′ ·Σ
  • − trace(Σ)
  • .

◮ Key assumptions that we used:

◮ X is normally distributed. ◮ Σ is known. ◮

θ is almost differentiable.

8 / 17

slide-9
SLIDE 9

Estimating risk SURE and CV

Panel setting and cross-validation

◮ Assume panel structure: X is a sample average,

i = 1,...,n and j = 1,...,k, X = 1

n ∑ i

Yi, Yi ∼i.i.d. (θ,n ·Σ).

◮ Leave-one-out mean and estimator:

X−i =

1 n−1 ∑ i′=i

Yi′,

  • θ−i =

θ(X−i).

◮ n-fold cross-validation:

CV = 1

n ∑ i

CVi, CVi = Yi −

θ−i2.

9 / 17

slide-10
SLIDE 10

Estimating risk Large n

Large n: SURE ≈ CV

Proposition

Suppose

θ(·) is continuously differentiable in a neighborhood of θ,

and suppose X n = 1

n ∑i Y n i with (Y n i −θ)/√

n i.i.d. with expectation 0 and variance Σ. Let

Σ = 1

n2 ∑i(Y n i − X n)(Y n i − X n)′. Then

CV n = X n −

θ n2 + 2trace

  • θ ′ ·

Σn +(n − 1)trace( Σn)+ op(1)

as n → ∞.

◮ New result, I believe. ◮ “For large n, CV is the same as SURE,

plus the irreducible forecasting error” n · trace(Σ) = Eθ[Yi −θ2].

◮ Does not require normality, known Σ!

10 / 17

slide-11
SLIDE 11

Estimating risk Large n

Sketch of proof

◮ Let s = √

n − 1, omit superscript n,

Ui = 1

s(Yi − X)

Ui ∼ (0,Σ), X−i = X − 1

s Ui

Yi = X + sUi

  • θ(X−i) =

θ(X)− 1

s

θ ′(X)· Ui +∆i ∆i = o( 1

s Ui)

  • Σ = 1

n ∑ i

UiU′

i .

◮ Then

CVi = Yi −

θ−i2 = X + sUi −( θ − 1

s

θ ′(X)· Ui +∆i)2 = X − θ2 + 2

  • Ui,

θ ′(X)· Ui

  • + s2Ui2

+2

  • X−

θ,(s+ 1

s

θ′)Ui

  • +

1

s2

θ′(X)·Ui2+2∆i,Yi− θ−i

  • .

CV = 1

n ∑ i

CVi = X −

θ2 + 2trace

  • θ ′ ·

Σ

  • +(n − 1)trace(

Σ)

+0+op( 1 n ).

11 / 17

slide-12
SLIDE 12

Estimating risk Large k

Large k: SURE,CV ≈ MSE

◮ Abadie and Kasy (2018): Random effects (empirical Bayes)

perspective:

(Xj,θj) ∼i.i.d. π,

Eπ[Xj|θj] = θj.

◮ Unbiasedness of SURE, CV:

Eθ[SURE] = MSE, Eθ[CV] = Eθ[CVi] = MSEn−1.

◮ Law of large numbers: For fixed π, n,

plimk→∞ SURE − MSE = 0 plimk→∞ CV − MSEn−1 = 0.

◮ Questions:

◮ Does this hold uniformly over π? ◮ If so, does this yield oracle-optimal tuning parameters?

12 / 17

slide-13
SLIDE 13

Estimating risk Large k

Componentwise estimators

◮ Answer requires more structure on estimators. Assume

  • θj = m(Xj,λ).

Examples:

◮ Ridge: mR(x,λ) =

1 1+λ x.

◮ Lasso: mL(x,λ) = 1(x < −λ)(x +λ)+ 1(x > λ)(x −λ).

◮ Denote

SE(λ) = 1

k k

j=1

(m(Xj,λ)−θj)2, (squared error loss)

MSE(λ) = Eθ[SE(λ)],

(compound risk)

MSE(λ) = Eπ[MSE(λ)] = Eπ[SE(λ)],

(empirical Bayes risk)

◮ and

MSE(λ) an estimator of MSE, e.g. SURE or CV.

13 / 17

slide-14
SLIDE 14

Estimating risk Large k

Theorem (Uniform loss consistency)

Assume that, as k → ∞, sup

π∈Q

  • sup

λ∈[0,∞]

  • SE(λ)− MSE(λ)
  • > ε
  • → 0,

∀ε > 0,

sup

π∈Q

  • sup

λ∈[0,∞]

  • MSE(λ)− MSE(λ)− vπ
  • > ε
  • → 0,

∀ε > 0.

Then sup

π∈Q

  • SE(

λ)−

inf

λ∈[0,∞]SE(λ)

  • > ε
  • → 0,

∀ε > 0,

where

λ ∈ argminλ∈[0,∞]

MSE(λ).

14 / 17

slide-15
SLIDE 15

Estimating risk Large k

Theorem (Uniform convergence)

Suppose that supπ∈Q Eπ[X 4] < ∞. Under some conditions on m (satisfied for Ridge and Lasso), the assumptions of the previous theorem are satisfied. Remarks:

◮ Extension of Glivenko-Cantelli theorem. ◮ Need conditions on m to get uniformity over λ. ◮ Only need (and get) uniform convergence of

  • MSE − MSE − vπ to 0 for some constant vπ.

◮ For CV, get uniform loss consistency to the estimator using λ

  • ptimal for SEn−1 (thus shrinking a bit too much for small n).

n ≈ sample size / # of parameters

15 / 17

slide-16
SLIDE 16

Estimating risk Large k

Outlook and work in progress

  • 1. Approximate CV using first-order approx to leave-1-out estimator,

in penalized M-estimator settings:

  • β−i(λ)−

β(λ) ≈

j

mbb(Xj,

β(λ))+πbb( β(λ),λ) −1 · mb(Xi, β(λ)).

◮ Fast alternative to CV for tuning of neural nets, etc. ◮ Additional acceleration by only calculating this for subset of i, j.

  • 2. Risk reductions for shrinkage toward inequality restrictions.

◮ Relevant for many restrictions implied by economic theory. ◮ Proving uniform dominance using SURE, extending James-Stein. ◮ Open question: Smooth choice of “degrees of freedom” that is not

too conservative.

16 / 17

slide-17
SLIDE 17

Estimating risk Large k

Thank you!

17 / 17