Conditional Predictive Inference Post Model Selection Hannes Leeb - - PowerPoint PPT Presentation

conditional predictive inference post model selection
SMART_READER_LITE
LIVE PREVIEW

Conditional Predictive Inference Post Model Selection Hannes Leeb - - PowerPoint PPT Presentation

Conditional Predictive Inference Post Model Selection Hannes Leeb Department of Statistics Yale University Model Selection Workshop, Vienna, July 25, 2008 Hannes Leeb Conditional Predictive Inference Post Model Selection Introduction


slide-1
SLIDE 1

Conditional Predictive Inference Post Model Selection

Hannes Leeb

Department of Statistics Yale University

Model Selection Workshop, Vienna, July 25, 2008

Hannes Leeb Conditional Predictive Inference Post Model Selection

slide-2
SLIDE 2

Introduction Setting Goal 1 Goal 2 Conclusion

Problem statment

Predictive inference post model selection in setting with large dimension and (comparatively) small sample size. Example: Stenbakken & Souders (1987, 1991): Predict performance

  • f D/A converters. Select 64 explanatory variables from a total of

8,192 based on a sample of size 88. Features of this example: Large number of candidate models Selected model is complex in relation to sample size Focus on predictive performance and inference, not on correctness Model is selected and fitted to the data once and then used repeatedly for prediction

Hannes Leeb Conditional Predictive Inference Post Model Selection

slide-3
SLIDE 3

Introduction Setting Goal 1 Goal 2 Conclusion

Problem statment

Predictive inference post model selection in setting with large dimension and (comparatively) small sample size. Example: Stenbakken & Souders (1987, 1991): Predict performance

  • f D/A converters. Select 64 explanatory variables from a total of

8,192 based on a sample of size 88. Features of this example: Large number of candidate models Selected model is complex in relation to sample size Focus on predictive performance and inference, not on correctness Model is selected and fitted to the data once and then used repeatedly for prediction

Hannes Leeb Conditional Predictive Inference Post Model Selection

slide-4
SLIDE 4

Introduction Setting Goal 1 Goal 2 Conclusion

Problem statment

Predictive inference post model selection in setting with large dimension and (comparatively) small sample size. Problem studied here: Given a training sample of size n and a collection M of candidate models, find a ‘good’ model m ∈ M and conduct predictive inference based on selected model, conditional on the training

  • sample. Features:

#M ≫ n, i.e., potentially many candidate models |m| ∼ n, i.e., potentially complex candidate models no strong regularity conditions

Hannes Leeb Conditional Predictive Inference Post Model Selection

slide-5
SLIDE 5

Introduction Setting Goal 1 Goal 2 Conclusion

Overview of results

We consider a model selector and a prediction interval post model selection (that are based on a variant of generalized cross-validation) in linear regression with random design. For Gaussian data we show: The prediction interval is ‘approximately valid and short’ conditional on the training sample, except on an event whose probability is less than C1 #M exp

  • − C2(n − |M|)
  • ,

where #M denotes the number of candidate models, and |M| denotes the number of parameters in the most complex candidate model. This finite-sample result holds uniformly over all data-generating processes that we consider.

Hannes Leeb Conditional Predictive Inference Post Model Selection

slide-6
SLIDE 6

Introduction Setting Goal 1 Goal 2 Conclusion

Overview of results

We consider a model selector and a prediction interval post model selection (that are based on a variant of generalized cross-validation) in linear regression with random design. For Gaussian data we show: The prediction interval is ‘approximately valid and short’ conditional on the training sample, except on an event whose probability is less than C1 #M exp

  • − C2(n − |M|)
  • ,

where #M denotes the number of candidate models, and |M| denotes the number of parameters in the most complex candidate model. This finite-sample result holds uniformly over all data-generating processes that we consider.

Hannes Leeb Conditional Predictive Inference Post Model Selection

slide-7
SLIDE 7

Introduction Setting Goal 1 Goal 2 Conclusion

Overview of results

We consider a model selector and a prediction interval post model selection (that are based on a variant of generalized cross-validation) in linear regression with random design. For Gaussian data we show: The prediction interval is ‘approximately valid and short’ conditional on the training sample, except on an event whose probability is less than C1 #M exp

  • − C2(n − |M|)
  • ,

where #M denotes the number of candidate models, and |M| denotes the number of parameters in the most complex candidate model. This finite-sample result holds uniformly over all data-generating processes that we consider.

Hannes Leeb Conditional Predictive Inference Post Model Selection

slide-8
SLIDE 8

Introduction Setting Goal 1 Goal 2 Conclusion

The data-generating process

Gaussian linear model with random design Consider a response y that is related to a (possibly infinite) number of explanatory variables xj, j ≥ 1, by y =

  • j=1

xjθj + u (1) with x1 = 1. Assume that u has mean zero and is uncorrelated with the xj’s. Moreover, assume that the xj’s for j > 1 and u are jointly non-degenerate Gaussian, such that the sum converges in L2.

Hannes Leeb Conditional Predictive Inference Post Model Selection

slide-9
SLIDE 9

Introduction Setting Goal 1 Goal 2 Conclusion

The data-generating process

Gaussian linear model with random design Consider a response y that is related to a (possibly infinite) number of explanatory variables xj, j ≥ 1, by y =

  • j=1

xjθj + u (1) with x1 = 1. Assume that u has mean zero and is uncorrelated with the xj’s. Moreover, assume that the xj’s for j > 1 and u are jointly non-degenerate Gaussian, such that the sum converges in L2. The unknown parameters here are θ, the variance of u, as well as the means and the variance/covariance structure of the xj’s.

Hannes Leeb Conditional Predictive Inference Post Model Selection

slide-10
SLIDE 10

Introduction Setting Goal 1 Goal 2 Conclusion

The data-generating process

Gaussian linear model with random design Consider a response y that is related to a (possibly infinite) number of explanatory variables xj, j ≥ 1, by y =

  • j=1

xjθj + u (1) with x1 = 1. Assume that u has mean zero and is uncorrelated with the xj’s. Moreover, assume that the xj’s for j > 1 and u are jointly non-degenerate Gaussian, such that the sum converges in L2. No further regularity conditions are imposed.

Hannes Leeb Conditional Predictive Inference Post Model Selection

slide-11
SLIDE 11

Introduction Setting Goal 1 Goal 2 Conclusion

The candidate models and predictors

The candidate models and predictors Consider a sample (X, Y ) of n independent realizations of (x, y) as in (1), and a collection M of candidate models. Each model m ∈ M is assumed to satisfy |m| < n − 1. Each model m is fit to the data by least-squares. Given a new set of explanatory variables x(f), the corresponding response y(f) is predicted by ˆ y(f)(m) =

  • j=1

x(f)

j

˜ θj(m) when using model m. Here, x(f), y(f) is another independent realization from (1), and ˜ θ(m) is the restricted least-squares estimator corresponding to m.

Hannes Leeb Conditional Predictive Inference Post Model Selection

slide-12
SLIDE 12

Introduction Setting Goal 1 Goal 2 Conclusion

Two goals

(i) Select a ‘good’ model from M for prediction out-of-sample, and (ii) conduct predictive inference based on the selected model, both conditional on the training sample. Two Quantities of Interest For m ∈ M, let ρ2(m) denote the conditional mean-squared error

  • f the predictor ˆ

y(f)(m) given the training sample, i.e., ρ2(m) = E y(f) − ˆ y(f)(m) 2

  • X, Y
  • .

For m ∈ M, the conditional distribution of the prediction error ˆ y(f)(m) − y(f) given the training sample is ˆ y(f)(m) − y(f)

  • X, Y

∼ N(ν(m), δ2(m)) ≡ L(m). Note that ρ2(m) = ν2(m) + δ2(m).

Hannes Leeb Conditional Predictive Inference Post Model Selection

slide-13
SLIDE 13

Introduction Setting Goal 1 Goal 2 Conclusion

Two goals

(i) Select a ‘good’ model from M for prediction out-of-sample, and (ii) conduct predictive inference based on the selected model, both conditional on the training sample. Two Quantities of Interest For m ∈ M, let ρ2(m) denote the conditional mean-squared error

  • f the predictor ˆ

y(f)(m) given the training sample, i.e., ρ2(m) = E y(f) − ˆ y(f)(m) 2

  • X, Y
  • .

For m ∈ M, the conditional distribution of the prediction error ˆ y(f)(m) − y(f) given the training sample is ˆ y(f)(m) − y(f)

  • X, Y

∼ N(ν(m), δ2(m)) ≡ L(m). Note that ρ2(m) = ν2(m) + δ2(m).

Hannes Leeb Conditional Predictive Inference Post Model Selection

slide-14
SLIDE 14

Introduction Setting Goal 1 Goal 2 Conclusion

Two goals

(i) Select a ‘good’ model from M for prediction out-of-sample, and (ii) conduct predictive inference based on the selected model, both conditional on the training sample. Two Quantities of Interest For m ∈ M, let ρ2(m) denote the conditional mean-squared error

  • f the predictor ˆ

y(f)(m) given the training sample, i.e., ρ2(m) = E y(f) − ˆ y(f)(m) 2

  • X, Y
  • .

For m ∈ M, the conditional distribution of the prediction error ˆ y(f)(m) − y(f) given the training sample is ˆ y(f)(m) − y(f)

  • X, Y

∼ N(ν(m), δ2(m)) ≡ L(m). Note that ρ2(m) = ν2(m) + δ2(m).

Hannes Leeb Conditional Predictive Inference Post Model Selection

slide-15
SLIDE 15

Introduction Setting Goal 1 Goal 2 Conclusion

Two goals

(i) Select a ‘good’ model from M for prediction out-of-sample, and (ii) conduct predictive inference based on the selected model, both conditional on the training sample. Two Quantities of Interest For m ∈ M, let ρ2(m) denote the conditional mean-squared error

  • f the predictor ˆ

y(f)(m) given the training sample, i.e., ρ2(m) = E y(f) − ˆ y(f)(m) 2

  • X, Y
  • .

For m ∈ M, the conditional distribution of the prediction error ˆ y(f)(m) − y(f) given the training sample is ˆ y(f)(m) − y(f)

  • X, Y

∼ N(ν(m), δ2(m)) ≡ L(m). Note that ρ2(m) = ν2(m) + δ2(m).

Hannes Leeb Conditional Predictive Inference Post Model Selection

slide-16
SLIDE 16

Introduction Setting Goal 1 Goal 2 Conclusion

A useful observation

Write σ2(m) for the conditional variance of the response given those explanatory variables that are included in model m, i.e., σ2(m) = Var[y || xj included in model m, j ≥ 1]. Lemma δ2(m) ∼ σ2(m)

  • 1 +

χ2

|m|−1

χ2

n−|m|+1

  • ,

where the χ2-random variables are independent. Similarly, ν2(m) ∼ 1 nσ2(m)

  • 1 +

χ2

|m|−1

χ2

n−|m|+1

  • ,

and ˆ σ2(m) = RSS(m)/(n − |m|) ∼ σ2(m)χ2

n−|m|/(n − |m|).

[The Lemma extends Theorem 1.3 of Breiman & Friedman (1983).]

Hannes Leeb Conditional Predictive Inference Post Model Selection

slide-17
SLIDE 17

Introduction Setting Goal 1 Goal 2 Conclusion

A useful observation

Write σ2(m) for the conditional variance of the response given those explanatory variables that are included in model m, i.e., σ2(m) = Var[y || xj included in model m, j ≥ 1]. Lemma δ2(m) ∼ σ2(m)

  • 1 +

χ2

|m|−1

χ2

n−|m|+1

  • ,

where the χ2-random variables are independent. Similarly, ν2(m) ∼ 1 nσ2(m)

  • 1 +

χ2

|m|−1

χ2

n−|m|+1

  • ,

and ˆ σ2(m) = RSS(m)/(n − |m|) ∼ σ2(m)χ2

n−|m|/(n − |m|).

[The Lemma extends Theorem 1.3 of Breiman & Friedman (1983).]

Hannes Leeb Conditional Predictive Inference Post Model Selection

slide-18
SLIDE 18

Introduction Setting Goal 1 Goal 2 Conclusion

Estimators for ρ2(m)

Note that E

  • ρ2(m)
  • =

σ2(m) n − 2 n − 1 − |m|

  • 1 + 1

n

  • .

The Sp criterion (Tukey, 1967): Sp(m) = ˆ σ2(m) n − 2 n − 1 − |m|. The GCV-criterion (Craven & Wahba, 1978): GCV(m) = ˆ σ2(m) n n − |m|. An auxiliary criterion: ˆ ρ2(m) = ˆ σ2(m) n n + 1 − |m|.

Hannes Leeb Conditional Predictive Inference Post Model Selection

slide-19
SLIDE 19

Introduction Setting Goal 1 Goal 2 Conclusion

Estimators for ρ2(m)

Note that E

  • ρ2(m)
  • =

σ2(m) n − 2 n − 1 − |m|

  • 1 + 1

n

  • .

The Sp criterion (Tukey, 1967): Sp(m) = ˆ σ2(m) n − 2 n − 1 − |m|. The GCV-criterion (Craven & Wahba, 1978): GCV(m) = ˆ σ2(m) n n − |m|. An auxiliary criterion: ˆ ρ2(m) = ˆ σ2(m) n n + 1 − |m|.

Hannes Leeb Conditional Predictive Inference Post Model Selection

slide-20
SLIDE 20

Introduction Setting Goal 1 Goal 2 Conclusion

Estimators for ρ2(m)

Note that E

  • ρ2(m)
  • =

σ2(m) n − 2 n − 1 − |m|

  • 1 + 1

n

  • .

The Sp criterion (Tukey, 1967): Sp(m) = ˆ σ2(m) n − 2 n − 1 − |m|. The GCV-criterion (Craven & Wahba, 1978): GCV(m) = ˆ σ2(m) n n − |m|. An auxiliary criterion: ˆ ρ2(m) = ˆ σ2(m) n n + 1 − |m|.

Hannes Leeb Conditional Predictive Inference Post Model Selection

slide-21
SLIDE 21

Introduction Setting Goal 1 Goal 2 Conclusion

Estimators for ρ2(m)

Note that E

  • ρ2(m)
  • =

σ2(m) n − 2 n − 1 − |m|

  • 1 + 1

n

  • .

The Sp criterion (Tukey, 1967): Sp(m) = ˆ σ2(m) n − 2 n − 1 − |m|. The GCV-criterion (Craven & Wahba, 1978): GCV(m) = ˆ σ2(m) n n − |m|. An auxiliary criterion: ˆ ρ2(m) = ˆ σ2(m) n n + 1 − |m|.

Hannes Leeb Conditional Predictive Inference Post Model Selection

slide-22
SLIDE 22

Introduction Setting Goal 1 Goal 2 Conclusion

Performance of ˆ ρ2(m)

Want: ˆ ρ2(m)/ρ2(m) ≈ 1 or, equivalently, log ˆ ρ2(m)/ρ2(m) ≈ 0 with high probability. Theorem For each ǫ > 0, we have P

  • log ˆ

ρ2(m) ρ2(m)

  • > ǫ

6 exp

  • −n − |m|

8 ǫ2 ǫ + 8

  • ,

for each sample size n and uniformly over all data-generating processes as in (1).

Hannes Leeb Conditional Predictive Inference Post Model Selection

slide-23
SLIDE 23

Introduction Setting Goal 1 Goal 2 Conclusion

Performance of ˆ ρ2(m)

Want: ˆ ρ2(m)/ρ2(m) ≈ 1 or, equivalently, log ˆ ρ2(m)/ρ2(m) ≈ 0 with high probability. Theorem For each ǫ > 0, we have P

  • log ˆ

ρ2(m) ρ2(m)

  • > ǫ

6 exp

  • −n − |m|

8 ǫ2 ǫ + 8

  • ,

for each sample size n and uniformly over all data-generating processes as in (1).

Hannes Leeb Conditional Predictive Inference Post Model Selection

slide-24
SLIDE 24

Introduction Setting Goal 1 Goal 2 Conclusion

Performance of ˆ ρ2(m)

Want: ˆ ρ2(m)/ρ2(m) ≈ 1 or, equivalently, log ˆ ρ2(m)/ρ2(m) ≈ 0 with high probability. Theorem For each ǫ > 0, we have P

  • log ˆ

ρ2(m) ρ2(m)

  • > ǫ

6 exp

  • −n − |m|

8 ǫ2 ǫ + 8

  • ,

for each sample size n and uniformly over all data-generating processes as in (1). A similar result holds for the absolute difference |ˆ ρ2(m) − ρ2(m)|, uniformly over all data-generating processes with bounded vari- ance, i.e., where Var[y] ≤ s2 (with an upper bound of the form C1 exp[−(n − |m|) C(ǫ, s2)]; here s2 is a fixed constant).

Hannes Leeb Conditional Predictive Inference Post Model Selection

slide-25
SLIDE 25

Introduction Setting Goal 1 Goal 2 Conclusion

Performance of ˆ ρ2(m)

Want: ˆ ρ2(m)/ρ2(m) ≈ 1 or, equivalently, log ˆ ρ2(m)/ρ2(m) ≈ 0 with high probability. Theorem For each ǫ > 0, we have P

  • log ˆ

ρ2(m) ρ2(m)

  • > ǫ

6 exp

  • −n − |m|

8 ǫ2 ǫ + 8

  • ,

for each sample size n and uniformly over all data-generating processes as in (1). Method of proof: Chernoff’s method or variations thereof (Gaussian case); Marˇ cenko-Pastur law (non-Gaussian case).

Hannes Leeb Conditional Predictive Inference Post Model Selection

slide-26
SLIDE 26

Introduction Setting Goal 1 Goal 2 Conclusion

Selecting the empirically best model

Write m∗ and ˆ m for the truly best and the empirically best candidate model, i.e., m∗ = argminMρ2(m) and ˆ m = argminMˆ ρ2(m). Moreover, write |M| for the number of parameters in the most complex candidate model. Corollary For each fixed sample size n and uniformly over all data-generating processes as in (1), we have P

  • log ρ2( ˆ

m) ρ2(m∗) > ǫ

6 exp

  • log #M − n − |M|

16 ǫ2 ǫ + 16

  • ,

P

  • log ˆ

ρ2( ˆ m) ρ2( ˆ m)

  • > ǫ

6 exp

  • log #M − n − |M|

8 ǫ2 ǫ + 8

  • ,

for each ǫ > 0.

Hannes Leeb Conditional Predictive Inference Post Model Selection

slide-27
SLIDE 27

Introduction Setting Goal 1 Goal 2 Conclusion

Selecting the empirically best model

Write m∗ and ˆ m for the truly best and the empirically best candidate model, i.e., m∗ = argminMρ2(m) and ˆ m = argminMˆ ρ2(m). Moreover, write |M| for the number of parameters in the most complex candidate model. Corollary For each fixed sample size n and uniformly over all data-generating processes as in (1), we have P

  • log ρ2( ˆ

m) ρ2(m∗) > ǫ

6 exp

  • log #M − n − |M|

16 ǫ2 ǫ + 16

  • ,

P

  • log ˆ

ρ2( ˆ m) ρ2( ˆ m)

  • > ǫ

6 exp

  • log #M − n − |M|

8 ǫ2 ǫ + 8

  • ,

for each ǫ > 0.

Hannes Leeb Conditional Predictive Inference Post Model Selection

slide-28
SLIDE 28

Introduction Setting Goal 1 Goal 2 Conclusion

Other model selectors

Consider AIC (Akaike, 1969), AICc (Hurvich & Tsai, 1989), FPE (Akaike, 1970), and BIC (Schwarz, 1978). Taking the exponential

  • f the objective functions of AIC, AICc and BIC, and using the fact

that GCV(m) = 1

nRSS(m)/(1 − |m|/n)2 ≈ ρ(m), we get

AIC(m) = 1 nRSS(m)e2 |m|

n

≈ ρ(m)e2 |m|

n (1 − |m|/n)2

AICc(m) = 1 nRSS(m)e2

|m|−1 n−|m|−2

≈ ρ(m)e2

|m|−1 n−|m|−2 (1 − |m|/n)2

FPE(m) = 1 nRSS(m)1 + |m|/n 1 − |m|/n ≈ ρ(m)(1 + |m|/n)(1 − |m|/n) BIC(m) = 1 nRSS(m)elog(n) |m|

n

≈ ρ(m)n|m|/n(1 − |m|/n)2.

Hannes Leeb Conditional Predictive Inference Post Model Selection

slide-29
SLIDE 29

Introduction Setting Goal 1 Goal 2 Conclusion

Other model selectors

Consider AIC (Akaike, 1969), AICc (Hurvich & Tsai, 1989), FPE (Akaike, 1970), and BIC (Schwarz, 1978). Taking the exponential

  • f the objective functions of AIC, AICc and BIC, and using the fact

that GCV(m) = 1

nRSS(m)/(1 − |m|/n)2 ≈ ρ(m), we get

AIC(m) = 1 nRSS(m)e2 |m|

n

≈ ρ(m)e2 |m|

n (1 − |m|/n)2

AICc(m) = 1 nRSS(m)e2

|m|−1 n−|m|−2

≈ ρ(m)e2

|m|−1 n−|m|−2 (1 − |m|/n)2

FPE(m) = 1 nRSS(m)1 + |m|/n 1 − |m|/n ≈ ρ(m)(1 + |m|/n)(1 − |m|/n) BIC(m) = 1 nRSS(m)elog(n) |m|

n

≈ ρ(m)n|m|/n(1 − |m|/n)2.

Hannes Leeb Conditional Predictive Inference Post Model Selection

slide-30
SLIDE 30

Introduction Setting Goal 1 Goal 2 Conclusion

Other model selectors

Consider AIC (Akaike, 1969), AICc (Hurvich & Tsai, 1989), FPE (Akaike, 1970), and BIC (Schwarz, 1978). Taking the exponential

  • f the objective functions of AIC, AICc and BIC, and using the fact

that GCV(m) = 1

nRSS(m)/(1 − |m|/n)2 ≈ ρ(m), we get

AIC(m) = 1 nRSS(m)e2 |m|

n

≈ ρ(m)e2 |m|

n (1 − |m|/n)2

AICc(m) = 1 nRSS(m)e2

|m|−1 n−|m|−2

≈ ρ(m)e2

|m|−1 n−|m|−2 (1 − |m|/n)2

FPE(m) = 1 nRSS(m)1 + |m|/n 1 − |m|/n ≈ ρ(m)(1 + |m|/n)(1 − |m|/n) BIC(m) = 1 nRSS(m)elog(n) |m|

n

≈ ρ(m)n|m|/n(1 − |m|/n)2.

Hannes Leeb Conditional Predictive Inference Post Model Selection

slide-31
SLIDE 31

Introduction Setting Goal 1 Goal 2 Conclusion

Other model selectors

Consider AIC (Akaike, 1969), AICc (Hurvich & Tsai, 1989), FPE (Akaike, 1970), and BIC (Schwarz, 1978). Taking the exponential

  • f the objective functions of AIC, AICc and BIC, and using the fact

that GCV(m) = 1

nRSS(m)/(1 − |m|/n)2 ≈ ρ(m), we get

AIC(m) = 1 nRSS(m)e2 |m|

n

≈ ρ(m)e2 |m|

n (1 − |m|/n)2

AICc(m) = 1 nRSS(m)e2

|m|−1 n−|m|−2

≈ ρ(m)e2

|m|−1 n−|m|−2 (1 − |m|/n)2

FPE(m) = 1 nRSS(m)1 + |m|/n 1 − |m|/n ≈ ρ(m)(1 + |m|/n)(1 − |m|/n) BIC(m) = 1 nRSS(m)elog(n) |m|

n

≈ ρ(m)n|m|/n(1 − |m|/n)2.

Hannes Leeb Conditional Predictive Inference Post Model Selection

slide-32
SLIDE 32

Introduction Setting Goal 1 Goal 2 Conclusion

Other model selectors

Consider AIC (Akaike, 1969), AICc (Hurvich & Tsai, 1989), FPE (Akaike, 1970), and BIC (Schwarz, 1978). Taking the exponential

  • f the objective functions of AIC, AICc and BIC, and using the fact

that GCV(m) = 1

nRSS(m)/(1 − |m|/n)2 ≈ ρ(m), we get

AIC(m) = 1 nRSS(m)e2 |m|

n

≈ ρ(m)e2 |m|

n (1 − |m|/n)2

AICc(m) = 1 nRSS(m)e2

|m|−1 n−|m|−2

≈ ρ(m)e2

|m|−1 n−|m|−2 (1 − |m|/n)2

FPE(m) = 1 nRSS(m)1 + |m|/n 1 − |m|/n ≈ ρ(m)(1 + |m|/n)(1 − |m|/n) BIC(m) = 1 nRSS(m)elog(n) |m|

n

≈ ρ(m)n|m|/n(1 − |m|/n)2.

Hannes Leeb Conditional Predictive Inference Post Model Selection

slide-33
SLIDE 33

Introduction Setting Goal 1 Goal 2 Conclusion

Other model selectors

Consider AIC (Akaike, 1969), AICc (Hurvich & Tsai, 1989), FPE (Akaike, 1970), and BIC (Schwarz, 1978). Taking the exponential

  • f the objective functions of AIC, AICc and BIC, and using the fact

that GCV(m) = 1

nRSS(m)/(1 − |m|/n)2 ≈ ρ(m), we get

AIC(m) = 1 nRSS(m)e2 |m|

n

≈ ρ(m)e2 |m|

n (1 − |m|/n)2

AICc(m) = 1 nRSS(m)e2

|m|−1 n−|m|−2

≈ ρ(m)e2

|m|−1 n−|m|−2 (1 − |m|/n)2

FPE(m) = 1 nRSS(m)1 + |m|/n 1 − |m|/n ≈ ρ(m)(1 + |m|/n)(1 − |m|/n) BIC(m) = 1 nRSS(m)elog(n) |m|

n

≈ ρ(m)n|m|/n(1 − |m|/n)2.

Hannes Leeb Conditional Predictive Inference Post Model Selection

slide-34
SLIDE 34

Introduction Setting Goal 1 Goal 2 Conclusion

Simulation Scenario I

Consider one sample of size n = 1300 from (1) with E[xj] = 0, E[xixj] = δi,j, and E[u2] = 1.

Hannes Leeb Conditional Predictive Inference Post Model Selection

slide-35
SLIDE 35

Introduction Setting Goal 1 Goal 2 Conclusion

Simulation Scenario I

Consider one sample of size n = 1300 from (1) with E[xj] = 0, E[xixj] = δi,j, and E[u2] = 1. The first 1000 components of θ are shown (in absolute value) below, the remaining components are zero:

Hannes Leeb Conditional Predictive Inference Post Model Selection

slide-36
SLIDE 36

Introduction Setting Goal 1 Goal 2 Conclusion

Simulation Scenario I

Consider one sample of size n = 1300 from (1) with E[xj] = 0, E[xixj] = δi,j, and E[u2] = 1. The first 1000 components of θ are shown (in absolute value) below, the remaining components are zero: The non-zero coefficients of θ are ‘sparse:’ Most are small, but there are a few groups of adjacent large coefficients.

Hannes Leeb Conditional Predictive Inference Post Model Selection

slide-37
SLIDE 37

Introduction Setting Goal 1 Goal 2 Conclusion

Simulation Scenario I

Consider one sample of size n = 1300 from (1) with E[xj] = 0, E[xixj] = δi,j, and E[u2] = 1. The first 1000 components of θ are shown (in absolute value) below, the remaining components are zero: Choose candidate models that can pick-out the few important groups: Divide the first 1000 coefficients of θ into 20 consecutive blocks of equal length and consider all candidate models that include or exclude one block at a time, resulting in 220 candidate models.

Hannes Leeb Conditional Predictive Inference Post Model Selection

slide-38
SLIDE 38

Introduction Setting Goal 1 Goal 2 Conclusion

Simulation Scenario I

Consider one sample of size n = 1300 from (1) with E[xj] = 0, E[xixj] = δi,j, and E[u2] = 1. The first 1000 components of θ are shown (in absolute value) below, the remaining components are zero: Model space is searched using a general-to-specific greedy strategy.

Hannes Leeb Conditional Predictive Inference Post Model Selection

slide-39
SLIDE 39

Introduction Setting Goal 1 Goal 2 Conclusion

Simulation Scenario I

Results for X Gaussian, u Gaussian: Run 1:

Hannes Leeb Conditional Predictive Inference Post Model Selection

slide-40
SLIDE 40

Introduction Setting Goal 1 Goal 2 Conclusion

Simulation Scenario I

Results for X Gaussian, u Gaussian: Run 1:

Hannes Leeb Conditional Predictive Inference Post Model Selection

slide-41
SLIDE 41

Introduction Setting Goal 1 Goal 2 Conclusion

Simulation Scenario I

Results for X Gaussian, u Gaussian: Run 1:

Hannes Leeb Conditional Predictive Inference Post Model Selection

slide-42
SLIDE 42

Introduction Setting Goal 1 Goal 2 Conclusion

Simulation Scenario I

Results for X Gaussian, u Gaussian: Run 1:

Hannes Leeb Conditional Predictive Inference Post Model Selection

slide-43
SLIDE 43

Introduction Setting Goal 1 Goal 2 Conclusion

Simulation Scenario I

Results for X Gaussian, u Gaussian: Run 1:

Hannes Leeb Conditional Predictive Inference Post Model Selection

slide-44
SLIDE 44

Introduction Setting Goal 1 Goal 2 Conclusion

Simulation Scenario I

Results for X Gaussian, u Gaussian: Run 1:

Hannes Leeb Conditional Predictive Inference Post Model Selection

slide-45
SLIDE 45

Introduction Setting Goal 1 Goal 2 Conclusion

Simulation Scenario I

Results for X Gaussian, u Gaussian: Run 1:

Hannes Leeb Conditional Predictive Inference Post Model Selection

slide-46
SLIDE 46

Introduction Setting Goal 1 Goal 2 Conclusion

Simulation Scenario I

Results for X Gaussian, u Gaussian: Run 2:

Hannes Leeb Conditional Predictive Inference Post Model Selection

slide-47
SLIDE 47

Introduction Setting Goal 1 Goal 2 Conclusion

Simulation Scenario I

Results for X Gaussian, u Gaussian: Run 3:

Hannes Leeb Conditional Predictive Inference Post Model Selection

slide-48
SLIDE 48

Introduction Setting Goal 1 Goal 2 Conclusion

Simulation Scenario I

Results for X Gaussian, u Gaussian: Run 4:

Hannes Leeb Conditional Predictive Inference Post Model Selection

slide-49
SLIDE 49

Introduction Setting Goal 1 Goal 2 Conclusion

Simulation Scenario I

Results for X Exponential, u Bernoulli (scaled and centered). Run 1:

Hannes Leeb Conditional Predictive Inference Post Model Selection

slide-50
SLIDE 50

Introduction Setting Goal 1 Goal 2 Conclusion

Simulation Scenario I

Results for X Bernoulli, u Exponential (scaled and centered). Run 1:

Hannes Leeb Conditional Predictive Inference Post Model Selection

slide-51
SLIDE 51

Introduction Setting Goal 1 Goal 2 Conclusion

Simulation Scenario II

Consider the same setting as in Scenario I, but instead of a parameter θ that is ‘sparse,’ consider a case where none of the candidate models fits particularly well. The first 1000 components of θ are shown (in absolute value) below, the remaining components are zero:

Hannes Leeb Conditional Predictive Inference Post Model Selection

slide-52
SLIDE 52

Introduction Setting Goal 1 Goal 2 Conclusion

Simulation Scenario II

Results for X Gaussian, u Gaussian: Run 1:

Hannes Leeb Conditional Predictive Inference Post Model Selection

slide-53
SLIDE 53

Introduction Setting Goal 1 Goal 2 Conclusion

Simulation Scenario II

Results for X Exponential, u Bernoulli (scaled and centered). Run 1:

Hannes Leeb Conditional Predictive Inference Post Model Selection

slide-54
SLIDE 54

Introduction Setting Goal 1 Goal 2 Conclusion

Simulation Scenario II

Results for X Bernoulli, u Exponential (scaled and centered). Run 1:

Hannes Leeb Conditional Predictive Inference Post Model Selection

slide-55
SLIDE 55

Introduction Setting Goal 1 Goal 2 Conclusion

Predictive Inference based on model m

Idea: Estimate the conditional distribution of the prediction error, i.e., L(m) ≡ N(ν(m), δ2(m)), by ˆ L(m) ≡ N(0, ˆ δ2(m)), where ˆ δ2(m) is defined as ˆ ρ2(m) before. Theorem For each fixed sample size n and uniformly over all data-generating processes as in (1), we have P

  • ˆ

L(m) − L(m)

  • TV >

1 √n + ǫ

7 exp

  • −n − |m|

2 ǫ2 ǫ + 2

  • for each ǫ ≤ log(2) ≈ 0.69.

Hannes Leeb Conditional Predictive Inference Post Model Selection

slide-56
SLIDE 56

Introduction Setting Goal 1 Goal 2 Conclusion

Predictive Inference based on model m

Idea: Estimate the conditional distribution of the prediction error, i.e., L(m) ≡ N(ν(m), δ2(m)), by ˆ L(m) ≡ N(0, ˆ δ2(m)), where ˆ δ2(m) is defined as ˆ ρ2(m) before. Theorem For each fixed sample size n and uniformly over all data-generating processes as in (1), we have P

  • ˆ

L(m) − L(m)

  • TV >

1 √n + ǫ

7 exp

  • −n − |m|

2 ǫ2 ǫ + 2

  • for each ǫ ≤ log(2) ≈ 0.69.

Hannes Leeb Conditional Predictive Inference Post Model Selection

slide-57
SLIDE 57

Introduction Setting Goal 1 Goal 2 Conclusion

Prediction intervals post model selection

Recall that ˆ y(f)(m) − y(f) || X, Y ∼ N(ν(m), δ2(m)) ≡ L(m) for each m ∈ M. Based on model m, the ‘prediction interval’ ˆ y(f)(m) − ν(m) ± qα/2δ(m) has coverage probability 1 − α conditional on the training sample X, Y , but is infeasible. In terms of width of this interval, the ‘best’ model is one that minimizes δ(m). Set m◦ = argminMδ2(m). For fixed m ∈ M, a feasible prediction interval is I(m) : ˆ y(f)(m) ± qα/2ˆ δ(m).

Hannes Leeb Conditional Predictive Inference Post Model Selection

slide-58
SLIDE 58

Introduction Setting Goal 1 Goal 2 Conclusion

Prediction intervals post model selection

Recall that ˆ y(f)(m) − y(f) || X, Y ∼ N(ν(m), δ2(m)) ≡ L(m) for each m ∈ M. Based on model m, the ‘prediction interval’ ˆ y(f)(m) − ν(m) ± qα/2δ(m) has coverage probability 1 − α conditional on the training sample X, Y , but is infeasible. In terms of width of this interval, the ‘best’ model is one that minimizes δ(m). Set m◦ = argminMδ2(m). For fixed m ∈ M, a feasible prediction interval is I(m) : ˆ y(f)(m) ± qα/2ˆ δ(m).

Hannes Leeb Conditional Predictive Inference Post Model Selection

slide-59
SLIDE 59

Introduction Setting Goal 1 Goal 2 Conclusion

Prediction intervals post model selection

Recall that ˆ y(f)(m) − y(f) || X, Y ∼ N(ν(m), δ2(m)) ≡ L(m) for each m ∈ M. Based on model m, the ‘prediction interval’ ˆ y(f)(m) − ν(m) ± qα/2δ(m) has coverage probability 1 − α conditional on the training sample X, Y , but is infeasible. In terms of width of this interval, the ‘best’ model is one that minimizes δ(m). Set m◦ = argminMδ2(m). For fixed m ∈ M, a feasible prediction interval is I(m) : ˆ y(f)(m) ± qα/2ˆ δ(m).

Hannes Leeb Conditional Predictive Inference Post Model Selection

slide-60
SLIDE 60

Introduction Setting Goal 1 Goal 2 Conclusion

Prediction interval is approx. valid & adaptive

Proposition For each ǫ ≤ log 2 and each fixed sample size n, we have

  • 1 − α
  • − P
  • y(f) ∈ I( ˆ

m)

  • Y, X

1 √n + ǫ and

  • log

ˆ δ( ˆ m) δ(m◦)

  • ≤ ǫ,

except on an event whose probability is not larger than 11 exp

  • log #M − n − |M|

2 ǫ2 ǫ + 2

  • ,

uniformly over all data-generating processes as in (1).

Hannes Leeb Conditional Predictive Inference Post Model Selection

slide-61
SLIDE 61

Introduction Setting Goal 1 Goal 2 Conclusion

Conclusion

Caution: The ‘large p / small n’ behavior of model selectors can be markedly different from their properties for ‘small p / large n’. Proof of concept: The two goals are achievable In ‘large p / small n’ settings and under minimal assumptions, good models can be found, and the resulting prediction intervals post model selection are approximately valid and adaptive (in finite samples with high probability uniformly over all data-generating processes considered).

Hannes Leeb Conditional Predictive Inference Post Model Selection

slide-62
SLIDE 62

Introduction Setting Goal 1 Goal 2 Conclusion

Conclusion

Caution: The ‘large p / small n’ behavior of model selectors can be markedly different from their properties for ‘small p / large n’. Proof of concept: The two goals are achievable In ‘large p / small n’ settings and under minimal assumptions, good models can be found, and the resulting prediction intervals post model selection are approximately valid and adaptive (in finite samples with high probability uniformly over all data-generating processes considered).

Hannes Leeb Conditional Predictive Inference Post Model Selection

slide-63
SLIDE 63

Introduction Setting Goal 1 Goal 2 Conclusion

Conclusion

Caution: The ‘large p / small n’ behavior of model selectors can be markedly different from their properties for ‘small p / large n’. Proof of concept: The two goals are achievable In ‘large p / small n’ settings and under minimal assumptions, good models can be found, and the resulting prediction intervals post model selection are approximately valid and adaptive (in finite samples with high probability uniformly over all data-generating processes considered).

Hannes Leeb Conditional Predictive Inference Post Model Selection