Optimal Inference After Model Selection Will Fithian Joint work - - PowerPoint PPT Presentation

optimal inference after model selection
SMART_READER_LITE
LIVE PREVIEW

Optimal Inference After Model Selection Will Fithian Joint work - - PowerPoint PPT Presentation

Optimal Inference After Model Selection Will Fithian Joint work with Dennis Sun & Jonathan Taylor December 11, 2015 Outline 1 Introduction 2 Inference After Selection 3 Linear Regression 4 Other Examples Two Stages Two stages of a


slide-1
SLIDE 1

Optimal Inference After Model Selection

Will Fithian Joint work with Dennis Sun & Jonathan Taylor December 11, 2015

slide-2
SLIDE 2

Outline

1 Introduction 2 Inference After Selection 3 Linear Regression 4 Other Examples

slide-3
SLIDE 3

Two Stages

Two stages of a statistical investigation:

  • 1. Selection: Choose a probabilistic model for the data, formulate

an inference problem. Ask a question

  • 2. Inference: Attempt the problem using data & selected model.

Answer the question

slide-4
SLIDE 4

Two Stages

Two stages of a statistical investigation:

  • 1. Selection: Choose a probabilistic model for the data, formulate

an inference problem. Ask a question

  • 2. Inference: Attempt the problem using data & selected model.

Answer the question Classical admonishment: no looking at data until stage 2 Actual practice: choose variables, check for interactions,

  • verdispersion, ...
slide-5
SLIDE 5

Two Stages

Two stages of a statistical investigation:

  • 1. Selection: Choose a probabilistic model for the data, formulate

an inference problem. Ask a question

  • 2. Inference: Attempt the problem using data & selected model.

Answer the question Classical admonishment: no looking at data until stage 2 Actual practice: choose variables, check for interactions,

  • verdispersion, ...

How should we relax the classical view?

slide-6
SLIDE 6

Naive Inference After Selection

What is wrong with naive inference after selection? Example (File Drawer Effect): Observe independent Yi ∼ N(µi, 1), i = 1, . . . , n.

  • 1. Restrict attention to apparently large effects

ˆ I = {i : |Yi| > 1}.

  • 2. Nominal level-α test of H0,i : µi = 0, for i ∈ ˆ

I (e.g., α = 0.05: reject if |Yi| > 1.96)

slide-7
SLIDE 7

Naive Inference After Selection

What is wrong with naive inference after selection? Example (File Drawer Effect): Observe independent Yi ∼ N(µi, 1), i = 1, . . . , n.

  • 1. Restrict attention to apparently large effects

ˆ I = {i : |Yi| > 1}.

  • 2. Nominal level-α test of H0,i : µi = 0, for i ∈ ˆ

I (e.g., α = 0.05: reject if |Yi| > 1.96) “Everyone knows” this is invalid. Why?

slide-8
SLIDE 8

Naive Inference After Selection

Problem: frequency properties among selected nulls # false rejections # true nulls tested → PH0,i(i ∈ ˆ I, reject H0,i) P(i ∈ ˆ I) = PH0,i(reject H0,i | i ∈ ˆ I)

slide-9
SLIDE 9

Naive Inference After Selection

Problem: frequency properties among selected nulls # false rejections # true nulls tested → PH0,i(i ∈ ˆ I, reject H0,i) P(i ∈ ˆ I) = PH0,i(reject H0,i | i ∈ ˆ I) Solution: directly control selective type I error rate PH0,i(reject H0,i | i ∈ ˆ I) Example: PH0,i(|Yi| > 2.41 | |Yi| > 1) = 0.05

slide-10
SLIDE 10

Naive Inference After Selection

Problem: frequency properties among selected nulls # false rejections # true nulls tested → PH0,i(i ∈ ˆ I, reject H0,i) P(i ∈ ˆ I) = PH0,i(reject H0,i | i ∈ ˆ I) Solution: directly control selective type I error rate PH0,i(reject H0,i | i ∈ ˆ I) Example: PH0,i(|Yi| > 2.41 | |Yi| > 1) = 0.05 Guiding principle when asking random questions: The answer must be valid, given that the question was asked

slide-11
SLIDE 11

False Coverage-Statement Rate

Benjamini & Yekutieli (2005): CIs for selected parameters, e.g.

  • selected genes in GWAS
  • selected treatment in clinical trials

Analog of FDR: E

  • # non-covering CIs

1 ∨ # CIs constructed

  • ≤ α

Conditional inference used as device for FCR control (Weinstein, F, & Benjamini 2013) Also used to correct bias (e.g. Sampson & Sill, 2005; Zöllner & Pritchard, 2007; Zhong & Prentice 2008) Difference in perspective: should we average over questions?

slide-12
SLIDE 12

Motivating Example 1: Verifying the Winner

Setup: Quinnipiac poll of 667 Iowa Republicans, May 2014: Rank Candidate Result 1. Scott Walker 21% 2. Rand Paul 13% 3. Marco Rubio 13% 4. Ted Cruz 12% . . . . . . 14. Bobby Jindal 1% 15. Lindsey Graham 0% Question: Is Scott Walker really winning? By how much? Problem: Winner’s curse “Question selection,” not really “model selection” Related to subset selection (Gupta & Nagel 1967, others)

slide-13
SLIDE 13

Motivating Example 2: Inference After Model Checking

Two-sample problem: X1, . . . , Xm

i.i.d.

∼ F1, Y1, . . . , Yn

i.i.d.

∼ F2

slide-14
SLIDE 14

Motivating Example 2: Inference After Model Checking

Two-sample problem: X1, . . . , Xm

i.i.d.

∼ F1, Y1, . . . , Yn

i.i.d.

∼ F2 Test Gaussian model based on normalized residuals R = X1 − X SX , . . . , Xm − X SX , Y1 − Y SY , . . . , Yn − Y SY

  • If test rejects, use permutation test (e.g., Wilcoxon):

F1 =?, F2 =?, H0 : F1 = F2 Otherwise, use two-sample t-test: F1 = N(µ, σ2), F2 = N(ν, τ 2), H0 : µ = ν Model selection, strong sense

slide-15
SLIDE 15

Motivating Example 3: Regression After Variable Selection

E.g., solve lasso at fixed λ > 0 (Tibshirani, 1996): ˆ γ = arg min

γ

Y − Xγ2

2 + λγ1

“Active set” E = {j : ˆ γj = 0} induces selected model M(E): Y ∼ N

  • XEβE, σ2In
slide-16
SLIDE 16

Motivating Example 3: Regression After Variable Selection

E.g., solve lasso at fixed λ > 0 (Tibshirani, 1996): ˆ γ = arg min

γ

Y − Xγ2

2 + λγ1

“Active set” E = {j : ˆ γj = 0} induces selected model M(E): Y ∼ N

  • XEβE, σ2In
  • Can we get valid tests / intervals for βE

j ,

j ∈ E? Lee, Sun, Sun, & Taylor (2013) studied slightly different problem (inference w.r.t. different model)

slide-17
SLIDE 17

Random Model, Random Null

Testing null hypothesis H0 in model M Selective error rate: PM,H0(reject H0 | (M, H0) selected) Nominal error rate: PM,H0(reject H0)

slide-18
SLIDE 18

Random Model, Random Null

Testing null hypothesis H0 in model M Selective error rate: PM,H0(reject H0 | (M, H0) selected) Nominal error rate: PM,H0(reject H0) “Kosher” adaptive selection: two independent experiments

  • Select M, H0 based on exploratory experiment 1
  • Test using confirmatory experiment 2
slide-19
SLIDE 19

Random Model, Random Null

Testing null hypothesis H0 in model M Selective error rate: PM,H0(reject H0 | (M, H0) selected) Nominal error rate: PM,H0(reject H0) “Kosher” adaptive selection: two independent experiments

  • Select M, H0 based on exploratory experiment 1
  • Test using confirmatory experiment 2

M, H0 random, but no adjustment necessary: PM,H0(reject H0 | (M, H0) selected) = PM,H0(reject H0).

slide-20
SLIDE 20

Data Splitting

Assume Y = (Y1, Y2) with Y1 ⊥ ⊥ Y2 Data splitting mimics exploratory / confirmatory split:

  • Select model based on Y1
  • Analyze Y2 as though model chosen “ahead of time.”

Again, no adjustment necessary: PM,H0(reject H0 | (M, H0) selected) = PM,H0(reject H0).

slide-21
SLIDE 21

Data Splitting

Assume Y = (Y1, Y2) with Y1 ⊥ ⊥ Y2 Data splitting mimics exploratory / confirmatory split:

  • Select model based on Y1
  • Analyze Y2 as though model chosen “ahead of time.”

Again, no adjustment necessary: PM,H0(reject H0 | (M, H0) selected) = PM,H0(reject H0). Objections to data splitting:

  • less data for selection
  • less data for inference
  • not always possible (e.g., autocorrelated data)
slide-22
SLIDE 22

Data Carving

Think of data as “revealed in stages:” Let A = {(M, H0) selected}. F0 ⊆

used for selection

F(1A(Y )) ⊆

used for inference

F(Y )

slide-23
SLIDE 23

Data Carving

Think of data as “revealed in stages:” Let A = {(M, H0) selected}. F0 ⊆

used for selection

F(1A(Y )) ⊆

used for inference

F(Y ) Conditioning on A in stage two ⇐ ⇒ Y ∈ A excluded as evidence against H0

slide-24
SLIDE 24

Data Carving

Think of data as “revealed in stages:” Let A = {(M, H0) selected}. F0 ⊆

used for selection

F(1A(Y )) ⊆

used for inference

F(Y ) Conditioning on A in stage two ⇐ ⇒ Y ∈ A excluded as evidence against H0 Data splitting conditions on Y1 instead of 1A(Y1) F0 ⊆

used for selection

F(1A(Y1)) ⊆

wasted

F(Y1) ⊆

used for inference

F(Y1, Y2). Data Carving: Use all leftover information for inference

slide-25
SLIDE 25

Lasso Partition

Yellow region: {y : Variables 1, 3 selected}

slide-26
SLIDE 26

Lasso Partition

M.hat = which(coef(glmnet(X, Y), lambda) != 0)

slide-27
SLIDE 27

Goals

Prior work on linear regression after selection with σ2 known Lockhart et al. (2014), Tibshirani et al. (2014), Lee et al. (2013), Loftus and Taylor (2014), Lee and Taylor (2014), ... Our goals:

1 Formalize inference after selection 2 Understand power — can it be improved? 3 Generalize to unknown σ2 4 Generalize to other exponential families

slide-28
SLIDE 28

Outline

1 Introduction 2 Inference After Selection 3 Linear Regression 4 Other Examples

slide-29
SLIDE 29

Selective Hypothesis Tests

Setup: Observe Y ∼ F on space (Y, F), F unknown Question space: collection Q of all candidate testing problems q Testing problem is a pair q = (M, H0) of

  • model M(q) (family of distributions)
  • null hypothesis H0(q) ⊆ M(q). (wlog H1 = M \ H0)
slide-30
SLIDE 30

Selective Hypothesis Tests

Setup: Observe Y ∼ F on space (Y, F), F unknown Question space: collection Q of all candidate testing problems q Testing problem is a pair q = (M, H0) of

  • model M(q) (family of distributions)
  • null hypothesis H0(q) ⊆ M(q). (wlog H1 = M \ H0)

Two stages:

  • 1. Selection: Select subset

Q(Y ) ⊆ Q to test

  • 2. Inference: Test H0 vs. M \ H0 for each q = (M, H0) ∈

Q

slide-31
SLIDE 31

Selective Hypothesis Tests

Design hypothesis test φq(y) : Y → [0, 1] for question q We only care about behavior on selection event: Aq = {q ∈ Q(Y )} Aq: event that q was asked

slide-32
SLIDE 32

Selective Hypothesis Tests

Design hypothesis test φq(y) : Y → [0, 1] for question q We only care about behavior on selection event: Aq = {q ∈ Q(Y )} Aq: event that q was asked Test φq(y) is a selective level-α test if EF [φq(Y ) | Aq] ≤ α, ∀F ∈ H0 Selective power function: Powφq(F | Aq) = EF [φq(Y ) | Aq]

slide-33
SLIDE 33

Selective Hypothesis Tests

Design hypothesis test φq(y) : Y → [0, 1] for question q We only care about behavior on selection event: Aq = {q ∈ Q(Y )} Aq: event that q was asked Test φq(y) is a selective level-α test if EF [φq(Y ) | Aq] ≤ α, ∀F ∈ H0 Selective power function: Powφq(F | Aq) = EF [φq(Y ) | Aq] NB: Selective level defined w.r.t. F ∈ M(q) = ⇒ can design tests “one (M, H0) at a time”

slide-34
SLIDE 34

What If the Model Is Wrong?

Some (all?) M are probably misspecified (F / ∈ M). We don’t know which. Non-adaptive inference:

  • Size of φ defined w.r.t. selected model M
  • Guarantees vacuous when F /

∈ M

  • Try to select correct or “close enough” M

Adaptive inference:

  • Same situation: selective size of φq defined w.r.t. M(q)
  • Benefit: allowed to check model
slide-35
SLIDE 35

Conditioning on Selection Variables‘

Sometimes want to condition on more than Aq: Y Aq {Sq = s} More generally, can condition on finer selection variable Sq(Y ), with Aq ∈ F(Sq)

slide-36
SLIDE 36

Conditioning on Selection Variables‘

Sometimes want to condition on more than Aq: Y Aq {Sq = s} More generally, can condition on finer selection variable Sq(Y ), with Aq ∈ F(Sq), e.g.

  • Sq(Y ) = Y1 (data splitting)
  • Sq(Y ) = active variables and signs (inference after lasso)

Reason: tractable computation

  • can control FCR with Sq(Y ) = (1Aq(Y ), |

Q(Y )|) Reason: stronger inferential guarantee

slide-37
SLIDE 37

Conditioning Discards Information

φq has selective level α w.r.t Sq if EF [φq(Y ) | Sq(Y )]

a.s.

≤ α,

  • n Aq,

∀F ∈ H0 More stringent when Sq is finer Finest: Sq(Y ) = Y , Coarsest: Sq(Y ) = 1Aq(Y ) Cost: conditioning on Sq ⇐ ⇒ ignoring evidence in Sq

slide-38
SLIDE 38

Leftover Information

After conditioning on S(Y ) = s, the leftover information is IY | S(θ; s) = Var [∇ℓ(θ; Y | S = s) | S = s] Can characterize: E

  • IY | S(θ; S)
  • = IY (θ) − IS(θ) IY (θ).

IS(θ): the (average) price of selection

slide-39
SLIDE 39

Leftover Information

Y ∼ N(µ, 1), A = {Y > 3}

−2 2 4 6 8

Leftover Fisher Information

µ Information 0.0 0.5 1.0 5 10 15 −5 5 10

Selective Confidence Interval

Observed Y µ Selective CI Nominal CI

slide-40
SLIDE 40

Selective Tests for Exponential Families

Goal: Test H0 : θ = θ0, nuisance parameter ζ where Y ∼ exp

  • θ T(y) + ζ′U(y) − ψ(θ, ζ)
  • f0(y)
slide-41
SLIDE 41

Selective Tests for Exponential Families

Goal: Test H0 : θ = θ0, nuisance parameter ζ where Y ∼ exp

  • θ T(y) + ζ′U(y) − ψ(θ, ζ)
  • f0(y)

Selection event A: Y | A ∼ exp

  • θ T(y) + ζ′U(y) − ψA(θ, ζ)
  • f0(y) 1A(y)
slide-42
SLIDE 42

Selective Tests for Exponential Families

Goal: Test H0 : θ = θ0, nuisance parameter ζ where Y ∼ exp

  • θ T(y) + ζ′U(y) − ψ(θ, ζ)
  • f0(y)

Selection event A: Y | A ∼ exp

  • θ T(y) + ζ′U(y) − ψA(θ, ζ)
  • f0(y) 1A(y)

Conditioning on U eliminates ζ, base test on one-parameter family Lθ(T | U, Y ∈ A) Side constraint: selective unbiasedness Eθ [φ(Y ) | A] ≥ α, ∀θ = θ0

slide-43
SLIDE 43

Selective Tests for Exponential Families

Y | Y ∈ A ∼ exp

  • θ T(y) + ζ′U(y) − ψA(θ, ζ)
  • f0(y) 1A(y)

Proposal (F, Sun & Taylor 2014)

The UMPU selective level-α test φ of H0 : θ = θ0 rejects for {T < C1(U)} ∪ {T > C2(U)}, with Ci chosen so that Eθ0 [φ(T, U) | U, A] = α (Selective Level α) Eθ0 [T φ(T, U) | U, A] = α Eθ0 [T | U, A] (Selectively Unbiased) Follows from Lehmann & Scheffé (1955) Solve for cutoffs using Monte Carlo (sampling can be hard) Also show: data splitting typically inadmissible

slide-44
SLIDE 44

Data Splitting is Inadmissible

Compare optimal test to data splitting for Y1, Y2

i.i.d.

∼ N(µ, 1), A = {Y1 > 3} Optimal test based on L(Y1 + Y2 | Y1 > 3), data splitting based on L(Y2).

−2 2 4 6 8

Leftover Fisher Information

µ Information 1 2 Data Splitting Data Carving −2 2 4 6 8

Expected CI Length

µ Interval Length 1 2 3 4 Data Splitting Data Carving

slide-45
SLIDE 45

Outline

1 Introduction 2 Inference After Selection 3 Linear Regression 4 Other Examples

slide-46
SLIDE 46

Linear Regression

Gaussian response Y ∈ Rn, regressors X ∈ Rn×p Select active set E ⊆ {1, . . . , p} based on lasso, LARS, forward stepwise, ... Inference w.r.t. selected linear model Y ∼ N(XEβE, σ2In) Exponential family in βE, σ2 = ⇒ ∃ UMPU selective test for H0 : βE

j = 0

slide-47
SLIDE 47

Linear Regression: Selected Model

Y ∼ exp

  • − 1

2σ2 (y − XEβ)′(y − XEβ)

  • 1

√ 2πσ2

slide-48
SLIDE 48

Linear Regression: Selected Model

Y ∼ exp

  • 1

σ2

  • k∈E

βk Xk′y − 1 2σ2 y2 − ψ(β, σ2)

  • f0(y)
slide-49
SLIDE 49

Linear Regression: Selected Model

Y ∼ exp

  • 1

σ2

  • k∈E

βk Xk′y − 1 2σ2 y2 − ψ(β, σ2)

  • f0(y)

σ2 known: T(y) = Xj′y, U(y) = XE\j

′y

Selective z-test for βj on event A is based on Lβj

  • X′

jY

  • XE\j

′Y, A

  • Condition on (n − |E|)-dim. hyperplane A

Hit-and-run MCMC (typically A = polytope) Exact level-α tests possible w/o mixing (Besag & Clifford, 1989)

slide-50
SLIDE 50

Linear Regression: Selected Model

Y ∼ exp

  • 1

σ2

  • k∈M

βk Xk′y − 1 2σ2 y2 − ψ(β, σ2)

  • f0(y)

σ2 unknown: T(y) = Xj′y, U(y) = (XE\j

′y, y2)

Selective t-test for βj on event A is based on Lβj/σ2

  • Xj′Y
  • XE\j

′Y, Y 2, A

  • Condition on (n − |E|)-dim. hyperplane sphere A

Sample using ball {y ≤ Y } instead of sphere, then adjust

slide-51
SLIDE 51

Saturated Model

What if we don’t believe linear model?

slide-52
SLIDE 52

Saturated Model

What if we don’t believe linear model? Idea: Y ∼ N(µ, σ2In) (saturated model), define least-squares parameters for “model” E ⊆ {1, . . . , p}: θE arg min

θ

  • Y − XEθ2

= (X′

EXE)−1X′ Eµ

Used by Berk et al. (2012), Taylor et al. (2014), Lee et al. (2013), Loftus and Taylor (2014), Lee and Taylor (2014), others

slide-53
SLIDE 53

Saturated Model

What if we don’t believe linear model? Idea: Y ∼ N(µ, σ2In) (saturated model), define least-squares parameters for “model” E ⊆ {1, . . . , p}: θE arg min

θ

  • Y − XEθ2

= (X′

EXE)−1X′ Eµ

Used by Berk et al. (2012), Taylor et al. (2014), Lee et al. (2013), Loftus and Taylor (2014), Lee and Taylor (2014), others Parameters are linear contrasts: θE

j = η′µ

σ2 known: test of H0 : θE

j = 0 based on LθE

j

  • η′Y
  • P⊥

η Y, A

slide-54
SLIDE 54

Linear Regression: Saturated Model

LθE

j

  • η′Y
  • P⊥

η Y, A

  • :

Gaussian truncated to a “slice”

slide-55
SLIDE 55

Linear Regression: Saturated Model

LθE

j

  • η′Y
  • P⊥

η Y, A

  • :

Gaussian truncated to a “slice” σ2 unknown: also need to condition on Y line sphere: leaves only 2 points in support

slide-56
SLIDE 56

Saturated vs. Selected z-Test

Usual z-statistic Z =

η′y ση

Selected-model z-test based on LβE

j

  • Z
  • XM\j

′Y, A

  • Saturated-model z-test based on

LθE

j

  • Z
  • P⊥

η Y, A

  • Selected-model test more powerful (conditions on less)

Saturated-model test more robust (valid under weaker assumptions) Hybrid approaches exist

slide-57
SLIDE 57

Simulation

Setup: regression with n = 100, p = 200, Y ∼ N(Xβ, In) True βj = 7 j = 1, . . . , 7 j > 7 X Gaussian, pairwise correlation 0.3 between variables (normalized)

slide-58
SLIDE 58

Simulation

Setup: regression with n = 100, p = 200, Y ∼ N(Xβ, In) True βj = 7 j = 1, . . . , 7 j > 7 X Gaussian, pairwise correlation 0.3 between variables (normalized) Split data into Y (1) = (Y1, . . . , Yn1), Y (2) = (Yn1+1, . . . , Y100) Selection: lasso on Y (1) using λ = 2E(X′ǫ∞), ǫ ∼ N(0, I) Suggested by Negahban et al. (2012) Inference: two procedures Data Splitting (Splitn1): Use Y (2) for inference Data Carving (Carven1): Selected model z-test

slide-59
SLIDE 59

Selection–Inference Tradeoff

As n1 varies, tradeoff between model selection quality and power

20 40 60 80 100 0.0 0.2 0.4 0.6 0.8 1.0 # data points used for selection Probability Screening Power, Carving Power, Splitting

slide-60
SLIDE 60

Selection–Inference Tradeoff

Robustness: same plot for t5 errors

20 40 60 80 100 0.0 0.2 0.4 0.6 0.8 1.0 # data points used for selection Probability Screening Power, Carving Power, Splitting

slide-61
SLIDE 61

Outline

1 Introduction 2 Inference After Selection 3 Linear Regression 4 Other Examples

slide-62
SLIDE 62

Motivation: Iowa Caucus

Setup: Quinnipiac poll of n = 667 Iowa Republicans: Rank Candidate Result Votes∗ 1. Scott Walker 21% 140 2. Rand Paul 13% 87 3. Marco Rubio 13% 87 4. Ted Cruz 12% 80 . . . . . . 14. Bobby Jindal 1% 7 15. Lindsey Graham 0% Question: Is Scott Walker really winning? Answer: Yes (p=0.00053), by at least 22% p=0.022 for Gupta & Nagel method

slide-63
SLIDE 63

Winner vs. Runner-Up Test

Theorem (F 2015):

Let [d] denote the index of the largest count, and conclude that π[d] > maxj<d π[j] if exact, two-sided binomial level-α test of H0 : π[d] ≤ π[d−1] rejects. This is a valid level-α procedure. Analogous result known for Gaussians (Gutmann & Maymin, 1987)

slide-64
SLIDE 64

Winner vs. Runner-Up Test

Theorem (F 2015):

Let [d] denote the index of the largest count, and conclude that π[d] > maxj<d π[j] if exact, two-sided binomial level-α test of H0 : π[d] ≤ π[d−1] rejects. This is a valid level-α procedure. Analogous result known for Gaussians (Gutmann & Maymin, 1987) Conditional approach leads to:

  • Lower confidence bound for πSW − maxj=SW πj
  • Subset selection rule
  • Stepdown procedure yielding confident ranks
slide-65
SLIDE 65

Stepdown Procedure

Stepdown Procedure: Start with #1, reject until p > .05 Quinnipiac poll of n = 692 Iowa Democrats: Rank Candidate Result Votes 1.∗ Hillary Clinton 60% 415 2.∗ Bernie Sanders 15% 104 3.∗ Joe Biden 11% 76 4.∗ Don’t Know 7% 48 5. Jim Webb 3% 21 6. Mark O’Malley 3% 21 7. Lincoln Chafee 0% FWER controlled at α = 0.05

slide-66
SLIDE 66

Sequential Model Selection

New work (F, Taylor, Tibshirani, Tibshirani): Generate nested model sequence in algorithmic fashion M0(Y ) ⊆ M1(Y ) ⊆ · · · ⊆ Md(Y ) ⊆ M∞ e.g.

  • Forward stepwise, lasso
  • Graphical lasso
  • “Best first” decision tree

Goal: select least complex model consistent with data control FDR, FWER (type I error = # of extra steps) Need to condition on subpath M0, . . . , Mk null p-values are iid uniform (use ForwardStop, Accum. Tests) Forward stepwise, lasso: 2p linear constraints afer k steps.

slide-67
SLIDE 67

Diabetes Example

Step Variable Nominal p-value Saturated p-value Max-t p-value 1 bmi 0.00 0.00 0.00 2 ltg 0.00 0.00 0.00 3 map 0.00 0.05 0.00 4 age:sex 0.00 0.33 0.02 5 bmi:map 0.00 0.76 0.08 6 hdl 0.00 0.25 0.06 7 sex 0.00 0.00 0.00 8 glu2 0.02 0.03 0.32 9 age2 0.11 0.55 0.94 10 map:glu 0.17 0.91 0.91 11 tc 0.15 0.37 0.25 12 ldl 0.06 0.15 0.01 13 ltg2 0.00 0.07 0.04 14 age:ldl 0.19 0.97 0.85 15 age:tc 0.08 0.15 0.03 16 sex:map 0.18 0.05 0.40 17 glu 0.23 0.45 0.58 18 tch 0.31 0.71 0.82 19 sex:tch 0.22 0.40 0.51 20 sex:bmi 0.27 0.60 0.44

slide-68
SLIDE 68

Conclusions

Conditioning on selection generalizes data splitting Doable in interesting problems Conditioning ⇐ ⇒ discarding information Knowledge of selection protocol allows us not to “overcorrect”

slide-69
SLIDE 69

The End

Thanks!