From selective inference to adaptive data analysis Xiaoying Tian - - PowerPoint PPT Presentation
From selective inference to adaptive data analysis Xiaoying Tian - - PowerPoint PPT Presentation
From selective inference to adaptive data analysis Xiaoying Tian Harris December 9, 2016 Acknowledgement My advisor: Jonathan Taylor Other coauthors: Snigdha Panigrahi Jelena Markovic Nan Bi Model selection Observe data ( y
Acknowledgement
My advisor:
◮ Jonathan Taylor
Other coauthors:
◮ Snigdha Panigrahi ◮ Jelena Markovic ◮ Nan Bi
Model selection
◮ Observe data (y, X), X ∈ Rn×p, y ∈ Rn
Model selection
◮ Observe data (y, X), X ∈ Rn×p, y ∈ Rn ◮ model = lm(y ∼ X1 + X2 + X3 + X4)
Model selection
◮ Observe data (y, X), X ∈ Rn×p, y ∈ Rn ◮ model = lm(y ∼ X1 + X2 + X3 + X4)
model = lm(y ∼ X1 + X2 + X4)
Model selection
◮ Observe data (y, X), X ∈ Rn×p, y ∈ Rn ◮ model = lm(y ∼ X1 + X2 + X3 + X4)
model = lm(y ∼ X1 + X2 + X4) model = lm(y ∼ X1 + X3 + X4)
Model selection
◮ Observe data (y, X), X ∈ Rn×p, y ∈ Rn ◮ model = lm(y ∼ X1 + X2 + X3 + X4)
model = lm(y ∼ X1 + X2 + X4) model = lm(y ∼ X1 + X3 + X4)
◮ Inference after model selection
- 1. Use data to select a set of variables E
- 2. Normal z-test to get p-values
Model selection
◮ Observe data (y, X), X ∈ Rn×p, y ∈ Rn ◮ model = lm(y ∼ X1 + X2 + X3 + X4)
model = lm(y ∼ X1 + X2 + X4) model = lm(y ∼ X1 + X3 + X4)
◮ Inference after model selection
- 1. Use data to select a set of variables E
- 2. Normal z-test to get p-values
◮ Problem: inflated significance
- 1. Normal z-tests need adjustment
- 2. Selection is biased towards “significance”
Inflated Significance
Setup:
◮ X ∈ R100×200 has i.i.d normal entries ◮ y = Xβ + ǫ, ǫ ∼ N(0, I) ◮ β = (5, . . . , 5 10
, 0, . . . , 0)
◮ LASSO, nonzero coefficient set E ◮ z-test, null pvalues for i ∈ E, i ∈ {1, . . . , 10}
0.0 0.1 0.2 0.3 0.4 0.5 p-values 0.0 0.1 0.2 0.3 0.4 0.5 frequencies
null pvalues after selection
Inflated Significance
Setup:
◮ X ∈ R100×200 has i.i.d normal entries ◮ y = Xβ + ǫ, ǫ ∼ N(0, I) ◮ β = (5, . . . , 5 10
, 0, . . . , 0)
◮ LASSO, nonzero coefficient set E ◮ z-test, null pvalues for i ∈ E, i ∈ {1, . . . , 10}
0.0 0.2 0.4 0.6 0.8 1.0 p-values 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 frequencies
selective p-values after selection
Selective inference: features and caveat
◮ Specific to particular selection procedures ◮ Exact post-selection test ◮ More powerful test
Selective inference: popping the hood
Consider the selection for “big effects”:
◮ X1, . . . , Xn i.i.d
∼ N(0, 1), X =
n
i=1 Xi
n ◮ Select for “big effects”, X > 1 ◮ Observation: X obs = 1.1, with n = 5 ◮ Normal z-test v.s. selective test for H0 : µ = 0.
- 1.5
- 1.0
- 0.5
0.0 0.5 1.0 1.5 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
- riginal distribution for ¯
X
0.0 0.5 1.0 1.5 2.0 1 2 3 4 5 6
conditional distribution after selection
Selective inference: popping the hood
Consider the selection for “big effects”:
◮ X1, . . . , Xn i.i.d
∼ N(0, 1), X =
n
i=1 Xi
n ◮ Select for “big effects”, X > 1 ◮ Observation: X obs = 1.1, with n = 5 ◮ Normal z-test v.s. selective test for H0 : µ = 0.
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
- riginal distribution for ¯
X
0.0 0.5 1.0 1.5 2.0 1 2 3 4 5 6
conditional distribution after selection
Selective inference: in a nutshell
◮ Selection, e.g. X > 1. ◮ Change of the reference measure
◮ the conditional distribution, e.g. N(µ, 1
n), truncated at 1.
◮ Target of inference may depend on the outcome of selection
◮ Example: selection by LASSO
What is the “selected” model?
Suppose a set of variables E are suggested by the data for further investigation.
◮ Selected model by Fithian et al. (2014):
ME = {N(XEβE, σ2
EI), βE ∈ R|E|, σ2 E > 0}.
Target is βE.
◮ Full model by Lee et al. (2016), Berk et al. (2013):
M = {N(µ, σ2I), µ ∈ Rn}. Target is βE(µ) = X †
Eµ.
◮ Nonparametric model:
M = {⊗nF : (X, Y ) ∼ F}. Target is βE(F) = EF[X T
E XE]−1EF[XE · Y ].
What is the “selected” model?
Suppose a set of variables E are suggested by the data for further investigation.
◮ Selected model by Fithian et al. (2014):
ME = {N(XEβE, σ2
EI), βE ∈ R|E|, σ2 E > 0}.
Target is βE.
◮ Full model by Lee et al. (2016), Berk et al. (2013):
M = {N(µ, σ2I), µ ∈ Rn}. Target is βE(µ) = X †
Eµ.
◮ Nonparametric model:
M = {⊗nF : (X, Y ) ∼ F}. Target is βE(F) = EF[X T
E XE]−1EF[XE · Y ].
A tool for valid inference after exploratory data analysis.
Selective inference on a DAG
E X∗, y∗ ω X, y ¯ E ◮ Incoporate randomness through ω
- 1. (X ∗, y ∗) = (X, y)
- 2. (X ∗, y ∗) = (X1, y1)
- 3. (X ∗, y ∗) = (X, y + ω)
◮ Reference measure conditioning on
E, the yellow node.
◮ Target of inference can be E
- 1. Not E, but depends on the data
through E
- 2. “Liberating” target of inference
from selection
- 3. E incorporate knowledge from
previous literature.
From selective inference to adaptive data analysis
Denote the data by S
E ω S ¯ E
From selective inference to adaptive data analysis
Denote the data by S
ω2 ω1 ¯ E E1 E2 S
Reference measure after selection
◮ Given any point null F0, use the conditional distribution F ∗ 0 as
reference measure, dF ∗ dF0 (S) = ℓF(S).
◮ ℓF is called the selective likelihood ratio. Depends on the
selection algorithm and the randomization distribution ω ∼ G.
◮ Tests of the form H0 : θ(F) = θ0 can be reduced to testing
point nulls, e.g.
◮ Score test ◮ Conditioning in exponential families
Computing the reference measure after selection
◮ Selection map ˆ
Q results from an optimization problem, ˆ β(S, ω) = arg min
β
ℓ(S; β) + P(β) + ωTβ. E is the active set of ˆ β.
◮ Selection region A(S) = {ω : ˆ
Q(S, ω) = E}, ω ∼ G dF ∗ dF0 (S) =
- A(S)
dG(ω).
E ω S
{ ˆ Q(S, ω) = E} is difficult to describe.
Computing the reference measure after selection
◮ Selection map ˆ
Q results from an optimization problem, ˆ β(S, ω) = arg min
β
ℓ(S; β) + P(β) + ωTβ. E is the active set of ˆ β.
◮ Selection region A(S) = {ω : ˆ
Q(S, ω) = E}, ω ∼ G dF ∗ dF0 (S) =
- A(S)
dG(ω). Let ˆ z(S, ω) be the subgradient of the optimization problem.
E ˆ z−E ˆ βE S
{(ˆ βE, ˆ z−E) ∈ B}, B depends only on the penalty P.
Monte-Carlo sampler for the conditional distribution
Suppose F0 has density f0 and G has density g,
E ˆ z−E ˆ βE S
dF ∗ dF0 (S) =
- B