High-Dimensional Variable Selection in Nonlinear Models that - - PowerPoint PPT Presentation

high dimensional variable selection in nonlinear models
SMART_READER_LITE
LIVE PREVIEW

High-Dimensional Variable Selection in Nonlinear Models that - - PowerPoint PPT Presentation

High-Dimensional Variable Selection in Nonlinear Models that Controls the False Discovery Rate Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18, 2017 Collaborators : Emmanuel


slide-1
SLIDE 1

High-Dimensional Variable Selection in Nonlinear Models that Controls the False Discovery Rate

Lucas Janson

Harvard University Department of Statistics blank line blank line

CMSA Big Data Conference, August 18, 2017 Collaborators: Emmanuel Cand` es (Stanford), Yingying Fan, Jinchi Lv (USC)

slide-2
SLIDE 2

Problem Statement

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 0 / 18

slide-3
SLIDE 3

Controlled Variable Selection

Given: Y an outcome of interest (AKA response or dependent variable), X1, . . . , Xp a set of p potential explanatory variables (AKA covariates, features, or independent variables), How can we select important explanatory variables with few mistakes?

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 1 / 18

slide-4
SLIDE 4

Controlled Variable Selection

Given: Y an outcome of interest (AKA response or dependent variable), X1, . . . , Xp a set of p potential explanatory variables (AKA covariates, features, or independent variables), How can we select important explanatory variables with few mistakes? Applications to: Medicine/genetics/health care

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 1 / 18

slide-5
SLIDE 5

Controlled Variable Selection

Given: Y an outcome of interest (AKA response or dependent variable), X1, . . . , Xp a set of p potential explanatory variables (AKA covariates, features, or independent variables), How can we select important explanatory variables with few mistakes? Applications to: Medicine/genetics/health care Economics/political science

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 1 / 18

slide-6
SLIDE 6

Controlled Variable Selection

Given: Y an outcome of interest (AKA response or dependent variable), X1, . . . , Xp a set of p potential explanatory variables (AKA covariates, features, or independent variables), How can we select important explanatory variables with few mistakes? Applications to: Medicine/genetics/health care Economics/political science Industry/technology

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 1 / 18

slide-7
SLIDE 7

Controlled Variable Selection (cont’d)

What is an important variable?

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 2 / 18

slide-8
SLIDE 8

Controlled Variable Selection (cont’d)

What is an important variable? We consider Xj to be unimportant if the conditional distribution of Y given X1, . . . , Xp does not depend on Xj. Formally, Xj is unimportant if it is conditionally independent of Y given X-j: Y ⊥ ⊥ Xj | X-j

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 2 / 18

slide-9
SLIDE 9

Controlled Variable Selection (cont’d)

What is an important variable? We consider Xj to be unimportant if the conditional distribution of Y given X1, . . . , Xp does not depend on Xj. Formally, Xj is unimportant if it is conditionally independent of Y given X-j: Y ⊥ ⊥ Xj | X-j Markov Blanket of Y : smallest set S such that Y ⊥ ⊥ X-S | XS

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 2 / 18

slide-10
SLIDE 10

Controlled Variable Selection (cont’d)

What is an important variable? We consider Xj to be unimportant if the conditional distribution of Y given X1, . . . , Xp does not depend on Xj. Formally, Xj is unimportant if it is conditionally independent of Y given X-j: Y ⊥ ⊥ Xj | X-j Markov Blanket of Y : smallest set S such that Y ⊥ ⊥ X-S | XS For GLMs with no stochastically redundant covariates, equivalent to {j : βj = 0}

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 2 / 18

slide-11
SLIDE 11

Controlled Variable Selection (cont’d)

What is an important variable? We consider Xj to be unimportant if the conditional distribution of Y given X1, . . . , Xp does not depend on Xj. Formally, Xj is unimportant if it is conditionally independent of Y given X-j: Y ⊥ ⊥ Xj | X-j Markov Blanket of Y : smallest set S such that Y ⊥ ⊥ X-S | XS For GLMs with no stochastically redundant covariates, equivalent to {j : βj = 0} To make sure we do not make too many mistakes, we seek to select a set ˆ S to control the false discovery rate (FDR): FDR( ˆ S) = E

  • #{j in ˆ

S : Xj unimportant} #{j in ˆ S}

  • ≤ q (e.g. 10%)

“Here is a set of variables ˆ S, 90% of which I expect to be important”

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 2 / 18

slide-12
SLIDE 12

Sneak Peak

New interpretation of knockoffs solves the controlled variable selection problem Allows any model for Y and X1, . . . , Xp Allows any dimension (including p > n) Finite-sample control (non-asymptotic) of FDR Practical performance on real problems

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 3 / 18

slide-13
SLIDE 13

Sneak Peak

New interpretation of knockoffs solves the controlled variable selection problem Allows any model for Y and X1, . . . , Xp Allows any dimension (including p > n) Finite-sample control (non-asymptotic) of FDR Practical performance on real problems Analysis of the genetic basis of Crohn’s Disease (WTCCC, 2007) ≈ 5, 000 subjects (≈ 40% with Crohn’s Disease) ≈ 375, 000 single nucleotide polymorphisms (SNPs) for each subject

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 3 / 18

slide-14
SLIDE 14

Sneak Peak

New interpretation of knockoffs solves the controlled variable selection problem Allows any model for Y and X1, . . . , Xp Allows any dimension (including p > n) Finite-sample control (non-asymptotic) of FDR Practical performance on real problems Analysis of the genetic basis of Crohn’s Disease (WTCCC, 2007) ≈ 5, 000 subjects (≈ 40% with Crohn’s Disease) ≈ 375, 000 single nucleotide polymorphisms (SNPs) for each subject Original analysis of the data made 9 discoveries by running marginal tests and selecting p-values to target a FDR of 10%

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 3 / 18

slide-15
SLIDE 15

Sneak Peak

New interpretation of knockoffs solves the controlled variable selection problem Allows any model for Y and X1, . . . , Xp Allows any dimension (including p > n) Finite-sample control (non-asymptotic) of FDR Practical performance on real problems Analysis of the genetic basis of Crohn’s Disease (WTCCC, 2007) ≈ 5, 000 subjects (≈ 40% with Crohn’s Disease) ≈ 375, 000 single nucleotide polymorphisms (SNPs) for each subject Original analysis of the data made 9 discoveries by running marginal tests and selecting p-values to target a FDR of 10% Model-free knockoffs used the same FDR of 10% and made 18 discoveries, with many of the new discoveries confirmed by a larger meta-analysis

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 3 / 18

slide-16
SLIDE 16

Review of Methods for Controlled Variable Selection

What is required for valid inference? Low dimensions Model for Y Asymptopic regime Sparsity Random design OLSp+BHq Yes Yes No No No

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 4 / 18

slide-17
SLIDE 17

Review of Methods for Controlled Variable Selection

What is required for valid inference? Low dimensions Model for Y Asymptopic regime Sparsity Random design OLSp+BHq Yes Yes No No No MLp+BHq Yes Yes Yes No No

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 4 / 18

slide-18
SLIDE 18

Review of Methods for Controlled Variable Selection

What is required for valid inference? Low dimensions Model for Y Asymptopic regime Sparsity Random design OLSp+BHq Yes Yes No No No MLp+BHq Yes Yes Yes No No HDp+BHq No Yes Yes Yes Yes

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 4 / 18

slide-19
SLIDE 19

Review of Methods for Controlled Variable Selection

What is required for valid inference? Low dimensions Model for Y Asymptopic regime Sparsity Random design OLSp+BHq Yes Yes No No No MLp+BHq Yes Yes Yes No No HDp+BHq No Yes Yes Yes Yes Orig KnO Yes Yes No No No

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 4 / 18

slide-20
SLIDE 20

Review of Methods for Controlled Variable Selection

What is required for valid inference? Low dimensions Model for Y Asymptopic regime Sparsity Random design OLSp+BHq Yes Yes No No No MLp+BHq Yes Yes Yes No No HDp+BHq No Yes Yes Yes Yes Orig KnO Yes Yes No No No New KnO No No No No Yes*

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 4 / 18

slide-21
SLIDE 21

The Knockoffs Idea

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 4 / 18

slide-22
SLIDE 22

Knockoffs (Barber and Cand` es, 2015)

y and Xj are n × 1 column vectors of data: n draws from the random variables Y and Xj, respectively; design matrix X := [X1 · · · Xp]

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 5 / 18

slide-23
SLIDE 23

Knockoffs (Barber and Cand` es, 2015)

y and Xj are n × 1 column vectors of data: n draws from the random variables Y and Xj, respectively; design matrix X := [X1 · · · Xp] (1) Construct knockoffs: Knockoffs ˜ Xj must satisfy, ( ˜ X := [ ˜ X1 · · · ˜ Xp]) [X ˜ X]⊤[X ˜ X] =

  • X⊤X

X⊤X − diag{s} X⊤X − diag{s} X⊤X

  • Lucas Janson (Harvard Statistics)

Knockoffs for HD Controlled Variable Selection 5 / 18

slide-24
SLIDE 24

Knockoffs (Barber and Cand` es, 2015)

y and Xj are n × 1 column vectors of data: n draws from the random variables Y and Xj, respectively; design matrix X := [X1 · · · Xp] (1) Construct knockoffs: Knockoffs ˜ Xj must satisfy, ( ˜ X := [ ˜ X1 · · · ˜ Xp]) [X ˜ X]⊤[X ˜ X] =

  • X⊤X

X⊤X − diag{s} X⊤X − diag{s} X⊤X

  • (2) Compute knockoff statistics:

Sufficiency: Wj only a function of [X ˜ X]⊤[X ˜ X] and [X ˜ X]⊤y Antisymmetry: swapping values of Xj and ˜ Xj flips sign of Wj

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 5 / 18

slide-25
SLIDE 25

Knockoffs (Barber and Cand` es, 2015)

y and Xj are n × 1 column vectors of data: n draws from the random variables Y and Xj, respectively; design matrix X := [X1 · · · Xp] (1) Construct knockoffs: Knockoffs ˜ Xj must satisfy, ( ˜ X := [ ˜ X1 · · · ˜ Xp]) [X ˜ X]⊤[X ˜ X] =

  • X⊤X

X⊤X − diag{s} X⊤X − diag{s} X⊤X

  • (2) Compute knockoff statistics:

Sufficiency: Wj only a function of [X ˜ X]⊤[X ˜ X] and [X ˜ X]⊤y Antisymmetry: swapping values of Xj and ˜ Xj flips sign of Wj

(3) Find the knockoff threshold:

Order the variables by decreasing |Wj| and proceed down list Select only variables with positive Wj until last time negatives

positives ≤ q

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 5 / 18

slide-26
SLIDE 26

Knockoffs (Barber and Cand` es, 2015)

y and Xj are n × 1 column vectors of data: n draws from the random variables Y and Xj, respectively; design matrix X := [X1 · · · Xp] (1) Construct knockoffs: Knockoffs ˜ Xj must satisfy, ( ˜ X := [ ˜ X1 · · · ˜ Xp]) [X ˜ X]⊤[X ˜ X] =

  • X⊤X

X⊤X − diag{s} X⊤X − diag{s} X⊤X

  • (2) Compute knockoff statistics:

Sufficiency: Wj only a function of [X ˜ X]⊤[X ˜ X] and [X ˜ X]⊤y Antisymmetry: swapping values of Xj and ˜ Xj flips sign of Wj

(3) Find the knockoff threshold:

Order the variables by decreasing |Wj| and proceed down list Select only variables with positive Wj until last time negatives

positives ≤ q

Comments: Finite-sample FDR control and leverages sparsity for power Requires data follow low-dimensional (n ≥ p) Gaussian linear model Canonical approach: condition on X, rely heavily on model for y

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 5 / 18

slide-27
SLIDE 27

Generalizing the Knockoffs Procedure

(1) Construct knockoffs:

Artificial versions (“knockoffs”) of each variable Act as controls for assessing importance of original variables

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 6 / 18

slide-28
SLIDE 28

Generalizing the Knockoffs Procedure

(1) Construct knockoffs:

Artificial versions (“knockoffs”) of each variable Act as controls for assessing importance of original variables

(2) Compute knockoff statistics:

Scalar statistic Wj for each variable Measures how much more important a variable appears than its knockoff Positive Wj denotes original more important, strength measured by magnitude

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 6 / 18

slide-29
SLIDE 29

Generalizing the Knockoffs Procedure

(1) Construct knockoffs:

Artificial versions (“knockoffs”) of each variable Act as controls for assessing importance of original variables

(2) Compute knockoff statistics:

Scalar statistic Wj for each variable Measures how much more important a variable appears than its knockoff Positive Wj denotes original more important, strength measured by magnitude

(3) Find the knockoff threshold: (same as before)

Order the variables by decreasing |Wj| and proceed down list Select only variables with positive Wj until last time negatives

positives ≤ q

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 6 / 18

slide-30
SLIDE 30

Generalizing the Knockoffs Procedure

(1) Construct knockoffs:

Artificial versions (“knockoffs”) of each variable Act as controls for assessing importance of original variables

(2) Compute knockoff statistics:

Scalar statistic Wj for each variable Measures how much more important a variable appears than its knockoff Positive Wj denotes original more important, strength measured by magnitude

(3) Find the knockoff threshold: (same as before)

Order the variables by decreasing |Wj| and proceed down list Select only variables with positive Wj until last time negatives

positives ≤ q

Coin-flipping property: The key to knockoffs is that steps (1) and (2) are done specifically to ensure that, conditional on |W1|, . . . , |Wp|, the signs of the unimportant/null Wj are independently ±1 with probability 1/2

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 6 / 18

slide-31
SLIDE 31

New Interpretation of Knockoffs

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 6 / 18

slide-32
SLIDE 32

Knockoffs Without a Model for Y (Cand` es et al., 2016)

Instead of modeling y and conditioning on X, condition on y and model X (shifts the burden of knowledge from y onto X)

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 7 / 18

slide-33
SLIDE 33

Knockoffs Without a Model for Y (Cand` es et al., 2016)

Instead of modeling y and conditioning on X, condition on y and model X (shifts the burden of knowledge from y onto X) Explicitly, rows of X = (Xi,1, . . . , Xi,p)

iid

∼ G where G can be arbitrary but is assumed known

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 7 / 18

slide-34
SLIDE 34

Knockoffs Without a Model for Y (Cand` es et al., 2016)

Instead of modeling y and conditioning on X, condition on y and model X (shifts the burden of knowledge from y onto X) Explicitly, rows of X = (Xi,1, . . . , Xi,p)

iid

∼ G where G can be arbitrary but is assumed known As compared to original knockoffs, removes

Restriction on dimension Linear model requirement for Y | X1, . . . , Xp “Sufficiency” constraint for Wj

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 7 / 18

slide-35
SLIDE 35

Knockoffs Without a Model for Y (Cand` es et al., 2016)

Instead of modeling y and conditioning on X, condition on y and model X (shifts the burden of knowledge from y onto X) Explicitly, rows of X = (Xi,1, . . . , Xi,p)

iid

∼ G where G can be arbitrary but is assumed known As compared to original knockoffs, removes

Restriction on dimension Linear model requirement for Y | X1, . . . , Xp “Sufficiency” constraint for Wj

The rows of X must be i.i.d., not the columns (covariates)

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 7 / 18

slide-36
SLIDE 36

Knockoffs Without a Model for Y (Cand` es et al., 2016)

Instead of modeling y and conditioning on X, condition on y and model X (shifts the burden of knowledge from y onto X) Explicitly, rows of X = (Xi,1, . . . , Xi,p)

iid

∼ G where G can be arbitrary but is assumed known As compared to original knockoffs, removes

Restriction on dimension Linear model requirement for Y | X1, . . . , Xp “Sufficiency” constraint for Wj

The rows of X must be i.i.d., not the columns (covariates) Nothing about y’s distribution is assumed or need be known

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 7 / 18

slide-37
SLIDE 37

Knockoffs Without a Model for Y (Cand` es et al., 2016)

Instead of modeling y and conditioning on X, condition on y and model X (shifts the burden of knowledge from y onto X) Explicitly, rows of X = (Xi,1, . . . , Xi,p)

iid

∼ G where G can be arbitrary but is assumed known As compared to original knockoffs, removes

Restriction on dimension Linear model requirement for Y | X1, . . . , Xp “Sufficiency” constraint for Wj

The rows of X must be i.i.d., not the columns (covariates) Nothing about y’s distribution is assumed or need be known Robust to overfitting X’s distribution in preliminary experiments

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 7 / 18

slide-38
SLIDE 38

Robustness

  • Exact Cov

0.00 0.25 0.50 0.75 1.00 0.0 0.5 1.0

Relative Frobenius Norm Error Power

  • 0.00

0.25 0.50 0.75 1.00 0.0 0.5 1.0

Relative Frobenius Norm Error FDR Figure: Covariates are AR(1) with autocorrelation coefficient 0.3. n = 800, p = 1500, and target FDR is 10%. Y comes from a binomial linear model with logit link function with 50 nonzero entries.

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 8 / 18

slide-39
SLIDE 39

Robustness

  • Exact Cov
  • Graph. Lasso

0.00 0.25 0.50 0.75 1.00 0.0 0.5 1.0

Relative Frobenius Norm Error Power

  • 0.00

0.25 0.50 0.75 1.00 0.0 0.5 1.0

Relative Frobenius Norm Error FDR Figure: Covariates are AR(1) with autocorrelation coefficient 0.3. n = 800, p = 1500, and target FDR is 10%. Y comes from a binomial linear model with logit link function with 50 nonzero entries.

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 8 / 18

slide-40
SLIDE 40

Robustness

  • Exact Cov
  • Graph. Lasso

50% Emp. Cov

0.00 0.25 0.50 0.75 1.00 0.0 0.5 1.0

Relative Frobenius Norm Error Power

  • 0.00

0.25 0.50 0.75 1.00 0.0 0.5 1.0

Relative Frobenius Norm Error FDR Figure: Covariates are AR(1) with autocorrelation coefficient 0.3. n = 800, p = 1500, and target FDR is 10%. Y comes from a binomial linear model with logit link function with 50 nonzero entries.

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 8 / 18

slide-41
SLIDE 41

Robustness

  • Exact Cov
  • Graph. Lasso

50% Emp. Cov 62.5% Emp. Cov

0.00 0.25 0.50 0.75 1.00 0.0 0.5 1.0

Relative Frobenius Norm Error Power

  • 0.00

0.25 0.50 0.75 1.00 0.0 0.5 1.0

Relative Frobenius Norm Error FDR Figure: Covariates are AR(1) with autocorrelation coefficient 0.3. n = 800, p = 1500, and target FDR is 10%. Y comes from a binomial linear model with logit link function with 50 nonzero entries.

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 8 / 18

slide-42
SLIDE 42

Robustness

  • Exact Cov
  • Graph. Lasso

50% Emp. Cov 62.5% Emp. Cov 75% Emp. Cov

0.00 0.25 0.50 0.75 1.00 0.0 0.5 1.0

Relative Frobenius Norm Error Power

  • 0.00

0.25 0.50 0.75 1.00 0.0 0.5 1.0

Relative Frobenius Norm Error FDR Figure: Covariates are AR(1) with autocorrelation coefficient 0.3. n = 800, p = 1500, and target FDR is 10%. Y comes from a binomial linear model with logit link function with 50 nonzero entries.

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 8 / 18

slide-43
SLIDE 43

Robustness

  • Exact Cov
  • Graph. Lasso

50% Emp. Cov 62.5% Emp. Cov 75% Emp. Cov 87.5% Emp. Cov

0.00 0.25 0.50 0.75 1.00 0.0 0.5 1.0

Relative Frobenius Norm Error Power

  • 0.00

0.25 0.50 0.75 1.00 0.0 0.5 1.0

Relative Frobenius Norm Error FDR Figure: Covariates are AR(1) with autocorrelation coefficient 0.3. n = 800, p = 1500, and target FDR is 10%. Y comes from a binomial linear model with logit link function with 50 nonzero entries.

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 8 / 18

slide-44
SLIDE 44

Robustness

  • Exact Cov
  • Graph. Lasso

50% Emp. Cov 62.5% Emp. Cov 75% Emp. Cov 87.5% Emp. Cov 100% Emp. Cov

0.00 0.25 0.50 0.75 1.00 0.0 0.5 1.0

Relative Frobenius Norm Error Power

  • 0.00

0.25 0.50 0.75 1.00 0.0 0.5 1.0

Relative Frobenius Norm Error FDR Figure: Covariates are AR(1) with autocorrelation coefficient 0.3. n = 800, p = 1500, and target FDR is 10%. Y comes from a binomial linear model with logit link function with 50 nonzero entries.

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 8 / 18

slide-45
SLIDE 45

Shifting the Burden of Knowledge

When is it appropriate?

  • 1. Subjects sampled from a population, and
  • 2a. Xj highly structured, well-studied, or well-understood, OR

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 9 / 18

slide-46
SLIDE 46

Shifting the Burden of Knowledge

When is it appropriate?

  • 1. Subjects sampled from a population, and
  • 2a. Xj highly structured, well-studied, or well-understood, OR
  • 2b. Large set of unsupervised X data (without Y ’s)

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 9 / 18

slide-47
SLIDE 47

Shifting the Burden of Knowledge

When is it appropriate?

  • 1. Subjects sampled from a population, and
  • 2a. Xj highly structured, well-studied, or well-understood, OR
  • 2b. Large set of unsupervised X data (without Y ’s)

For instance, many genome-wide association studies satisfy all conditions:

  • 1. Subjects sampled from a population (oversampling cases still valid)

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 9 / 18

slide-48
SLIDE 48

Shifting the Burden of Knowledge

When is it appropriate?

  • 1. Subjects sampled from a population, and
  • 2a. Xj highly structured, well-studied, or well-understood, OR
  • 2b. Large set of unsupervised X data (without Y ’s)

For instance, many genome-wide association studies satisfy all conditions:

  • 1. Subjects sampled from a population (oversampling cases still valid)
  • 2a. Strong spatial structure: linkage disequilibrium models, e.g., Markov chains,

are well-studied and work well

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 9 / 18

slide-49
SLIDE 49

Shifting the Burden of Knowledge

When is it appropriate?

  • 1. Subjects sampled from a population, and
  • 2a. Xj highly structured, well-studied, or well-understood, OR
  • 2b. Large set of unsupervised X data (without Y ’s)

For instance, many genome-wide association studies satisfy all conditions:

  • 1. Subjects sampled from a population (oversampling cases still valid)
  • 2a. Strong spatial structure: linkage disequilibrium models, e.g., Markov chains,

are well-studied and work well

  • 2b. Other studies have collected same or similar SNP arrays on different subjects

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 9 / 18

slide-50
SLIDE 50

The New Knockoffs Procedure

(1) Construct knockoffs: Exchangeability [X1 ··· Xj ··· Xp ˜ X1 ··· ˜ Xj ··· ˜ Xp]

D

= [X1 ··· ˜ Xj ··· Xp ˜ X1 ··· Xj ··· ˜ Xp]

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 10 / 18

slide-51
SLIDE 51

The New Knockoffs Procedure

(1) Construct knockoffs: Exchangeability [X1 ··· Xj ··· Xp ˜ X1 ··· ˜ Xj ··· ˜ Xp]

D

= [X1 ··· ˜ Xj ··· Xp ˜ X1 ··· Xj ··· ˜ Xp] (2) Compute knockoff statistics:

Variable importance measure Z Antisymmetric function fj : R2 → R, i.e., fj(z1, z2) = −fj(z2, z1) Wj = fj(Zj, Zj), where Zj and Zj are the variable importances of Xj and ˜ Xj, respectively

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 10 / 18

slide-52
SLIDE 52

The New Knockoffs Procedure

(1) Construct knockoffs: Exchangeability [X1 ··· Xj ··· Xp ˜ X1 ··· ˜ Xj ··· ˜ Xp]

D

= [X1 ··· ˜ Xj ··· Xp ˜ X1 ··· Xj ··· ˜ Xp] (2) Compute knockoff statistics:

Variable importance measure Z Antisymmetric function fj : R2 → R, i.e., fj(z1, z2) = −fj(z2, z1) Wj = fj(Zj, Zj), where Zj and Zj are the variable importances of Xj and ˜ Xj, respectively

(3) Find the knockoff threshold: (same as before)

Order the variables by decreasing |Wj| and proceed down list Select only variables with positive Wj until last time negatives

positives ≤ q

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 10 / 18

slide-53
SLIDE 53

Step (1): Construct Knockoffs

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 10 / 18

slide-54
SLIDE 54

Knockoff Construction

Proof that valid knockoff variables can be generated for any X distribution

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 11 / 18

slide-55
SLIDE 55

Knockoff Construction

Proof that valid knockoff variables can be generated for any X distribution If (X1, . . . , Xp) multivariate Gaussian, exchangeability reduces to matching first and second moments when Xj, ˜ Xj swapped For Cov(X1, . . . , Xp) = Σ: Cov(X1, . . . , Xp, ˜ X1, . . . , ˜ Xp) =

  • Σ

Σ − diag{s} Σ − diag{s} Σ

  • For non-Gaussian X, still second-order-correct approximate knockoffs

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 11 / 18

slide-56
SLIDE 56

Knockoff Construction

Proof that valid knockoff variables can be generated for any X distribution If (X1, . . . , Xp) multivariate Gaussian, exchangeability reduces to matching first and second moments when Xj, ˜ Xj swapped For Cov(X1, . . . , Xp) = Σ: Cov(X1, . . . , Xp, ˜ X1, . . . , ˜ Xp) =

  • Σ

Σ − diag{s} Σ − diag{s} Σ

  • For non-Gaussian X, still second-order-correct approximate knockoffs

Linear algebra and semidefinite programming to find good s Recently: construction for Markov chains and HMMs (Sesia et al., 2017) Constructions also possible for grouped variables (Dai and Barber, 2016)

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 11 / 18

slide-57
SLIDE 57

Step (2): Compute Knockoff Statistics

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 11 / 18

slide-58
SLIDE 58

Strategy for Choosing Knockoff Statistics

Recall Wj an antisymmetric function fj of Zj and Zj (the variable importances of Xj and ˜ Xj, respectively): Wj = fj(Zj, Zj) = −fj( Zj, Zj)

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 12 / 18

slide-59
SLIDE 59

Strategy for Choosing Knockoff Statistics

Recall Wj an antisymmetric function fj of Zj and Zj (the variable importances of Xj and ˜ Xj, respectively): Wj = fj(Zj, Zj) = −fj( Zj, Zj) For example, Z is magnitude of fitted coefficient β from a lasso regression of y on [X ˜ X] fj(z1, z2) = z1 − z2

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 12 / 18

slide-60
SLIDE 60

Strategy for Choosing Knockoff Statistics

Recall Wj an antisymmetric function fj of Zj and Zj (the variable importances of Xj and ˜ Xj, respectively): Wj = fj(Zj, Zj) = −fj( Zj, Zj) For example, Z is magnitude of fitted coefficient β from a lasso regression of y on [X ˜ X] fj(z1, z2) = z1 − z2 Lasso Coefficient Difference (LCD) statistic: Wj = |βj| − |˜ βj|

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 12 / 18

slide-61
SLIDE 61

Exchangeability Endows Coin-Flipping

Recall exchangeability property: for any j, [X1 ··· Xj ··· Xp ˜ X1 ··· ˜ Xj ··· ˜ Xp]

D

= [X1 ··· ˜ Xj ··· Xp ˜ X1 ··· Xj ··· ˜ Xp]

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 13 / 18

slide-62
SLIDE 62

Exchangeability Endows Coin-Flipping

Recall exchangeability property: for any j, [X1 ··· Xj ··· Xp ˜ X1 ··· ˜ Xj ··· ˜ Xp]

D

= [X1 ··· ˜ Xj ··· Xp ˜ X1 ··· Xj ··· ˜ Xp] Coin-flipping property for Wj:

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 13 / 18

slide-63
SLIDE 63

Exchangeability Endows Coin-Flipping

Recall exchangeability property: for any j, [X1 ··· Xj ··· Xp ˜ X1 ··· ˜ Xj ··· ˜ Xp]

D

= [X1 ··· ˜ Xj ··· Xp ˜ X1 ··· Xj ··· ˜ Xp] Coin-flipping property for Wj: for any unimportant variable j,

  • Zj,

Zj

  • :=
  • Zj
  • y,
  • · · · Xj· · · ˜

Xj· · ·

  • ,
  • Zj
  • y,
  • · · · Xj· · · ˜

Xj· · ·

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 13 / 18

slide-64
SLIDE 64

Exchangeability Endows Coin-Flipping

Recall exchangeability property: for any j, [X1 ··· Xj ··· Xp ˜ X1 ··· ˜ Xj ··· ˜ Xp]

D

= [X1 ··· ˜ Xj ··· Xp ˜ X1 ··· Xj ··· ˜ Xp] Coin-flipping property for Wj: for any unimportant variable j,

  • Zj,

Zj

  • :=
  • Zj
  • y,
  • · · · Xj· · · ˜

Xj· · ·

  • ,
  • Zj
  • y,
  • · · · Xj· · · ˜

Xj· · ·

D

=

  • Zj
  • y,
  • · · · ˜

Xj· · · Xj· · ·

  • ,
  • Zj
  • y,
  • · · · ˜

Xj· · · Xj· · ·

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 13 / 18

slide-65
SLIDE 65

Exchangeability Endows Coin-Flipping

Recall exchangeability property: for any j, [X1 ··· Xj ··· Xp ˜ X1 ··· ˜ Xj ··· ˜ Xp]

D

= [X1 ··· ˜ Xj ··· Xp ˜ X1 ··· Xj ··· ˜ Xp] Coin-flipping property for Wj: for any unimportant variable j,

  • Zj,

Zj

  • :=
  • Zj
  • y,
  • · · · Xj· · · ˜

Xj· · ·

  • ,
  • Zj
  • y,
  • · · · Xj· · · ˜

Xj· · ·

D

=

  • Zj
  • y,
  • · · · ˜

Xj· · · Xj· · ·

  • ,
  • Zj
  • y,
  • · · · ˜

Xj· · · Xj· · · =

  • Zj
  • y,
  • · · · Xj· · · ˜

Xj· · ·

  • ,

Zj

  • y,
  • · · · Xj· · · ˜

Xj· · ·

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 13 / 18

slide-66
SLIDE 66

Exchangeability Endows Coin-Flipping

Recall exchangeability property: for any j, [X1 ··· Xj ··· Xp ˜ X1 ··· ˜ Xj ··· ˜ Xp]

D

= [X1 ··· ˜ Xj ··· Xp ˜ X1 ··· Xj ··· ˜ Xp] Coin-flipping property for Wj: for any unimportant variable j,

  • Zj,

Zj

  • :=
  • Zj
  • y,
  • · · · Xj· · · ˜

Xj· · ·

  • ,
  • Zj
  • y,
  • · · · Xj· · · ˜

Xj· · ·

D

=

  • Zj
  • y,
  • · · · ˜

Xj· · · Xj· · ·

  • ,
  • Zj
  • y,
  • · · · ˜

Xj· · · Xj· · · =

  • Zj
  • y,
  • · · · Xj· · · ˜

Xj· · ·

  • ,

Zj

  • y,
  • · · · Xj· · · ˜

Xj· · · =

  • Zj, Zj
  • Lucas Janson (Harvard Statistics)

Knockoffs for HD Controlled Variable Selection 13 / 18

slide-67
SLIDE 67

Exchangeability Endows Coin-Flipping

Recall exchangeability property: for any j, [X1 ··· Xj ··· Xp ˜ X1 ··· ˜ Xj ··· ˜ Xp]

D

= [X1 ··· ˜ Xj ··· Xp ˜ X1 ··· Xj ··· ˜ Xp] Coin-flipping property for Wj: for any unimportant variable j,

  • Zj,

Zj

  • :=
  • Zj
  • y,
  • · · · Xj· · · ˜

Xj· · ·

  • ,
  • Zj
  • y,
  • · · · Xj· · · ˜

Xj· · ·

D

=

  • Zj
  • y,
  • · · · ˜

Xj· · · Xj· · ·

  • ,
  • Zj
  • y,
  • · · · ˜

Xj· · · Xj· · · =

  • Zj
  • y,
  • · · · Xj· · · ˜

Xj· · ·

  • ,

Zj

  • y,
  • · · · Xj· · · ˜

Xj· · · =

  • Zj, Zj
  • Wj = fj(Zj,

Zj)

D

= fj( Zj, Zj)

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 13 / 18

slide-68
SLIDE 68

Exchangeability Endows Coin-Flipping

Recall exchangeability property: for any j, [X1 ··· Xj ··· Xp ˜ X1 ··· ˜ Xj ··· ˜ Xp]

D

= [X1 ··· ˜ Xj ··· Xp ˜ X1 ··· Xj ··· ˜ Xp] Coin-flipping property for Wj: for any unimportant variable j,

  • Zj,

Zj

  • :=
  • Zj
  • y,
  • · · · Xj· · · ˜

Xj· · ·

  • ,
  • Zj
  • y,
  • · · · Xj· · · ˜

Xj· · ·

D

=

  • Zj
  • y,
  • · · · ˜

Xj· · · Xj· · ·

  • ,
  • Zj
  • y,
  • · · · ˜

Xj· · · Xj· · · =

  • Zj
  • y,
  • · · · Xj· · · ˜

Xj· · ·

  • ,

Zj

  • y,
  • · · · Xj· · · ˜

Xj· · · =

  • Zj, Zj
  • Wj = fj(Zj,

Zj)

D

= fj( Zj, Zj) = −fj(Zj, Zj) = −Wj

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 13 / 18

slide-69
SLIDE 69

Exchangeability Endows Coin-Flipping

Recall exchangeability property: for any j, [X1 ··· Xj ··· Xp ˜ X1 ··· ˜ Xj ··· ˜ Xp]

D

= [X1 ··· ˜ Xj ··· Xp ˜ X1 ··· Xj ··· ˜ Xp] Coin-flipping property for Wj: for any unimportant variable j,

  • Zj,

Zj

  • :=
  • Zj
  • y,
  • · · · Xj· · · ˜

Xj· · ·

  • ,
  • Zj
  • y,
  • · · · Xj· · · ˜

Xj· · ·

D

=

  • Zj
  • y,
  • · · · ˜

Xj· · · Xj· · ·

  • ,
  • Zj
  • y,
  • · · · ˜

Xj· · · Xj· · · =

  • Zj
  • y,
  • · · · Xj· · · ˜

Xj· · ·

  • ,

Zj

  • y,
  • · · · Xj· · · ˜

Xj· · · =

  • Zj, Zj
  • Wj

D

= −Wj

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 13 / 18

slide-70
SLIDE 70

Adaptivity and Prior Information in Wj

Recall LCD: Wj = |βj| − |˜ βj|, where βj, ˜ βj come from ℓ1-penalized regression Adaptivity Cross-validation (on [X ˜ X]) to choose the penalty parameter in LCD

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 14 / 18

slide-71
SLIDE 71

Adaptivity and Prior Information in Wj

Recall LCD: Wj = |βj| − |˜ βj|, where βj, ˜ βj come from ℓ1-penalized regression Adaptivity Cross-validation (on [X ˜ X]) to choose the penalty parameter in LCD Higher-level adaptivity: CV to choose best-fitting model for inference

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 14 / 18

slide-72
SLIDE 72

Adaptivity and Prior Information in Wj

Recall LCD: Wj = |βj| − |˜ βj|, where βj, ˜ βj come from ℓ1-penalized regression Adaptivity Cross-validation (on [X ˜ X]) to choose the penalty parameter in LCD Higher-level adaptivity: CV to choose best-fitting model for inference

− E.g., fit random forest and ℓ1-penalized regression; derive feature importance from whichever has lower CV error—still strict FDR control

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 14 / 18

slide-73
SLIDE 73

Adaptivity and Prior Information in Wj

Recall LCD: Wj = |βj| − |˜ βj|, where βj, ˜ βj come from ℓ1-penalized regression Adaptivity Cross-validation (on [X ˜ X]) to choose the penalty parameter in LCD Higher-level adaptivity: CV to choose best-fitting model for inference

− E.g., fit random forest and ℓ1-penalized regression; derive feature importance from whichever has lower CV error—still strict FDR control

Can even let analyst look at (masked version of) data to choose Z function

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 14 / 18

slide-74
SLIDE 74

Adaptivity and Prior Information in Wj

Recall LCD: Wj = |βj| − |˜ βj|, where βj, ˜ βj come from ℓ1-penalized regression Adaptivity Cross-validation (on [X ˜ X]) to choose the penalty parameter in LCD Higher-level adaptivity: CV to choose best-fitting model for inference

− E.g., fit random forest and ℓ1-penalized regression; derive feature importance from whichever has lower CV error—still strict FDR control

Can even let analyst look at (masked version of) data to choose Z function Prior information Bayesian approach: choose prior and model, and Zj could be the posterior probability that Xj contributes to the model

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 14 / 18

slide-75
SLIDE 75

Adaptivity and Prior Information in Wj

Recall LCD: Wj = |βj| − |˜ βj|, where βj, ˜ βj come from ℓ1-penalized regression Adaptivity Cross-validation (on [X ˜ X]) to choose the penalty parameter in LCD Higher-level adaptivity: CV to choose best-fitting model for inference

− E.g., fit random forest and ℓ1-penalized regression; derive feature importance from whichever has lower CV error—still strict FDR control

Can even let analyst look at (masked version of) data to choose Z function Prior information Bayesian approach: choose prior and model, and Zj could be the posterior probability that Xj contributes to the model Still strict FDR control, even if wrong prior or MCMC has not converged

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 14 / 18

slide-76
SLIDE 76

Step (3): Find the Knockoff Threshold

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 14 / 18

slide-77
SLIDE 77

Find the Knockoff Threshold

Example with p = 10 and q = 20% = 1/5:

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 15 / 18

slide-78
SLIDE 78

Find the Knockoff Threshold

Example with p = 10 and q = 20% = 1/5:

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 15 / 18

slide-79
SLIDE 79

Find the Knockoff Threshold

Example with p = 10 and q = 20% = 1/5: W1 W2 W3 W4 W5 W6 W7 W8 W9 W10

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 15 / 18

slide-80
SLIDE 80

Find the Knockoff Threshold

Example with p = 10 and q = 20% = 1/5: |W1| |W2| |W3| |W4| |W5| |W6| |W7| |W8| |W9| |W10|

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 15 / 18

slide-81
SLIDE 81

Find the Knockoff Threshold

Example with p = 10 and q = 20% = 1/5: |W1| |W2| |W3| |W4| |W5| |W6| |W7| |W8| |W9| |W10| q = 20%

#{negative Wj} #{positive Wj}

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 15 / 18

slide-82
SLIDE 82

Find the Knockoff Threshold

Example with p = 10 and q = 20% = 1/5: |W1| |W2| |W3| |W4| |W5| |W6| |W7| |W8| |W9| |W10|

1

q = 20%

#{negative Wj} #{positive Wj}

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 15 / 18

slide-83
SLIDE 83

Find the Knockoff Threshold

Example with p = 10 and q = 20% = 1/5: |W1| |W2| |W3| |W4| |W5| |W6| |W7| |W8| |W9| |W10|

1 2

q = 20%

#{negative Wj} #{positive Wj}

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 15 / 18

slide-84
SLIDE 84

Find the Knockoff Threshold

Example with p = 10 and q = 20% = 1/5: |W1| |W2| |W3| |W4| |W5| |W6| |W7| |W8| |W9| |W10|

1 2 3

q = 20%

#{negative Wj} #{positive Wj}

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 15 / 18

slide-85
SLIDE 85

Find the Knockoff Threshold

Example with p = 10 and q = 20% = 1/5: |W1| |W2| |W3| |W4| |W5| |W6| |W7| |W8| |W9| |W10|

1 2 3 1 3

q = 20%

#{negative Wj} #{positive Wj}

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 15 / 18

slide-86
SLIDE 86

Find the Knockoff Threshold

Example with p = 10 and q = 20% = 1/5: |W1| |W2| |W3| |W4| |W5| |W6| |W7| |W8| |W9| |W10|

1 2 3 1 3 1 4

q = 20%

#{negative Wj} #{positive Wj}

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 15 / 18

slide-87
SLIDE 87

Find the Knockoff Threshold

Example with p = 10 and q = 20% = 1/5: |W1| |W2| |W3| |W4| |W5| |W6| |W7| |W8| |W9| |W10|

1 2 3 1 3 1 4 1 5

q = 20%

#{negative Wj} #{positive Wj}

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 15 / 18

slide-88
SLIDE 88

Find the Knockoff Threshold

Example with p = 10 and q = 20% = 1/5: |W1| |W2| |W3| |W4| |W5| |W6| |W7| |W8| |W9| |W10|

1 2 3 1 3 1 4 1 5 2 5

q = 20%

#{negative Wj} #{positive Wj}

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 15 / 18

slide-89
SLIDE 89

Find the Knockoff Threshold

Example with p = 10 and q = 20% = 1/5: |W1| |W2| |W3| |W4| |W5| |W6| |W7| |W8| |W9| |W10|

1 2 3 1 3 1 4 1 5 2 5 3 5

q = 20%

#{negative Wj} #{positive Wj}

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 15 / 18

slide-90
SLIDE 90

Find the Knockoff Threshold

Example with p = 10 and q = 20% = 1/5: |W1| |W2| |W3| |W4| |W5| |W6| |W7| |W8| |W9| |W10|

1 2 3 1 3 1 4 1 5 2 5 3 5 3 6

q = 20%

#{negative Wj} #{positive Wj}

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 15 / 18

slide-91
SLIDE 91

Find the Knockoff Threshold

Example with p = 10 and q = 20% = 1/5: |W1| |W2| |W3| |W4| |W5| |W6| |W7| |W8| |W9| |W10|

1 2 3 1 3 1 4 1 5 2 5 3 5 3 6 3 7

q = 20%

#{negative Wj} #{positive Wj}

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 15 / 18

slide-92
SLIDE 92

Find the Knockoff Threshold

Example with p = 10 and q = 20% = 1/5: |W1| |W2| |W3| |W4| |W5| |W6| |W7| |W8| |W9| |W10|

1 2 3 1 3 1 4 1 5 2 5 3 5 3 6 3 7

q = 20%

#{negative Wj} #{positive Wj}

ˆ τ

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 15 / 18

slide-93
SLIDE 93

Find the Knockoff Threshold

Example with p = 10 and q = 20% = 1/5: |W1| |W4| |W5| |W6| |W7| q = 20%

#{negative Wj} #{positive Wj}

ˆ τ S = {1, 4, 5, 6, 7}

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 15 / 18

slide-94
SLIDE 94

Intuition for FDR Control

FDR = E

  • #{null Xj selected}

#{total Xj selected}

  • Lucas Janson (Harvard Statistics)

Knockoffs for HD Controlled Variable Selection 16 / 18

slide-95
SLIDE 95

Intuition for FDR Control

FDR = E

  • #{null Xj selected}

#{total Xj selected}

  • = E
  • #{null positive |Wj| > ˆ

τ} #{positive |Wj| > ˆ τ}

  • Lucas Janson (Harvard Statistics)

Knockoffs for HD Controlled Variable Selection 16 / 18

slide-96
SLIDE 96

Intuition for FDR Control

FDR = E

  • #{null Xj selected}

#{total Xj selected}

  • = E
  • #{null positive |Wj| > ˆ

τ} #{positive |Wj| > ˆ τ}

  • ≈ E
  • #{null negative |Wj| > ˆ

τ} #{positive |Wj| > ˆ τ}

  • Lucas Janson (Harvard Statistics)

Knockoffs for HD Controlled Variable Selection 16 / 18

slide-97
SLIDE 97

Intuition for FDR Control

FDR = E

  • #{null Xj selected}

#{total Xj selected}

  • = E
  • #{null positive |Wj| > ˆ

τ} #{positive |Wj| > ˆ τ}

  • ≈ E
  • #{null negative |Wj| > ˆ

τ} #{positive |Wj| > ˆ τ}

  • ≤ E
  • #{negative |Wj| > ˆ

τ} #{positive |Wj| > ˆ τ}

  • q

ˆ τ

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 16 / 18

slide-98
SLIDE 98

GWAS Application

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 16 / 18

slide-99
SLIDE 99

Genetic Analysis of Crohn’s Disease

2007 case-control study by WTCCC n ≈ 5, 000, p ≈ 375, 000; preprocessing mirrored original analysis

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 17 / 18

slide-100
SLIDE 100

Genetic Analysis of Crohn’s Disease

2007 case-control study by WTCCC n ≈ 5, 000, p ≈ 375, 000; preprocessing mirrored original analysis Strong spatial structure: second-order knockoffs generated using genetic covariance estimate (Wen and Stephens, 2010)

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 17 / 18

slide-101
SLIDE 101

Genetic Analysis of Crohn’s Disease

2007 case-control study by WTCCC n ≈ 5, 000, p ≈ 375, 000; preprocessing mirrored original analysis Strong spatial structure: second-order knockoffs generated using genetic covariance estimate (Wen and Stephens, 2010) Entire analysis took 6 hours of serial computation time; 1 hour in parallel

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 17 / 18

slide-102
SLIDE 102

Genetic Analysis of Crohn’s Disease

2007 case-control study by WTCCC n ≈ 5, 000, p ≈ 375, 000; preprocessing mirrored original analysis Strong spatial structure: second-order knockoffs generated using genetic covariance estimate (Wen and Stephens, 2010) Entire analysis took 6 hours of serial computation time; 1 hour in parallel Knockoffs made twice as many discoveries as original analysis

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 17 / 18

slide-103
SLIDE 103

Genetic Analysis of Crohn’s Disease

2007 case-control study by WTCCC n ≈ 5, 000, p ≈ 375, 000; preprocessing mirrored original analysis Strong spatial structure: second-order knockoffs generated using genetic covariance estimate (Wen and Stephens, 2010) Entire analysis took 6 hours of serial computation time; 1 hour in parallel Knockoffs made twice as many discoveries as original analysis

− Some new discoveries confirmed in larger study

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 17 / 18

slide-104
SLIDE 104

Genetic Analysis of Crohn’s Disease

2007 case-control study by WTCCC n ≈ 5, 000, p ≈ 375, 000; preprocessing mirrored original analysis Strong spatial structure: second-order knockoffs generated using genetic covariance estimate (Wen and Stephens, 2010) Entire analysis took 6 hours of serial computation time; 1 hour in parallel Knockoffs made twice as many discoveries as original analysis

− Some new discoveries confirmed in larger study − Some corroborated by work on nearby genes: promising candidates

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 17 / 18

slide-105
SLIDE 105

Genetic Analysis of Crohn’s Disease

2007 case-control study by WTCCC n ≈ 5, 000, p ≈ 375, 000; preprocessing mirrored original analysis Strong spatial structure: second-order knockoffs generated using genetic covariance estimate (Wen and Stephens, 2010) Entire analysis took 6 hours of serial computation time; 1 hour in parallel Knockoffs made twice as many discoveries as original analysis

− Some new discoveries confirmed in larger study − Some corroborated by work on nearby genes: promising candidates − Similar result when HMM knockoffs applied to same data (Sesia et al., 2017)

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 17 / 18

slide-106
SLIDE 106

Discussion

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 17 / 18

slide-107
SLIDE 107

Summary and Next Steps

By conditioning on Y and modeling X, knockoffs can be applied to high-dimensional and nonlinear problems, where it is powerful, flexible, and appears robust

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 18 / 18

slide-108
SLIDE 108

Summary and Next Steps

By conditioning on Y and modeling X, knockoffs can be applied to high-dimensional and nonlinear problems, where it is powerful, flexible, and appears robust Some future directions for research: Theoretical: rigorous guarantees on robustness

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 18 / 18

slide-109
SLIDE 109

Summary and Next Steps

By conditioning on Y and modeling X, knockoffs can be applied to high-dimensional and nonlinear problems, where it is powerful, flexible, and appears robust Some future directions for research: Theoretical: rigorous guarantees on robustness Methodological: develop knockoff constructions for new X distributions

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 18 / 18

slide-110
SLIDE 110

Summary and Next Steps

By conditioning on Y and modeling X, knockoffs can be applied to high-dimensional and nonlinear problems, where it is powerful, flexible, and appears robust Some future directions for research: Theoretical: rigorous guarantees on robustness Methodological: develop knockoff constructions for new X distributions Applied: team up with domain experts who know/control their X, e.g., gene knockout/knockdown, climate change modeling

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 18 / 18

slide-111
SLIDE 111

Summary and Next Steps

By conditioning on Y and modeling X, knockoffs can be applied to high-dimensional and nonlinear problems, where it is powerful, flexible, and appears robust Some future directions for research: Theoretical: rigorous guarantees on robustness Methodological: develop knockoff constructions for new X distributions Applied: team up with domain experts who know/control their X, e.g., gene knockout/knockdown, climate change modeling Thank you!

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 18 / 18

slide-112
SLIDE 112

Appendix

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 18 / 18

slide-113
SLIDE 113

References

Barber, R. F. and Cand` es, E. J. (2015). Controlling the false discovery rate via

  • knockoffs. Ann. Statist., 43(5):2055–2085.

Cand` es, E., Fan, Y., Janson, L., and Lv, J. (2016). Panning for gold: Model-free knockoffs for high-dimensional controlled variable selection. arXiv preprint arXiv:1610.02351. Dai, R. and Barber, R. F. (2016). The knockoff filter for fdr control in group-sparse and multitask regression. arXiv preprint arXiv:1602.03589. Sesia, M., Sabatti, C., and Cand` es, E. (2017). Gene hunting with knockoffs for hidden markov models. arXiv preprint arXiv:1706.04677. Wen, X. and Stephens, M. (2010). Using linear predictors to impute allele frequencies from summary or pooled genotype data. Ann. Appl. Stat., 4(3):1158–1182. WTCCC (2007). Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature, 447(7145):661–678.

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 18 / 18

slide-114
SLIDE 114

Simulations in Low-Dimensional Linear Model

0.00 0.25 0.50 0.75 1.00 2 3 4 5

Coefficient Amplitude Power Methods

BHq Marginal BHq Max Lik. MF Knockoffs

  • Orig. Knockoffs

0.00 0.25 0.50 0.75 1.00 2 3 4 5

Coefficient Amplitude FDR Figure: Power and FDR (target is 10%) for MF knockoffs and alternative procedures. The design matrix is i.i.d. N(0, 1/n), n = 3000, p = 1000, and y comes from a Gaussian linear model with 60 nonzero regression coefficients having equal magnitudes and random signs. The noise variance is 1.

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 18 / 18

slide-115
SLIDE 115

Simulations in Low-Dimensional Nonlinear Model

0.00 0.25 0.50 0.75 1.00 6 8 10

Coefficient Amplitude Power Methods

BHq Marginal BHq Max Lik. MF Knockoffs 0.00 0.25 0.50 0.75 1.00 6 8 10

Coefficient Amplitude FDR Figure: Power and FDR (target is 10%) for MF knockoffs and alternative procedures. The design matrix is i.i.d. N(0, 1/n), n = 3000, p = 1000, and y comes from a binomial linear model with logit link function, and 60 nonzero regression coefficients having equal magnitudes and random signs.

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 18 / 18

slide-116
SLIDE 116

Simulations in High Dimensions

0.00 0.25 0.50 0.75 1.00 8 10 12

Coefficient Amplitude Power Methods

BHq Marginal MF Knockoffs 0.00 0.25 0.50 0.75 1.00 8 10 12

Coefficient Amplitude FDR Figure: Power and FDR (target is 10%) for MF knockoffs and alternative procedures. The design matrix is i.i.d. N(0, 1/n), n = 3000, p = 6000, and y comes from a binomial linear model with logit link function, and 60 nonzero regression coefficients having equal magnitudes and random signs.

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 18 / 18

slide-117
SLIDE 117

Simulations in High Dimensions with Dependence

0.00 0.25 0.50 0.75 1.00 0.0 0.2 0.4 0.6 0.8

Autocorrelation Coefficient Power Methods

BHq Marginal MF Knockoffs 0.00 0.25 0.50 0.75 1.00 0.0 0.2 0.4 0.6 0.8

Autocorrelation Coefficient FDR Figure: Power and FDR (target is 10%) for MF knockoffs and alternative procedures. The design matrix has AR(1) columns, and marginally each Xj ∼ N(0, 1/n). n = 3000, p = 6000, and y follows a binomial linear model with logit link function, and 60 nonzero coefficients with random signs and randomly selected locations.

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 18 / 18

slide-118
SLIDE 118

Checking Sensitivity to Misspecification Error

Concern about misspecification Y | X X Canonical (model Y , not X) Yes No model X, not Y No Yes

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 18 / 18

slide-119
SLIDE 119

Checking Sensitivity to Misspecification Error

Concern about misspecification Y | X X Canonical (model Y , not X) Yes No model X, not Y No Yes Misspecification replicated in simulation? No Yes

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 18 / 18

slide-120
SLIDE 120

Checking Sensitivity to Misspecification Error

Concern about misspecification Y | X X Canonical (model Y , not X) Yes No model X, not Y No Yes Misspecification replicated in simulation? No Yes Can actually check sensitivity to misspecification error!

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 18 / 18

slide-121
SLIDE 121

Robustness on Real Data

  • 0.00

0.25 0.50 0.75 1.00 9 12 15 18 21

Coefficient Amplitude Power

  • 0.00

0.25 0.50 0.75 1.00 9 12 15 18 21

Coefficient Amplitude FDR Figure: Power and FDR (target is 10%) for model-free knockoffs applied to subsamples

  • f a chromosome 1 of real genetic design matrix; n ≈ 1, 400.

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 18 / 18

slide-122
SLIDE 122

Computation of Second-Order Knockoffs

Cov(X1, . . . , Xp) = Σ, need: Cov(X1, . . . , Xp, ˜ X1, . . . , ˜ Xp) =

  • Σ

Σ − diag{s} Σ − diag{s} Σ

  • Lucas Janson (Harvard Statistics)

Knockoffs for HD Controlled Variable Selection 18 / 18

slide-123
SLIDE 123

Computation of Second-Order Knockoffs

Cov(X1, . . . , Xp) = Σ, need: Cov(X1, . . . , Xp, ˜ X1, . . . , ˜ Xp) =

  • Σ

Σ − diag{s} Σ − diag{s} Σ

  • Equicorrelated (EQ) (fast, less powerful): sEQ

j

= 2λmin(Σ) ∧ 1 for all j

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 18 / 18

slide-124
SLIDE 124

Computation of Second-Order Knockoffs

Cov(X1, . . . , Xp) = Σ, need: Cov(X1, . . . , Xp, ˜ X1, . . . , ˜ Xp) =

  • Σ

Σ − diag{s} Σ − diag{s} Σ

  • Equicorrelated (EQ) (fast, less powerful): sEQ

j

= 2λmin(Σ) ∧ 1 for all j Semidefinite program (SDP) (slower, more powerful): minimize

  • j |1 − sSDP

j

| subject to sSDP

j

≥ 0 diag{sSDP} 2Σ,

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 18 / 18

slide-125
SLIDE 125

Computation of Second-Order Knockoffs

Cov(X1, . . . , Xp) = Σ, need: Cov(X1, . . . , Xp, ˜ X1, . . . , ˜ Xp) =

  • Σ

Σ − diag{s} Σ − diag{s} Σ

  • Equicorrelated (EQ) (fast, less powerful): sEQ

j

= 2λmin(Σ) ∧ 1 for all j Semidefinite program (SDP) (slower, more powerful): minimize

  • j |1 − sSDP

j

| subject to sSDP

j

≥ 0 diag{sSDP} 2Σ, (New) Approximate SDP:

Approximate Σ as block diagonal so that SDP separates Bisection search scalar multiplier of solution to account for approximation faster than SDP, more powerful than EQ, and easily parallelizable

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 18 / 18

slide-126
SLIDE 126

Sequential Independent Pairs Generates Valid Knockoffs

Algorithm 1 Sequential Conditional Independent Pairs for j = {1, . . . , p} do Sample ˜ Xj from L(Xj | X-j, ˜ X1:j−1) conditionally independently of Xj end

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 18 / 18

slide-127
SLIDE 127

Sequential Independent Pairs Generates Valid Knockoffs

Algorithm 1 Sequential Conditional Independent Pairs for j = {1, . . . , p} do Sample ˜ Xj from L(Xj | X-j, ˜ X1:j−1) conditionally independently of Xj end Proof sketch (discrete case): Denote PMF of (X1:p, ˜ X1:j−1) by L(X-j, Xj, ˜ X1:j−1)

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 18 / 18

slide-128
SLIDE 128

Sequential Independent Pairs Generates Valid Knockoffs

Algorithm 1 Sequential Conditional Independent Pairs for j = {1, . . . , p} do Sample ˜ Xj from L(Xj | X-j, ˜ X1:j−1) conditionally independently of Xj end Proof sketch (discrete case): Denote PMF of (X1:p, ˜ X1:j−1) by L(X-j, Xj, ˜ X1:j−1) Conditional PMF of ˜ Xj | X1:p, ˜ X1:j−1 is L(X-j, ˜ Xj, ˜ X1:j−1)

  • u L(X-j, u, ˜

X1:j−1) .

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 18 / 18

slide-129
SLIDE 129

Sequential Independent Pairs Generates Valid Knockoffs

Algorithm 1 Sequential Conditional Independent Pairs for j = {1, . . . , p} do Sample ˜ Xj from L(Xj | X-j, ˜ X1:j−1) conditionally independently of Xj end Proof sketch (discrete case): Denote PMF of (X1:p, ˜ X1:j−1) by L(X-j, Xj, ˜ X1:j−1) Conditional PMF of ˜ Xj | X1:p, ˜ X1:j−1 is L(X-j, ˜ Xj, ˜ X1:j−1)

  • u L(X-j, u, ˜

X1:j−1) . Joint PMF of (X1:p, ˜ X1:j) is L(X-j, Xj, ˜ X1:j−1)L(X-j, ˜ Xj, ˜ X1:j−1)

  • u L(X-j, u, ˜

X1:j−1)

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 18 / 18

slide-130
SLIDE 130

Sequential Independent Pairs Generates Valid Knockoffs

Algorithm 1 Sequential Conditional Independent Pairs for j = {1, . . . , p} do Sample ˜ Xj from L(Xj | X-j, ˜ X1:j−1) conditionally independently of Xj end Proof sketch (discrete case): Denote PMF of (X1:p, ˜ X1:j−1) by L(X-j, Xj, ˜ X1:j−1) Conditional PMF of ˜ Xj | X1:p, ˜ X1:j−1 is L(X-j, ˜ Xj, ˜ X1:j−1)

  • u L(X-j, u, ˜

X1:j−1) . Joint PMF of (X1:p, ˜ X1:j) is L(X-j, Xj, ˜ X1:j−1)L(X-j, ˜ Xj, ˜ X1:j−1)

  • u L(X-j, u, ˜

X1:j−1)

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 18 / 18

slide-131
SLIDE 131

Sequential Independent Pairs Generates Valid Knockoffs

Algorithm 1 Sequential Conditional Independent Pairs for j = {1, . . . , p} do Sample ˜ Xj from L(Xj | X-j, ˜ X1:j−1) conditionally independently of Xj end Proof sketch (discrete case): Denote PMF of (X1:p, ˜ X1:j−1) by L(X-j, Xj, ˜ X1:j−1) Conditional PMF of ˜ Xj | X1:p, ˜ X1:j−1 is L(X-j, ˜ Xj, ˜ X1:j−1)

  • u L(X-j, u, ˜

X1:j−1) . Joint PMF of (X1:p, ˜ X1:j) is L(X-j, ˜ Xj, ˜ X1:j−1)L(X-j, Xj, ˜ X1:j−1)

  • u L(X-j, u, ˜

X1:j−1)

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 18 / 18

slide-132
SLIDE 132

Sequential Independent Pairs Generates Valid Knockoffs

Algorithm 1 Sequential Conditional Independent Pairs for j = {1, . . . , p} do Sample ˜ Xj from L(Xj | X-j, ˜ X1:j−1) conditionally independently of Xj end Proof sketch (discrete case): Denote PMF of (X1:p, ˜ X1:j−1) by L(X-j, Xj, ˜ X1:j−1) Conditional PMF of ˜ Xj | X1:p, ˜ X1:j−1 is L(X-j, ˜ Xj, ˜ X1:j−1)

  • u L(X-j, u, ˜

X1:j−1) . Joint PMF of (X1:p, ˜ X1:j) is L(X-j, Xj, ˜ X1:j−1)L(X-j, ˜ Xj, ˜ X1:j−1)

  • u L(X-j, u, ˜

X1:j−1)

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 18 / 18

slide-133
SLIDE 133

Proof of Control

FDR = E

  • #{null Xj selected}

#{total Xj selected}

  • q

ˆ τ

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 18 / 18

slide-134
SLIDE 134

Proof of Control

FDR = E

  • #{null Xj selected}

#{total Xj selected}

  • = E
  • #{null positive |Wj| > ˆ

τ} #{positive |Wj| > ˆ τ}

  • q

ˆ τ

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 18 / 18

slide-135
SLIDE 135

Proof of Control

FDR = E

  • #{null Xj selected}

#{total Xj selected}

  • = E
  • #{null positive |Wj| > ˆ

τ} #{positive |Wj| > ˆ τ}

  • ≈ E
  • #{null negative |Wj| > ˆ

τ} #{positive |Wj| > ˆ τ}

  • q

ˆ τ

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 18 / 18

slide-136
SLIDE 136

Proof of Control

FDR = E

  • #{null Xj selected}

#{total Xj selected}

  • = E
  • #{null positive |Wj| > ˆ

τ} #{positive |Wj| > ˆ τ}

  • ≈ E
  • #{null negative |Wj| > ˆ

τ} #{positive |Wj| > ˆ τ}

  • ≤ E
  • #{negative |Wj| > ˆ

τ} #{positive |Wj| > ˆ τ}

  • q

ˆ τ

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 18 / 18

slide-137
SLIDE 137

Proof of Control

FDR = E

  • #{null Xj selected}

#{total Xj selected}

  • = E
  • #{null positive |Wj| > ˆ

τ} #{positive |Wj| > ˆ τ}

  • ≈ E
  • #{null negative |Wj| > ˆ

τ} #{positive |Wj| > ˆ τ}

  • ≤ E
  • #{negative |Wj| > ˆ

τ} #{positive |Wj| > ˆ τ}

  • q

ˆ τ More precisely: mFDR = E

  • #{null Xj selected}

q−1 + #{total Xj selected}

  • = E
  • #{null positive |Wj| > ˆ

τ} q−1 + #{positive |Wj| > ˆ τ}

  • Lucas Janson (Harvard Statistics)

Knockoffs for HD Controlled Variable Selection 18 / 18

slide-138
SLIDE 138

Proof of Control

FDR = E

  • #{null Xj selected}

#{total Xj selected}

  • = E
  • #{null positive |Wj| > ˆ

τ} #{positive |Wj| > ˆ τ}

  • ≈ E
  • #{null negative |Wj| > ˆ

τ} #{positive |Wj| > ˆ τ}

  • ≤ E
  • #{negative |Wj| > ˆ

τ} #{positive |Wj| > ˆ τ}

  • q

ˆ τ More precisely: mFDR = E

  • #{null Xj selected}

q−1 + #{total Xj selected}

  • = E
  • #{null positive |Wj| > ˆ

τ} q−1 + #{positive |Wj| > ˆ τ}

  • = E
  • #{null positive |Wj| > ˆ

τ} 1 + #{null negative |Wj| > ˆ τ} · 1 + #{null negative |Wj| > ˆ τ} q−1 + #{positive|Wj| > ˆ τ}

  • Lucas Janson (Harvard Statistics)

Knockoffs for HD Controlled Variable Selection 18 / 18

slide-139
SLIDE 139

Proof of Control

FDR = E

  • #{null Xj selected}

#{total Xj selected}

  • = E
  • #{null positive |Wj| > ˆ

τ} #{positive |Wj| > ˆ τ}

  • ≈ E
  • #{null negative |Wj| > ˆ

τ} #{positive |Wj| > ˆ τ}

  • ≤ E
  • #{negative |Wj| > ˆ

τ} #{positive |Wj| > ˆ τ}

  • q

ˆ τ More precisely: mFDR = E

  • #{null Xj selected}

q−1 + #{total Xj selected}

  • = E
  • #{null positive |Wj| > ˆ

τ} q−1 + #{positive |Wj| > ˆ τ}

  • = E
  • #{null positive |Wj| > ˆ

τ} 1 + #{null negative |Wj| > ˆ τ} · 1 + #{null negative |Wj| > ˆ τ} q−1 + #{positive|Wj| > ˆ τ}

  • ≤ q by definition of ˆ

τ

  • Lucas Janson (Harvard Statistics)

Knockoffs for HD Controlled Variable Selection 18 / 18

slide-140
SLIDE 140

Proof of Control

FDR = E

  • #{null Xj selected}

#{total Xj selected}

  • = E
  • #{null positive |Wj| > ˆ

τ} #{positive |Wj| > ˆ τ}

  • ≈ E
  • #{null negative |Wj| > ˆ

τ} #{positive |Wj| > ˆ τ}

  • ≤ E
  • #{negative |Wj| > ˆ

τ} #{positive |Wj| > ˆ τ}

  • q

ˆ τ More precisely: mFDR = E

  • #{null Xj selected}

q−1 + #{total Xj selected}

  • = E
  • #{null positive |Wj| > ˆ

τ} q−1 + #{positive |Wj| > ˆ τ}

  • = E
  • #{null positive |Wj| > ˆ

τ} 1 + #{null negative |Wj| > ˆ τ}

  • Supermartingale ≤ 1

with ˆ τ a stopping time · 1 + #{null negative |Wj| > ˆ τ} q−1 + #{positive|Wj| > ˆ τ}

  • ≤ q by definition of ˆ

τ

  • Lucas Janson (Harvard Statistics)

Knockoffs for HD Controlled Variable Selection 18 / 18