Model-Free Knockoffs: High-Dimensional Variable Selection that - PowerPoint PPT Presentation

Model-Free Knockoffs: High-Dimensional Variable Selection that Controls the False Discovery Rate Lucas Janson, Stanford Department of Statistics WADAPT Workshop, NIPS, December 2016 Collaborators : Emmanuel Cand` es (Stanford), YingYing Fan, Jinchi Lv (USC)

Problem Statement Controlled Variable Selection Given: Y an outcome of interest (AKA response or dependent variable), X 1 , . . . , X p a set of p potential explanatory variables (AKA covariates, features, or independent variables), How can we select important explanatory variables with few mistakes? Lucas Janson, Stanford Department of Statistics Knockoffs for Controlled Variable Selection 1 / 11

Problem Statement Controlled Variable Selection Given: Y an outcome of interest (AKA response or dependent variable), X 1 , . . . , X p a set of p potential explanatory variables (AKA covariates, features, or independent variables), How can we select important explanatory variables with few mistakes? Applications to: Medicine/genetics/health care Lucas Janson, Stanford Department of Statistics Knockoffs for Controlled Variable Selection 1 / 11

Problem Statement Controlled Variable Selection Given: Y an outcome of interest (AKA response or dependent variable), X 1 , . . . , X p a set of p potential explanatory variables (AKA covariates, features, or independent variables), How can we select important explanatory variables with few mistakes? Applications to: Medicine/genetics/health care Economics/political science Lucas Janson, Stanford Department of Statistics Knockoffs for Controlled Variable Selection 1 / 11

Problem Statement Controlled Variable Selection Given: Y an outcome of interest (AKA response or dependent variable), X 1 , . . . , X p a set of p potential explanatory variables (AKA covariates, features, or independent variables), How can we select important explanatory variables with few mistakes? Applications to: Medicine/genetics/health care Economics/political science Industry/technology Lucas Janson, Stanford Department of Statistics Knockoffs for Controlled Variable Selection 1 / 11

Controlled Variable Selection What is an important variable? Lucas Janson, Stanford Department of Statistics Knockoffs for Controlled Variable Selection 2 / 11

Controlled Variable Selection What is an important variable? We consider X j to be unimportant if the conditional distribution of Y given X 1 , . . . , X p does not depend on X j . Formally, X j is unimportant if it is conditionally independent of Y given X - j : Y ⊥ ⊥ X j | X - j Lucas Janson, Stanford Department of Statistics Knockoffs for Controlled Variable Selection 2 / 11

Controlled Variable Selection What is an important variable? We consider X j to be unimportant if the conditional distribution of Y given X 1 , . . . , X p does not depend on X j . Formally, X j is unimportant if it is conditionally independent of Y given X - j : Y ⊥ ⊥ X j | X - j Markov Blanket of Y : smallest set S such that Y ⊥ ⊥ X - S | X S Lucas Janson, Stanford Department of Statistics Knockoffs for Controlled Variable Selection 2 / 11

Controlled Variable Selection What is an important variable? We consider X j to be unimportant if the conditional distribution of Y given X 1 , . . . , X p does not depend on X j . Formally, X j is unimportant if it is conditionally independent of Y given X - j : Y ⊥ ⊥ X j | X - j Markov Blanket of Y : smallest set S such that Y ⊥ ⊥ X - S | X S To make sure we do not make too many mistakes, we seek to select a set ˆ S to control the false discovery rate (FDR) : � � # { j in ˆ S : X j unimportant } FDR ( ˆ S ) = E ≤ q (e.g. 10%) # { j in ˆ S } “Here is a set of variables ˆ S , 90% of which I expect to be important” Lucas Janson, Stanford Department of Statistics Knockoffs for Controlled Variable Selection 2 / 11

Sneak Peak Model-free knockoffs solves the controlled variable selection problem Any model for Y and X 1 , . . . , X p Any dimension (including p > n ) Finite-sample control (non-asymptotic) of FDR Practical performance on real problems Lucas Janson, Stanford Department of Statistics Knockoffs for Controlled Variable Selection 3 / 11

Sneak Peak Model-free knockoffs solves the controlled variable selection problem Any model for Y and X 1 , . . . , X p Any dimension (including p > n ) Finite-sample control (non-asymptotic) of FDR Practical performance on real problems Application: the Genetic Basis of Crohn’s Disease (WTCCC, 2007) ≈ 5 , 000 subjects ( ≈ 40% with Crohn’s Disease) ≈ 375 , 000 single nucleotide polymorphisms (SNPs) for each subject Lucas Janson, Stanford Department of Statistics Knockoffs for Controlled Variable Selection 3 / 11

Sneak Peak Model-free knockoffs solves the controlled variable selection problem Any model for Y and X 1 , . . . , X p Any dimension (including p > n ) Finite-sample control (non-asymptotic) of FDR Practical performance on real problems Application: the Genetic Basis of Crohn’s Disease (WTCCC, 2007) ≈ 5 , 000 subjects ( ≈ 40% with Crohn’s Disease) ≈ 375 , 000 single nucleotide polymorphisms (SNPs) for each subject The original analysis of the data made 9 discoveries by running marginal tests of association on each SNP and applying a p-value cutoff corresponding (by a Bayesian argument, under assumptions) to a FDR of 10% Lucas Janson, Stanford Department of Statistics Knockoffs for Controlled Variable Selection 3 / 11

Sneak Peak Model-free knockoffs solves the controlled variable selection problem Any model for Y and X 1 , . . . , X p Any dimension (including p > n ) Finite-sample control (non-asymptotic) of FDR Practical performance on real problems Application: the Genetic Basis of Crohn’s Disease (WTCCC, 2007) ≈ 5 , 000 subjects ( ≈ 40% with Crohn’s Disease) ≈ 375 , 000 single nucleotide polymorphisms (SNPs) for each subject The original analysis of the data made 9 discoveries by running marginal tests of association on each SNP and applying a p-value cutoff corresponding (by a Bayesian argument, under assumptions) to a FDR of 10% Model-free knockoffs used the same FDR of 10% and made 18 discoveries, with many of the new discoveries confirmed by a larger meta-analysis Lucas Janson, Stanford Department of Statistics Knockoffs for Controlled Variable Selection 3 / 11

Methods for Controlled Variable Selection What is required for valid inference? Low Model for Asymptopic Random dimensions Y regime Sparsity design OLSp+BHq Yes Yes No No No Lucas Janson, Stanford Department of Statistics Knockoffs for Controlled Variable Selection 4 / 11

Methods for Controlled Variable Selection What is required for valid inference? Low Model for Asymptopic Random dimensions Y regime Sparsity design OLSp+BHq Yes Yes No No No MLp+BHq Yes Yes Yes No No Lucas Janson, Stanford Department of Statistics Knockoffs for Controlled Variable Selection 4 / 11

Methods for Controlled Variable Selection What is required for valid inference? Low Model for Asymptopic Random dimensions Y regime Sparsity design OLSp+BHq Yes Yes No No No MLp+BHq Yes Yes Yes No No HDp+BHq No Yes Yes Yes Yes Lucas Janson, Stanford Department of Statistics Knockoffs for Controlled Variable Selection 4 / 11

Methods for Controlled Variable Selection What is required for valid inference? Low Model for Asymptopic Random dimensions Y regime Sparsity design OLSp+BHq Yes Yes No No No MLp+BHq Yes Yes Yes No No HDp+BHq No Yes Yes Yes Yes Orig KnO Yes Yes No No No Lucas Janson, Stanford Department of Statistics Knockoffs for Controlled Variable Selection 4 / 11

Methods for Controlled Variable Selection What is required for valid inference? Low Model for Asymptopic Random dimensions Y regime Sparsity design OLSp+BHq Yes Yes No No No MLp+BHq Yes Yes Yes No No HDp+BHq No Yes Yes Yes Yes Orig KnO Yes Yes No No No MF KnO No No No No Yes* Lucas Janson, Stanford Department of Statistics Knockoffs for Controlled Variable Selection 4 / 11

The Knockoffs Framework The generic knockoffs procedure for controlling the FDR at level q : (1) Construct knockoffs : Artificial versions (“knockoffs”) of each variable Act as controls for assessing importance of original variables Lucas Janson, Stanford Department of Statistics Knockoffs for Controlled Variable Selection 5 / 11

The Knockoffs Framework The generic knockoffs procedure for controlling the FDR at level q : (1) Construct knockoffs : Artificial versions (“knockoffs”) of each variable Act as controls for assessing importance of original variables (2) Compute knockoff statistics : Scalar statistic W j for each variable Measures how much more important a variable appears than its knockoff Positive W j denotes original more important, strength measured by magnitude Lucas Janson, Stanford Department of Statistics Knockoffs for Controlled Variable Selection 5 / 11

Model-Free Knockoffs: High-Dimensional Variable Selection that - PowerPoint PPT Presentation

Model-Free Knockoffs: High-Dimensional Variable Selection that Controls the False Discovery Rate Lucas Janson, Stanford Department of Statistics WADAPT Workshop, NIPS, December 2016 Collaborators : Emmanuel Cand` es (Stanford), YingYing Fan,

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

Numberjack User Guide May 27, 2013 1 Variables Constructor for the class Variable : Constructor

Bayesian Methods for Variable Selection with Applications to High-Dimensional Data Part 3:

ERP Selection KIRTANE & PANDIT Suhas Deshpande Why ERP Selection is important ?

Bayesian Methods for Variable Selection with Applications to High-Dimensional Data Part 5:

Luigi Spezia Biomathematics & Statistics Scotland Aberdeen BAYESIAN VARIABLE SELECTION

Variable selection STAT 401 - Statistical Methods for Research Workers Jarad Niemi Iowa State

MLCC 2019 Variable Selection and Sparsity Lorenzo Rosasco UNIGE-MIT-IIT Outline Variable

Joint variable and rank selection for parsimonious estimation of high dimensional matrices

STAT 213 Model Selection II Colin Reimer Dawson Oberlin College March 30, 2018 1 / 13 Outline

n -dimensional manifold M with T := TM n -dimensional manifold M with T := TM T n -dimensional

Model Selection and Assumptions November 15, 2019 November 15, 2019 1 / 32 Forward Selection

Nonparametric Variable Selection via Sufficient Dimension Reduction Lexin Li Workshop on Current

SECONDHAND SELECTION Sales Price - 275,000.00 EU SECONDHAND SELECTION INTERNAL VIEWS SECONDHAND

SELECTION Deterministic Stochastic Proportionate selection: Roulette Wheel Selection

Selection 2 Selection Selection given a set of (distinct) elements, finding the element larger

Barcode Sequence Alignment and Statistical Analysis (Barcas) tool 2016.10.05 Mun, Jihyeob and

Evaluating the Effect of Perturbations in Reconstructing Network Topologies Florian Markowetz and

Mes essag age e in M Mes essag age M e Mec echa hanism sm in M Mode odern 8 n 802.11

External Business Conduct Rules / Overview Maria Douvas April 2013 The information contained

3. Resonance Region and Isospin Or: Hints of Hadron Substructure References: [PRSZR 2.4, 6.2,

Constraint sasfacon problems II CS171, Fall 2016 Introducon to Arficial Intelligence Prof.

Fair Computation using Enclaves and Shared Ledger Rohit Sinha , Siva Gaddam, and Ranjit Kumaresan

Steven Y. Ko (SUNY at Buffalo), Kyungho Jeon (SUNY at Buffalo), Ramses Morales (Xerox Research

Model-Free Knockoffs: High-Dimensional Variable Selection that - PowerPoint PPT Presentation

Model-Free Knockoffs: High-Dimensional Variable Selection that Controls the False Discovery Rate Lucas Janson, Stanford Department of Statistics WADAPT Workshop, NIPS, December 2016 Collaborators : Emmanuel Cand` es (Stanford), YingYing Fan,

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

Numberjack User Guide May 27, 2013 1 Variables Constructor for the class Variable : Constructor

Bayesian Methods for Variable Selection with Applications to High-Dimensional Data Part 3:

ERP Selection KIRTANE &amp; PANDIT Suhas Deshpande Why ERP Selection is important ?

Bayesian Methods for Variable Selection with Applications to High-Dimensional Data Part 5:

Luigi Spezia Biomathematics &amp; Statistics Scotland Aberdeen BAYESIAN VARIABLE SELECTION

Variable selection STAT 401 - Statistical Methods for Research Workers Jarad Niemi Iowa State

MLCC 2019 Variable Selection and Sparsity Lorenzo Rosasco UNIGE-MIT-IIT Outline Variable

Joint variable and rank selection for parsimonious estimation of high dimensional matrices

STAT 213 Model Selection II Colin Reimer Dawson Oberlin College March 30, 2018 1 / 13 Outline

n -dimensional manifold M with T := TM n -dimensional manifold M with T := TM T n -dimensional

Model Selection and Assumptions November 15, 2019 November 15, 2019 1 / 32 Forward Selection

Nonparametric Variable Selection via Sufficient Dimension Reduction Lexin Li Workshop on Current

SECONDHAND SELECTION Sales Price - 275,000.00 EU SECONDHAND SELECTION INTERNAL VIEWS SECONDHAND

SELECTION Deterministic Stochastic Proportionate selection: Roulette Wheel Selection

Selection 2 Selection Selection given a set of (distinct) elements, finding the element larger

Barcode Sequence Alignment and Statistical Analysis (Barcas) tool 2016.10.05 Mun, Jihyeob and

Evaluating the Effect of Perturbations in Reconstructing Network Topologies Florian Markowetz and

Mes essag age e in M Mes essag age M e Mec echa hanism sm in M Mode odern 8 n 802.11

External Business Conduct Rules / Overview Maria Douvas April 2013 The information contained

3. Resonance Region and Isospin Or: Hints of Hadron Substructure References: [PRSZR 2.4, 6.2,

Constraint sa*sfac*on problems II CS171, Fall 2016 Introduc*on to Ar*ficial Intelligence Prof.

Fair Computation using Enclaves and Shared Ledger Rohit Sinha , Siva Gaddam, and Ranjit Kumaresan

Steven Y. Ko (SUNY at Buffalo), Kyungho Jeon (SUNY at Buffalo), Ramses Morales (Xerox Research

ERP Selection KIRTANE & PANDIT Suhas Deshpande Why ERP Selection is important ?

Luigi Spezia Biomathematics & Statistics Scotland Aberdeen BAYESIAN VARIABLE SELECTION

Constraint sasfacon problems II CS171, Fall 2016 Introducon to Arficial Intelligence Prof.