Ultrahigh dimensional variable selection: Beyond the linear model - PowerPoint PPT Presentation

Ultrahigh dimensional variable selection: Beyond the linear model Jianqing Fan Princeton University With Richard Samworth and Yichao Wu ; Rui Song http://www.princeton.edu/ ∼ jqfan May 16, 2009 Jianqing Fan ( Princeton University) High-dimensional variable selection Yale University 1 / 43

Outline Introduction 1 Large-scale screening 2 Moderate-scale Selection 3 Iterative feature selection 4 Numerical Studies 5 Jianqing Fan ( Princeton University) High-dimensional variable selection Yale University 2 / 43

Introduction Jianqing Fan ( Princeton University) High-dimensional variable selection Yale University 3 / 43

Introduction High-dim variable selection characterizes many contemporary statistical problems. Bioinformatic: disease classification using microarray, proteomics, fMRI data. Document or text classification: E-mail spam. Association studies between phenotypes and SNPs. Jianqing Fan ( Princeton University) High-dimensional variable selection Yale University 4 / 43

Growth of Dimensionality � Dimensionality grows rapidly with interactions Portfolio selection and network modeling : 2,000 stocks involves over 2m unknown parameters in the covariance matrix. 50% 50% 0% Gene-gene inteaction : interactions of 5000 genes result in 12.5m features. Jianqing Fan ( Princeton University) High-dimensional variable selection Yale University 5 / 43

Aims of High-dimensional Regression and Classification � To construct as effective a method as possible to predict future observations. � To gain insight into the relationship between features and response for scientific purposes, as well as, hopefully, to construct an improved prediction method. Bickel (2008) discussion of the SIS paper (JRSS-B). Jianqing Fan ( Princeton University) High-dimensional variable selection Yale University 6 / 43

Challenges with Ultrahigh Dimensionality � Computational cost � Estimation accuracy. � Stability Key idea : Large-scale screening and moderate-scale searching. Jianqing Fan ( Princeton University) High-dimensional variable selection Yale University 7 / 43

Large-scale sreening Jianqing Fan ( Princeton University) High-dimensional variable selection Yale University 8 / 43

Independence learning Regression : Feature ranking by correlation learning (Fan and Lv, 2008, JRSS-B) . When Y = ± 1, this implies Classification : Feature ranking by two-sample t-tests or other tests (Tibshirani, et al, 03; Fan and Fan, 2008) . SIS : By an appropriate thresholding (e.g., n variables), relevant features are in the selected set (Fan and Lv, 08) , relying on joint-normality assumption. Other independent learning : Hall, Titterington and Xue (2009) derive such a method from empirical likelihood point of view. Jianqing Fan ( Princeton University) High-dimensional variable selection Yale University 9 / 43

Model setting � � GLIM : f Y ( y | X = x ; θ ) = exp ( y θ − b ( θ )) / φ + c ( y , φ ) with b ′− 1 ( µ ) = θ = x T β . canonial link : Objective : Find sparse β to minimize Q ( β ) = ∑ n i = 1 L ( Y i , x T i β ) . � GLIM : L ( Y i , x T i β ) = b ( x T i β ) − Y i x T i β . � Classification : Y = ± 1. ⋆ SVM L ( Y i , x T i β ) = ( 1 − Y i x T i β ) + . ⋆ AdaBoost L ( Y i , x T i β ) = exp ( − Y i x T i β ) . � Robustness : L ( Y i , x T i β ) = | Y i − x T i β | . Jianqing Fan ( Princeton University) High-dimensional variable selection Yale University 10 / 43

Questions How to screen discrete variables (Genome-wide association)? 1 Do they have sure screening property? 2 What is the size of selected model in order to have SIS? 3 The arguments in Fan and Lv (2008) can not be applied here. Jianqing Fan ( Princeton University) High-dimensional variable selection Yale University 11 / 43

Independence learning L 0 = min β 0 n − 1 ∑ n Marginal utility : Letting ˆ i = 1 L ( Y i , β 0 ) , define n ∑ ˆ L j = ˆ L 0 − min n − 1 L ( Y i , β 0 + X ij β j ) Wilks . β 0 , β j i = 1 M or ˆ β ( Wald ), assuming EX 2 j = 1. j Feature ranking : Select features w/ largest marginal utilities : M w β M � � γ n = { j : ˆ M ν n = { j : ˆ L j ≥ ν n } , j ≥ γ n } Dim. reduction : From p n = O ( exp ( n a )) to O ( n b ) : 200 10000 Jianqing Fan ( Princeton University) High-dimensional variable selection Yale University 12 / 43

Theoretical Basis – Population Aspect I j = E ℓ ( Y , β M Marginal utility : L ⋆ 0 ) − min E ℓ ( Y , β 0 + β j X j ) . Likelihood ratio (Fan and Song, 09) ⇒ cov ( Y , X j ) = cov ( b ′ ( X T β ⋆ ) , X j ) = 0 Theorem 1 : L ⋆ j = 0 ⇐ ⇒ β M ⇐ j = 0 . For Gaussian covariates, conclusion holds if | cov ( X T β ⋆ , X j ) | = 0, i.e. independence. Jianqing Fan ( Princeton University) High-dimensional variable selection Yale University 13 / 43

Theoretical Basis – Population Aspect II j � = 0 } , where β ⋆ = argmin EL ( Y , X T β ) . True model : M ⋆ = { j : β ⋆ Theorem 2 : If | cov ( b ′ ( X T β ⋆ ) , X j ) | ≥ c 1 n − κ for j ∈ M ⋆ , then j | ≥ c 1 n − κ , j | ≥ c 2 n − 2 κ . | β M | L ⋆ min min j ∈ M ⋆ j ∈ M ⋆ If { X j , j / ∈ M ⋆ } is independent of { X i , i ∈ M ⋆ } , then L ⋆ j = 0. For Gaussian covariates, conclusion holds if | cov ( X T β ⋆ , X j ) | ≥ c 1 n − κ , min condition even for LS . Jianqing Fan ( Princeton University) High-dimensional variable selection Yale University 14 / 43

Ultrahigh dimensional variable selection: Beyond the linear model - PowerPoint PPT Presentation

Ultrahigh dimensional variable selection: Beyond the linear model Jianqing Fan Princeton University With Richard Samworth and Yichao Wu ; Rui Song http://www.princeton.edu/ jqfan May 16, 2009 Jianqing Fan ( Princeton University)

Ultrahigh energy cosmic rays Ultrahigh energy cosmic rays sources? and pulsar winds Kumiko

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

Numberjack User Guide May 27, 2013 1 Variables Constructor for the class Variable : Constructor

a life of an ultrahigh energy cosmic ray Kumiko Kotera , Institut dAstrophysique de Paris UCL

Bayesian Methods for Variable Selection with Applications to High-Dimensional Data Part 3:

ERP Selection KIRTANE & PANDIT Suhas Deshpande Why ERP Selection is important ?

Luigi Spezia Biomathematics & Statistics Scotland Aberdeen BAYESIAN VARIABLE SELECTION

Variable selection STAT 401 - Statistical Methods for Research Workers Jarad Niemi Iowa State

MLCC 2019 Variable Selection and Sparsity Lorenzo Rosasco UNIGE-MIT-IIT Outline Variable

n -dimensional manifold M with T := TM n -dimensional manifold M with T := TM T n -dimensional

Bayesian Methods for Variable Selection with Applications to High-Dimensional Data Part 5:

SECONDHAND SELECTION Sales Price - 275,000.00 EU SECONDHAND SELECTION INTERNAL VIEWS SECONDHAND

SELECTION Deterministic Stochastic Proportionate selection: Roulette Wheel Selection

Selection 2 Selection Selection given a set of (distinct) elements, finding the element larger

Joint variable and rank selection for parsimonious estimation of high dimensional matrices

Design of experiments for the NIPS 2003 variable selection benchmark Isabelle Guyon July 2003

The signatures of long-lived spirals in disk galaxies The signatures of long-lived spirals in disk

Department of f Administration Joint Appropriations Committee on General Government March 22,

Inequality and Poverty in Bangladesh: Evidence from Household Surveys Dayal Talukder, PhD

The dis istribution of f pension wealth in in Europe Ja Javie ier Oliv ivera Luxembourg

Social Determinants and EHR Data: Analytic Decision Support Harold P. Lehmann MD PhD The PaTH

GLM Proxy Data Monte Bateman Proxy Data Creator Introduction GLM is an optical instrument

Early Assessment of Local Adaptation in Juvenile Prairie Grasses Healthy Prairies Project

Use of Placebo in Pediatric trials in IBD Cons Dan Turner Chair, Pediatric IBD Porto Group of

Sambuz

Useful Links

Newsletter

Mail Us

Ultrahigh dimensional variable selection: Beyond the linear model - PowerPoint PPT Presentation

Ultrahigh dimensional variable selection: Beyond the linear model Jianqing Fan Princeton University With Richard Samworth and Yichao Wu ; Rui Song http://www.princeton.edu/ jqfan May 16, 2009 Jianqing Fan ( Princeton University)

Ultrahigh energy cosmic rays Ultrahigh energy cosmic rays sources? and pulsar winds Kumiko

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

Numberjack User Guide May 27, 2013 1 Variables Constructor for the class Variable : Constructor

a life of an ultrahigh energy cosmic ray Kumiko Kotera , Institut dAstrophysique de Paris UCL

Bayesian Methods for Variable Selection with Applications to High-Dimensional Data Part 3:

ERP Selection KIRTANE &amp; PANDIT Suhas Deshpande Why ERP Selection is important ?

Luigi Spezia Biomathematics &amp; Statistics Scotland Aberdeen BAYESIAN VARIABLE SELECTION

Variable selection STAT 401 - Statistical Methods for Research Workers Jarad Niemi Iowa State

MLCC 2019 Variable Selection and Sparsity Lorenzo Rosasco UNIGE-MIT-IIT Outline Variable

n -dimensional manifold M with T := TM n -dimensional manifold M with T := TM T n -dimensional

Bayesian Methods for Variable Selection with Applications to High-Dimensional Data Part 5:

SECONDHAND SELECTION Sales Price - 275,000.00 EU SECONDHAND SELECTION INTERNAL VIEWS SECONDHAND

SELECTION Deterministic Stochastic Proportionate selection: Roulette Wheel Selection

Selection 2 Selection Selection given a set of (distinct) elements, finding the element larger

Joint variable and rank selection for parsimonious estimation of high dimensional matrices

Design of experiments for the NIPS 2003 variable selection benchmark Isabelle Guyon July 2003

The signatures of long-lived spirals in disk galaxies The signatures of long-lived spirals in disk

Department of f Administration Joint Appropriations Committee on General Government March 22,

Inequality and Poverty in Bangladesh: Evidence from Household Surveys Dayal Talukder, PhD

The dis istribution of f pension wealth in in Europe Ja Javie ier Oliv ivera Luxembourg

Social Determinants and EHR Data: Analytic Decision Support Harold P. Lehmann MD PhD The PaTH

GLM Proxy Data Monte Bateman Proxy Data Creator Introduction GLM is an optical instrument

Early Assessment of Local Adaptation in Juvenile Prairie Grasses Healthy Prairies Project

Use of Placebo in Pediatric trials in IBD Cons Dan Turner Chair, Pediatric IBD Porto Group of

Sambuz

Useful Links

Newsletter

Mail Us

ERP Selection KIRTANE & PANDIT Suhas Deshpande Why ERP Selection is important ?

Luigi Spezia Biomathematics & Statistics Scotland Aberdeen BAYESIAN VARIABLE SELECTION