A new Bayesian variable selection criterion based on a g -Prior - PowerPoint PPT Presentation

A new Bayesian variable selection criterion based on a g -Prior extension for p > n Yuzo Maruyama and Edward George CSIS, The University of Tokyo, Japan Department of Stat, University of Pennsylvania

Overview: Our recommendable Bayes factor  � � − n +1 sv[ X γ ] × � ˆ β MP  LSE [ γ ] � if q γ ≥ n − 1     γ ) − n − q γ  d q γ + 3 4 B ( q γ 4 , n − q γ 2 + 1 − 3 q γ (1 − R 2 4 ) 2 2 4 + q γ 1 q γ � ˆ 4 , n − q γ  2 B ( 1 − 3 sv[ X γ ] q γ (1 − R 2 γ + d 2 β LSE [ γ ] � 2 ) 4 )   2    if q γ ≤ n − 2 ◮ A criterion based on full Bayes ◮ but we need no MCMC ◮ An exact closed form by using a special prior ◮ applicable for p > n as well as n > p ◮ model selection consistency and good numerical performance

Introduction Priors Sketch of the calculation of the marginal density The estimation after selection Model selection consistency Numerical experiments Summary and Future work

Full model ◮ Y |{ α, β, σ 2 } ∼ N n ( α 1 n + X β, σ 2 I ) ◮ α : an intercept parameter ◮ 1 n = (1 , 1 , . . . , 1) ′ ◮ X = ( X 1 , . . . , X p ): an n × p standarized design matrix rank X = min( n − 1 , p ) ◮ β : a p × 1 vector of unknown coefficients ◮ σ 2 : an unknown variance Since there is usually a subset of useless regressors in the full model, we would like to choose a good sub-model with only important regressors.

Submodel ◮ submodel M γ Y |{ α, β γ , σ 2 } ∼ N n ( α 1 n + X γ β γ , σ 2 I ) ◮ Assume the intercept is always included ◮ X γ : the n × q γ matrix, rank X γ = min( n − 1 , q γ ) columns = the γ th subset of X 1 , . . . , X p ◮ β γ : a q γ × 1 vector of unknown regression coefficients ◮ q γ : the number of regressors of M γ ◮ The null model: The special case of sub-model M N : Y |{ α, σ 2 } ∼ N n ( α 1 n , σ 2 I )

Variable selection in the Bayesian framework ◮ It entails the specification of prior ◮ on the models Pr( M γ ) ◮ on parameters p ( α, β γ , σ 2 ) of each model ◮ Assumption: equal model space probability Pr( M γ ) = Pr( M γ ′ ) for any γ � = γ ′ ◮ Choose M γ as the best model which maximizes m γ ( y ) posterior prob. Pr( M γ | y ) = � γ m γ ( y ) ◮ m γ ( y ): the marginal density under M γ larger m γ ( y ) is better!

Variable selection in the Bayesian framework ◮ the marginal density �� p y ( y | α, β γ , σ 2 ) p ( α, β γ , σ 2 ) d α d β γ d σ 2 m γ ( y ) = ◮ Recall that we consider Full Bayes method, which means the joint prior density p ( α, β γ , σ 2 ) does not depend on data unlike Empirical Bayes method. ◮ Bayes factor is often used for expression of Pr( M γ | y ) BF( M γ ; M N ) Pr( M γ | y ) = � γ BF( M γ ; M N ) where BF( M γ ; M N ) = m γ ( y ) m N ( y )

Priors ◮ The form of our joint density p ( α, β γ , σ 2 ) = p ( α ) p ( σ 2 ) p ( β | σ 2 ) � = 1 × σ − 2 × p ( β | g , σ 2 ) p ( g ) dg ◮ 1 × σ − 2 : a popular non-informative prior ◮ improper but justificated because α and σ 2 are included in all submodels ◮ p ( β | g , σ 2 ) and p ( g )

The original Zellner’s g -prior ◮ prior of regression coefficients ◮ Zellner’s (1986) g -prior is popular p β γ ( β γ | σ 2 , g ) = N q γ (0 , g σ 2 ( X ′ γ X γ ) − 1 ) ◮ It is applicable for the traditional situation p + 1 < n ⇒ q γ + 1 < n for any M γ ◮ There are many papers which use g -priors including George and Foster (2000, Biometrika) and Liang et al. (2008, JASA)

The beauty of the g -prior ◮ The marginal density of y given g and σ 2 � � �� g α,β γ log p ( Y | α, β γ , σ 2 ) − q γ g + 1 exp max log( g + 1) g + 1 2 g ◮ Under known σ 2 , g − 1 ( g + 1) log( g + 1) = 2 , or log n leads to AIC by Akaike (1974) and BIC by Schwarz (1978) respectively ◮ several studies: how to choose g based on non-full Bayesian method

Many regressors case ( p > n ) ◮ In modern statistics, treating (very) many regressors case ( p > n ) becomes more and more important ◮ the original Zellner’s g -prior is not available ◮ R 2 is always 1 in the case where q γ ≥ n − 1 ⇒ naive AIC and BIC methods do not work ◮ When we do not use the original g -prior, Bayesian method is available in many regressors case for example β ∼ N (0 , σ 2 λ I ) ◮ inverse-gamma conjugate prior for σ 2 are also available

Many regressors case ( p > n ) ◮ The integral with respect to λ still remains in m γ ( y ) as long as the full Bayes method is considered. ◮ Needless to say, it should be calculated by numerical methods like MCMC or by approximation like Laplace method. ◮ We do not have comparative advantage in numerical methods,,,,, ◮ We like exact analytical results very much.

A variant of Zellner’s g -prior ◮ a special variant of g -prior which enables us to ◮ not only calculate the marginal density analytically (closed form!!) ◮ but also treat many regressors case ◮ [KEY] singular value decomposition of X γ r � X γ = U γ D γ W ′ d i [ γ ] u i [ γ ] w ′ γ = i [ γ ] i =1 ◮ r : rank of X = min( q γ , n − 1) ◮ the n − 1 is from “ X is the centered matrix” ◮ singular values d 1 [ γ ] ≥ · · · ≥ d r [ γ ] > 0

A special variant of g -prior  arbitrary   � �� n − 1  i β | g , σ 2 ) ×  i =1 p i ( w ′ p # ( W ′ # β ) p β ( β | g , σ 2 ) = if q ≥ n    � q i β | g , σ 2 ) if q ≤ n − 1  i =1 p i ( w ′ p i ( ·| g , σ 2 ) = N (0 , σ 2 { ν i (1 + g ) − 1 } ) d 2 i W # : a q × ( q − r ) matrix from the orthogonal complement of W q � c.f. original g -prior p β ( β | g , σ 2 ) = i β | g , σ 2 ) if q ≤ n − 1 p i ( w ′ i =1 p i ( ·| g , σ 2 ) = N (0 , g σ 2 ) d 2 i

A special variant of g -prior ◮ ν 1 , . . . , ν r ( ν i ≥ 1) where r = min { n − 1 , q } hyperparameters we have to fix ◮ q ≤ n − 1 ⇒ ( Z ′ Z ) − 1 exists ν 1 = · · · = ν q = 1 ⇒ the original Zellner’s prior ◮ the descending order ν 1 ≥ · · · ≥ ν r like ν i = d 2 i / d 2 (our recommendation) r for 1 ≤ i ≤ r is reasonable for our purpose ◮ numerical experiment and the estimation after selection support the choice

Sketch of the calculation of the marginal density ◮ we have prepared all of priors except for g (we will give a prior of g later) ◮ the marginal density of y given g = the marignal density after the integration w.r.t. α , β , σ 2 � � − ( n − 1) / 2 ( g + 1)(1 − R 2 γ ) + GR 2 m γ ( y | g ) = C ( n , y ) γ × (1 + g ) − r / 2+( n − 1) / 2 � r i =1 ν 1 / 2 i where G R 2 γ means the “generalized” R 2 γ r ( u ′ y 1 n } ) 2 i { y − ¯ � G R 2 γ = y 1 n � 2 ν i � y − ¯ i =1

Many regressors case ◮ rank of X = r = n − 1, R 2 γ = 1 ◮ m γ ( y | g ) does not depend on g � � − ( n − 1) / 2 m γ ( y ) = m γ ( y | g ) = C ( n , y ) � n − 1 i =1 ν − 1 / 2 GR 2 γ i ◮ If ν 1 = · · · = ν n − 1 = 1, GR 2 γ just becomes 1 and hence m γ ( y ) = C ( n , y ) ◮ it does not work for model selection because it always takes the same value in many regressors case ◮ That is why the choice of ν is important.

few regressors case ( q ≤ n − 2) ◮ p g ( g ) = { B ( a + 1 , b + 1) } − 1 g b (1 + g ) − a − b − 2 ◮ it is proper if a > − 1 and b > − 1 ◮ Liang et al (2008, JASA) “hyper- g priors” b = 0 p g ( g ) = ( a + 1) − 1 ( g + 1) − a − 2 ◮ b = ( n − 5 − r ) / 2 − a is for getting a closed simple form of the marginal density ◮ − 1 < a < − 1 / 2 is for well-defining the marginal density of every sub-model ◮ The median a = − 3 / 4 is our recommendation

Sketch of the calculation of the marginal density ◮ When b = ( n − 5) / 2 − r / 2 − a , the beta function takes the integration w.r.t. g � m γ ( y | g ) p ( g ) dg = C ( n , y ) B ( q / 2 + a + 1 , b + 1)(1 − R 2 γ + GR 2 γ ) − ( n − 1) / 2+ b +1 � r i =1 ν 1 / 2 B ( a + 1 , b + 1)(1 − R 2 γ ) b +1 i ◮ When b � = ( n − 5) / 2 − r / 2 − a , there remains an integral with R 2 γ and GR 2 γ in m γ ( y ) ⇒ the need of MCMC or approximation ◮ Liang et al (2008, JASA) b = 0, ν 1 = · · · = ν r = 1 the Laplace approximation

Our recommendable BF ◮ After insertion of our recommendable hyperparameters a = − 3 / 4, b = ( n − 5) / 2 − r / 2 − a and ν i = d 2 i / d 2 r Our criterion BF[ M γ ; M N ]= m γ ( y ) / m N ( y ) becomes  � � − n +1 sv[ X γ ] × � ˆ β MP  LSE [ γ ] � if q γ ≥ n − 1     γ ) − n − q γ d q γ + 3 4 B ( q γ 4 , n − q γ  q γ (1 − R 2 2 + 1 − 3 4 ) 2 2 4 + q γ q γ � ˆ 4 , n − q γ 1 2 B ( 1 − 3 sv[ X γ ] q γ (1 − R 2 γ + d 2 β LSE [ γ ] � 2 )  4 )   2    if q γ ≤ n − 2 ◮ It is exactly proportional to the posterior probability ◮ based on fundamental aggregated information of y and X γ

A new Bayesian variable selection criterion based on a g -Prior - PowerPoint PPT Presentation

A new Bayesian variable selection criterion based on a g -Prior extension for p > n Yuzo Maruyama and Edward George CSIS, The University of Tokyo, Japan Department of Stat, University of Pennsylvania Overview: Our recommendable Bayes factor

NEW CRITERIA LABELS Criterion 1. Students Criterion 2. Program Educational Objectives Criterion

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

Luigi Spezia Biomathematics & Statistics Scotland Aberdeen BAYESIAN VARIABLE SELECTION

A New Information Criterion A New Information Criterion for the Selection of Subspace Models for

Bayesian variable selection Dr. Jarad Niemi Iowa State University September 4, 2017 Jarad Niemi

Numberjack User Guide May 27, 2013 1 Variables Constructor for the class Variable : Constructor

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Bayesian Methods for Variable Selection with Applications to High-Dimensional Data Part 3:

Stochastic multi-scale selection of the stopping Nicolai Bissantz criterion for MLEM

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

Lecture 6. Bayesian estimation Lecture 6. Bayesian estimation 1 (172) 6. Bayesian estimation

Von Mises Failure Criterion Von Mises Criterion . . . in Mechanics of Materials: Computing V

Plan of the Lecture Review: Nyquist stability criterion Todays topic: Nyquist stability

Subspace Information Criterion Subspace Information Criterion for Image Restoration for Image

Shear Strength Chapter 10 Mohrs Failure Criterion 1 4/13/2015 Coulombs

Bayesian Methods for Variable Selection with Applications to High-Dimensional Data Part 5:

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture 21: (Brief) Introduction

t s t t s st r

The necessity and formulation of a robust (imprecise) Bayes Factor Patrick Schwaferts

Applied Bayesian Inference in R Using MCMCpack Andrew D. Martin Kevin Quinn Free,

Advanced Simulation - Lecture 10 Patrick Rebeschini February 14th, 2018 Patrick Rebeschini

Bayesian Receiver Autonomous Integrity Monitoring Technique Henri Pesonen and Robert Pich

Outline Inference in Bayes Nets Variable Elimination Bayes Nets (cont) CS 486/686

Bayesian regression with a categorical predictor Alicia Johnson Associate Professor, Macalester

Sambuz

Useful Links

Newsletter

Mail Us

A new Bayesian variable selection criterion based on a g -Prior - PowerPoint PPT Presentation

A new Bayesian variable selection criterion based on a g -Prior extension for p > n Yuzo Maruyama and Edward George CSIS, The University of Tokyo, Japan Department of Stat, University of Pennsylvania Overview: Our recommendable Bayes factor

NEW CRITERIA LABELS Criterion 1. Students Criterion 2. Program Educational Objectives Criterion

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

Luigi Spezia Biomathematics &amp; Statistics Scotland Aberdeen BAYESIAN VARIABLE SELECTION

A New Information Criterion A New Information Criterion for the Selection of Subspace Models for

Bayesian variable selection Dr. Jarad Niemi Iowa State University September 4, 2017 Jarad Niemi

Numberjack User Guide May 27, 2013 1 Variables Constructor for the class Variable : Constructor

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Bayesian Methods for Variable Selection with Applications to High-Dimensional Data Part 3:

Stochastic multi-scale selection of the stopping Nicolai Bissantz criterion for MLEM

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

Lecture 6. Bayesian estimation Lecture 6. Bayesian estimation 1 (172) 6. Bayesian estimation

Von Mises Failure Criterion Von Mises Criterion . . . in Mechanics of Materials: Computing V

Plan of the Lecture Review: Nyquist stability criterion Todays topic: Nyquist stability

Subspace Information Criterion Subspace Information Criterion for Image Restoration for Image

Shear Strength Chapter 10 Mohrs Failure Criterion 1 4/13/2015 Coulombs

Bayesian Methods for Variable Selection with Applications to High-Dimensional Data Part 5:

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture 21: (Brief) Introduction

t s t t s st r

The necessity and formulation of a robust (imprecise) Bayes Factor Patrick Schwaferts

Applied Bayesian Inference in R Using MCMCpack Andrew D. Martin Kevin Quinn Free,

Advanced Simulation - Lecture 10 Patrick Rebeschini February 14th, 2018 Patrick Rebeschini

Bayesian Receiver Autonomous Integrity Monitoring Technique Henri Pesonen and Robert Pich

Outline Inference in Bayes Nets Variable Elimination Bayes Nets (cont) CS 486/686

Bayesian regression with a categorical predictor Alicia Johnson Associate Professor, Macalester

Sambuz

Useful Links

Newsletter

Mail Us

Luigi Spezia Biomathematics & Statistics Scotland Aberdeen BAYESIAN VARIABLE SELECTION