Mixture of g Priors for Bayesian Variable Selection Feng Liang, Rui - PowerPoint PPT Presentation

Introduction Zellner’s g priors Mixture of g priors Consistency Discussion Mixture of g Priors for Bayesian Variable Selection Feng Liang, Rui Paulo et al. Sheng Zhang Department of Statistics, University of Wisconsin Madison April 30, 2010

Introduction Zellner’s g priors Mixture of g priors Consistency Discussion Outline Introduction 1 Zellner’s g priors 2 Mixture of g priors 3 Consistency 4 Discussion 5

Introduction Zellner’s g priors Mixture of g priors Consistency Discussion Basic Setup Consider Y ∼ N ( µ , I n /φ ) , where Y = ( y 1 , y 2 , . . . , y n ) T , µ = ( µ 1 , µ 2 , . . . , µ n ) T , I n is the n × n identity matrix, and φ is the precision parameter Potential centered predictors X 1 , . . . , X p Only consider the case n ≥ p + 2 Index the model space by γ p × 1 : � 0 if X j is excluded γ j = 1 if X j is included Under model M γ : µ = 1 n α + X γ β γ

Introduction Zellner’s g priors Mixture of g priors Consistency Discussion Key Idea of Bayesian Variable Selection Put priors on the unknowns θ γ = ( α, β γ , φ ) ∈ Θ γ Update prior probabilities of models p ( M γ ) to p ( M γ ) p ( Y |M γ ) p ( M γ | Y ) = � γ p ( M γ ) p ( Y |M γ ) � where p ( Y |M γ ) = Θ γ p ( Y | θ γ , M γ ) p ( θ γ |M γ ) d θ γ , and p ( M γ ) could be 1 / 2 p Choose the model with greatest p ( M γ | Y )

Introduction Zellner’s g priors Mixture of g priors Consistency Discussion The Goal of the Paper Y | α, β γ , φ, M γ ∼ N (1 n α + X γ β γ , I n /φ ) p ( α, φ |M γ ) = 1 φ β γ | φ, M γ ∼ N (0 , g φ ( X T γ X γ ) − 1 ) (Zellner’s g prior) Several previous work involves choices of calibration of g g acts as a dimensionality penalty The goal of the paper is to propose a new family of priors for g, the hyper-g prior family, to guarantee: robustness of mis-specification of g a closed-form marginal likelihoods computational efficiency desirable consistency properties in model selection

Introduction Zellner’s g priors Mixture of g priors Consistency Discussion Null-Based Bayes Factors (1) The bayes factor of comparing each of M γ to a base model M b is BF [ M γ : M b ] = p ( Y |M γ ) p ( Y |M b ) To compare two models M γ and M γ ′ , BF [ M γ : M γ ′ ] = BF [ M γ : M b ] BF [ M γ ′ : M b ] The posterior probability could be written as p ( M γ ) BF [ M γ : M b ] p ( M γ | Y ) = � γ ′ p ( M γ ′ ) BF [ M γ ′ : M b ]

Introduction Zellner’s g priors Mixture of g priors Consistency Discussion Null-Based Bayes Factors (2) M b = M N H 0 : β γ = 0 vs . H 0 : β γ � = 0 Recall p ( α, φ |M γ ) = 1 φ and β γ | φ, M γ ∼ N (0 , g γ X γ ) − 1 ) φ ( X T Closed form of marginal likelihood: (1+ g ) ( n − 1 − p γ ) / 2 Γ(( n − 1) / 2) ( π ) ( n − 1) √ n � Y − ¯ √ Y � − ( n − 1) × p ( Y |M γ , g ) = γ )] − ( n − 1) / 2 [1+ g (1 − R 2 The null model p ( Y |M N ) corresponds to R 2 γ = 0 and p γ = 0 BF [ M γ : M N ] = (1 + g ) ( n − 1 − p γ ) / 2 [1 + g (1 − R 2 γ )] − ( n − 1) / 2

Introduction Zellner’s g priors Mixture of g priors Consistency Discussion Paradoxes of fixed g Priors – Bartlett’s Paradox When g → ∞ while n and p γ are fixed: (1 + g ) ( n − 1 − p γ ) / 2 [1 + g (1 − R 2 γ )] − ( n − 1) / 2 BF [ M γ : M N ] = → 0 This means, regardless of the information in the data, the Bayes factor always favors the null model, which is due to the large spread of the prior induced by the noninformative choice of g

Introduction Zellner’s g priors Mixture of g priors Consistency Discussion Paradoxes of fixed g Priors – Information Paradox β γ � 2 → ∞ so that R 2 Suppose � ˆ γ → 1 while n and p γ are fixed Expect BF [ M γ : M N ] → ∞ However, as R 2 γ → 1, (1 + g ) ( n − 1 − p γ ) / 2 [1 + g (1 − R 2 γ )] − ( n − 1) / 2 BF [ M γ : M N ] = (1 + g ) ( n − p γ − 1) / 2 → which is a constant!

Introduction Zellner’s g priors Mixture of g priors Consistency Discussion Choices of g Unit information prior : g = n (BF behaves like BIC) Risk inflation criterion : g = p 2 (minimax perspective) Benchmark prior : g = max ( n , p 2 ) (BRIC) Local empirical Bayes : the MLE of p ( Y |M γ , g ) with the g EBL = max ( F γ − 1 , 0), where nonnegative constraint. ˆ γ R 2 γ / p γ F γ = γ ) / ( n − 1 − p γ ) . (1 − R 2 Global empirical Bayes : (1+ g ) ( n − 1 − p γ ) / 2 g EBL = argmax g > 0 � ˆ γ p ( M γ ) [1+ g (1 − R 2 γ )] ( n − 1) / 2

Introduction Zellner’s g priors Mixture of g priors Consistency Discussion Choices of g and Information Paradox For fixed n and p , The Unit information prior , Risk inflation criterion and the Benchmark prior do not solve the information paradox The two EB approaches do have the desirable behavior Theorem 1 : In the setting of the information paradox with fixed n , p < n and R 2 γ → 1, for both global and local EB estimate of g , (1 + g ) ( n − 1 − p γ ) / 2 [1 + g (1 − R 2 γ )] − ( n − 1) / 2 BF [ M γ : M N ] = → ∞ Proof: by direct checking

Introduction Zellner’s g priors Mixture of g priors Consistency Discussion Desirable π ( g ) g ∼ π ( g ) The Bayes factor BF [ M γ : M N ] = � ∞ 0 (1 + g ) ( n − 1 − p γ ) / 2 [1 + g (1 − R 2 γ )] − ( n − 1) / 2 π ( g ) dg The posterior mean µ under M γ � = M N : g X γ ˆ α and ˆ � � E [ µ | µ γ , Y ] = 1 n ˆ α + E 1+ g |M γ , Y β γ , where ˆ β are g � least square estimates of α and β , and E 1+ g is regarded as a shrinkage factor The optimal Bayes estimate of µ under the squared error loss: g X γ ˆ � � E [ µ | Y ] = 1 n ˆ α + � γ : M γ � = M N p ( M γ | bY ) E 1+ g |M γ , Y β γ g appears everywhere: BF, posterior mean and prediction Want priors leading to tractable computation for these quantities, and consistent model selection and risk properties

Introduction Zellner’s g priors Mixture of g priors Consistency Discussion Zellner-Siow Cauchy Priors Jeffreys (1961) rejected normal priors essentially for reasons related to BF paradoxes Cauchy prior is the simplest prior to satisfy basic consistency requirement for hypothesis testing The Zellner-Siow priors can be represented as a mixture of g priors with an Inv-Gamma(1 / 2 , n / 2): π ( g ) = ( n / 2) 1 / 2 Γ(1 / 2) g − 3 / 2 e − n / (2 g ) The corresponding integrals are are approximated by Laplace approximation As the model dimensionality increases, the accuracy of the approximation decreases

Introduction Zellner’s g priors Mixture of g priors Consistency Discussion Hyper-g Priors (1) π ( g ) = a − 2 2 (1 + g ) − a / 2 , g > 0 Only consider the case a > 2 when π ( g ) is a proper prior g 1+ g ∼ Beta (1 , a This prior leads to the shrinkage factor 2 − 1) Value of a ≥ 4 tends to put more mass on shrinkage values near 0, which is undesirable, hence only consider 2 < a ≤ 4 g When a = 4, 1+ g has a uniform distribution When a = 3, most of the mass is near 1

Introduction Zellner’s g priors Mixture of g priors Consistency Discussion Hyper-g Priors (2) Main advantage of hyper-g prior : leads to closed form of posterior distribution of g in terms of Gaussian hypergeometric function The posterior distribution of g: p γ + a − 2 p ( g | Y , M γ ) = 2 2 F 1 (( n − 1) / 2 , 1; ( p γ + a ) / 2; R 2 γ ) (1 + g ) ( n − 1 − p γ − a ) / 2 [1 + (1 − R 2 γ ) g ] − ( n − 1) / 2 × 2 F 1 ( a , b ; c ; z ) is convergent for real | z | < 1 with c > b > 0 and for z = ± 1 only if c > a + b and b > 0 To evaluate Gaussian hypergeometric function, numerical overflow is problematic for moderate to large n and large R 2 γ .

Introduction Zellner’s g priors Mixture of g priors Consistency Discussion Hyper-g Priors (3) Gaussian hypergeometric function appears in many quantities of interest: 2 , 1; p γ + a p γ + a − 2 2 F 1 ( n − 1 a − 2 ; R 2 BF [ M γ : M N ] = γ ) 2 2 F 1 (( n − 1) / 2 , 2;( p γ + a ) / 2; R 2 γ ) 2 E [ g |M γ , Y ] = p γ + a − 4 2 F 1 (( n − 1) / 2 , 1;( p γ + a ) / 2; R 2 γ ) 2 F 1 (( n − 1) / 2 , 2;( p γ + a ) / 2+1; R 2 γ ) g 2 1+ g |M γ , Y ] = E [ p γ + a 2 F 1 (( n − 1) / 2 , 1;( p γ + a ) / 2; R 2 γ )

Introduction Zellner’s g priors Mixture of g priors Consistency Discussion Overview The following three aspects of consistency are considered: 1) the ”information paradox” where R 2 γ → 1 2) the asymptotic consistency of model posterior probabilities as n → ∞ 3) the asymptotic consistency for prediction The above are studied under the assumption of the true model

Mixture of g Priors for Bayesian Variable Selection Feng Liang, Rui - PowerPoint PPT Presentation

Introduction Zellners g priors Mixture of g priors Consistency Discussion Mixture of g Priors for Bayesian Variable Selection Feng Liang, Rui Paulo et al. Sheng Zhang Department of Statistics, University of Wisconsin Madison April 30, 2010

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

Luigi Spezia Biomathematics & Statistics Scotland Aberdeen BAYESIAN VARIABLE SELECTION

Bernoulli Mixture Models Victor Medina Researcher at SBIF DataCamp Mixture Models in R The

Structure of mixture models Victor Medina Researcher at SBIF DataCamp Mixture Models in R

Bayesian variable selection Dr. Jarad Niemi Iowa State University September 4, 2017 Jarad Niemi

Bayesian Methods for Variable Selection with Applications to High-Dimensional Data Part 1:

Bayesian Methods for Variable Selection with Applications to High-Dimensional Data Part 3:

Conjugate Priors: Beta and Normal; Choosing Priors 18.05 Spring 2014 Jeremy Orloff and Jonathan

Conjugate Priors: Beta and Normal; Choosing Priors 18.05 Spring 2014 Jeremy Orloff and Jonathan

Bayesian Variable Selection via Spike-and-Slab Priors: Annotated Bibliography Marina

Mixture Selection, Mechanism Design, and Signaling Ho Yee Cheung Shaddin Dughmi Yu Cheng Ehsan

Function Space Priors in Bayesian Deep Learning Roger Grosse Motivation Today Bayesian deep

Beyond Uniform Priors in Bayesian Network Structure Learning (for Discrete Bayesian Networks)

Numberjack User Guide May 27, 2013 1 Variables Constructor for the class Variable : Constructor

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

Perceptual Evaluation of Source Separation for Remixing Music H. Wierstorf 1 D. Ward 1 E. M. Grais

PLN BERSIH: Clean Business! A Civil Society Support for Mainstreaming Corporate Governance in

Policy and Strategy: Accelerating Poverty Reduction, Reducing Unemployment, and Overcoming

4Q17 & FY17 Performance Results Jakarta, 26 February 2018 1 Agenda 1 INTRODUCTION 2

Jeffrey Wennberg Commissioner of Public Works City of Rutland 1 City of Rutland CSO Planning

Learning in Graphical Models Problem Dimensions Model Bayes Nets Markov Nets

WORKSHOP ON ELETRONIC PRODUCT INFORMATION (ePI) Sine Jensen Danish Consumer Council -

Model error in geophysical data assimilation Some (older and new) ideas Alberto Carrassi Nansen

Mixture of g Priors for Bayesian Variable Selection Feng Liang, Rui - PowerPoint PPT Presentation

Introduction Zellners g priors Mixture of g priors Consistency Discussion Mixture of g Priors for Bayesian Variable Selection Feng Liang, Rui Paulo et al. Sheng Zhang Department of Statistics, University of Wisconsin Madison April 30, 2010

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

Luigi Spezia Biomathematics &amp; Statistics Scotland Aberdeen BAYESIAN VARIABLE SELECTION

Bernoulli Mixture Models Victor Medina Researcher at SBIF DataCamp Mixture Models in R The

Structure of mixture models Victor Medina Researcher at SBIF DataCamp Mixture Models in R

Bayesian variable selection Dr. Jarad Niemi Iowa State University September 4, 2017 Jarad Niemi

Bayesian Methods for Variable Selection with Applications to High-Dimensional Data Part 1:

Bayesian Methods for Variable Selection with Applications to High-Dimensional Data Part 3:

Conjugate Priors: Beta and Normal; Choosing Priors 18.05 Spring 2014 Jeremy Orloff and Jonathan

Conjugate Priors: Beta and Normal; Choosing Priors 18.05 Spring 2014 Jeremy Orloff and Jonathan

Bayesian Variable Selection via Spike-and-Slab Priors: Annotated Bibliography Marina

Mixture Selection, Mechanism Design, and Signaling Ho Yee Cheung Shaddin Dughmi Yu Cheng Ehsan

Function Space Priors in Bayesian Deep Learning Roger Grosse Motivation Today Bayesian deep

Beyond Uniform Priors in Bayesian Network Structure Learning (for Discrete Bayesian Networks)

Numberjack User Guide May 27, 2013 1 Variables Constructor for the class Variable : Constructor

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

Perceptual Evaluation of Source Separation for Remixing Music H. Wierstorf 1 D. Ward 1 E. M. Grais

PLN BERSIH: Clean Business! A Civil Society Support for Mainstreaming Corporate Governance in

Policy and Strategy: Accelerating Poverty Reduction, Reducing Unemployment, and Overcoming

4Q17 &amp; FY17 Performance Results Jakarta, 26 February 2018 1 Agenda 1 INTRODUCTION 2

Jeffrey Wennberg Commissioner of Public Works City of Rutland 1 City of Rutland CSO Planning

Learning in Graphical Models Problem Dimensions Model Bayes Nets Markov Nets

WORKSHOP ON ELETRONIC PRODUCT INFORMATION (ePI) Sine Jensen Danish Consumer Council -

Model error in geophysical data assimilation Some (older and new) ideas Alberto Carrassi Nansen

Luigi Spezia Biomathematics & Statistics Scotland Aberdeen BAYESIAN VARIABLE SELECTION

4Q17 & FY17 Performance Results Jakarta, 26 February 2018 1 Agenda 1 INTRODUCTION 2