SLIDE 1 “Bayesian Variable Selection via Spike-and-Slab Priors: Annotated Bibliography”
Marina Vannucci
Department of Statistics, Rice University, Houston, TX 77030, USA
June 10, 2013
This is a collection of references and readings related to the topics addressed in my short-course. Only main references are given, with some annotations.
- Linear Regression Models: Mixture priors for Bayesian variable se-
lection in univariate linear regression models were originally proposed by Leamer (1978) and Mitchell & Beauchamp (1988) and made popu- lar by George & McCulloch (1993, 1997), Geweke (1996), Clyde et al. (1996), Smith & Kohn (1996), Carlin & Chib (1995) and Raftery et al. (1997). Brown et al. (1998a, 2002) extended the construction to mul- tivariate linear regression models. Reviews of special features of the selection priors and on computational aspects can be found in Chip- man et al. (2001) and Clyde & George (2004). See also O’Hara & Sillanp¨ a¨ a (2009) for a more recent review paper.
- Common choices for the priors on the regression coefficients of the
regression model assume that the βj’s are a priori independent given the selection parameter γ, for example, by choosing hj = c for every j in the prior model (slide 3, part 1). Brown et al. (1998a) investigate the case of hj chosen to be proportional to the j-th diagonal element
- f (X′X)−1, while Smith & Kohn (1996) propose the use of a Zellner’s
g-prior, see Zellner (1986), of the type βγ|σ2 ∼ N(0, c(X′
γXγ)−1σ2).
This prior has an intuitive interpretation as it uses the design matrix
- f the current experiment. Liang et al. (2008) and Cui & George (2008)
1
SLIDE 2 have investigated formulations that use a fully Bayesian approach by imposing mixtures of g-priors on c. They also propose hyper-g priors for c which leads to closed form marginal likelihoods and nonlinear shrinkage via Empirical Bayes procedures.
- Independent Bernoulli priors on the γj’s with a Beta hyperprior, w ∼
Beta(a, b), with a, b to be chosen, are used for example by Brown et al. (1998b). An attractive feature of these priors is that appropriate choices
- f w that depend on p impose an a priori multiplicity penalty, as argued
in Scott & Berger (2010). Applications of Bayesian variable selection models to the analysis of genomic data have looked into priors on γ that exploit the complex dependence structure between genes (variables) as captured via underlying biological processes and/or networks. Some of these contributions include Li & Zhang (2010) and Stingo et al. (2010, 2011).
- When a large number of predictors makes the full exploration of the
model space unfeasible, Monte Carlo Markov chain methods can be used as stochastic searches to quickly and efficiently explore the poste- rior distribution looking for “good” models, i.e., models with high pos- terior probability, see George & McCulloch (1997). The most popular is the Metropolis scheme (MC3), proposed by Madigan & York (1995) in the context of model selection for discrete graphical models and sub- sequently adapted to variable selection, see Raftery et al. (1997) and Brown et al. (1998b, 2002), among others. Improved MCMC schemes have been proposed to achieve an even faster exploration of the poste- rior space, see for example the shotgun algorithm of Hans et al. (2007) and the evolutionary Monte Carlo schemes combined with parallel tem- pering proposed by Bottolo & Richardson (2010), Bottolo et al. (2011) (software available at http://www.bgx.org.uk/software.html).
- Variable selection can be achieved by thresholding marginal posterior
probabilities of inclusion. Barbieri & Berger (2004) define the median- probability model, which is the model that includes those covariates having posterior inclusion probability at least 1/2, and show that, under many circumstances, this model has greater predictive power than the most probable. Another method chooses a cut-off threshold based on the expected false discovery rate, see Newton et al. (2004). 2
SLIDE 3
- Extensions to Probit and Logit Models: The prior models for vari-
able selection described above can be easily applied to other modeling settings, where a response variable is expressed as a linear combinations
- f the predictors. For example, Bayesian variable selection for probit
models is investigated by Sha et al. (2004) and Kwon et al. (2007), within the data augmentation framework of Albert & Chib (1993). Holmes & Held (2006) (with correction in Bayesian Analysis (2011), 6(2)) and T¨ uchler (2008) considered logistic models - see also Polson & Scott (2013) for an alternative data augmentation scheme. Gustafson & Lefebvre (2008) extended methodologies to settings where the subset
- f predictors associated with the propensity to belong to a class varies
with the class. Sha et al. (2006) considered accelerated failure time models for survival data.
- Generalized Linear Models: Probit and logit models, in particular,
belong to the more general class of generalized linear models (GLMs)
- f McCullagh & Nelder (1989), that assume the distribution of the
response variable as coming from the exponential family. Conditional densities in the general GLM framework cannot be obtained directly and the resulting mixture posterior may be difficult to sample using standard MCMC methods due to multimodality. Some attempts to Bayesian variable selection methods for GLMs were done by Raftery (1996), who proposed approximate Bayes factors, and by Ntzoufras et al. (2003), who developed a method to jointly select variables and the link function. See also Ibrahim et al. (2000) and Chen et al. (2003).
- Covariance Selection in Models with Random Effects: Among
possible extensions of linear models, we also mention the class of mixed models, that include random effects capturing heterogeneity among subjects, Laird & Ware (1982). One challenge in developing SSVS ap- proaches for random effects models is the constraint that the random effects covariance matrix needs to be semi-definite positive. Chen & Dunson (2003) imposed mixture priors on the regression coefficients
- f the fixed effects and achieve simultaneous selection of the random
effects by imposing variable selection priors on the components in a special LDU decomposition of the random effects covariance. A simi- lar approach, based on the Cholesky decomposition, was proposed by Fr¨ uhwirth-Schnatter & T¨ uchler (2008). Cai & Dunson (2006) extended 3
SLIDE 4 the approach to generalized linear mixed models (GLMM) and Kinney & Dunson (2007) to logistic mixed effects models for binary data. Fi- nally, MacLehose et al. (2007), Dunson et al. (2008) and Yang (2012) considered Bayesian nonparametric approaches that use spiked Dirich- let process priors. Their approach models the unknown distribution
- f the regression coefficients via a Dirichlet process prior with a spike-
and-slab centering distribution. This allows different predictors to have identical coefficients while performing variable selection. There, the clustering induced by the Dirichlet process is on the univariate regres- sion coefficients and strength is borrowed across covariates. Kim et al. (2010) consider similar priors in a random effects model to cluster the coefficient vectors across samples.
- Regularization Priors: With spike and slab priors, all possible mod-
els are embodied within a hierarchical formulation and variable selec- tion is carried out model-wise. Regularization approaches, instead, use priors with just one continuous component and rely on the shrinkage properties of Bayesian estimators. Examples include the Laplace prior and the ridge prior. These have a singularity at the origin, which pro- motes an intensive shrinkage towards the zero prior mean. These priors can be expressed as scale mixture of normal distributions to facilitate computation. Popular regularized regression techniques include the Bayesian LASSO of Park & Casella (2008) and Hans (2009), which is equivalent to the MAP estimation under normal/exponential (Laplace) prior, and the normal scale mixture priors proposed of Griffin & Brown (2010). Li & Lin (2010) proposed the elastic net, which encourages a grouping effect in which strongly correlated predictors tend to come in
- r out of the model together. Lasso procedures tend to overshrink large
coefficients due to the relatively light tails of the Laplace prior. To over- come the issue, Carvalho et al. (2010) and Armagan et al. (2013) have proposed the horseshoe prior and generalized double Pareto shrinkage prior for linear models, respectively. The posterior summary measures (mean or median) are never zero with a positive probability, and zeroing the redundant variables out then needs to be carried out via threshold- ing the estimated coefficients. A solution is to augment the shrinkage priors to include a point mass at zero, see for example Hans (2010).
- Mixture Models: Bayesian variable selection has been applied also to
4
SLIDE 5 clustering settings that use finite and infinite mixture models. The first approach to variable selection for model-based clustering was put for- ward by Tadesse et al. (2005), who formulated the clustering in terms of a finite mixture of Gaussian distributions with an unknown number of components and then introduced latent variables to identify discrim- inating variables. The authors used a reversible jump Markov chain Monte Carlo technique to allow for the creation and deletion of clus-
- ters. Raftery & Dean (2006) considered a similar approach, with a like-
lihood formulation that avoids the independence assumption between the noisy and discriminating variables. Kim et al. (2006) proposed an alternative modeling approach that uses infinite mixture models via Dirichlet process priors. Hoff (2006) adopted a mixture of Gaussian distributions where different clusters are identified by mean shifts and Bayes factors are computed to identify discriminating variables. This method allows separate subsets of variables to discriminate different groups of observations. Stingo & Vannucci (2011) investigated adap- tations of the Bayesian variable selection method to the simpler super- vised clustering setting, also know as discriminant analysis. In that set- ting, the number of groups and the sample labels are available and the aim is to derive a classification rule that will assign further cases to their correct groups. In their application they use MRF priors to account for network prior structures among the variables (genes). Methodologies can be extended to more complex mixture models. Stingo et al. (2013) considered a hierarchical mixture model that incorporates discriminat- ing variables, network priors and mixture components that depend on selected covariates. Chung & Dunson (2009) investigated infinite mix- tures of univariate regressions, which also incorporate selection of the covariates, and an inferential approach based on the probit stick break- ing process. See also Papathomas et al. (2012) for discrete covariates.
References
Albert, J. & Chib, S. (1993), ‘Bayesian analysis of binary and polychotomous response data’, Journal of American Statistical Association 88, 669–679. Armagan, A., Dunson, D. & Lee, J. (2013), ‘Generalized double Pareto shrinkage’, Statistica Sinica 23, 119–143. 5
SLIDE 6 Barbieri, M. M. & Berger, J. O. (2004), ‘Optimal predictive model selection’,
- Ann. Statist. 32(3), 870–897.
Bottolo, L., Chadeau-Hyam, M., Hastie, D., Langley, S., Petretto, E., Tiret, L., Tregouet, D. & Richardson, S. (2011), ‘Ess++: a c++ objected-
- riented algorithm for bayesian stochastic search model exploration’,
Bioinformatics 27, 587–8. Bottolo, L. & Richardson, S. (2010), ‘Evolutionary stochastic search’, Bayesian Analysis 5(3), 583–618. Brown, P., Vannucci, M. & Fearn, T. (1998a), ‘Multivariate bayesian vari- able selection and prediction’, J. of the Royal Statistical Society, Series B 60, 627–641. Brown, P., Vannucci, M. & Fearn, T. (1998b), ‘Multivariate bayesian variable selection and prediction’, Journal of Chemometrics 12(3), 173–182. Brown, P., Vannucci, M. & Fearn, T. (2002), ‘Bayes model averaging with se- lection of regressors’, Journal of Royal Statistical Society, Series B 64, 519– 536. Cai, B. & Dunson, D. (2006), ‘Bayesian covariance selection in generalized linear mixed models’, Biometrics 62, 446–457. Carlin, B. P. & Chib, S. (1995), ‘Bayesian model choice via Markov chain Monte Carlo methods’, Journal of the Royal Statistical Society, Series B 57, 473–484. Carvalho, C., Polson, N. & Scott, J. (2010), ‘The horseshoe estimator for sparse signals’, Biometrika 97, 465–480. Chen, M.-H., Ibrahim, J., Shao, Q.-M. & Weiss, R. (2003), ‘Prior elicitation for model selection and estimation in generalized linear mixed models’, Journal of Statistical Planning and Inference 111, 57–76. Chen, Z. & Dunson, D. (2003), ‘Random effects selection in linear mixed models’, Biometrics 59, 762–769. Chipman, H., George, E. & McCulloch, R. (2001), The practical implemen- tation of Bayesian model selection, in ‘Model Selection’, IMS, pp. 67–116. 6
SLIDE 7 Chung, Y. & Dunson, D. (2009), ‘Nonparametric Bayes conditional distribu- tion modeling with variable selection’, Journal of the American Statistical Association 104, 1646–1660. Clyde, M., DeSimone, H. & Parmigiani, G. (1996), ‘Predition via orthog-
- nalized model mixing’, Journal of the American Statistical Association
91, 1197–1208. Clyde, M. & George, E. (2004), ‘Model uncertainty’, Statistical Science 19(1), 81–94. Cui, W. & George, E. I. (2008), ‘Empirical Bayes vs. fully Bayes variable section’, Journal of the Statistical Planning and Inference 138, 888–900. Dunson, D., Herring, A. & Engel, S. (2008), ‘Bayesian selection and clustering
- f polymorphisms in functionally-related gene’, Journal of the American
Statistical Association 103, 534–546. Fr¨ uhwirth-Schnatter, S. & T¨ uchler, R. (2008), ‘Bayesian parsimonious co- variance estimation for hierarchical linear mixed models’, Statistics and Computing 18(1), 1–13. George, E. & McCulloch, R. (1993), ‘Variable selection via gibbs sampling’, JASA 85, 398–409. George, E. & McCulloch, R. (1997), ‘Approaches for Bayesian variable selec- tion’, Statistica Sinica 7, 339–373. Geweke, J. (1996), Variable selection and model comparison in regression, in
- A. D. J.M. Bernardo, J.O. Berger & A. Smith, eds, ‘Bayesian Statistics
5: Proceedings of the Fifth Valencia International Meeting’, Oxford Univ. Press, pp. 609–620. Griffin, J. & Brown, P. (2010), ‘Inference with normal-gamma prior distri- butions in regression problems’, Bayesian Analysis 5, 17–188. Gustafson, P. & Lefebvre, G. (2008), ‘Bayesian multinomial regression with class-specific predictor selection’, Annals of Applied Statistics 2, 1478– 1502. Hans, C. (2009), ‘Bayesian lasso regression’, Biometrika 96, 835–845. 7
SLIDE 8
Hans, C. (2010), ‘Model uncertainty and variable selection in Bayesian lasso regression’, Statistics and Computing 20, 221–229. Hans, C., Dobra, A. & West, M. (2007), ‘Shotgun stochastic search for ”large p” regression’, Journal of the American Statistical Association 102 (478), 507–516. Hoff, P. (2006), ‘Model-based subspace clustering’, Bayesian Analysis 1, 321– 44. Holmes, C. & Held, L. (2006), ‘Bayesian auxiliary variable models for binary and multinomial regression’, Bayesian Analysis 1, 145–168. Ibrahim, J., Chen, M.-H. & Ryan, L. (2000), ‘Bayesian variable selection for time series count data’, Statistica Sinica 10, 971–987. Kim, S., Dahl, D. & Vannucci, M. (2010), ‘Spiked Dirichlet process prior for Bayesian multiple hypothesis testing in random effects models’, Bayesian Analysis 4, 707–732. Kim, S., Tadesse, M. & Vannucci, M. (2006), ‘Variable selection in clustering via Dirichlet process mixture models’, Biometrika 93 (4), 877–893. Kinney, S. & Dunson, D. (2007), ‘Fixed and random effects selection in linear and logistic models’, Biometrics 63(3), 690–698. Kwon, D., Tadesse, M., Sha, N., Pfeiffer, R. & Vannucci, M. (2007), ‘Iden- tifying biomarkers from mass spectrometry data with ordinal outcomes’, Cancer Informatics 3, 19–28. Laird, N. & Ware, J. (1982), ‘Random effects models for longitudinal data’, Biometrics 38, 963–974. Leamer, E. (1978), ‘Regression selection strategies and revealed priors’, Jour- nal of the American Statistical Association 73, 580–587. Li, F. & Zhang, N. (2010), ‘Bayesian variable selection in structured high- dimensional covariate space with application in genomics’, Journal of the American Statistical Association to appear. Li, Q. & Lin, N. (2010), ‘The Bayesian elastic net’, Bayesian Analysis 5(1), 151–170. 8
SLIDE 9 Liang, F., Paulo, R., Molina, G., Clyde, M. & Berger, J. (2008), ‘Mixture
- f g priors for Bayes variable section’, Journal of the American Statistical
Association 103, 410–423. MacLehose, R. F., Dunson, D. B., Herring, A. H. & Hoppin, J. A. (2007), ‘Bayesian methods for highly correlated exposure data’, Epidemiology 18(2), 199–207. Madigan, D. & York, J. (1995), ‘Bayesian graphical models for discrete data’, International Statistical Review 63, 215 – 232. McCullagh, P. & Nelder, J. (1989), Generalized Linear Models, second edi- tion, Chapman & Hall, London. Mitchell, T. J. & Beauchamp, J. J. (1988), ‘Bayesian variable selection in lin- ear regression’, Journal of the American Statistical Association 83, 1023– 1036. Newton, M. A., Noueiry, A., Sarkar, D. & Ahlquist, P. (2004), ‘Detecting differential gene expression with a semiparametric hierarchical mixture model’, Biostastics 5(2), 155–176. Ntzoufras, I., Dellaportas, P. & Forster, J. (2003), ‘Bayesian variable and link determination for generalised linear models’, Journal of Statistical Planning and Inference 111, 165–180. O’Hara, R. & Sillanp¨ a¨ a, M. (2009), ‘A review of bayesian variable selection methods: What, how and which’, Bayesian Analysis 4(1), 85–118. Papathomas, M., Molitor, J., Hoggart, C., Hastie, D. & Richardson, S. (2012), ‘Exploring data from genetic association studies using Bayesian variable selection and the Dirichlet process : application to searching for gene x gene patterns’, Genetic Epidemiology 6(36), 663–74. Park, T. & Casella, G. (2008), ‘The Bayesian lasso’, Journal of the American Statistical Association 103(482), 681–686. Polson, N. & Scott, J. (2013), ‘Data augmentation for non-Gaussian regres- sion models using variance-mean mixtures’, Biometrika 100(2), 549–71. Raftery, A. (1996), ‘Approximate Bayes factors and accounting for model uncertainty in generalized linear models’, Biometrika 83, 251 – 266. 9
SLIDE 10
Raftery, A. & Dean, N. (2006), ‘Variable selection for model-based cluster- ing’, Journal of the American Statistical Assocation 101, 168–178. Raftery, A., Madigan, D. & Hoeting, J. (1997), ‘Bayesian model averaging for linear regression models’, Journal of the American Statistical Association 92(437), 179 – 191. Scott, J. & Berger, J. (2010), ‘Bayes and empirical-Bayes multiplicity adjustment in the variable-selection problem’, The Annals of Statistics 38(5), 2587–2619. Sha, N., M.G.Tadesse & Vannucci, M. (2006), ‘Bayesian variable selection for the analysis of microarray data with censored outcome’, Bioinformatics 22(18), 2262–2268. Sha, N., Vannucci, M., Tadesse, M., Brown, P., Dragoni, I., Davies, N., Roberts, T., Contestabile, A., Salmon, N., Buckley, C. & Falciani, F. (2004), ‘Bayesian variable selection in multinomial probit models to iden- tify molecular signatures of disease stage’, Biometrics 60(3), 812–819. Smith, M. & Kohn, R. (1996), ‘Nonparametric regression using bayesian variable selection’, Journal of Econometrics 75, 317–343. Stingo, F., Chen, Y., Tadesse, M. & Vannucci, M. (2011), ‘Incorporating biological information in Bayesian models for the selection of pathways and genes’, Annals of Applied Statistics 5(3), 1978–2002. Stingo, F., Chen, Y., Vannucci, M., Barrier, M. & Mirkes, P. (2010), ‘A Bayesian graphical modeling approach to microrna regulatory network in- ference’, Annals of Applied Statistics 4(4), 2024–2048. Stingo, F., Guindani, M., Vannucci, M. & Calhoun, V. (2013), ‘An integrative bayesian modeling approach to imaging genetics’, Journal of American Statistical Association to appear. Stingo, F. & Vannucci, M. (2011), ‘Variable selection for discriminant anal- ysis with Markov random field priors for the analysis of microarray data’, Bioinformatics 27(4), 495–501. Tadesse, M. G., Sha, N. & Vannucci, M. (2005), ‘Bayesian variable selec- tion in clustering high dimensional data’, Journal of American Statistical Association 100, 602–617. 10
SLIDE 11
T¨ uchler, R. (2008), ‘Bayesian variable selection for logistic models using aux- iliary mixture sampling’, Journal of Computational and Graphical Statis- tics 17(1), 76–94. Yang, M. (2012), ‘Bayesian variable selection for logistic mixed model with nonparametric random effects’, Computational Statistics and Data Analy- sis 56, 2663–2674. Zellner, A. (1986), On assessing prior distributions and bayesian regression analysis with g-prior distributions, in P. Goel & A. Zellner, eds, ‘Bayesian Inference and Decision Techniques: Essays in Honor of Bruno de Finetti’, North-Holland/Elsevier, pp. 233–243. 11