High-Dimensional Multivariate Bayesian Linear Regression with - PowerPoint PPT Presentation

High-Dimensional Multivariate Bayesian Linear Regression with Shrinkage Priors Ray Bai Department of Statistics, University of Florida Joint work with Dr. Malay Ghosh March 20, 2018 Ray Bai (University of Florida) MBSP March 20, 2018 1 / 48

Overview Overview of High-Dimensional Multivariate Linear Regression 1 Multivariate Bayesian Model with Shrinkage Priors (MBSP) 2 Posterior Consistency of MBSP 3 Low-Dimensional Case Ultrahigh-Dimensional Case Implementation of the MBSP Model 4 Simulation Study 5 Yeast Cell Cycle Data Analysis 6 Ray Bai (University of Florida) MBSP March 20, 2018 2 / 48

Simultaneous Prediction and Estimation There are many scenarios where we would want to simultaneously predict q continuous response variables y 1 , ..., y q : Longitudinal data : The q response variables represent measurements at q consecutive time points. mRNA levels at different time points children’s heights at different ages of development CD4 cell counts over time for HIV/AIDS patients The data have a group structure : The q response variables represent a “group.” In genomics, genes within the same pathway often act together in regulating a biological system. Ray Bai (University of Florida) MBSP March 20, 2018 3 / 48

Multivariate Linear Regression Consider the multivariate linear regression model, Y = XB + E , where Y = ( y 1 , ..., y q ) is an n × q response matrix of n samples and q response variables, X is an n × p matrix of n samples and p covariates, B ∈ R p × q is the coefficient matrix, and E = ( ε 1 , ..., ε n ) T is an n × q noise i.i.d. ∼ N q ( 0 , Σ ) , i = 1, ..., n . matrix, where ε i Throughout, we assume that X is centered, so there is no intercept term. Ray Bai (University of Florida) MBSP March 20, 2018 4 / 48

Multivariate Linear Regression For the multivariate linear regression model, Y n × q = X n × p B p × q + E n × q , i.i.d. where E = ( ε 1 , ..., ε n ) T , ε i ∼ N q ( 0 , Σ ) , i = 1, ..., n , Σ represents the covariance structure of the q response variables. We wish to estimate the coefficient matrix B . Model selection from the p covariates is also often desired. This can be done using multivariate generalizations of AIC, BIC, or Mallow’s C p . Ray Bai (University of Florida) MBSP March 20, 2018 5 / 48

Multivariate Linear Regression For the multivariate linear regression model, the usual maximum likelihood estimator (MLE) is the ordinary least squares estimator, B = ( X T X ) − 1 X T Y . � The MLE is only unique if p ≤ n . It is well-known that the MLE is an inconsistent estimator of B if p / n → c , c > 0. Variable selection using AIC, BIC, and Mallow’s C p is infeasible for large p , since it requires searching over a model space of 2 p models. Ray Bai (University of Florida) MBSP March 20, 2018 6 / 48

High-Dimensional Multivariate Linear Regression To handle cases where p is large (including the p > n regime), frequentists typically use penalized regression (e.g. Li et al. (2015), Vincent and HAnsen (2014), Wilms and Croux (2017)): p B || Y − XB || 2 ∑ min 2 + λ || b i || 2 , i = 1 where b i represents the i th row of B and λ > 0 is a tuning parameter. The group lasso penalty, || · || 2 , shrinks entire rows of B to exactly 0 , leading to a sparse estimate of B and facilitating variable selection from the p estimators. We can use adaptive group lasso penalty to avoid overshrinkage of b i , i = 1, ..., p . Ray Bai (University of Florida) MBSP March 20, 2018 7 / 48

Bayesian High-Dimensional Multivariate Linear Regression The Bayesian approach is to put a prior distribution on B , π ( B ) . That is, given the model, Y = XB + E and data ( X , Y ) , we have π ( B | Y ) ∝ f ( Y | X , B ) π ( B ) . Inference can be conducted through the posterior, π ( B | Y ) . Ray Bai (University of Florida) MBSP March 20, 2018 8 / 48

Bayesian High-Dimensional Multivariate Linear Regression To achieve sparsity and variable selection, a common approach is to place spike-and-slab priors on the rows of B (e.g. Brown et al. (1998), Liquet et al. (2017)): i . i . d . b T ∼ ( 1 − p ) δ { 0 } + p N q ( 0 , τ 2 V ) , i = 1, ..., p . i δ { 0 } represents a point mass at 0 ∈ R q , and V is a q × q symmetric positive definite matrix. τ 2 can be treated as a tuning parameter, or a prior can be placed on τ 2 . A prior can also be placed on p so that the model adapts to the underlying sparsity. Usually, we put a Beta prior on p . Ray Bai (University of Florida) MBSP March 20, 2018 9 / 48

Bayesian High-Dimensional Multivariate Linear Regression For the spike-and-slab approach, i . i . d . b T ∼ ( 1 − p ) δ { 0 } + p N q ( 0 , τ 2 V ) , i = 1, ..., p , i τ 2 ∼ µ ( τ 2 ) , p ∼ B ( a , b ) , Taking the posterior median will give a point estimate of B with rows equal to 0 T , thus recovering a sparse estimate of B and facilitating variable selection. Due to the point mass at 0 , this model can be very, very slow for large p . Ray Bai (University of Florida) MBSP March 20, 2018 10 / 48

Bayesian High-Dimensional Multivariate Linear Regression Due to the computational inefficiency of discontinuous priors, it is often desirable to put a continuous prior on the parameters of interest. For the multivariate linear regression model, Y = XB + E , our aim to estimate B . This requires putting a prior density on a p × q matrix. A popular continuous prior to place on B is the matrix-normal prior . Ray Bai (University of Florida) MBSP March 20, 2018 11 / 48

The Matrix-Normal Prior Definition A random matrix X is said to have the matrix-normal density if X has the density function (on the space R a × b ): f ( X ) = | U | − b / 2 | V | − a / 2 e − 1 2 tr [ U − 1 ( X − M ) V − 1 ( X − M ) T ] , ( 2 π ) ab / 2 where M ∈ R a × b , and U and V are positive semi-definite matrices of dimension a × a and b × b respectively. If X is distributed as a matrix-normal distribution with pdf above, we write X ∼ MN a × b ( M, U, V ) . Ray Bai (University of Florida) MBSP March 20, 2018 12 / 48

Multivariate Bayesian Model with Shrinkage Priors (MBSP) By adding an additional layer in the Bayesian hierarchy, we can obtain a row-sparse estimate of B . This row-sparse estimate also facilitates variable selection from the p variables. Our model is specified as follows: Y | X , B , Σ ∼ MN n × q ( XB , I n , Σ ) , B | ξ 1 , ..., ξ p , Σ ∼ MN p × q ( O , τ diag ( ξ 1 , ..., ξ p ) , Σ ) , ind ∼ π ( ξ i ) , i = 1, ..., p , ξ i where τ > 0 is a tuning parameter, and π ( ξ i ) is a polynomial-tailed prior density of the form, π ( ξ i ) = K ( ξ i ) − a − 1 L ( ξ i ) , where K > 0 is the constant of proportionality, a is positive real number, and L is a a positive measurable, non-constant, slowly varying function over ( 0, ∞ ) . Ray Bai (University of Florida) MBSP March 20, 2018 13 / 48

Examples of Polynomial-Tailed Priors π ( ξ i ) / C L ( ξ i ) Prior ξ − a − 1 Student’s t exp ( − a / ξ i ) exp − a / ξ i i ξ − 1 / 2 ( 1 + ξ i ) − 1 ξ a Horseshoe i / ( 1 + ξ i ) i ξ − 1 / 2 ( ξ i − 1 ) − 1 log ( ξ i ) i ( ξ i − 1 ) − 1 log ( ξ i ) ξ a Horseshoe+ i ( 1 + ξ i ) − 1 − a { ξ i / ( 1 + ξ i ) } a + 1 NEG ξ u − 1 { ξ i / ( 1 + ξ i ) } a + u ( 1 + ξ i ) − a − u TPBN � � i � ∞ � ∞ � − λ 2 ξ i λ 2 λ 2 a − 1 exp ( − ηλ ) d λ 0 t a exp ( − t − η GDP 2 exp 2 t / ξ i ) dt 0 2 � � ( 1 + ξ i ) − ( a + u ) exp { ξ i / ( 1 + ξ i ) } a + u ξ u − 1 s − HIB i 1 + ξ i � � − 1 � � � � − 1 φ 2 + 1 − φ 2 φ 2 + 1 − φ 2 s × × exp − 1 + ξ i 1 + ξ i 1 + ξ i Table: Polynomial-tailed priors, their respective prior densities for π ( ξ i ) up to normalizing constant C , and the slowly-varying component L ( ξ i ) . Ray Bai (University of Florida) MBSP March 20, 2018 14 / 48

Sparse Estimation of B : Examples If π ( ξ j ) ind ∼ Inverse-Gamma ( α j , γ j 2 ) , then the marginal density for B , π ( B ) , under the MBSP model is proportional to � � − ( α j + q p 2 ) || b j ( τ Σ ) − 1 / 2 || 2 ∏ 2 + γ j , j = 1 which corresponds to a multivariate t -distribution. Here b j denotes the j th row of B . Ray Bai (University of Florida) MBSP March 20, 2018 15 / 48

Sparse Estimation of B : Examples If π ( ξ j ) ∝ ξ q / 2 − 1 ( 1 + ξ j ) − 1 , then the joint density π ( B , ξ 1 , ..., ξ p ) under j the MBSP model is proportional to p − 1 2 ξ j || b j ( τ Σ ) − 1 / 2 || 2 ξ − 1 ( 1 + ξ j ) − 1 e ∏ 2 , j j = 1 and integrating out the ξ j ’s gives a multivariate horseshoe density function. Ray Bai (University of Florida) MBSP March 20, 2018 16 / 48

Notation For any two sequences of positive real numbers { a n } and { b n } with b n � = 0, � � � � � a n � ≤ M for all n , for some positive real number M a n = O ( b n ) if b n independent of n a n = o ( b n ) if lim n → ∞ a n b n = 0. Therefore, a n = o ( 1 ) if lim n → ∞ a n = 0. � For a vector v ∈ R n , || v || 2 : = ∑ n i = 1 v 2 i denote the ℓ 2 norm. � For a matrix A ∈ R a × b with entries a ij , || A || F : = tr ( A T A ) � ∑ a i = 1 ∑ b j = 1 a 2 = ij denotes the Frobenius norm of A . For a symmetric matrix A , we denote its minimum and maximum eigenvalues by λ min ( A ) and λ max ( A ) respectively. Ray Bai (University of Florida) MBSP March 20, 2018 17 / 48

High-Dimensional Multivariate Bayesian Linear Regression with - PowerPoint PPT Presentation

High-Dimensional Multivariate Bayesian Linear Regression with Shrinkage Priors Ray Bai Department of Statistics, University of Florida Joint work with Dr. Malay Ghosh March 20, 2018 Ray Bai (University of Florida) MBSP March 20, 2018 1 / 48

Outline Multivariate Data 1 Multivariate Parametric Methods Multivariate Normal Distribution 2

Multivariate Linear Regression Max Turgeon STAT 4690Applied Multivariate Analysis

Bayesian linear regression Dr. Jarad Niemi STAT 544 - Iowa State University April 23, 2019

Case Study: Bayesian Linear Regression and Sparse Bayesian Models Piyush Rai Dept. of CSE, IIT

Bayesian linear regression (cont.) Dr. Jarad Niemi STAT 544 - Iowa State University April 20,

Linear Models for Regression Greg Mori - CMPT 419/726 Bishop PRML Ch. 3 Regression Linear Basis

Regression 1: Linear Regression Marco Baroni Practical Statistics in R Outline Classic linear

Bayesian Linear Regression Seung-Hoon Na Chonbuk National University Bayesian Linear Regression

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

Regression Diagnostics and the Forward Search 3. A Single Multivariate Sample Anthony Atkinson,

Linear regression How to measure the accuracy of linear regression models Linear Regression

STAT 213 Simple Linear Regression I Colin Reimer Dawson Oberlin College 5 October 2016 Outline

Ensembled Multivariate Adaptive Regression Splines Ensembled Multivariate Adaptive Regression

Sequence Alignment: Scoring Schemes COMP 571 Luay Nakhleh, Rice University Scoring Schemes

Children in Foster Care December 4, 2018 Webinar Instructions Remember to Turn on Your Computer

CSCE 790 Computer Systems Security Malware Professor Qiang Zeng Spring 2020 Previous

PR PREVENT NTION ON OF OF EXPOS POSURE OF OF WOR ORKERS TO O BIOL OLOG OGICAL RISKS

Jess Stohlmann-Rainey @JessStohlmann #PSW20 Worki king Ha Harder Isnt W Wor orking: H

MLCC 2017 Deep Learning Lorenzo Rosasco UNIGE-MIT-IIT June 29, 2017 What? Classification

Biology Summer Assignment PART 1 Topic: Ecosystems Directions Instructions: Review the

Starter What are they? Name a BIOTIC factor and an Abiotic and Biotic: whats the common