Bayesian Linear Regression Sudipto Banerjee 1 Biostatistics, School - PowerPoint PPT Presentation

Bayesian Linear Regression Sudipto Banerjee 1 Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota, U.S.A. September 15, 2010 1

Linear regression models: a Bayesian perspective Linear regression is, perhaps, the most widely used statistical modelling tool. It addresses the following question: How does a quantity of primary interest, y , vary as (depend upon) another quantity, or set of quantities, x ? The quantity y is called the response or outcome variable . Some people simply refer to it as the dependent variable . The variable(s) x are called explanatory variables , covariates or simply independent variables . In general, we are interested in the conditional distribution of y , given x , parametrized as p ( y | θ , x ) . 2 CDC 2010 Hierarchical Modeling and Analysis

Linear regression models: a Bayesian perspective Typically, we have a set of units or experimental subjects i = 1 , 2 , . . . , n . For each of these units we have measured an outcome y i and a set of explanatory variables x ′ i = (1 , x i 1 , x i 2 , . . . , x ip ) . The first element of x ′ i is often taken as 1 to signify the presence of an “intercept”. We collect the outcome and explanatory variables into an n × 1 vector and an n × ( p + 1) matrix:       y 1 1 x 11 x 12 . . . x 1 p x ′ 1 y 2 1 x 21 x 22 . . . x 2 p x ′       2 y =  ; X =  =  .  .   . . . . .   .  . . . . . . .       . . . . . . .    y n 1 x n 1 x n 2 . . . x np x ′ n 3 CDC 2010 Hierarchical Modeling and Analysis

Linear regression models: a Bayesian perspective The linear model is the most fundamental of all serious statistical models underpinning: ANOVA: y i is continuous, x ij ’s are all categorical REGRESSION: y i is continuous, x ij ’s are continuous ANCOVA: y i is continuous, x ij ’s are continuous for some j and categorical for others. 4 CDC 2010 Hierarchical Modeling and Analysis

Linear regression models: a Bayesian perspective The Bayesian or hierarchical linear model is given by: y i | µ i , σ 2 , X ind ∼ N ( µ i , σ 2 ); i = 1 , 2 , . . . , n ; µ i = β 0 + β 1 x i 1 + · · · + β p x ip = x ′ i β ; β = ( β 0 , β 1 , . . . , β p ); β , σ 2 | X ∼ p ( β , σ 2 | X ) . Unknown parameters include the regression parameters and the variance, i.e. θ = { β , σ 2 } . p ( β , σ 2 | X ) ≡ p ( θ | X ) is the joint prior on the parameters. We assume X is observed without error and all inference is conditional on X . We suppress dependence on X in subsequent notation. 5 CDC 2010 Hierarchical Modeling and Analysis

Bayesian regression with flat priors Specifying p ( β , σ 2 ) completes the hierarchical model. All inference proceeds from p ( β , σ 2 | y ) With no prior information, we specify p ( β , σ 2 ) ∝ 1 σ 2 or equivalently p ( β ) ∝ 1; p (log( σ 2 )) ∝ 1 . The above is NOT a probability density (they do not integrate to any finite number). So why is it that we are even discussing them? Even if the priors are improper , as long as the resulting posterior distributions are valid we can still conduct legitimate statistical inference on them. 6 CDC 2010 Hierarchical Modeling and Analysis

Bayesian regression with flat priors Computing the posterior distribution Strategy: Factor the joint posterior distribution for β and σ 2 as: p ( β , σ 2 | y ) = p ( β | σ 2 , y ) × p ( σ 2 | y ) . The conditional posterior distribution of β , given σ 2 : β | σ 2 , y ∼ N (ˆ β , σ 2 V β ) , where, using some algebra, one finds V β = ( X ′ X ) − 1 . ˆ β = ( X ′ X ) − 1 X ′ y and 7 CDC 2010 Hierarchical Modeling and Analysis

Bayesian regression with flat priors The marginal posterior distribution of σ 2 : Let k = ( p + 1) be the number of columns of X . , ( n − k ) s 2 � n − k � σ 2 | y ∼ IG , 2 2 where 1 s 2 = n − k ( y − X ˆ β ) ′ ( y − X ˆ β ) is the classical unbiased estimate of σ 2 in the linear regression model. The marginal posterior distribution p ( β | y ) , averaging over σ 2 , is multivariate t with n − k degrees of freedom. But we rarely use this fact in practice. Instead, we sample from the posterior distribution. 8 CDC 2010 Hierarchical Modeling and Analysis

Bayesian regression with flat priors Algorithm for sampling from the posterior distribution We draw samples from p ( β , σ 2 | y ) by executing the following steps: Step 1: Compute ˆ β and V β . Step 2: Compute s 2 . Step 3: Draw M samples from p ( σ 2 | y ) : , ( n − k ) s 2 � n − k � σ 2( j ) ∼ IG , j = 1 , . . . M 2 2 Step 4: For j = 1 , . . . , M , draw β ( j ) from p ( β | σ 2( j ) , y ) : β ( j ) ∼ N � � ˆ β , σ 2( j ) V β 9 CDC 2010 Hierarchical Modeling and Analysis

Bayesian regression with flat priors The marginal distribution of each individual regression parameter β j is a non-central univariate t n − p distribution. In fact, β j − ˆ β j ∼ t n − p . � s V β ; jj The 95% credible interval for each β j is constructed from the quantiles of the t -distribution. This exactly coincides with the 95% classical confidence intervals, but the intepretation is direct: the probability of β j falling in that interval, given the observed data, is 0 . 95 . Note: an intercept only linear model reduces to the simple y | µ, σ 2 /n ) likelihood, for which the marginal univariate N (¯ posterior of µ is: µ − ¯ y s/ √ n ∼ t n − 1 . 10 CDC 2010 Hierarchical Modeling and Analysis

Bayesian predictions from the linear model Suppose we have observed the new predictors ˜ X , and we wish to predict the outcome ˜ y . If β and σ 2 were known exactly, the random vector ˜ y would follow N (˜ X β , σ 2 I ) . But we do not know model parameters, which contribute to the uncertainty in predictions. Predictions are carried out by sampling from the posterior predictive distribution, p (˜ y | y ) Draw { β ( j ) , σ 2( j ) } ∼ p ( β , σ 2 | y ) , j = 1 , 2 , . . . , M 1 y ( j ) ∼ N (˜ X β ( j ) , σ 2( j ) I ) , j = 1 , 2 , . . . , M . Draw ˜ 2 11 CDC 2010 Hierarchical Modeling and Analysis

Bayesian predictions from the linear model Predictive Mean and Variance (conditional upon σ 2 ): y | σ 2 , y ) = ˜ X ˆ E (˜ β y | σ 2 , y ) = ( I + ˜ XV β ˜ ′ ) σ 2 . var (˜ X The posterior predictive distribution, p (˜ y | y ) , is a multivariate t distribution, t n − p (˜ X ˆ β , s 2 ( I + ˜ XV β ˜ ′ )) . X 12 CDC 2010 Hierarchical Modeling and Analysis

Incorporating prior information Incorporating prior information y i | µ i , σ 2 ind ∼ N ( µ i , σ 2 ); i = 1 , 2 , . . . , n ; µ i = β 0 + β 1 x i 1 + · · · + β p x ip = x ′ i β ; β = ( β 0 , β 1 , . . . , β p ); β | σ 2 ∼ N ( β 0 , σ 2 R β ); σ 2 ∼ IG ( a σ , b σ ) , where R β is a fixed correlation matrix. Alternatively, y i | µ i , σ 2 ind ∼ N ( µ i , σ 2 ); i = 1 , 2 , . . . , n ; µ i = β 0 + β 1 x i 1 + · · · + β p x p = x ′ i β ; β = ( β 0 , β 1 , . . . , β p ); σ 2 ∼ IG ( a σ , b σ ) , β | Σ β ∼ N ( β 0 , Σ β ); Σ β ∼ IW ( ν, S ); where Σ β is a random covariance matrix. 13 CDC 2010 Hierarchical Modeling and Analysis

The Gibbs sampler The Gibbs sampler: If θ = ( θ 1 , . . . , θ p ) are the parameters in our model, we provide a set of initial values θ (0) = ( θ (0) 1 , . . . , θ (0) p ) and then performs the j -th iteration, say for j = 1 , . . . , M , by updating successively from the full conditional distributions: θ ( j ) ∼ p ( θ ( j ) 1 | θ ( j − 1) , . . . , θ ( j − 1) , y ) p 1 2 θ ( j ) ∼ p ( θ 2 | θ ( j ) 1 , θ ( j ) 3 , . . . , θ ( j − 1) , y ) p 2 . . . (the generic k th element) θ ( j ) ∼ p ( θ k | θ ( j ) 1 , . . . , θ ( j ) k − 1 , θ ( j ) k +1 , . . . , θ ( j − 1) , y ) p k . . . θ ( j ) ∼ p ( θ p | θ ( j ) 1 , . . . , θ ( j ) p − 1 , y ) p 14 CDC 2010 Hierarchical Modeling and Analysis

The Gibbs sampler In principle, the Gibbs sampler will work for extremely complex hierarchical models. The only issue is sampling from the full conditionals. They may not be amenable to easy sampling – when these are not in closed form. A more general and extremely powerful - and often easier to code - algorithm is the Metropolis-Hastings (MH) algorithm. This algorithm also constructs a Markov Chain, but does not necessarily care about full conditionals. Popular approach: Embed Metropolis steps within Gibbs to draw from full conditionals that are not accessible to directly generate from. 15 CDC 2010 Hierarchical Modeling and Analysis

Bayesian Linear Regression Sudipto Banerjee 1 Biostatistics, School - PowerPoint PPT Presentation

Bayesian Linear Regression Sudipto Banerjee 1 Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota, U.S.A. September 15, 2010 1 Linear regression models: a Bayesian perspective Linear regression is, perhaps,

Bayesian linear regression Dr. Jarad Niemi STAT 544 - Iowa State University April 23, 2019

Case Study: Bayesian Linear Regression and Sparse Bayesian Models Piyush Rai Dept. of CSE, IIT

Bayesian linear regression (cont.) Dr. Jarad Niemi STAT 544 - Iowa State University April 20,

Linear Models for Regression Greg Mori - CMPT 419/726 Bishop PRML Ch. 3 Regression Linear Basis

Regression 1: Linear Regression Marco Baroni Practical Statistics in R Outline Classic linear

Bayesian Linear Regression Seung-Hoon Na Chonbuk National University Bayesian Linear Regression

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

Linear regression How to measure the accuracy of linear regression models Linear Regression

STAT 213 Simple Linear Regression I Colin Reimer Dawson Oberlin College 5 October 2016 Outline

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Logistic regression CS 446 1. Linear classifiers Linear regression Last two lectures, we studied

LINEAR REGRESSION LINEAR REGRESSION - FROM A MACHINE LEARNING POINT OF VIEW 25 SIMPLE LINEAR

Notes on the Non-linear Regression The model Non-linear regression models, like ordinary linear

VAT SEMINAR Stelios Lazarou Member of the board of directors, Managing the operations of

Class :- III Semester MBA By Dr.Shenbagavalli Associate Professor Koshys Institute of

Value Added Tax The debt collection industry Presenter: Christo Theron Founder: TradeTaxPlus

1 Hello! The aim of this presentation is to give you an overview of the main things that will

Predicting Performance on MOOC Assessments using Multi-Regression Models Zhiyun Ren, Huzefa

Prediction of WVTR with General Regression Models Kimmo Lahtinen Session 9.1 1. Introduction

BOUGHT IN? D O E S T H E S U P P O R T O F S O M E I N T E R N A L S T A K E H O L D E R G

Investing in Illinois Families Investing in Illinois Families HEALTH CARE | EDUCATION |