Regression
Applied Bayesian Statistics
- Dr. Earvin Balderama
Department of Mathematics & Statistics Loyola University Chicago
October 24, 2017
1
Regression Last edited October 24, 2017 by <ebalderama@luc.edu>
Regression Applied Bayesian Statistics Dr. Earvin Balderama - - PowerPoint PPT Presentation
Regression Applied Bayesian Statistics Dr. Earvin Balderama Department of Mathematics & Statistics Loyola University Chicago October 24, 2017 Regression 1 Last edited October 24, 2017 by <ebalderama@luc.edu> MCMC Bayesian linear
Applied Bayesian Statistics
Department of Mathematics & Statistics Loyola University Chicago
October 24, 2017
1
Regression Last edited October 24, 2017 by <ebalderama@luc.edu>
MCMC
Suppose the data is Yi
ind
∼ Normal
. If we want to model Yi in terms of covariate information, what we are really saying is that µ varies across the i = 1, . . . , n observations (instead of remaining constant). The (multiple) linear regression model is Yi
ind
∼ Normal
µi = β0 + Xi1β1 + · · · + Xipβp Bayesian and classical linear regression are similar if n ≫ p and the priors are uninformative. However, the results can be different for challenging problems and the interpretation is always different.
2
Regression Last edited October 24, 2017 by <ebalderama@luc.edu>
MCMC
The least squares estimate of β = (β0, β1, . . . , βp)T is ˆ βOLS = argmin β
n
(Yi − µi)2, where µi = β0 + Xi1β1 + · · · + Xipβp. If the errors are Gaussian then the likelihood is f(Yi |β, σ2) ∝
n
exp
2σ2
n
i=1(Yi − µi)2
2σ2
βOLS is also the MLE. Note: ˆ βOLS is unbiased even if the errors are non-Gaussian.
3
Regression Last edited October 24, 2017 by <ebalderama@luc.edu>
MCMC
In algebra notation, Let Y = (Y0, Y1, . . . , Yn)T be the response vector and X be the n × (p + 1) matrix of covariates. Then the mean of Y is Xβ and the least squares solution is ˆ βOLS = argmin β (Y − Xβ)T(Y − Xβ) = (XTX)−1XTY If the errors are Gaussian then the sampling distribution is ˆ βOLS ∼ Normal
XTX −1 . Note: If σ2 is estimated, then the sampling distribution is multivariate t.
4
Regression Last edited October 24, 2017 by <ebalderama@luc.edu>
MCMC
The likelihood is Yi
ind
∼ Normal
We will need to set priors for β0, β1, . . . , βp, σ2. Note: For the purpose of setting priors and better interpretability of coefficient estimates later on, it is helpful to standardize both the response and each covariate to have mean 0 and variance 1. Many priors for β have been considered:
1
Improper priors
2
Gaussian priors
3
Double-exponential priors
4
...
5
Regression Last edited October 24, 2017 by <ebalderama@luc.edu>
MCMC
The flat prior is f(β) = 1. (This is also the Jeffrey’s prior). Note: This is improper, but the posterior is proper under the same conditions required by least squares. With σ2 known, the posterior is β |Y ∼ Normal
βOLS, σ2 XTX −1 Therefore, shouldn’t the results be similar to least squares? How do they differ?
6
Regression Last edited October 24, 2017 by <ebalderama@luc.edu>
MCMC
Because we rarely know σ, we set a prior for the error variance σ2, typically with InverseGamma(a, b), with a and b set to something small, say a = b = 0.01. The posterior for β then follows a multivariate t centered at ˆ βOLS. The Jeffreys prior is f(β, σ2) = 1 σ2 , which is the limit as a, b → 0.
7
Regression Last edited October 24, 2017 by <ebalderama@luc.edu>
MCMC
Another common prior is Zellner’s g-prior β ∼ Normal
g
−1 Note: This prior is proper assuming X is full rank. The posterior mean is 1 1 + g ˆ βOLS Note: g controlled the amount of shrinkage. g = 1
n is common, and called the unit information prior.
8
Regression Last edited October 24, 2017 by <ebalderama@luc.edu>
MCMC
If there are many covariates, or if the covariates are collinear, then ˆ βOLS is unstable. Independent priors can counteract collinearity: βj
ind
∼ Normal
g
argmin β
n
(Yi − µi)2 + g
p
β2
j
Note: In classical statistics, this is known as the ridge regression solution and is used to stabilize the least squares solution.
9
Regression Last edited October 24, 2017 by <ebalderama@luc.edu>
MCMC
An increasingly popular prior is the double-exponential or Bayesian LASSO prior. The prior is βj ∼ DE(τ) with PDF f(β) ∝ exp
τ
absolute value. The shape of the PDF is more peaked at 0. This favors settings where there are many βj near zero and a few large βj; that is, p is large but most of the covariates are noise.
10
Regression Last edited October 24, 2017 by <ebalderama@luc.edu>
MCMC
−3 −2 −1 1 2 3 0.0 0.2 0.4 0.6 0.8 1.0
β Prior Gaussian BLASSO
11
Regression Last edited October 24, 2017 by <ebalderama@luc.edu>
MCMC
The posterior mode is argmin β
n
(Yi − µi)2 + g
p
|βj| Note: In classical statistics, this is known as the LASSO solution and is used to add stability by shrinking estimates towards 0, and also setting some coefficients to 0. Covariates with coefficients set to 0 can be removed from analysis. Therefore, LASSO performs variable selection and estimation simultaneously!
12
Regression Last edited October 24, 2017 by <ebalderama@luc.edu>
MCMC
In logistic regression, we have a binary response Yi ∈ {0, 1}, and logit[P(Yi = 1)] = β0 + β1Xi1 + · · · + βpXip P(Yi = 1) = expit(β0 + β1Xi1 + · · · + βpXip) ∈ [0, 1] The logit link is the log-odds: logit(x) = log
1 − x
ex 1 + ex The βj represents the change in the log odds of Yi = 1 corresponding to a
All of the priors discussed apply. Computationally the full conditional distributions are no longer conjugate and so we must use Metropolis sampling.
13
Regression Last edited October 24, 2017 by <ebalderama@luc.edu>
MCMC
The dlaplace function in the rmutil gives the density of the double-exponential (Laplace) distribution. Of course, there is also rlaplace, plaplace, etc. The BLR function in the BLR package is probably the most common for Bayesian linear regression. It also works well for BLASSO, and is super fast. The MCMClogit function in the MCMCpack package performs Metropolis sampling efficiently for logistic regression. The MCMCpack package also includes many other functions for several other regression methods, e.g., MCMCpoisson and MCMCprobit for Poisson and probit regression, respectively. Another option is to code your own MCMC sampler yourself in R.
14
Regression Last edited October 24, 2017 by <ebalderama@luc.edu>