Regression Applied Bayesian Statistics Dr. Earvin Balderama - - PowerPoint PPT Presentation

regression
SMART_READER_LITE
LIVE PREVIEW

Regression Applied Bayesian Statistics Dr. Earvin Balderama - - PowerPoint PPT Presentation

Regression Applied Bayesian Statistics Dr. Earvin Balderama Department of Mathematics & Statistics Loyola University Chicago October 24, 2017 Regression 1 Last edited October 24, 2017 by <ebalderama@luc.edu> MCMC Bayesian linear


slide-1
SLIDE 1

Regression

Applied Bayesian Statistics

  • Dr. Earvin Balderama

Department of Mathematics & Statistics Loyola University Chicago

October 24, 2017

1

Regression Last edited October 24, 2017 by <ebalderama@luc.edu>

slide-2
SLIDE 2

MCMC

Bayesian linear regression

Suppose the data is Yi

ind

∼ Normal

  • µ, σ2

. If we want to model Yi in terms of covariate information, what we are really saying is that µ varies across the i = 1, . . . , n observations (instead of remaining constant). The (multiple) linear regression model is Yi

ind

∼ Normal

  • µi, σ2

µi = β0 + Xi1β1 + · · · + Xipβp Bayesian and classical linear regression are similar if n ≫ p and the priors are uninformative. However, the results can be different for challenging problems and the interpretation is always different.

2

Regression Last edited October 24, 2017 by <ebalderama@luc.edu>

slide-3
SLIDE 3

MCMC

Ordinary Least Squares

The least squares estimate of β = (β0, β1, . . . , βp)T is ˆ βOLS = argmin β

n

  • i=1

(Yi − µi)2, where µi = β0 + Xi1β1 + · · · + Xipβp. If the errors are Gaussian then the likelihood is f(Yi |β, σ2) ∝

n

  • i=1

exp

  • −(Yi − µi)2

2σ2

  • = exp

n

i=1(Yi − µi)2

2σ2

  • Therefore, if the errors are Gaussian, ˆ

βOLS is also the MLE. Note: ˆ βOLS is unbiased even if the errors are non-Gaussian.

3

Regression Last edited October 24, 2017 by <ebalderama@luc.edu>

slide-4
SLIDE 4

MCMC

Ordinary Least Squares

In algebra notation, Let Y = (Y0, Y1, . . . , Yn)T be the response vector and X be the n × (p + 1) matrix of covariates. Then the mean of Y is Xβ and the least squares solution is ˆ βOLS = argmin β (Y − Xβ)T(Y − Xβ) = (XTX)−1XTY If the errors are Gaussian then the sampling distribution is ˆ βOLS ∼ Normal

  • β, σ2

XTX −1 . Note: If σ2 is estimated, then the sampling distribution is multivariate t.

4

Regression Last edited October 24, 2017 by <ebalderama@luc.edu>

slide-5
SLIDE 5

MCMC

Bayesian linear regression

The likelihood is Yi

ind

∼ Normal

  • β0 + Xi1β1 + · · · + Xipβp, σ2

We will need to set priors for β0, β1, . . . , βp, σ2. Note: For the purpose of setting priors and better interpretability of coefficient estimates later on, it is helpful to standardize both the response and each covariate to have mean 0 and variance 1. Many priors for β have been considered:

1

Improper priors

2

Gaussian priors

3

Double-exponential priors

4

...

5

Regression Last edited October 24, 2017 by <ebalderama@luc.edu>

slide-6
SLIDE 6

MCMC

Improper priors

The flat prior is f(β) = 1. (This is also the Jeffrey’s prior). Note: This is improper, but the posterior is proper under the same conditions required by least squares. With σ2 known, the posterior is β |Y ∼ Normal

  • ˆ

βOLS, σ2 XTX −1 Therefore, shouldn’t the results be similar to least squares? How do they differ?

6

Regression Last edited October 24, 2017 by <ebalderama@luc.edu>

slide-7
SLIDE 7

MCMC

Improper priors

Because we rarely know σ, we set a prior for the error variance σ2, typically with InverseGamma(a, b), with a and b set to something small, say a = b = 0.01. The posterior for β then follows a multivariate t centered at ˆ βOLS. The Jeffreys prior is f(β, σ2) = 1 σ2 , which is the limit as a, b → 0.

7

Regression Last edited October 24, 2017 by <ebalderama@luc.edu>

slide-8
SLIDE 8

MCMC

Multivariate normal prior

Another common prior is Zellner’s g-prior β ∼ Normal

  • 0, σ2

g

  • XTX

−1 Note: This prior is proper assuming X is full rank. The posterior mean is 1 1 + g ˆ βOLS Note: g controlled the amount of shrinkage. g = 1

n is common, and called the unit information prior.

8

Regression Last edited October 24, 2017 by <ebalderama@luc.edu>

slide-9
SLIDE 9

MCMC

Univariate Gaussian priors

If there are many covariates, or if the covariates are collinear, then ˆ βOLS is unstable. Independent priors can counteract collinearity: βj

ind

∼ Normal

  • 0, σ2

g

  • The posterior mode is

argmin β

n

  • i=1

(Yi − µi)2 + g

p

  • j=1

β2

j

Note: In classical statistics, this is known as the ridge regression solution and is used to stabilize the least squares solution.

9

Regression Last edited October 24, 2017 by <ebalderama@luc.edu>

slide-10
SLIDE 10

MCMC

BLASSO

An increasingly popular prior is the double-exponential or Bayesian LASSO prior. The prior is βj ∼ DE(τ) with PDF f(β) ∝ exp

  • −|β|

τ

  • Note: Basically, the squared term in the Gaussian prior is replaced with an

absolute value. The shape of the PDF is more peaked at 0. This favors settings where there are many βj near zero and a few large βj; that is, p is large but most of the covariates are noise.

10

Regression Last edited October 24, 2017 by <ebalderama@luc.edu>

slide-11
SLIDE 11

MCMC

BLASSO

−3 −2 −1 1 2 3 0.0 0.2 0.4 0.6 0.8 1.0

β Prior Gaussian BLASSO

11

Regression Last edited October 24, 2017 by <ebalderama@luc.edu>

slide-12
SLIDE 12

MCMC

BLASSO

The posterior mode is argmin β

n

  • i=1

(Yi − µi)2 + g

p

  • j=1

|βj| Note: In classical statistics, this is known as the LASSO solution and is used to add stability by shrinking estimates towards 0, and also setting some coefficients to 0. Covariates with coefficients set to 0 can be removed from analysis. Therefore, LASSO performs variable selection and estimation simultaneously!

12

Regression Last edited October 24, 2017 by <ebalderama@luc.edu>

slide-13
SLIDE 13

MCMC

Logistic regression

In logistic regression, we have a binary response Yi ∈ {0, 1}, and logit[P(Yi = 1)] = β0 + β1Xi1 + · · · + βpXip P(Yi = 1) = expit(β0 + β1Xi1 + · · · + βpXip) ∈ [0, 1] The logit link is the log-odds: logit(x) = log

  • x

1 − x

  • The expit transformation is the inverse: expit(x) =

ex 1 + ex The βj represents the change in the log odds of Yi = 1 corresponding to a

  • ne-unit increase in covariate j.

All of the priors discussed apply. Computationally the full conditional distributions are no longer conjugate and so we must use Metropolis sampling.

13

Regression Last edited October 24, 2017 by <ebalderama@luc.edu>

slide-14
SLIDE 14

MCMC

Bayesian regression packages in R

The dlaplace function in the rmutil gives the density of the double-exponential (Laplace) distribution. Of course, there is also rlaplace, plaplace, etc. The BLR function in the BLR package is probably the most common for Bayesian linear regression. It also works well for BLASSO, and is super fast. The MCMClogit function in the MCMCpack package performs Metropolis sampling efficiently for logistic regression. The MCMCpack package also includes many other functions for several other regression methods, e.g., MCMCpoisson and MCMCprobit for Poisson and probit regression, respectively. Another option is to code your own MCMC sampler yourself in R.

14

Regression Last edited October 24, 2017 by <ebalderama@luc.edu>