Case Study: Bayesian Linear Regression and Sparse Bayesian Models - - PowerPoint PPT Presentation

case study bayesian linear regression and sparse bayesian
SMART_READER_LITE
LIVE PREVIEW

Case Study: Bayesian Linear Regression and Sparse Bayesian Models - - PowerPoint PPT Presentation

Case Study: Bayesian Linear Regression and Sparse Bayesian Models Piyush Rai Dept. of CSE, IIT Kanpur (Mini-course: lecture 2) Nov 05, 2015 Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 1 Recap Piyush Rai


slide-1
SLIDE 1

Case Study: Bayesian Linear Regression and Sparse Bayesian Models

Piyush Rai

  • Dept. of CSE, IIT Kanpur

(Mini-course: lecture 2) Nov 05, 2015

Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 1

slide-2
SLIDE 2

Recap

Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 2

slide-3
SLIDE 3

Maximum Likelihood Estimation (MLE)

We wish to estimate parameters θ from observed data {x1, . . . , xN} MLE does this by finding θ that maximizes the (log)likelihood p(X|θ) ˆ θ = arg max

θ

log p(X|θ) = arg max

θ

log

N

  • n=1

p(xn|θ) = arg max

θ N

  • n=1

log p(xn|θ) MLE now reduces to solving an optimization problem w.r.t. θ

Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 3

slide-4
SLIDE 4

Maximum-a-Posteriori (MAP) Estimation

Incorporating prior knowledge p(θ) about the parameters MAP estimation finds θ that maximizes the posterior p(θ|X) ∝ p(X|θ)p(θ) ˆ θ = arg max

θ

log

N

  • n=1

p(xn|θ)p(θ) = arg max

θ N

  • n=1

log p(xn|θ) + log p(θ) MAP now reduces to solving an optimization problem w.r.t. θ Objective function very similar to MLE, except for the log p(θ) term In some sense, MAP is just a “regularized” MLE

Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 4

slide-5
SLIDE 5

Bayesian Learning

Both MLE and MAP only give a point estimate (single best answer) of θ How can we capture/quantify the uncertainty in θ? Need to infer the full posterior distribution p(θ|X) = p(X|θ)p(θ) p(X) = p(X|θ)p(θ)

  • θ p(X|θ)p(θ)dθ ∝ Likelihood × Prior

Requires doing a “fully Bayesian” inference Inference sometimes a somewhat easy and sometimes a (very) hard problem Conjugate priors often make life easy when doing inference

Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 5

slide-6
SLIDE 6

Warm-up: Least Squares Regression

Training data: {xn, yn}N

n=1. Response is a noisy function of the input

yn = f (xn, w) + ǫn

Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 6

slide-7
SLIDE 7

Warm-up: Least Squares Regression

Training data: {xn, yn}N

n=1. Response is a noisy function of the input

yn = f (xn, w) + ǫn Assume a data representation φ(xn) = [φ1(xn), . . . , φM(xn)] ∈ RM

Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 6

slide-8
SLIDE 8

Warm-up: Least Squares Regression

Training data: {xn, yn}N

n=1. Response is a noisy function of the input

yn = f (xn, w) + ǫn Assume a data representation φ(xn) = [φ1(xn), . . . , φM(xn)] ∈ RM Denote y = [y1, . . . , yN]⊤ ∈ RN, Φ = [φ(x1), . . . , φ(xN)]⊤ ∈ RN×M

Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 6

slide-9
SLIDE 9

Warm-up: Least Squares Regression

Training data: {xn, yn}N

n=1. Response is a noisy function of the input

yn = f (xn, w) + ǫn Assume a data representation φ(xn) = [φ1(xn), . . . , φM(xn)] ∈ RM Denote y = [y1, . . . , yN]⊤ ∈ RN, Φ = [φ(x1), . . . , φ(xN)]⊤ ∈ RN×M Assume linear (in the parameters) function: f (xn, w) = w⊤φ(xn)

Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 6

slide-10
SLIDE 10

Warm-up: Least Squares Regression

Training data: {xn, yn}N

n=1. Response is a noisy function of the input

yn = f (xn, w) + ǫn Assume a data representation φ(xn) = [φ1(xn), . . . , φM(xn)] ∈ RM Denote y = [y1, . . . , yN]⊤ ∈ RN, Φ = [φ(x1), . . . , φ(xN)]⊤ ∈ RN×M Assume linear (in the parameters) function: f (xn, w) = w⊤φ(xn) Sum of squared error function E(w) = 1 2

N

  • n=1

|f (xn, w) − yn|2

Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 6

slide-11
SLIDE 11

Warm-up: Least Squares Regression

Training data: {xn, yn}N

n=1. Response is a noisy function of the input

yn = f (xn, w) + ǫn Assume a data representation φ(xn) = [φ1(xn), . . . , φM(xn)] ∈ RM Denote y = [y1, . . . , yN]⊤ ∈ RN, Φ = [φ(x1), . . . , φ(xN)]⊤ ∈ RN×M Assume linear (in the parameters) function: f (xn, w) = w⊤φ(xn) Sum of squared error function E(w) = 1 2

N

  • n=1

|f (xn, w) − yn|2 Classical solution: ˆ w = arg minw E(w) = (Φ⊤Φ)−1Φ⊤y

Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 6

slide-12
SLIDE 12

Warm-up: Least Squares Regression

Training data: {xn, yn}N

n=1. Response is a noisy function of the input

yn = f (xn, w) + ǫn Assume a data representation φ(xn) = [φ1(xn), . . . , φM(xn)] ∈ RM Denote y = [y1, . . . , yN]⊤ ∈ RN, Φ = [φ(x1), . . . , φ(xN)]⊤ ∈ RN×M Assume linear (in the parameters) function: f (xn, w) = w⊤φ(xn) Sum of squared error function E(w) = 1 2

N

  • n=1

|f (xn, w) − yn|2 Classical solution: ˆ w = arg minw E(w) = (Φ⊤Φ)−1Φ⊤y Classification: replace the least squares by some other loss (e.g., logistic)

Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 6

slide-13
SLIDE 13

Regularization

Want functions that are “simple” (and hence “generalize” to future data) How: penalize “complex” functions. Use a regularized loss function ˜ E(w) = E(w) + λΩ(w) Ω(w): a measure of how complex w is (want it small)

Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 7

slide-14
SLIDE 14

Regularization

Want functions that are “simple” (and hence “generalize” to future data) How: penalize “complex” functions. Use a regularized loss function ˜ E(w) = E(w) + λΩ(w) Ω(w): a measure of how complex w is (want it small) Regularization parameter λ trades off data fit vs model simplicity

Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 7

slide-15
SLIDE 15

Regularization

Want functions that are “simple” (and hence “generalize” to future data) How: penalize “complex” functions. Use a regularized loss function ˜ E(w) = E(w) + λΩ(w) Ω(w): a measure of how complex w is (want it small) Regularization parameter λ trades off data fit vs model simplicity For Ω(w) = ||w||2, the solution ˆ w = arg minw ˜ E(w) = (Φ⊤Φ + λI)−1Φ⊤y

Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 7

slide-16
SLIDE 16

A Probabilistic Framework for Regression

Recall: yn = f (xn, w) + ǫn Assume a zero-mean Gaussian error p(ǫ|σ2) = N(ǫ|0, σ2)

Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 8

slide-17
SLIDE 17

A Probabilistic Framework for Regression

Recall: yn = f (xn, w) + ǫn Assume a zero-mean Gaussian error p(ǫ|σ2) = N(ǫ|0, σ2) Leads to a Gaussian likelihood model p(yn|xn, w) = N(yn|f (xn, w), σ2) p(yn|xn, w) =

  • 1

2πσ2 1/2 exp

  • − 1

2σ2 (f (xn, w) − yn)2

  • Piyush Rai (IIT Kanpur)

Bayesian Linear Regression and Sparse Bayesian Models 8

slide-18
SLIDE 18

A Probabilistic Framework for Regression

Recall: yn = f (xn, w) + ǫn Assume a zero-mean Gaussian error p(ǫ|σ2) = N(ǫ|0, σ2) Leads to a Gaussian likelihood model p(yn|xn, w) = N(yn|f (xn, w), σ2) p(yn|xn, w) =

  • 1

2πσ2 1/2 exp

  • − 1

2σ2 (f (xn, w) − yn)2

  • Joint probability of the data (likelihood)

L(w) =

N

  • n=1

p(yn|xn, w) =

  • 1

2πσ2 N/2 exp

  • − 1

2σ2

N

  • n=1

(f (xn, w) − yn)2

  • Piyush Rai (IIT Kanpur)

Bayesian Linear Regression and Sparse Bayesian Models 8

slide-19
SLIDE 19

A Probabilistic Framework for Regression

Let’s look at the negative log-likelihood − log L(w) = N 2 log σ2 + N 2 log 2π + 1 2σ2

N

  • n=1

(f (xn, w) − yn)2

Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 9

slide-20
SLIDE 20

A Probabilistic Framework for Regression

Let’s look at the negative log-likelihood − log L(w) = N 2 log σ2 + N 2 log 2π + 1 2σ2

N

  • n=1

(f (xn, w) − yn)2 Minimizing w.r.t. w leads to the same answer as the unregularized case ˆ w = (Φ⊤Φ)−1Φ⊤y

Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 9

slide-21
SLIDE 21

A Probabilistic Framework for Regression

Let’s look at the negative log-likelihood − log L(w) = N 2 log σ2 + N 2 log 2π + 1 2σ2

N

  • n=1

(f (xn, w) − yn)2 Minimizing w.r.t. w leads to the same answer as the unregularized case ˆ w = (Φ⊤Φ)−1Φ⊤y Also get an estimate of error variance 1 ˆ σ2 = 1 N

N

  • n=1

(f (xn, ˆ w) − yn)2

Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 9

slide-22
SLIDE 22

Specifying a Prior and Computing the Posterior

Let’s assume a Gaussian prior on the weight vector w = [w1, . . . , wM] p(w|α) =

M

  • m=1

p(wm|α) =

M

  • m=1

α 2π 1/2 exp

  • −α

2 w 2

m

  • Piyush Rai (IIT Kanpur)

Bayesian Linear Regression and Sparse Bayesian Models 10

slide-23
SLIDE 23

Specifying a Prior and Computing the Posterior

Let’s assume a Gaussian prior on the weight vector w = [w1, . . . , wM] p(w|α) =

M

  • m=1

p(wm|α) =

M

  • m=1

α 2π 1/2 exp

  • −α

2 w 2

m

  • The posterior

p(w|y, α, σ2) = likelihood × prior normalizing factor = p(y|w, σ2) × p(w|α) p(y|α, σ2)

Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 10

slide-24
SLIDE 24

Specifying a Prior and Computing the Posterior

Let’s assume a Gaussian prior on the weight vector w = [w1, . . . , wM] p(w|α) =

M

  • m=1

p(wm|α) =

M

  • m=1

α 2π 1/2 exp

  • −α

2 w 2

m

  • The posterior

p(w|y, α, σ2) = likelihood × prior normalizing factor = p(y|w, σ2) × p(w|α) p(y|α, σ2) The posterior p(w|y, α, σ2) will be Gaussian N(µ, Σ) µ = (Φ⊤Φ + σ2αI)−1Φ⊤y Σ = σ2(Φ⊤Φ + σ2αI)−1 Instead of a single estimate, we now have a distribution over w

Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 10

slide-25
SLIDE 25

Maximizing the Posterior

Recall: Gaussian prior on the weight vector w = [w1, . . . , wM] p(w|α) =

M

  • m=1

p(wm|α) =

M

  • m=1

α 2π 1/2 exp

  • −α

2 w 2

m

  • The likelihood p(yn|w, xn, σ2) ∝ exp
  • − 1

2σ2 (f (xn, w) − yn)2

Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 11

slide-26
SLIDE 26

Maximizing the Posterior

Recall: Gaussian prior on the weight vector w = [w1, . . . , wM] p(w|α) =

M

  • m=1

p(wm|α) =

M

  • m=1

α 2π 1/2 exp

  • −α

2 w 2

m

  • The likelihood p(yn|w, xn, σ2) ∝ exp
  • − 1

2σ2 (f (xn, w) − yn)2

Maximizing the posterior p(w|y, α, σ2) ∝ p(y|w, σ2) × p(w|α) w.r.t w is equivalent to minimizing EMAP(w) = 1 2σ2

N

  • n=1

{f (xn, w) − yn}2 + α 2

M

  • m=1

w 2

m

Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 11

slide-27
SLIDE 27

Maximizing the Posterior

Recall: Gaussian prior on the weight vector w = [w1, . . . , wM] p(w|α) =

M

  • m=1

p(wm|α) =

M

  • m=1

α 2π 1/2 exp

  • −α

2 w 2

m

  • The likelihood p(yn|w, xn, σ2) ∝ exp
  • − 1

2σ2 (f (xn, w) − yn)2

Maximizing the posterior p(w|y, α, σ2) ∝ p(y|w, σ2) × p(w|α) w.r.t w is equivalent to minimizing EMAP(w) = 1 2σ2

N

  • n=1

{f (xn, w) − yn}2 + α 2

M

  • m=1

w 2

m

Will lead to an identical solution as ridge-regression with λ = σ2α

Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 11

slide-28
SLIDE 28

Evolution of the Posterior

Posterior updates have a naturally online flavor.. p(w|y1, y2, y3) ∝ p(y1, y2, y3|w)p(w)

Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 12

slide-29
SLIDE 29

Evolution of the Posterior

Posterior updates have a naturally online flavor.. p(w|y1, y2, y3) ∝ p(y1, y2, y3|w)p(w) = p(y2, y3|w)p(y1|w)p(w)

Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 12

slide-30
SLIDE 30

Evolution of the Posterior

Posterior updates have a naturally online flavor.. p(w|y1, y2, y3) ∝ p(y1, y2, y3|w)p(w) = p(y2, y3|w)p(y1|w)p(w) = p(y2, y3|w)p(w|y1)

Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 12

slide-31
SLIDE 31

Evolution of the Posterior

Posterior updates have a naturally online flavor.. p(w|y1, y2, y3) ∝ p(y1, y2, y3|w)p(w) = p(y2, y3|w)p(y1|w)p(w) = p(y2, y3|w)p(w|y1) = likelihood w.r.t. y2 & y3 × posterior after seeing y1

Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 12

slide-32
SLIDE 32

Let’s Compare Predictions

Ridge regression prediction = f (ˆ w, x∗)

Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 13

slide-33
SLIDE 33

Let’s Compare Predictions

Ridge regression prediction = f (ˆ w, x∗) MAP estimation (or “Pseudo” Bayesian) prediction = p(y∗|wMAP, x∗, σ2)

Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 13

slide-34
SLIDE 34

Let’s Compare Predictions

Ridge regression prediction = f (ˆ w, x∗) MAP estimation (or “Pseudo” Bayesian) prediction = p(y∗|wMAP, x∗, σ2) True Bayesian prediction = p(y∗|x∗, y, X, σ2, α) =

  • p(y∗|w, x∗, σ2)p(w|y, X, α, σ2)dw

The true Bayesian way integrates out or marginalizes/averages over the uncertain variables (w in this case) to get a predictive distribution

Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 13

slide-35
SLIDE 35

Not Quite Done Yet..

We haven’t really averaged over all unknowns (which also include α, σ2) Ideally, would like to get the posterior over all the unknowns p(w, α, σ2|y) = p(y|w, σ2)p(w|α)p(α)p(σ2) p(y) where p(y) =

  • p(y|w, σ2)p(w|α)p(α)p(σ2) dw dα dσ2 (hard to compute)

Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 14

slide-36
SLIDE 36

Not Quite Done Yet..

We haven’t really averaged over all unknowns (which also include α, σ2) Ideally, would like to get the posterior over all the unknowns p(w, α, σ2|y) = p(y|w, σ2)p(w|α)p(α)p(σ2) p(y) where p(y) =

  • p(y|w, σ2)p(w|α)p(α)p(σ2) dw dα dσ2 (hard to compute)

Making prediction for new data points. The predictive distribution: p(y∗|y) =

  • p(y∗|w, σ2)p(w, α, σ2|y) dw dα dσ2

.. again, hard to compute

Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 14

slide-37
SLIDE 37

Not Quite Done Yet..

We haven’t really averaged over all unknowns (which also include α, σ2) Ideally, would like to get the posterior over all the unknowns p(w, α, σ2|y) = p(y|w, σ2)p(w|α)p(α)p(σ2) p(y) where p(y) =

  • p(y|w, σ2)p(w|α)p(α)p(σ2) dw dα dσ2 (hard to compute)

Making prediction for new data points. The predictive distribution: p(y∗|y) =

  • p(y∗|w, σ2)p(w, α, σ2|y) dw dα dσ2

.. again, hard to compute

  • Approx. Bayesian inference (Type-II maximum likelihood, Laplace

approximation, MCMC, variational Bayes, etc.) saves the day..

Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 14

slide-38
SLIDE 38

Approximating the Predictive Distribution

Making prediction for new data points p(y∗|y) =

  • p(y∗|w, σ2)p(w, α, σ2|y) dw dα dσ2

Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 15

slide-39
SLIDE 39

Approximating the Predictive Distribution

Making prediction for new data points p(y∗|y) =

  • p(y∗|w, σ2)p(w, α, σ2|y) dw dα dσ2

=

  • p(y∗|w, σ2)p(w|α, σ2, y)p(α, σ2|y) dw dα dσ2

Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 15

slide-40
SLIDE 40

Approximating the Predictive Distribution

Making prediction for new data points p(y∗|y) =

  • p(y∗|w, σ2)p(w, α, σ2|y) dw dα dσ2

=

  • p(y∗|w, σ2)p(w|α, σ2, y)p(α, σ2|y) dw dα dσ2

  • p(y∗|w, σ2)p(w|α, σ2, y)δ(αMP, σ2

MP) dw dα dσ2

Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 15

slide-41
SLIDE 41

Approximating the Predictive Distribution

Making prediction for new data points p(y∗|y) =

  • p(y∗|w, σ2)p(w, α, σ2|y) dw dα dσ2

=

  • p(y∗|w, σ2)p(w|α, σ2, y)p(α, σ2|y) dw dα dσ2

  • p(y∗|w, σ2)p(w|α, σ2, y)δ(αMP, σ2

MP) dw dα dσ2

=

  • p(y∗|w, σ2)p(w|αMP, σ2

MP, y) dw

Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 15

slide-42
SLIDE 42

Approximating the Predictive Distribution

Making prediction for new data points p(y∗|y) =

  • p(y∗|w, σ2)p(w, α, σ2|y) dw dα dσ2

=

  • p(y∗|w, σ2)p(w|α, σ2, y)p(α, σ2|y) dw dα dσ2

  • p(y∗|w, σ2)p(w|α, σ2, y)δ(αMP, σ2

MP) dw dα dσ2

=

  • p(y∗|w, σ2)p(w|αMP, σ2

MP, y) dw

Recall: p(w|αMP, σ2

MP, y) is a Gaussian; so is p(y∗|w, σ2)

Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 15

slide-43
SLIDE 43

Approximating the Predictive Distribution

Making prediction for new data points p(y∗|y) =

  • p(y∗|w, σ2)p(w, α, σ2|y) dw dα dσ2

=

  • p(y∗|w, σ2)p(w|α, σ2, y)p(α, σ2|y) dw dα dσ2

  • p(y∗|w, σ2)p(w|α, σ2, y)δ(αMP, σ2

MP) dw dα dσ2

=

  • p(y∗|w, σ2)p(w|αMP, σ2

MP, y) dw

Recall: p(w|αMP, σ2

MP, y) is a Gaussian; so is p(y∗|w, σ2)

Can thus now compute p(y∗|y) =

  • p(y∗|w, σ2)p(w|αMP, σ2

MP, y) dw, which

is again a Gaussian N(y∗|µ∗, σ2

∗)

µ∗ = f (x∗, w) σ∗

2

= σMP

2 + φ(x∗)⊤Σφ(x∗)

Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 15

slide-44
SLIDE 44

Marginal Likelihood

Hyperparameters α, σ2 are estimated by maximizing the marginal likelihood Marginal likelihood (averaged over the prior on w) is p(y|α, σ2) =

  • p(y|w, σ2)p(w|α)dα

=

1 (2π)N/2 |σ2I + ΦA−1Φ⊤|−1/2 exp(−1 2y⊤(σ2I + ΦA−1Φ⊤|−1y)

Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 16

slide-45
SLIDE 45

Marginal Likelihood

Hyperparameters α, σ2 are estimated by maximizing the marginal likelihood Marginal likelihood (averaged over the prior on w) is p(y|α, σ2) =

  • p(y|w, σ2)p(w|α)dα

=

1 (2π)N/2 |σ2I + ΦA−1Φ⊤|−1/2 exp(−1 2y⊤(σ2I + ΦA−1Φ⊤|−1y)

Maximizing p(y|α, σ2) w.r.t. α and σ2 gives αMP and σ2

MP, respectively

Maximization can be done using gradient-based methods

Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 16

slide-46
SLIDE 46

Marginal Likelihood

Hyperparameters α, σ2 are estimated by maximizing the marginal likelihood Marginal likelihood (averaged over the prior on w) is p(y|α, σ2) =

  • p(y|w, σ2)p(w|α)dα

=

1 (2π)N/2 |σ2I + ΦA−1Φ⊤|−1/2 exp(−1 2y⊤(σ2I + ΦA−1Φ⊤|−1y)

Maximizing p(y|α, σ2) w.r.t. α and σ2 gives αMP and σ2

MP, respectively

Maximization can be done using gradient-based methods Can assume uniform priors on α, σ2 and compute marginal model probability p(y|M) =

  • p(y|α, σ2)p(α)p(σ2)dαdσ2

p(y|M) ≈ 1 S

S

  • s=1

p(y|αs, σ2

s )

(useful for model-selection)

Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 16

slide-47
SLIDE 47

Sparse Modeling

Want very few elements in w to be nonzero

Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 17

slide-48
SLIDE 48

Sparse Bayesian Regression

Recall the Gaussian prior on w p(w|α) =

M

  • m=1

p(wm|α) =

M

  • m=1

α 2π 1/2 exp

  • −α

2 w 2

m

  • Each component of w is a zero-mean Gaussian p(wm|α) = N(wm|0, α−1)

Same hyperparameter α on each entry of w. Can’t impose sparsity on w

Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 18

slide-49
SLIDE 49

Sparse Bayesian Regression

Recall the Gaussian prior on w p(w|α) =

M

  • m=1

p(wm|α) =

M

  • m=1

α 2π 1/2 exp

  • −α

2 w 2

m

  • Each component of w is a zero-mean Gaussian p(wm|α) = N(wm|0, α−1)

Same hyperparameter α on each entry of w. Can’t impose sparsity on w Let’s have a separate inverse variance αm for each component of w p(w|α) =

M

  • m=1

p(wm|αm) =

M

  • m=1

αm 2π 1/2 exp

  • −αm

2 w 2

m

  • We now have M hyperparameters α = [α1, . . . , αM] individually controlling

the variance of each component wm of w

Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 18

slide-50
SLIDE 50

A Hierarchical Prior

Our new hierarchical prior on w p(w|α) =

M

  • m=1

p(wm|αm) =

M

  • m=1

αm 2π 1/2 exp

  • −αm

2 w 2

m

  • We will assume a gamma prior on αm: p(αm) ∝ αa−1

m

exp−αm/b

Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 19

slide-51
SLIDE 51

A Hierarchical Prior

Our new hierarchical prior on w p(w|α) =

M

  • m=1

p(wm|αm) =

M

  • m=1

αm 2π 1/2 exp

  • −αm

2 w 2

m

  • We will assume a gamma prior on αm: p(αm) ∝ αa−1

m

exp−αm/b The marginal prior on each weight wm after averaging over p(αm) p(wm) =

  • p(wm|αm)p(αm)dαm

(will be a Student-t distribution)

Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 19

slide-52
SLIDE 52

A Hierarchical Prior

Our new hierarchical prior on w p(w|α) =

M

  • m=1

p(wm|αm) =

M

  • m=1

αm 2π 1/2 exp

  • −αm

2 w 2

m

  • We will assume a gamma prior on αm: p(αm) ∝ αa−1

m

exp−αm/b The marginal prior on each weight wm after averaging over p(αm) p(wm) =

  • p(wm|αm)p(αm)dαm

(will be a Student-t distribution)

Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 19

slide-53
SLIDE 53

A Hierarchical Prior

Our new hierarchical prior on w p(w|α) =

M

  • m=1

p(wm|αm) =

M

  • m=1

αm 2π 1/2 exp

  • −αm

2 w 2

m

  • We will assume a gamma prior on αm: p(αm) ∝ αa−1

m

exp−αm/b The marginal prior on each weight wm after averaging over p(αm) p(wm) =

  • p(wm|αm)p(αm)dαm

(will be a Student-t distribution) Akin to penalizing M

m=1 log |wm|. Leads to sparse solutions for w

Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 19

slide-54
SLIDE 54

Sparse Bayesian Regression

Likelihood model p(y|w, σ2) = (2πσ2)−N/2 exp

  • − 1

2σ2 ||y − Φµ||2

  • Prior on w: Gaussian-gamma (Student-t)

Posterior p(w, α, σ2|y) = p(y|w, α, σ2)p(w, α, σ2) p(y)

Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 20

slide-55
SLIDE 55

Sparse Bayesian Regression

Likelihood model p(y|w, σ2) = (2πσ2)−N/2 exp

  • − 1

2σ2 ||y − Φµ||2

  • Prior on w: Gaussian-gamma (Student-t)

Posterior p(w, α, σ2|y) = p(y|w, α, σ2)p(w, α, σ2) p(y) Posterior p(w, α, σ2|y) is further decomposed as p(w, α, σ2|y) = p(w|y, α, σ2)p(α, σ2|y)

Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 20

slide-56
SLIDE 56

The Posterior

Posterior over weights will be Gaussian p(w|y, α, σ2) = p(y|w, σ2)p(w|α) p(y|α, σ2) = (2π)(N+1)/2|Σ|−1/2 exp

  • −1

2(w − µ)Σ−1(w − µ)

  • where Σ = (σ−2Φ⊤Φ + A)−1, µ = σ−2ΣΦ⊤y, A = diag(α1, α2, . . . , αM)

Note: if αm = ∞ then µm = 0

Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 21

slide-57
SLIDE 57

Hyperparameter Re-estimation

Posterior over w: p(w|y, α, σ2) = N(µ, Σ) Marginal likelihood (averaged over the prior on w) is p(y|α, σ2) =

  • p(y|w, σ2)p(w|α)dα

=

1 (2π)N/2 |σ2I + ΦA−1Φ⊤|−1/2 exp(−1 2y⊤(σ2I + ΦA−1Φ⊤|−1y)

Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 22

slide-58
SLIDE 58

Hyperparameter Re-estimation

Posterior over w: p(w|y, α, σ2) = N(µ, Σ) Marginal likelihood (averaged over the prior on w) is p(y|α, σ2) =

  • p(y|w, σ2)p(w|α)dα

=

1 (2π)N/2 |σ2I + ΦA−1Φ⊤|−1/2 exp(−1 2y⊤(σ2I + ΦA−1Φ⊤|−1y)

Maximize the marginal likelihood p(y|α, σ2) w.r.t. α = [α1, . . . , αM] and σ2 αnew

m

= γm µ2

m

(σ2)new = ||y − Φµ||2 N − M

m=1 γm

where γm = 1 − αmΣmm Alternate between estimating w, α, and σ2

Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 22

slide-59
SLIDE 59

Approximate Bayesian Inference

Bayesian learning routinely needs to deal with intractable integrals, e.g., Normalization: when computing the posterior distribution p(θ|D) = p(D|θ)p(θ) p(D) = p(D|θ)p(θ)

  • p(D|θ)p(θ)dθ

where the denominator is rarely available in closed analytical form

Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 23

slide-60
SLIDE 60

Approximate Bayesian Inference

Bayesian learning routinely needs to deal with intractable integrals, e.g., Normalization: when computing the posterior distribution p(θ|D) = p(D|θ)p(θ) p(D) = p(D|θ)p(θ)

  • p(D|θ)p(θ)dθ

where the denominator is rarely available in closed analytical form Marginalization: p(θ|D) =

  • p(θ, φ|D)p(φ)dφ

Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 23

slide-61
SLIDE 61

Approximate Bayesian Inference

Bayesian learning routinely needs to deal with intractable integrals, e.g., Normalization: when computing the posterior distribution p(θ|D) = p(D|θ)p(θ) p(D) = p(D|θ)p(θ)

  • p(D|θ)p(θ)dθ

where the denominator is rarely available in closed analytical form Marginalization: p(θ|D) =

  • p(θ, φ|D)p(φ)dφ

Expectations: Ep(θ|D)[f (x)] =

  • f (x)p(θ|D)dθ

Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 23

slide-62
SLIDE 62

Approximate Bayesian Inference

Several ways to do approximate inference in Bayesian models

Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 24

slide-63
SLIDE 63

Approximate Bayesian Inference

Several ways to do approximate inference in Bayesian models

Sampling based approximations: Monte Carlo methods, Markov-Chain Monte Carlo (MCMC) methods (e.g., Gibbs sampling)

Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 24

slide-64
SLIDE 64

Approximate Bayesian Inference

Several ways to do approximate inference in Bayesian models

Sampling based approximations: Monte Carlo methods, Markov-Chain Monte Carlo (MCMC) methods (e.g., Gibbs sampling) Deterministic approximations: Laplace approximation, Variational Bayes (VB), Expectation Propagation (EP). Treats inference as an optimization problem of finding the parameters of the closest distribution from a family.

Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 24

slide-65
SLIDE 65

Approximate Bayesian Inference

Several ways to do approximate inference in Bayesian models

Sampling based approximations: Monte Carlo methods, Markov-Chain Monte Carlo (MCMC) methods (e.g., Gibbs sampling) Deterministic approximations: Laplace approximation, Variational Bayes (VB), Expectation Propagation (EP). Treats inference as an optimization problem of finding the parameters of the closest distribution from a family.

A very active area of research, lot of recent work on scalable inference (online and distributed Bayesian inference)

Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 24

slide-66
SLIDE 66

Being Bayesian

Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 25

slide-67
SLIDE 67

Other Recent Advances in Bayesian Learning

Bayesian Optimization

Used for optimization problems where the objective function is unknown and expensive to evaluate

Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 26

slide-68
SLIDE 68

Other Recent Advances in Bayesian Learning

Bayesian Optimization

Used for optimization problems where the objective function is unknown and expensive to evaluate

Closed connections to other “hot” areas in ML, e.g.,

Dropout in Deep Learning vs approximate Bayesian inference

Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 26

slide-69
SLIDE 69

Other Recent Advances in Bayesian Learning

Bayesian Optimization

Used for optimization problems where the objective function is unknown and expensive to evaluate

Closed connections to other “hot” areas in ML, e.g.,

Dropout in Deep Learning vs approximate Bayesian inference

A lot of ongoing work to automate Bayesian inference

Probabilistic Programming: computer programs to express probabilistic models

Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 26

slide-70
SLIDE 70

Other Recent Advances in Bayesian Learning

Bayesian Optimization

Used for optimization problems where the objective function is unknown and expensive to evaluate

Closed connections to other “hot” areas in ML, e.g.,

Dropout in Deep Learning vs approximate Bayesian inference

A lot of ongoing work to automate Bayesian inference

Probabilistic Programming: computer programs to express probabilistic models

Nonparametric Bayesian modeling (or “letting the data speak for itself”)

Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 26

slide-71
SLIDE 71

Next Talk

Introduction to nonparametric Bayesian modeling Nonparametric Bayesian regression: Gaussian Process (GP) regression

Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 27

slide-72
SLIDE 72

Thanks! Questions?

Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 28