Bayesian logistic regression Already covered in lectures on - - PowerPoint PPT Presentation

bayesian logistic regression
SMART_READER_LITE
LIVE PREVIEW

Bayesian logistic regression Already covered in lectures on - - PowerPoint PPT Presentation

Deterministic Approximations Bayesian logistic regression Already covered in lectures on classification Laplace and variational approximations I will review Murphy pp256259 on the board. Similar material by MacKay, Ch. 41, pp492503. (


slide-1
SLIDE 1

Deterministic Approximations

Laplace and variational approximations

Iain Murray http://iainmurray.net/

Bayesian logistic regression

Already covered in lectures on classification I will review Murphy pp256–259 on the board. Similar material by MacKay, Ch. 41, pp492–503. (§41.4 uses non-examinable MCMC methods)

http://www.inference.phy.cam.ac.uk/mackay/itila/book.html

Posterior distributions

p(θ|D, M) = P(D|θ) p(θ) P(D|M) E.g., logistic regression:

p(θ=w) = N(w; 0, σ2I) P(D|θ=w) =

  • n

σ(z(n)w⊤x(n)), labels z(n) ∈ ±1 Integrate large product non-linear functions.

Goals: summarize posterior in simple form, estimate model evidence P(D|M)

Non-Gaussian example

p(w) ∝ N(w; 0, 1) p(w|D) ∝ N(w; 0, 1) σ(10 − 20w)

−4 −2 2 4

slide-2
SLIDE 2

Posterior after 500 datapoints

N =500 labels generated with w=1 at x(n) ∼ N(0, 102)

p(w) ∝ N(w; 0, 1) p(w|D) ∝ N(w; 0, 1)

500

  • n=1

σ(wx(n)z(n))

−4 −2 2 4 −4 −2 2 4

Gaussian fit overlaid

Gaussian approximations

Finite parameter vector θ P(θ|lots of data) often nearly Gaussian around the mode Need to identify which Gaussian it is: mean, covariance

Laplace Approximation

MAP estimate: θ∗ = arg max

θ

  • log P(D|θ) + log P(θ)
  • .

Define ‘energy’: E(θ) = − log P(θ|D) = − log P(D|θ) − log P(θ) + log P(D). Because ∇θE is zero at θ∗ (a turning point), Taylor expansion: E(θ∗ + δ) ≈ E(θ∗) + 1

2δ⊤Hδ

Do same thing to Gaussian around mean, identify Laplace approximation: P(θ|D) ≈ N(θ; θ∗, H−1)

Laplace details

Matrix of second derivatives is called the Hessian: Hij = ∂2 ∂θi∂θj

  • − log P(θ|D)
  • θ=θ∗

Find posterior mode (MAP estimate) θ∗ using favourite gradient-based optimizer. Log posterior doesn’t need to be normalized: constants disappear from derivatives and second-derivatives

slide-3
SLIDE 3

Laplace picture

Curvature and mode match. We can normalize Gaussian. Height at mode won’t match exactly! Used to approximate model likelihood (AKA ‘evidence’, ‘marginal likelihood’): P(D) = P(D|θ) P(θ) P(θ|D) ≈ P(D|θ∗) P(θ∗) N(θ∗; θ∗, H−1) = P(D|θ∗) P(θ∗)|2πH−1|

1 2

Laplace problems

Weird densities won’t work well. We only locally match one mode. Mode may not have much mass, or misleading curvature High dimensions: mode may be flat in some direction → ill-conditioned Hessian

Other Gaussian approximations

Can match a Gaussian in other ways that derivatives Accurate approximation with Gaussian may not be possible Capturing posterior width better than only fitting point estimate

Variational methods

Goal: fit target distribution (e.g., parameter posterior) Define: — family of possible distributions q(θ) — ‘variational objective’ (says ‘how well does q match?’) Optimize objective: Fit parameters of q(θ) — e.g., mean and cov of Gaussian

slide-4
SLIDE 4

Kullback–Leibler Divergence

DKL(p||q) =

  • p(θ) log p(θ)

q(θ) dθ DKL(p||q) ≥ 0. Minimized by p(θ) = q(θ). Information theory (non-examinable for MLPR): KL divergence: average storage wasted by compression system using model q instead of true distribution p.

Minimizing DKL(p||q)

Select family: q(θ) = N(θ; µ, Σ), Minimize DKL(p||q): match mean and cov of p.

−4 −2 2 4

Minimizing DKL(p||q)

Optimizing DKL(p||q) tends to be hard. Even Gaussian q: mean and cov of p? MCMC? Answer may not be what you want:

Considering DKL(q||p)

Murphy Fig 21.1

DKL(q||p) = −

  • q(θ) log p(θ|D) dθ +
  • q(θ) log q(θ) dθ
  • neg. entropy, −H(q)
  • 1. “Don’t put probability mass on implausible parameters”
  • 2. Want to be spread out, high entropy.

H is the standard symbol for entropy. Nothing to do with a Hessian, also H; sorry!

slide-5
SLIDE 5

Usual variational methods

Most variational methods in Machine Learning minimize DKL(q||p) — All parameters are plausible. — We know how to do it! (There are other variational principles.)

DKL(q||p): fitting posterior

Fit q to p(θ|D) = p(D|θ) p(θ) p(D) Substitute into KL divergence and get spray of terms: DKL(q||p) = Eq[log q(θ)] − Eq[log p(D|θ)] −Eq[log p(θ)] + log p(D) First three terms: Minimize sum of these, J(q). log p(D): Model evidence. Usually intractable, but: DKL(q||p) ≥ 0 ⇒ log p(D) ≥ −J(q) We optimize lower bound on the log marginal likelihood

DKL(q||p): optimization

Literature full of clever (non-examinable) iterative ways to

  • ptimize DKL(q||p). q not always Gaussian.

Use standard optimizers? Hardest term to evaluate is: Eq[log p(D|θ)] =

N

  • n=1

Eq[log p(xn|θ)] Sum of possibly simple integrals. Stochastic gradient descent is an option.

Summary

Laplace approximation: — Straightforward to apply — 2nd derivatives → certainty of parameter — Incremental improvement on MAP estimate Variational methods: — Fit variational parameters of q

(not θ!)

— Usually KL(q||p), compare to KL(p||q) — Bound marginal/model likelihood (‘the evidence’)