Posterior distributions p ( |D , M ) = P ( D| ) p ( ) P ( D|M ) - - PowerPoint PPT Presentation

posterior distributions
SMART_READER_LITE
LIVE PREVIEW

Posterior distributions p ( |D , M ) = P ( D| ) p ( ) P ( D|M ) - - PowerPoint PPT Presentation

Deterministic Approximations 2 Posterior distributions p ( |D , M ) = P ( D| ) p ( ) P ( D|M ) Laplace and variational approximations E.g., logistic regression: p ( ) = N ( ; 0 , 2 I ) labels z ( n ) 1 ( z ( n ) w


slide-1
SLIDE 1

Deterministic Approximations 2

Laplace and variational approximations

Iain Murray http://iainmurray.net/

Posterior distributions

p(θ|D, M) = P(D|θ) p(θ) P(D|M) E.g., logistic regression:

p(θ) = N(θ; 0, σ2I) P(D|θ) =

  • σ(z(n)w⊤x(n)),

labels z(n) ∈ ±1 Integrate large product non-linear functions.

Goals: summarize posterior in simple form, estimate model evidence P(D|M)

Non-Gaussian example

p(w) ∝ N(w; 0, 1) p(w|D) ∝ N(w; 0, 1) σ(10 − 20w)

−4 −2 2 4

Posterior after 500 datapoints

N =50 labels generated with w=1 at x(n) ∼ N(0, 102)

p(w) ∝ N(w; 0, 1) p(w|D) ∝ N(w; 0, 1)

500

  • n=1

σ(wx(n)z(n))

−4 −2 2 4 −4 −2 2 4

Gaussian fit overlaid

slide-2
SLIDE 2

Gaussian approximations

Finite parameter vector θ P(θ|lots of data) often nearly Gaussian around the mode Need to identify which Gaussian it is: mean, covariance

Laplace Approximation

MAP estimate: θ∗ = arg max

θ

  • log P(D|θ) + log P(θ)
  • .

Taylor expand at optimum: − log P(θ|D) = E(θ) = − log P(D|θ) − log P(θ) + log P(D). Because ∇θE is zero at θ∗ (a turning point): E(θ∗ + δ) ≃ E(θ∗) + 1

2δ⊤Hδ

Do same thing to Gaussian around mean, identify Laplace approximation: P(θ|D) ≈ N(θ; θ∗, H−1)

Laplace details

Matrix of second derivatives is called the Hessian: Hij = ∂2 ∂θi∂θj

  • − log P(θ|D)
  • θ=θ∗

Find posterior mode (MAP estimate) θ∗ using favourite gradient-based optimizer. Log posterior doesn’t need to be normalized: constants disappear from derivatives and second-derivatives

Laplace picture

Curvature and mode match. We can normalize Gaussian. Height at mode won’t match exactly! Used to approximate model likelihood (AKA ‘evidence’, ‘marginal likelihood’): P(D) = P(D|θ) P(θ) P(θ|D) ≈ P(D|θ∗) P(θ∗) N(θ∗; θ∗, H−1) = P(D|θ∗) P(θ∗)|2πH−1|

1 2

slide-3
SLIDE 3

Laplace problems

Weird densities (we’ve seen sometimes happen) won’t work well. We only locally match one mode. Mode may not have much mass, or misleading curvature High dimensions: mode may be flat in some direction → ill-conditioned Hessian

Other Gaussian approximations

Can match a Gaussian in other ways that derivatives Accurate approximation with Gaussian may not be possible Capturing posterior width better than only fitting point estimate

Variational methods

Goal: fit target distribution (e.g., parameter posterior) Define: — family of possible distributions q(θ) — ‘variational objective’ (says ‘how well does q match?’) Optimize objective: Fit parameters of q(θ) — e.g., mean and cov of Gaussian

Kullback–Leibler Divergence

DKL(p||q) =

  • p(θ) log p(θ)

q(θ) dθ DKL(p||q) ≥ 0. Minimized by p(θ) = q(θ). Information theory (non-examinable for MLPR): KL divergence: average storage wasted by compression system using model q instead of true distribution p.

slide-4
SLIDE 4

Minimizing DKL(p||q)

Select family: q(θ) = N(θ; µ, Σ), Minimize DKL(p||q): match mean and cov of p.

−4 −2 2 4

Minimizing DKL(p||q)

Optimizing DKL(p||q) tends to be hard. Even Gaussian q: mean and cov of p? MCMC? Answer may not be what you want:

Considering DKL(q||p)

min KL(p||q) local min KL(q||p) local min KL(q||p)

DKL(q||p) = −

  • q(θ) log p(θ|D) dθ +
  • q(θ) log q(θ) dθ
  • neg. entropy, −H(q)
  • 1. “Don’t put probability mass on implausible parameters”
  • 2. Want to be spread out, high entropy.

DKL(q||p): fitting posterior

Fit q to p(θ|D) = p(D|θ) p(θ) p(D) Substitute into KL and get spray of terms: DKL(q||p) = Eq[log q(θ)] − Eq[log p(D|θ)] −Eq[log p(θ)] + log p(D) First three terms: Minimize sum of these, J(q). log p(D): Model evidence. Usually intractable, but: DKL(q||p) ≥ 0 ⇒ log p(D) ≥ −J(q) We optimize lower bound on the log marginal likelihood

slide-5
SLIDE 5

DKL(q||p): optimization

Literature full of clever (non-examinable) iterative ways to

  • ptimize DKL(q||p). q not always Gaussian.

Use standard optimizers? Hardest term to evaluate is: Eq[log p(D|θ)] =

N

  • n=1

Eq[log p(xn|θ)] Sum of possibly simple integrals. Stochastic gradient descent is an option.

Summary

Laplace approximation: — Straightforward to apply; accuracy variable — 2nd derivatives → certainty of parameter — Incremental improvement on MAP estimate Variational methods: — Fit variational parameters of q

(not θ!)

— KL(p||q) vs. KL(q||p) — Bound marginal/model likelihood (‘the evidence’)