Non-Gaussian likelihoods for Gaussian Processes Alan Saul Outline - - PowerPoint PPT Presentation

non gaussian likelihoods for gaussian processes
SMART_READER_LITE
LIVE PREVIEW

Non-Gaussian likelihoods for Gaussian Processes Alan Saul Outline - - PowerPoint PPT Presentation

Non-Gaussian likelihoods for Gaussian Processes Alan Saul Outline Motivation Non-Gaussian posteriors Approximate methods Laplace approximation Variational bayes Expectation propagation Comparisons GP regression - recap so far Model the


slide-1
SLIDE 1

Non-Gaussian likelihoods for Gaussian Processes

Alan Saul

slide-2
SLIDE 2

Outline

Motivation Non-Gaussian posteriors Approximate methods Laplace approximation Variational bayes Expectation propagation Comparisons

slide-3
SLIDE 3

GP regression - recap so far

Model the observations as a distorted version of the process fi = f(xi): yi ∼ N

  • f(xi), σ2

f is a non-linear function, in our case we assume it is latent, and is assigned a Gaussian process prior.

5 5 10 15 20 25 30 8 6 4 2 2 4 6 8 Realizations of f(x) Observations 95% credible intervals for p(y ∗ |y)

slide-4
SLIDE 4

GP regression setting

So far we have assumed that the latent values, f, have been corrupted by Gaussian noise. Everything remains analytically tractable. Gaussian Prior: f ∼ GP(0, Kff) = p(f) Gaussian likelihood: y ∼ N

  • f, σ2I
  • =

n

  • i=1

p(yi|fi) Gaussian posterior: p(f|y) ∝ N

  • y|f, σ2I
  • N (f|0, Kff)
slide-5
SLIDE 5

Outline

Motivation Non-Gaussian posteriors Approximate methods Laplace approximation Variational bayes Expectation propagation Comparisons

slide-6
SLIDE 6

Motivation

◮ You have been given some data you wish to model. ◮ You believe that the observations are connected through

some underlying unknown function.

◮ You know from your understanding of the data generation

process, that the observations are not Gaussian.

◮ You still want to learn, as best as possible, what is the

unknown function being used, and make predictions.

5 10 15 20 2 4 6 8 10 12 14 Poisson 5 10 15 20 10 20 30 40 Log-Gaussian 5 10 15 20 0.0 0.2 0.4 0.6 0.8 1.0 Bernoulli

slide-7
SLIDE 7

Likelihood

◮ p(y|f) is the probability that we would see some random

variables, y, if we knew the latent function values f, which act as parameters.

slide-8
SLIDE 8

Likelihood

◮ p(y|f) is the probability that we would see some random

variables, y, if we knew the latent function values f, which act as parameters.

◮ Given the observed values for y are fixed, it can also be

seen as the likelihood that some latent function values, f, would give rise to the observed values of y. Note this is a function of f, and doesn’t integrate to 1 in f.

slide-9
SLIDE 9

Likelihood

◮ p(y|f) is the probability that we would see some random

variables, y, if we knew the latent function values f, which act as parameters.

◮ Given the observed values for y are fixed, it can also be

seen as the likelihood that some latent function values, f, would give rise to the observed values of y. Note this is a function of f, and doesn’t integrate to 1 in f.

◮ Often observations aren’t observed by simple Gaussian

corruptions of the underlying latent function, f.

slide-10
SLIDE 10

Likelihood

◮ p(y|f) is the probability that we would see some random

variables, y, if we knew the latent function values f, which act as parameters.

◮ Given the observed values for y are fixed, it can also be

seen as the likelihood that some latent function values, f, would give rise to the observed values of y. Note this is a function of f, and doesn’t integrate to 1 in f.

◮ Often observations aren’t observed by simple Gaussian

corruptions of the underlying latent function, f.

◮ In the case of count data, binary data, etc, we need to

choose a different likelihood function.

slide-11
SLIDE 11

Likelihood

p(y| f) as a function of y, with fixed f

10 10 y 0.00 0.05 0.10 p(y|f) Gaussian N(y| = f = 2, = 3) 5 10 y 0.0 0.1 0.2 0.3 p(y|f) Log-Gaussian LG(y| = f = 2, = 0.7) 25 25 y 0.00 0.05 0.10 p(y|f) Student-T t(y| = f = 2, = 3, df = 4) 0.0 0.5 1.0 y 0.0 0.5 1.0 1.5 p(y|f) Beta Be(y|a = f = 2, b = 1.6) 0.0 0.5 1.0 y 0.0 0.2 0.4 0.6 p(y|f) Bernoulli B(y|p = f = 0.3) 5 y 0.0 0.1 0.2 p(y|f) Poisson P(y| = f = 2)

slide-12
SLIDE 12

Likelihood

p(y|f) as a function of f, with fixed y

10 f 0.00 0.05 0.10 0.15 0.20 p(y=3|f) Gaussian N(y = 3| = f, = 2) 10 20 f 0.00 0.05 0.10 0.15 p(y=3|f) Log-Gaussian LG(y = 3| = f, = 0.7) 10 f 0.00 0.05 0.10 0.15 p(y=3|f) Student-T t(y = 3| = f, = 3, df = 4) 10 f 0.0 0.5 1.0 p(y=0.3|f) Beta Be(y = 0.3|a = f, b = 1.6) 0.0 0.5 1.0 f 0.00 0.25 0.50 0.75 1.00 p(y=1|f) Bernoulli B(y = 1|p = f) 5 10 f 0.00 0.05 0.10 0.15 0.20 p(y=3|f) Poisson P(y = 3.0| = f)

slide-13
SLIDE 13

Binary example

◮ Binary outcomes for yi, yi ∈ [0, 1]. ◮ Model the probability of yi = 1 with transformation of GP,

with Bernoulli likelihood.

◮ Probability of 1 must be between 0 and 1, thus use

squashing transformation, λ(fi) = Φ(fi). p(yi|λ(fi)) =

  • λ(fi),

if yi = 1 1 − λ(fi), if yi = 0

20 10 10 20 10 5 5 10 Realizations of f(x) 20 10 10 20 0.0 0.2 0.4 0.6 0.8 1.0 Realizations of Φ(f(x)) Observations

slide-14
SLIDE 14

Count data example

◮ Non-negative and discrete values only for yi, yi ∈ N. ◮ Model the rate or intensity, λ, of events with a

transformation of a Gaussian process.

◮ Rate parameter must remain positive, use transformation

to maintain positiveness λ(fi) = exp(fi) or λ(fi) = f2

i

yi ∼ Poisson(yi|λi = λ(fi)) Poisson(yi|λi) = λyi

i

!yi e−λi

5 5 10 15 20 25 30 5 10 15 20 25 Realizations of exp(f(x)) Observations 95% credible intervals for p(y ∗ |y)

slide-15
SLIDE 15

Application example

◮ Chicago crime counts. ◮ Same Poisson likelihood. ◮ 2D-input to kernel.

87.89 87.77 87.65 87.53 41.65 41.78 41.90 42.02 . 2 5 . 2 5 0.500 0.750 1.000 1.250

slide-16
SLIDE 16

Outline

Motivation Non-Gaussian posteriors Approximate methods Laplace approximation Variational bayes Expectation propagation Comparisons

slide-17
SLIDE 17

Non-Gaussian posteriors

◮ Exact computation of posterior is no longer analytically

tractable due to non-conjugate Gaussian process prior to non-Gaussian likelihood, p(y|f). p(f|y) = p(f) n

i=1 p(yi|fi)

  • p(f) n

i=1 p(yi|fi) df

Why is it so difficult?

slide-18
SLIDE 18

Non-Gaussian posteriors illustrated

◮ Consider one observation, y1 = 1, at input x1. ◮ Can normalise easily with numerical integration,

  • p(y1 = 1|λ(f1))p( f1)d f1.

1 2 3 4 5 6 −7.5 −5.0 −2.5 0.0 2.5 5.0 7.5

f

f1

slide-19
SLIDE 19

Non-Gaussian posteriors illustrated

◮ Consider one observation, y1 = 1, at input x1. ◮ Can normalise easily with numerical integration,

  • p(y1 = 1|λ(f1))p( f1)d f1.

1 2 3 4 5 6

x

0.0 0.2 0.4 0.6 0.8 1.0

λ(f)

f1

slide-20
SLIDE 20

Non-Gaussian posteriors illustrated

◮ Consider one observation, y1 = 1, at input x1. ◮ Can normalise easily with numerical integration,

  • p(y1 = 1|λ(f1))p( f1)d f1.

1 2 3 4 5 6

x

0.0 0.2 0.4 0.6 0.8 1.0

λ(f)

f1

slide-21
SLIDE 21

Non-Gaussian posteriors illustrated

−10 −5 5 10

f1

0.0 0.2 0.4 0.6 0.8 1.0 posterior p(f|y = 1) likelihood p(y = 1|f) prior p(f)

slide-22
SLIDE 22

Non-Gaussian posteriors illustrated

◮ Now two observations, y1 = 1 and y2 = 1 at x1 and x2 ◮ Need to calculate the joint posterior,

p(f|y) = p( f1, f2|y1 = 1, y2 = 1).

◮ Requires 2D integral

p(y1 = 1, y2 = 1|λ( f1), λ( f2))p( f1, f2)d f1d f2.

1 2 3 4 5 6 −7.5 −5.0 −2.5 0.0 2.5 5.0 7.5

f

f1 f2

slide-23
SLIDE 23

Non-Gaussian posteriors illustrated

◮ Now two observations, y1 = 1 and y2 = 1 at x1 and x2 ◮ Need to calculate the joint posterior,

p(f|y) = p( f1, f2|y1 = 1, y2 = 1).

◮ Requires 2D integral

p(y1 = 1, y2 = 1|λ( f1), λ( f2))p( f1, f2)d f1d f2.

1 2 3 4 5 6

x

0.0 0.2 0.4 0.6 0.8 1.0

λ(f)

f1 f2

slide-24
SLIDE 24

Non-Gaussian posteriors illustrated

◮ Now two observations, y1 = 1 and y2 = 1 at x1 and x2 ◮ Need to calculate the joint posterior,

p(f|y) = p( f1, f2|y1 = 1, y2 = 1).

◮ Requires 2D integral

p(y1 = 1, y2 = 1|λ( f1), λ( f2))p( f1, f2)d f1d f2.

1 2 3 4 5 6

x

0.0 0.2 0.4 0.6 0.8 1.0

λ(f)

f1 f2

slide-25
SLIDE 25

Non-Gaussian posteriors illustrated

◮ To find the true posterior values, we need to perform a two

dimensional integral.

◮ Still possible, but things are getting more difficult quickly. −5 5

f1

−5 5

f2

Prior p(f1, f2)

slide-26
SLIDE 26

Non-Gaussian posteriors illustrated

◮ To find the true posterior values, we need to perform a two

dimensional integral.

◮ Still possible, but things are getting more difficult quickly. −5 5

f1

−5 5

f2

Likelihood p(y1 = 1, y2 = 1|f1, f2)

slide-27
SLIDE 27

Non-Gaussian posteriors illustrated

◮ To find the true posterior values, we need to perform a two

dimensional integral.

◮ Still possible, but things are getting more difficult quickly. −5 5

f1

−5 5

f2

True posterior p(f1, f2|y1 = 1, y2 = 1)

slide-28
SLIDE 28

Approaches to handling non-Gaussian posteriors

Generally fall into two areas:

◮ Sampling methods that obtain samples of the posterior. ◮ Approximation of the posterior with something of known

form. Today we will focus on the latter.

5 5

5 5 5 5

slide-29
SLIDE 29

Non-Gaussian posterior approximation

◮ Various methods to make a Gaussian approximation,

p(f|y) ≈ q(f) = N f|µ =?, C =?.

◮ Only need to obtain an approximate posterior at the

training locations.

◮ At test locations, the data only effects their probabily via

the posterior at these locations. p(f, f∗|x∗, x, y) = p(f∗|f, x∗)p(f|x, y)

slide-30
SLIDE 30

Why do we want the posterior anyway?

True posterior, posterior approximation, or samples are needed to make predictions at new locations, x∗. p(f∗|x∗, x, y) =

  • p(f∗|f, x∗)p(f|y, x)df

q(f∗|x∗, x, y) =

  • p(f∗|f, x∗)q(f|x)df
slide-31
SLIDE 31

Outline

Motivation Non-Gaussian posteriors Approximate methods Laplace approximation Variational bayes Expectation propagation Comparisons

slide-32
SLIDE 32

Methods overview

Given choice of Gaussian approximation of posterior. How do we choose the parameter values µ and C? There a number of different methods in which to choose how to set the parameters of our Gaussian approximation.

slide-33
SLIDE 33

Parameters effect - mean

slide-34
SLIDE 34

Parameters effect - variance

slide-35
SLIDE 35

How to choose the parameters?

Two approaches that we might take:

◮ Is to match the mean and variance at some point, for

example the mode.

◮ Attempt to minimise some divergence measure between

the approximate distribution and the true distribution.

◮ Laplace takes the former ◮ Variational bayes takes the latter ◮ EP kind of takes the latter

slide-36
SLIDE 36

Outline

Motivation Non-Gaussian posteriors Approximate methods Laplace approximation Variational bayes Expectation propagation Comparisons

slide-37
SLIDE 37

Laplace approximation

Task: for some generic random variable, f, and data, y, find a good approximation to difficult to compute posterior distribution, p( f|y). Laplace approach: fit a Gaussian by matching the curvature at the modal point of the posterior.

◮ Use a second-order taylor expansion around the mode of

the log-posterior.

◮ Use the expansion to find an equivalent Gaussian in the

probability space.

slide-38
SLIDE 38

Laplace approximation

◮ Log of a Gaussian distribution, q(f) = N f|µ, C, is a

quadratic function of f.

◮ A second-order taylor expansion is an approximation of a

function using only quadratic terms.

◮ Laplace approximation expands the un-normalised

posterior, and then uses it to set the linear and quadratic terms of the log q(f).

◮ The first and second derivatives of the form of the

log-posterior, at the mode, will match the derivatives of the approximate Gaussian at this same point.

slide-39
SLIDE 39

Second-order taylor expansion

p(f|y) = 1 Zh(f) In our case: h(f) = p(y|f)p(f)

0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0 0.00 0.50 1.00 1.50 2.00 2.50 h(f)

slide-40
SLIDE 40

Second-order taylor expansion

log p(f|y) = log 1 Z + log h(f)

0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0

  • 3.00
  • 2.00
  • 1.00

0.00 1.00 2.00 logh(f)

slide-41
SLIDE 41

Second-order taylor expansion

log p(f|y) = log 1 Z + log h(f) ≈ log 1 Z + log h(a) + d log h(a) da (f − a) +1 2(f − a)⊤ d2 log h(a) da2 (f − a) + · · ·

slide-42
SLIDE 42

Second-order taylor expansion

≈ log 1 Z + log h(a) + d log h(a) da (f − a) +1 2(f − a)⊤ d2 log h(a) da2 (f − a) + · · · Want to make the expansion around the mode, ˆ f: d log h(a) da

  • a=ˆ

f

= 0

slide-43
SLIDE 43

Second-order taylor expansion

log p(f|y) ≈ log 1 Z + log h(ˆ f) + d log h(ˆ f) dˆ f (f − ˆ f)

0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0

  • 3.00
  • 2.00
  • 1.00

0.00 1.00 2.00 logh(f) Mode f taylor O=1 at f

slide-44
SLIDE 44

Second-order taylor expansion

log p(f|y) ≈ log 1 Z + log h(ˆ f) + d log h(ˆ f) dˆ f (f − ˆ f) +1 2(f − ˆ f)⊤ d2 log h(ˆ f) dˆ f2 (f − ˆ f)

0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0

  • 3.00
  • 2.00
  • 1.00

0.00 1.00 2.00 logh(f) Mode f taylor O=1 at f taylor O=2 at f

slide-45
SLIDE 45

Second-order taylor expansion

log p(f|y) ≈ log 1 Z + log h(ˆ f) + d log h(ˆ f) dˆ f (f − ˆ f) +1 2(f − ˆ f)⊤ d2 log h(ˆ f) dˆ f2 (f − ˆ f) + · · ·

0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0

  • 3.00
  • 2.00
  • 1.00

0.00 1.00 2.00 logh(f) Mode f taylor O=1 at f taylor O=2 at f taylor O=3 at f

slide-46
SLIDE 46

Second-order taylor expansion

log p(f|y) ≈ log 1 Z + log h(ˆ f) + d log h(ˆ f) dˆ f (f − ˆ f) +1 2(f − ˆ f)⊤ d2 log h(ˆ f) dˆ f2 (f − ˆ f)

0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0

  • 3.00
  • 2.00
  • 1.00

0.00 1.00 2.00 log h(f) Mode taylor O=2 at f

slide-47
SLIDE 47

Second-order taylor expansion

p(f|y) ≈ 1 Zh(ˆ f) exp     −1 2(f − ˆ f)⊤      −d2 log h(ˆ f) dˆ f2       (f − ˆ f)     

0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0 0.00 0.50 1.00 1.50 2.00 2.50 h(f) exp(taylor O=2 at f) Mode f

slide-48
SLIDE 48

Second-order taylor expansion

p(f|y) ≈ 1 Zh(ˆ f) exp     −1 2(f − ˆ f)⊤      −d2 log h(ˆ f) dˆ f2       (f − ˆ f)      = N       f|ˆ f,      −d2 log h(ˆ f) dˆ f2      

−1

     

0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0 0.00 0.50 1.00 1.50 2.00 2.50 h(f) N(f|f, logh(f)

1)

Mode f

slide-49
SLIDE 49

Laplace appoximation for Gaussian processes

In our case, h(f) = p(y|f)p(f), so we need to evaluate −d2 log h(ˆ f) dˆ f2 = −d2(log p(y|ˆ f) + log p(ˆ f)) dˆ f2 = −d2 log p(y|ˆ f) dˆ f2 + K−1 W + K−1 giving a posterior approximation: p(f|y) ≈ q(f) = N

  • f|ˆ

f,

  • W + K−1−1
slide-50
SLIDE 50

Laplace approximation - algorithm overview

◮ Find the mode, ˆ

f of the true log posterior, via Newton’s method.

◮ Use second-order Taylor expansion around this modal

value.

◮ Form Gaussian approximation setting the mean equal to

the posterior mode, ˆ f, and matching the curvature.

◮ p(f|y) ≈ q(f|µ, C) = N

  • f|ˆ

f, (K−1 + W )−1

◮ W − d2 log p(y|ˆ f) dˆ f2

.

◮ For factorizing likelihoods (most), W is diagonal.

slide-51
SLIDE 51

Visualization of Laplace

8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 Evaluate curvature log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 Evaluate curvature log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) mode, f

slide-52
SLIDE 52

Visualization of Laplace

8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 Evaluate curvature log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 Evaluate curvature log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) mode, f

slide-53
SLIDE 53

Visualization of Laplace

8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 Evaluate curvature log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 Evaluate curvature log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) mode, f

slide-54
SLIDE 54

Visualization of Laplace

8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 Evaluate curvature log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 Evaluate curvature log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) mode, f

slide-55
SLIDE 55

Visualization of Laplace

8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 Evaluate curvature log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 Evaluate curvature log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) mode, f

slide-56
SLIDE 56

Visualization of Laplace

8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 Evaluate curvature log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 Evaluate curvature log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) mode, f

slide-57
SLIDE 57

Visualization of Laplace

8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 Evaluate curvature log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 Evaluate curvature log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) mode, f

slide-58
SLIDE 58

Visualization of Laplace

8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 Evaluate curvature log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 Evaluate curvature log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) mode, f

slide-59
SLIDE 59

Visualization of Laplace

8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 Evaluate curvature log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 Evaluate curvature log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) mode, f

slide-60
SLIDE 60

Visualization of Laplace

8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 Evaluate curvature log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 Evaluate curvature log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) mode, f

slide-61
SLIDE 61

Visualization of Laplace

8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 Evaluate curvature log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) 8 6 4 2 2 4 6 8 f1 15.0 12.5 10.0 7.5 5.0 2.5 0.0 Evaluate curvature log prior, logp(f) log likelihood, logp(y = 4| (f)) log posterior, logp(f|y = 4) mode, f

slide-62
SLIDE 62

Visualization of Laplace

8 6 4 2 2 4 6 8 f1 0.0 0.2 0.4 0.6 0.8 1.0 prior, p(f) likelihood, p(y = 4| (f)) posterior, p(f|y = 4) laplace, q(f) mode, f

slide-63
SLIDE 63

Visualise of Laplace - Bernoulli

8 6 4 2 2 4 6 8 f1 0.0 0.2 0.4 0.6 0.8 1.0 prior, p(f) likelihood, p(y = 1| (f)) posterior, p(f|y = 1) laplace, q(f) mode, f

slide-64
SLIDE 64

Outline

Motivation Non-Gaussian posteriors Approximate methods Laplace approximation Variational bayes Expectation propagation Comparisons

slide-65
SLIDE 65

Variational Bayes (VB)

Task: for some generic random variable, z, and data, y, find a good approximation to difficult to compute posterior distribution, p(z|y). VB approach: minimise a divergence measure between an approximate posterior, q(z) and true posterior, p(z|y).

◮ KL divergence, KL q(z) p(z|y). ◮ Minimize this with respect to parameters of q(z).

slide-66
SLIDE 66

KL divergence

◮ General for any two distributions q(x) and p(x). ◮ KL q(x) p(x) is the average additional amount of

information lost when p(x) is used to approximate q(x). It’s a measure of divergence of one distribution to another.

◮ KL q(x) p(x) =

  • log q(x)

p(x)

  • q(x)

◮ Always 0 or positive, not symmetric. ◮ Lets look at how it changes with response to changes in the

approximating distribution.

slide-67
SLIDE 67

KL varying mean

6 4 2 2 4 6

x

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40

pdf

q(x) ~ N(µ =−2.0,σ2 =1.0) p(x) ~ N(0.0,1.0) 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0

µ

0.0 0.5 1.0 1.5 2.0

KL[q(x)||p(x)]

6 4 2 2 4 6

x

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40

pdf

q(x) ~ N(µ =−1.0,σ2 =1.0) p(x) ~ N(0.0,1.0) 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0

µ

0.0 0.5 1.0 1.5 2.0

KL[q(x)||p(x)]

6 4 2 2 4 6

x

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40

pdf

q(x) ~ N(µ =0.0,σ2 =1.0) p(x) ~ N(0.0,1.0) 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0

µ

0.0 0.5 1.0 1.5 2.0

KL[q(x)||p(x)]

6 4 2 2 4 6

x

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40

pdf

q(x) ~ N(µ =1.0,σ2 =1.0) p(x) ~ N(0.0,1.0) 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0

µ

0.0 0.5 1.0 1.5 2.0

KL[q(x)||p(x)]

6 4 2 2 4 6

x

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40

pdf

q(x) ~ N(µ =2.0,σ2 =1.0) p(x) ~ N(0.0,1.0) 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0

µ

0.0 0.5 1.0 1.5 2.0

KL[q(x)||p(x)]

slide-68
SLIDE 68

KL varying mean

6 4 2 2 4 6

x

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40

pdf

q(x) ~ N(µ =−2.0,σ2 =1.0) p(x) ~ N(0.0,1.0) 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0

µ

0.0 0.5 1.0 1.5 2.0

KL[q(x)||p(x)]

6 4 2 2 4 6

x

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40

pdf

q(x) ~ N(µ =−1.0,σ2 =1.0) p(x) ~ N(0.0,1.0) 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0

µ

0.0 0.5 1.0 1.5 2.0

KL[q(x)||p(x)]

6 4 2 2 4 6

x

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40

pdf

q(x) ~ N(µ =0.0,σ2 =1.0) p(x) ~ N(0.0,1.0) 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0

µ

0.0 0.5 1.0 1.5 2.0

KL[q(x)||p(x)]

6 4 2 2 4 6

x

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40

pdf

q(x) ~ N(µ =1.0,σ2 =1.0) p(x) ~ N(0.0,1.0) 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0

µ

0.0 0.5 1.0 1.5 2.0

KL[q(x)||p(x)]

6 4 2 2 4 6

x

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40

pdf

q(x) ~ N(µ =2.0,σ2 =1.0) p(x) ~ N(0.0,1.0) 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0

µ

0.0 0.5 1.0 1.5 2.0

KL[q(x)||p(x)]

slide-69
SLIDE 69

KL varying mean

6 4 2 2 4 6

x

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40

pdf

q(x) ~ N(µ =−2.0,σ2 =1.0) p(x) ~ N(0.0,1.0) 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0

µ

0.0 0.5 1.0 1.5 2.0

KL[q(x)||p(x)]

6 4 2 2 4 6

x

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40

pdf

q(x) ~ N(µ =−1.0,σ2 =1.0) p(x) ~ N(0.0,1.0) 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0

µ

0.0 0.5 1.0 1.5 2.0

KL[q(x)||p(x)]

6 4 2 2 4 6

x

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40

pdf

q(x) ~ N(µ =0.0,σ2 =1.0) p(x) ~ N(0.0,1.0) 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0

µ

0.0 0.5 1.0 1.5 2.0

KL[q(x)||p(x)]

6 4 2 2 4 6

x

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40

pdf

q(x) ~ N(µ =1.0,σ2 =1.0) p(x) ~ N(0.0,1.0) 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0

µ

0.0 0.5 1.0 1.5 2.0

KL[q(x)||p(x)]

6 4 2 2 4 6

x

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40

pdf

q(x) ~ N(µ =2.0,σ2 =1.0) p(x) ~ N(0.0,1.0) 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0

µ

0.0 0.5 1.0 1.5 2.0

KL[q(x)||p(x)]

slide-70
SLIDE 70

KL varying mean

6 4 2 2 4 6

x

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40

pdf

q(x) ~ N(µ =−2.0,σ2 =1.0) p(x) ~ N(0.0,1.0) 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0

µ

0.0 0.5 1.0 1.5 2.0

KL[q(x)||p(x)]

6 4 2 2 4 6

x

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40

pdf

q(x) ~ N(µ =−1.0,σ2 =1.0) p(x) ~ N(0.0,1.0) 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0

µ

0.0 0.5 1.0 1.5 2.0

KL[q(x)||p(x)]

6 4 2 2 4 6

x

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40

pdf

q(x) ~ N(µ =0.0,σ2 =1.0) p(x) ~ N(0.0,1.0) 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0

µ

0.0 0.5 1.0 1.5 2.0

KL[q(x)||p(x)]

6 4 2 2 4 6

x

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40

pdf

q(x) ~ N(µ =1.0,σ2 =1.0) p(x) ~ N(0.0,1.0) 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0

µ

0.0 0.5 1.0 1.5 2.0

KL[q(x)||p(x)]

6 4 2 2 4 6

x

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40

pdf

q(x) ~ N(µ =2.0,σ2 =1.0) p(x) ~ N(0.0,1.0) 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0

µ

0.0 0.5 1.0 1.5 2.0

KL[q(x)||p(x)]

slide-71
SLIDE 71

KL varying mean

6 4 2 2 4 6

x

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40

pdf

q(x) ~ N(µ =−2.0,σ2 =1.0) p(x) ~ N(0.0,1.0) 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0

µ

0.0 0.5 1.0 1.5 2.0

KL[q(x)||p(x)]

6 4 2 2 4 6

x

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40

pdf

q(x) ~ N(µ =−1.0,σ2 =1.0) p(x) ~ N(0.0,1.0) 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0

µ

0.0 0.5 1.0 1.5 2.0

KL[q(x)||p(x)]

6 4 2 2 4 6

x

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40

pdf

q(x) ~ N(µ =0.0,σ2 =1.0) p(x) ~ N(0.0,1.0) 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0

µ

0.0 0.5 1.0 1.5 2.0

KL[q(x)||p(x)]

6 4 2 2 4 6

x

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40

pdf

q(x) ~ N(µ =1.0,σ2 =1.0) p(x) ~ N(0.0,1.0) 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0

µ

0.0 0.5 1.0 1.5 2.0

KL[q(x)||p(x)]

6 4 2 2 4 6

x

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40

pdf

q(x) ~ N(µ =2.0,σ2 =1.0) p(x) ~ N(0.0,1.0) 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0

µ

0.0 0.5 1.0 1.5 2.0

KL[q(x)||p(x)]

slide-72
SLIDE 72

KL varying variance

6 4 2 2 4 6

x

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

pdf

q(x) ~ N(µ =0.0,σ2 =0.3) p(x) ~ N(0.0,1.0) 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0

σ2

0.0 0.1 0.2 0.3 0.4 0.5

KL[q(x)||p(x)]

6 4 2 2 4 6

x

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

pdf

q(x) ~ N(µ =0.0,σ2 =0.725) p(x) ~ N(0.0,1.0) 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0

σ2

0.0 0.1 0.2 0.3 0.4 0.5

KL[q(x)||p(x)]

6 4 2 2 4 6

x

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

pdf

q(x) ~ N(µ =0.0,σ2 =1.15) p(x) ~ N(0.0,1.0) 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0

σ2

0.0 0.1 0.2 0.3 0.4 0.5

KL[q(x)||p(x)]

6 4 2 2 4 6

x

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

pdf

q(x) ~ N(µ =0.0,σ2 =1.575) p(x) ~ N(0.0,1.0) 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0

σ2

0.0 0.1 0.2 0.3 0.4 0.5

KL[q(x)||p(x)]

6 4 2 2 4 6

x

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

pdf

q(x) ~ N(µ =0.0,σ2 =2.0) p(x) ~ N(0.0,1.0) 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0

σ2

0.0 0.1 0.2 0.3 0.4 0.5

KL[q(x)||p(x)]

slide-73
SLIDE 73

KL varying variance

6 4 2 2 4 6

x

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

pdf

q(x) ~ N(µ =0.0,σ2 =0.3) p(x) ~ N(0.0,1.0) 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0

σ2

0.0 0.1 0.2 0.3 0.4 0.5

KL[q(x)||p(x)]

6 4 2 2 4 6

x

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

pdf

q(x) ~ N(µ =0.0,σ2 =0.725) p(x) ~ N(0.0,1.0) 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0

σ2

0.0 0.1 0.2 0.3 0.4 0.5

KL[q(x)||p(x)]

6 4 2 2 4 6

x

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

pdf

q(x) ~ N(µ =0.0,σ2 =1.15) p(x) ~ N(0.0,1.0) 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0

σ2

0.0 0.1 0.2 0.3 0.4 0.5

KL[q(x)||p(x)]

6 4 2 2 4 6

x

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

pdf

q(x) ~ N(µ =0.0,σ2 =1.575) p(x) ~ N(0.0,1.0) 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0

σ2

0.0 0.1 0.2 0.3 0.4 0.5

KL[q(x)||p(x)]

6 4 2 2 4 6

x

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

pdf

q(x) ~ N(µ =0.0,σ2 =2.0) p(x) ~ N(0.0,1.0) 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0

σ2

0.0 0.1 0.2 0.3 0.4 0.5

KL[q(x)||p(x)]

slide-74
SLIDE 74

KL varying variance

6 4 2 2 4 6

x

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

pdf

q(x) ~ N(µ =0.0,σ2 =0.3) p(x) ~ N(0.0,1.0) 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0

σ2

0.0 0.1 0.2 0.3 0.4 0.5

KL[q(x)||p(x)]

6 4 2 2 4 6

x

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

pdf

q(x) ~ N(µ =0.0,σ2 =0.725) p(x) ~ N(0.0,1.0) 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0

σ2

0.0 0.1 0.2 0.3 0.4 0.5

KL[q(x)||p(x)]

6 4 2 2 4 6

x

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

pdf

q(x) ~ N(µ =0.0,σ2 =1.15) p(x) ~ N(0.0,1.0) 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0

σ2

0.0 0.1 0.2 0.3 0.4 0.5

KL[q(x)||p(x)]

6 4 2 2 4 6

x

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

pdf

q(x) ~ N(µ =0.0,σ2 =1.575) p(x) ~ N(0.0,1.0) 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0

σ2

0.0 0.1 0.2 0.3 0.4 0.5

KL[q(x)||p(x)]

6 4 2 2 4 6

x

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

pdf

q(x) ~ N(µ =0.0,σ2 =2.0) p(x) ~ N(0.0,1.0) 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0

σ2

0.0 0.1 0.2 0.3 0.4 0.5

KL[q(x)||p(x)]

slide-75
SLIDE 75

KL varying variance

6 4 2 2 4 6

x

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

pdf

q(x) ~ N(µ =0.0,σ2 =0.3) p(x) ~ N(0.0,1.0) 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0

σ2

0.0 0.1 0.2 0.3 0.4 0.5

KL[q(x)||p(x)]

6 4 2 2 4 6

x

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

pdf

q(x) ~ N(µ =0.0,σ2 =0.725) p(x) ~ N(0.0,1.0) 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0

σ2

0.0 0.1 0.2 0.3 0.4 0.5

KL[q(x)||p(x)]

6 4 2 2 4 6

x

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

pdf

q(x) ~ N(µ =0.0,σ2 =1.15) p(x) ~ N(0.0,1.0) 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0

σ2

0.0 0.1 0.2 0.3 0.4 0.5

KL[q(x)||p(x)]

6 4 2 2 4 6

x

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

pdf

q(x) ~ N(µ =0.0,σ2 =1.575) p(x) ~ N(0.0,1.0) 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0

σ2

0.0 0.1 0.2 0.3 0.4 0.5

KL[q(x)||p(x)]

6 4 2 2 4 6

x

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

pdf

q(x) ~ N(µ =0.0,σ2 =2.0) p(x) ~ N(0.0,1.0) 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0

σ2

0.0 0.1 0.2 0.3 0.4 0.5

KL[q(x)||p(x)]

slide-76
SLIDE 76

KL varying variance

6 4 2 2 4 6

x

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

pdf

q(x) ~ N(µ =0.0,σ2 =0.3) p(x) ~ N(0.0,1.0) 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0

σ2

0.0 0.1 0.2 0.3 0.4 0.5

KL[q(x)||p(x)]

6 4 2 2 4 6

x

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

pdf

q(x) ~ N(µ =0.0,σ2 =0.725) p(x) ~ N(0.0,1.0) 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0

σ2

0.0 0.1 0.2 0.3 0.4 0.5

KL[q(x)||p(x)]

6 4 2 2 4 6

x

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

pdf

q(x) ~ N(µ =0.0,σ2 =1.15) p(x) ~ N(0.0,1.0) 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0

σ2

0.0 0.1 0.2 0.3 0.4 0.5

KL[q(x)||p(x)]

6 4 2 2 4 6

x

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

pdf

q(x) ~ N(µ =0.0,σ2 =1.575) p(x) ~ N(0.0,1.0) 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0

σ2

0.0 0.1 0.2 0.3 0.4 0.5

KL[q(x)||p(x)]

6 4 2 2 4 6

x

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

pdf

q(x) ~ N(µ =0.0,σ2 =2.0) p(x) ~ N(0.0,1.0) 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0

σ2

0.0 0.1 0.2 0.3 0.4 0.5

KL[q(x)||p(x)]

slide-77
SLIDE 77

Variational Bayes

Don’t have access to or can’t compute for computational reasons: p(z|y) or p(y), and hence KL q(z) p(z|y) How can we minimize something we can’t compute?

◮ Can compute q(z) and p(y|z) for any z. ◮ q(z) is parameterised by ‘variational parameters’. ◮ True posterior using Bayes rule, p(z|y) = p(y|z)p(z) p(y)

.

◮ p(y) doesn’t change when variational parameters are

changed.

slide-78
SLIDE 78

Variational Bayes - Derivation

KL q(z) p(z|y)

slide-79
SLIDE 79

Variational Bayes - Derivation

KL q(z) p(z|y) =

  • q(z)
  • log q(z)

p(z|y)

  • dz
slide-80
SLIDE 80

Variational Bayes - Derivation

KL q(z) p(z|y) =

  • q(z)
  • log q(z)

p(z|y)

  • dz

=

  • q(z)
  • log q(z)

p(z) − log p(y|z) + log p(y)

  • dz
slide-81
SLIDE 81

Variational Bayes - Derivation

KL q(z) p(z|y) =

  • q(z)
  • log q(z)

p(z|y)

  • dz

=

  • q(z)
  • log q(z)

p(z) − log p(y|z) + log p(y)

  • dz

= KL q(z) p(z) −

  • q(z) log p(y|z) dz + log p(y)
slide-82
SLIDE 82

Variational Bayes - Derivation

KL q(z) p(z|y) =

  • q(z)
  • log q(z)

p(z|y)

  • dz

=

  • q(z)
  • log q(z)

p(z) − log p(y|z) + log p(y)

  • dz

= KL q(z) p(z) −

  • q(z) log p(y|z) dz + log p(y)

log p(y) =

  • q(z) log p(y|z) dz − KL q(z) p(z) + KL q(z) p(z|y)
slide-83
SLIDE 83

Variational Bayes - Derivation

log p(y) =

  • q(z) log p(y|z) dz − KL q(z) p(z) + KL q(z) p(z|y)

  • q(z) log p(y|z) dz − KL q(z) p(z)

◮ Tractable terms give lower bound on log p(y) as

KL q(z) p(z|y) always positive.

◮ Adjust variational parameters of q(z) to make tractable

terms as large as possible, thus KL q(z) p(z|y) as small as possible.

slide-84
SLIDE 84

VB optimisation illustration

slide-85
SLIDE 85

Variational Bayes for Gaussian processes

◮ Make a Gaussian approximation, q(f) = N (f|µ, C), as

similar possible to true posterior, p(f|y).

◮ Treat µ and C as ‘variational parameters’, effecting quality

  • f approximation.

KL q(f) p(f|y) =

  • log q(f)

p(f|y)

  • q(f)

=

  • log q(f)

p(f) − log p(y|f) + log p(y)

  • q(f)

= KL q(f) p(f) − log p(y|f)

q(f) + log p(y)

log p(y) = log p(y|f)

q(f) − KL q(f) p(f) + KL q(f) p(f|y)

slide-86
SLIDE 86

Variational Bayes for Gaussian processes - bound

log p(y) = log p(y|f)

q(f) − KL q(f) p(f) + KL q(f) p(f|y)

≥ log p(y|f)

q(f) − KL q(f) p(f) ◮ Adjust variational parameters µ and C to make tractable

terms as large as possible, thus KL q(f) p(f|y) as small as possible.

◮ log p(y|f) q(f) with factorizing likelihood can be done with

a series of n 1 dimensional integrals.

◮ In practice, can reduce the number of variational

parameters by reparameterizing C = (Kff − 2Λ)−1 by noting that the bound is constant in off diagonal terms of C.

slide-87
SLIDE 87

VB optimisation illustration for Gaussian processes

slide-88
SLIDE 88

Outline

Motivation Non-Gaussian posteriors Approximate methods Laplace approximation Variational bayes Expectation propagation Comparisons

slide-89
SLIDE 89

Expectation propagation

p(f|y) ∝ p(f)

n

  • i=1

p(yi|fi) q(f) 1 Zep p(f)

n

  • i=1

ti(fi| ˜ Zi, ˜ µi, ˜ σ2

i ) = N (f|µ, Σ)

ti ˜ ZiN

  • fi| ˜

µi, ˜ σ2

i

  • ◮ Individual likelihood terms, p(yi|fi), replaced by

independent un-normalised 1D Gaussians, ti.

◮ Uses an iterative algorithm to update ti’s, to get more and

more accurate approximation.

slide-90
SLIDE 90

Expectation propagation

  • 1. Remove one factor ti from the approximation q(f).
slide-91
SLIDE 91

Expectation propagation

  • 1. Remove one factor ti from the approximation q(f).
  • 2. The approximate marginal q(fi) with ti contribution

removed is called cavity distribution, q−i(fi)

slide-92
SLIDE 92

Expectation propagation

  • 1. Remove one factor ti from the approximation q(f).
  • 2. The approximate marginal q(fi) with ti contribution

removed is called cavity distribution, q−i(fi)

  • 3. Find ti that minimises KL p(yi|fi)q−i(fi)/zi q(fi) by

matching moments.

slide-93
SLIDE 93

Expectation propagation

  • 1. Remove one factor ti from the approximation q(f).
  • 2. The approximate marginal q(fi) with ti contribution

removed is called cavity distribution, q−i(fi)

  • 3. Find ti that minimises KL p(yi|fi)q−i(fi)/zi q(fi) by

matching moments.

  • 4. Repeat until convergence.
slide-94
SLIDE 94

Expectation propagation

  • 1. Remove one factor ti from the approximation q(f).
  • 2. The approximate marginal q(fi) with ti contribution

removed is called cavity distribution, q−i(fi)

  • 3. Find ti that minimises KL p(yi|fi)q−i(fi)/zi q(fi) by

matching moments.

  • 4. Repeat until convergence.

This approximately minimises KL p(f|y) q(f) locally, but not globally.

slide-95
SLIDE 95

Expectation propagation - in math

Step 1 & 2. First choose a local likelihood contribution, i, to leave out, and find the marginal cavity distribution, q(f|y) ∝ p(f)

n

  • j=1

tj(fj) → p(f) n

j=1 tj(fj)

ti(fi) → p(f)

n

  • ji

tj(fj) →

  • p(f)
  • ji

tj(fj) dfji q−i(fi)

slide-96
SLIDE 96

Expectation propagation - in math

Step 1 & 2. First choose a local likelihood contribution, i, to leave out, and find the marginal cavity distribution, q(f|y) ∝ p(f)

n

  • j=1

tj(fj) → p(f) n

j=1 tj(fj)

ti(fi) → p(f)

n

  • ji

tj(fj) →

  • p(f)
  • ji

tj(fj) dfji q−i(fi) Step 3.1. ˆ q(fi) ≈ min KL

  • p(yi|fi)q−i(fi) N
  • fi| ˆ

µi, ˆ σ2

i

ˆ Zi

slide-97
SLIDE 97

Expectation propagation - in math

Step 1 & 2. First choose a local likelihood contribution, i, to leave out, and find the marginal cavity distribution, q(f|y) ∝ p(f)

n

  • j=1

tj(fj) → p(f) n

j=1 tj(fj)

ti(fi) → p(f)

n

  • ji

tj(fj) →

  • p(f)
  • ji

tj(fj) dfji q−i(fi) Step 3.1. ˆ q(fi) ≈ min KL

  • p(yi|fi)q−i(fi) N
  • fi| ˆ

µi, ˆ σ2

i

ˆ Zi

  • Step 3.2: Compute parameters of ti(fi| ˜

Zi, ˜ µi, ˜ σ2

i ) making

moments of q(fi) match those of ˆ ZiN

  • fi| ˆ

µi, ˆ σ2

i

  • .
slide-98
SLIDE 98

Outline

Motivation Non-Gaussian posteriors Approximate methods Laplace approximation Variational bayes Expectation propagation Comparisons

slide-99
SLIDE 99

Comparing posterior approximations

20 10 10 20

f1

20 10 10 20

f2

Prior p(f1,f2)

20 10 10 20

f1

20 10 10 20

f2

Likelihood p(y =1|f1,f2)

◮ Gaussian prior between two function values {f1, f2}, at

{x1, x2} respectively.

◮ Bernoulli likelihood, y1 = 1 and y2 = 1.

slide-100
SLIDE 100

Comparing posterior approximations

20 10 10 20

f1

20 10 10 20

f2

True posterior

20 10 10 20

f1

20 10 10 20

f2

Laplace approximation

◮ p(f|y) ∝ p(y|f)p(f) p(y) ◮ True posterior is non-Gaussian. ◮ Laplace approximates with a Gaussian at the mode of the

posterior.

slide-101
SLIDE 101

Comparing posterior approximations

20 10 10 20

f1

20 10 10 20

f2

True posterior

20 10 10 20

f1

20 10 10 20

f2

KL approximation

◮ True posterior is non-Gaussian. ◮ VB approximate with a Gaussian that has minimal KL

divergence, KL q(f) p(f|y).

◮ This leads to distributions that avoid regions in which

p(f|y) is small.

◮ It has a large penality for assigning density where there is

none.

slide-102
SLIDE 102

Comparing posterior approximations

20 10 10 20

f1

20 10 10 20

f2

True posterior

20 10 10 20

f1

20 10 10 20

f2

EP approximation

◮ True posterior is non-Gaussian. ◮ EP tends to try and put density where p(f|y) is large ◮ Cares less about assigning density density where there is

  • none. Contrasts to VB method.
slide-103
SLIDE 103

Comparing posterior marginal approximations

30 20 10 10 20 30 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 laplace posterior EP KL

Marginals for f2 compared

◮ Laplace: Poor approximation. ◮ VB: Avoids assigning density to areas where there is none,

at the expense of areas where there is some (right tail).

◮ EP: Assigns density to areas with density, at the expense of

areas where there is none (left tail).

slide-104
SLIDE 104

Pros - Cons - When - Laplace

Laplace approximation

◮ Pros

◮ Simple to implement. ◮ Fast.

◮ Cons

◮ Poor approximation if the mode does not well describe the

posterior, for example Bernoulli likelihood.

◮ When

◮ When the posterior is well characterized by its mode, for

example Poisson.

slide-105
SLIDE 105

Pros - Cons - When - VB

Variational Bayes

◮ Pros

◮ Principled in that it we are directly optimizing a measure of

divergence between an approximation and true distribution.

◮ Lends itself to sparse extensions.

◮ Cons

◮ Requires factorizing likelihoods to avoid n dimensional

integral.

◮ As seen, can result in underestimating the variance, i.e.

becomes overconfident.

◮ When

◮ Applicable to a range of likelihood ◮ Might need to be careful if you wish to be conservative with

predictive uncertainty.

slide-106
SLIDE 106

Pros - Cons - When - EP

EP method

◮ Pros

◮ Very effective for certain likelihoods (classification). ◮ Also lends itself to sparse approximations.

◮ Cons

◮ Standard algorithm is slow; though possible to extend to

sparse case.

◮ Not always guaranteed to converge. ◮ Can be brittle with initialisation and tricky implement.

◮ When

◮ Binary data (Nickisch and Rasmussen, 2008; Kuß, 2006),

perhaps with truncated likelihood (censored data) (Vanhatalo et al., 2015).

◮ In conjunction with sparse methods.

slide-107
SLIDE 107

Pros - Cons - When - MCMC

MCMC methods

◮ Pros

◮ Theoretical limit gives true distribution.

◮ Cons

◮ Can be very slow.

◮ When

◮ If time is not an issue, but exact accuracy is. ◮ If you are unsure whether a different approximation is

appropriate, can be used as a “ground truth”

slide-108
SLIDE 108

Conclusion

◮ Many real world tasks require non-Gaussian observation

models.

◮ Non-Gaussian likelihoods cause complications in applying

  • ur framework.

◮ Several different ways to deal with the problem. Many are

based on Gaussian approximations.

◮ Different methods have their own advantages and

disadvantages.

slide-109
SLIDE 109

Questions

Thanks for listening. Any questions?

slide-110
SLIDE 110

Bonus - Hetroscedastic likelihoods

◮ Likelihood whos parameters are governed by two known

functions, f and g.

◮ p(y|f, g) = N

  • y|µ = f, σ2 = exp(g)
slide-111
SLIDE 111

Bonus - non-Gaussian hetroscedastic likelihoods

−2 −1 1 2 −6 −4 −2 2 4 6 Standard Gaussian Process −2 −1 1 2 −6 −4 −2 2 4 6 Heteroscedastic Gaussian −2 −1 1 2 −6 −4 −2 2 4 6 Heteroscedastic Student-t

◮ Likelihood whos parameters are governed by two known

functions, f and g.

◮ p(y|f, g) = t(y|µ = f, σ2 = exp(g), ν = 3.0)

slide-112
SLIDE 112

Bonus - non-Gaussian hetroscedastic likelihoods

87.89 87.77 87.65 87.53 41.65 41.78 41.90 42.02 . 2 5 . 2 5 0.500 . 7 5 1 . 1.250 87.89 87.77 87.65 87.53 41.65 41.78 41.90 42.02 . 1 5 0.200 0.250 . 2 5 . 2 5 . 3 . 3 5 . 4

2006 2008 2010 2012 2014 2016 1 2 3 2006 2008 2010 2012 2014 2016 1 2 3

◮ Λ(x, t) = λ1(x)µ1(t) + λ2(x)µ2(t)

slide-113
SLIDE 113

References I

Hensman, J., Matthews, A. G. D. G., and Ghahramani, Z. (2015). Scalable variational gaussian process classification. In In 18th International Conference on Artificial Intelligence and Statistics, pages 1–9, San Diego, California, USA. Kuß, M. (2006). Gaussian Process Models for Robust Regression, Classification, and Reinforcement Learning. PhD thesis, TU Darmstadt. Nickisch, H. and Rasmussen, C. E. (2008). Approximations for Binary Gaussian Process Classification. Journal of Machine Learning Research, 9:2035–2078. Vanhatalo, J., Riihimaki, J., Hartikainen, J., Jylanki, P., Tolvanen, V., and Vehtari, A. (2015). Gpstuff. http://mloss.org/software/view/451/.