Gaussian Processes Covariance Functions and Classification Carl - - PowerPoint PPT Presentation

gaussian processes covariance functions and classification
SMART_READER_LITE
LIVE PREVIEW

Gaussian Processes Covariance Functions and Classification Carl - - PowerPoint PPT Presentation

Gaussian Processes Covariance Functions and Classification Carl Edward Rasmussen Max Planck Institute for Biological Cybernetics T ubingen, Germany Gaussian Processes in Practice, Bletchley Park, July 12th, 2006 Carl Edward Rasmussen


slide-1
SLIDE 1

Gaussian Processes Covariance Functions and Classification

Carl Edward Rasmussen

Max Planck Institute for Biological Cybernetics T¨ ubingen, Germany

Gaussian Processes in Practice, Bletchley Park, July 12th, 2006

Carl Edward Rasmussen Covariance Functions and Classification

slide-2
SLIDE 2

Outline

Covariance functions encode structure. You can learn about them by

  • sampling,
  • optimizing the marginal likelihood.

GP’s with various covariance functions are equivalent to many well known models, large neural networks, splines, relevance vector machines...

  • infinitely many Gaussian bumps regression
  • Rational Quadratic and Mat´

ern Quick two-page recap of GP regression Approximate inference for Gaussian process classification: Replace the non-Gaussian intractable posterior by a Gaussian. Expectation Propagation.

Carl Edward Rasmussen Covariance Functions and Classification

slide-3
SLIDE 3

From random functions to covariance functions

Consider the class of functions (sums of squared exponentials): f(x) = lim

n→∞

1 n

  • i

γi exp(−(x − i/n)2), where γi ∼ N(0, 1), ∀i = ∞

−∞

γ(u) exp(−(x − u)2)du, where γ(u) ∼ N(0, 1), ∀u. The mean function is: µ(x) = E[f(x)] = ∞

−∞

exp(−(x − u)2) ∞

−∞

γp(γ)dγdu = 0, and the covariance function: E[f(x)f(x′)] =

  • exp
  • − (x − u)2 − (x′ − u)2

du =

  • exp
  • − 2(u − x + x′

2 )2 + (x + x′)2 2 − x2 − x′2 )du ∝ exp

  • − (x − x′)2

2

  • .

Thus, the squared exponential covariance function is equivalent to regression using infinitely many Gaussian shaped basis functions placed everywhere, not just at your training points!

Carl Edward Rasmussen Covariance Functions and Classification

slide-4
SLIDE 4

Why it is dangerous to use only finitely many basis functions?

−10 −8 −6 −4 −2 2 4 6 8 10 −0.5 0.5 1 ?

Carl Edward Rasmussen Covariance Functions and Classification

slide-5
SLIDE 5

Rational quadratic covariance function

The rational quadratic (RQ) covariance function: kRQ(r) =

  • 1 +

r2 2αℓ2 −α with α, ℓ > 0 can be seen as a scale mixture (an infinite sum) of squared exponential (SE) covariance functions with different characteristic length-scales. Using τ = ℓ−2 and p(τ|α, β) ∝ τ α−1 exp(−ατ/β): kRQ(r) =

  • p(τ|α, β)kSE(r|τ)dτ

  • τ α−1 exp
  • − ατ

β

  • exp
  • − τr2

2

  • dτ ∝
  • 1 +

r2 2αℓ2 −α .

Carl Edward Rasmussen Covariance Functions and Classification

slide-6
SLIDE 6

Rational quadratic covariance function II

1 2 3 0.2 0.4 0.6 0.8 1 input distance covariance

α=1/2 α=2 α→∞

−5 5 −3 −2 −1 1 2 3 input, x

  • utput, f(x)

The limit α → ∞ of the RQ covariance function is the SE.

Carl Edward Rasmussen Covariance Functions and Classification

slide-7
SLIDE 7

Mat´ ern covariance functions

Stationary covariance functions can be based on the Mat´ ern form: k(x, x′) = 1 Γ(ν)2ν−1 √ 2ν κ |x − x′| ν Kν √ 2ν κ |x − x′|

  • ,

where Kν is the modified Bessel function of second kind of order ν, and κ is the characteristic length scale. Sample functions from Mat´ ern forms are ⌊ν − 1⌋ times differentiable. Thus, the hyperparameter ν can control the degree of smoothness

Carl Edward Rasmussen Covariance Functions and Classification

slide-8
SLIDE 8

Mat´ ern covariance functions II

Univariate Mat´ ern covariance function with unit characteristic length scale and unit variance:

1 2 3 0.5 1 covariance function input distance covariance −5 5 −2 −1 1 2 sample functions input, x

  • utput, f(x)

ν=1/2 ν=1 ν=2 ν→∞

Carl Edward Rasmussen Covariance Functions and Classification

slide-9
SLIDE 9

Mat´ ern covariance functions II

It is possible that the most interesting cases for machine learning are ν = 3/2 and ν = 5/2, for which kν=3/2(r) =

  • 1 +

√ 3r ℓ

  • exp

√ 3r ℓ

  • ,

kν=5/2(r) =

  • 1 +

√ 5r ℓ + 5r2 3ℓ2

  • exp

√ 5r ℓ

  • ,

Other special cases:

  • ν = 1/2: Laplacian covariance function, sample functions: stationary

Browninan motion

  • ν → ∞: Gaussian covariance function with smooth (infinitely differentiable)

sample functions

Carl Edward Rasmussen Covariance Functions and Classification

slide-10
SLIDE 10

A Comparison

Left, SE covariance function, log marginal likelihood −15.6, and right Mat´ ern covariance function with ν = 3/2, marginal likelihood −18.0.

Carl Edward Rasmussen Covariance Functions and Classification

slide-11
SLIDE 11

GP regression recap

We use a Gaussian process prior for the latent function: f|X, θ ∼ N(0, K) The likelihood is a factorized Gaussian y|f ∼

m

  • i=1

N(yi|fi, σ2

n)

The posterior is Gaussian p(f|D, θ) = p(f|X, θ) p(y|f) p(D|θ) The latent value at the test point, f(x∗) is Gaussian p(f∗|D, θ, x∗) =

  • p(f∗|f, X, θ, x∗)p(f|D, θ)df,

and the predictive class probability is Gaussian p(y∗|D, θ, x∗) =

  • p(y∗|f∗)p(f∗|D, θ, x∗)d

f∗.

Carl Edward Rasmussen Covariance Functions and Classification

slide-12
SLIDE 12

Prior and posterior

−5 5 −2 −1 1 2 input, x

  • utput, f(x)

−5 5 −2 −1 1 2 input, x

  • utput, f(x)

Predictive distribution: p(y∗|x∗, x, y) ∼ N

  • k(x∗, x)⊤[K + σ2

noiseI]−1y,

k(x∗, x∗) + σ2

noise − k(x∗, x)⊤[K + σ2 noiseI]−1k(x∗, x)

  • Carl Edward Rasmussen

Covariance Functions and Classification

slide-13
SLIDE 13

The marginal likelihood

To chose between models M1, M2, . . ., compare the posterior for the models p(Mi|D) = p(y|x, Mi)p(Mi) p(D) . Log marginal likelihood: log p(y|x, Mi) = −1 2y⊤K−1y − 1 2 log |K| − n 2 log(2π) is the combination of a data fit term and complexity penalty. Occam’s Razor is automatic.

Carl Edward Rasmussen Covariance Functions and Classification

slide-14
SLIDE 14

Binary Gaussian Process Classification

−4 −2 2 4 input, x latent function, f(x) 1 input, x class probability, π(x)

The class probability is related to the latent function through: p(y = 1|f(x)) = π(x) = Φ

  • f(x)
  • .

Observations are independent given f, so the likelihood is p(y|f) =

n

  • i=1

p(yi|fi) =

n

  • i=1

Φ(yifi).

Carl Edward Rasmussen Covariance Functions and Classification

slide-15
SLIDE 15

Likelihood functions

The logistic (1 + exp(−yifi))−1 and probit Φ(yifi) and their derivatives:

−2 2 −3 −2 −1 1 latent times target, zi=yifi log likelihood, log p(yi|fi) log likelihood 1st derivative 2nd derivative −2 2 −6 −4 −2 2 latent times target, zi=yifi log likelihood, log p(yi|fi) log likelihood 1st derivative 2nd derivative

Carl Edward Rasmussen Covariance Functions and Classification

slide-16
SLIDE 16

Exact expressions

We use a Gaussian process prior for the latent function: f|X, θ ∼ N(0, K) The posterior becomes: p(f|D, θ) = p(f|X, θ) p(y|f) p(D|θ) = N(f|0, K) p(D|θ)

m

  • i=1

Φ(yifi), which is non-Gaussian. The latent value at the test point, f(x∗) is p(f∗|D, θ, x∗) =

  • p(f∗|f, X, θ, x∗)p(f|D, θ)df,

and the predictive class probability becomes p(y∗|D, θ, x∗) =

  • p(y∗|f∗)p(f∗|D, θ, x∗)d

f∗, both of which are intractable to compute.

Carl Edward Rasmussen Covariance Functions and Classification

slide-17
SLIDE 17

Gaussian Approximation to the Posterior

We approximate the non-Gaussian posterior by a Gaussian: p(f|D, θ) ≃ q(f|D, θ) = N(m, A) then q(f∗|D, θ, x∗) = N(f∗|µ∗, σ2

∗), where

µ∗ = k⊤

∗ K−1m

σ2

∗ = k(x∗, x∗)−k⊤ ∗ (K−1 − K−1AK−1)k∗.

Using this approximation: q(y∗ = 1|D, θ, x∗) =

  • Φ(f∗) N(f∗|µ∗, σ2

∗)d

f∗ = Φ

  • µ∗

√ 1 + σ2

  • Carl Edward Rasmussen

Covariance Functions and Classification

slide-18
SLIDE 18

What Gaussian?

Some suggestions:

  • local expansion: Laplace’s method
  • optimize a variational lower bound (using Jensen’s ineqality):

log p(y|X) = log

  • p(y|f)p(f)df ≥
  • log

p(y|f)p(f) q(f)

  • q(f)df
  • the Expectation Propagation (EP) algorithm

Carl Edward Rasmussen Covariance Functions and Classification

slide-19
SLIDE 19

Expectation Propagation

Posterior: p(f|X, y) = 1 Z p(f|X)

n

  • i=1

p(yi|fi), where the normalizing term is the marginal likelihood Z = p(y|X) =

  • p(f|X)

n

  • i=1

p(yi|fi)df. Exact likelihood: p(yi|fi) = Φ(fiyi) which makes inference intractable. In EP we use a local likelihood approximation p(yi|fi) ≃ ti(fi| ˜ Zi, ˜ µi, ˜ σ2

i ) ˜

ZiN(fi|˜ µi, ˜ σ2

i ),

where the site parameters are ˜ Zi, ˜ µi and ˜ σ2

i , such that: n

  • i=1

ti(fi| ˜ Zi, ˜ µi, ˜ σ2

i ) = N(˜

µ, ˜ Σ)

  • i

˜ Zi.

Carl Edward Rasmussen Covariance Functions and Classification

slide-20
SLIDE 20

Expectation Propagation II

We approximate the posterior by: q(f|X, y) 1 ZEP p(f|X)

n

  • i=1

ti(fi| ˜ Zi, ˜ µi, ˜ σ2

i ) = N(µ, Σ),

with µ = Σ˜ Σ−1 ˜ µ, and Σ = (K−1 + ˜ Σ−1)−1, How do we choose the site parameters? Key idea: iteratively update each site in turn, based on approximation so far. The approximate posterior for fi contains three kinds of terms:

1 the prior p(f|X) 2 the approximate likelihoods tj for all cases j = i 3 the exact likelihood for case i, p(yi|fi).

Carl Edward Rasmussen Covariance Functions and Classification

slide-21
SLIDE 21

The Cavity distribution

The cavity distribution q−i(fi) ∝

  • p(f|X)
  • j=i

tj(fj| ˜ Zj, ˜ µj, ˜ σ2

j )d

fj, can be found by “removing” one term from the posterior: q(fi|X, y) = N(fi|µi, σ2

i )

to get: q−i(fi) N(fi|µ−i, σ2

−i),

where µ−i = σ2

−i(σ−2 i

µi − ˜ σ−2

i

˜ µi), and σ2

−i = (σ−2 i

− ˜ σ−2

i

)−1. Now, find ˆ q(fi) which matches the desired: ˆ q(fi) ˆ ZiN(ˆ µi, ˆ σ2

i ) ≃ q−i(fi)p(yi|fi).

by matching moments.

Carl Edward Rasmussen Covariance Functions and Classification

slide-22
SLIDE 22

Expectation Propagation III

The desired moments can be computed in closed form: ˆ Zi = Φ(zi), ˆ µi = µ−i + yiσ2

−iN(zi)

Φ(zi) √ 1 + σ2

−i

, ˆ σ2

i = σ2 −i −

σ4

−iN(zi)

(1 + σ2

−i)Φ(zi)

  • zi + N(zi)

Φ(zi)

  • ,

where zi = yiµ−i √ 1 + σ2

−i

. These moments are achieved by setting the site parameters to: ˜ µi = ˜ σ2

i (ˆ

σ−2

i

ˆ µi − σ−2

−i µ−i),

˜ σ2

i = (ˆ

σ−2

i

− σ−2

−i )−1,

˜ Zi = ˆ Zi √ 2π

  • σ2

−i + ˜

σ2

i exp

1

2(µ−i − ˜

µi)2/(σ2

−i + ˜

σ2

i )

  • ,

Carl Edward Rasmussen Covariance Functions and Classification

slide-23
SLIDE 23

The EP approximation

−5 5 10 0.2 0.4 0.6 0.8 1

likelihood cavity posterior approximation

−5 5 10 0.02 0.04 0.06 0.08 0.1 0.12 0.14

Carl Edward Rasmussen Covariance Functions and Classification

slide-24
SLIDE 24

Predictive distribution

The latent predictive mean: Eq[f∗|X, y, x∗] = k⊤

∗ K−1µ = k⊤ ∗ K−1(K−1 + ˜

Σ−1)−1 ˜ Σ−1 ˜ µ = k⊤

∗ (K + ˜

Σ)−1 ˜ µ. and variance: Vq[f∗|X, y, x∗] = k(x∗, x∗) − k⊤

∗ (K + ˜

Σ)−1k∗, which can be plugged into the class probability equation: q(y∗ = 1|D, θ, x∗) =

  • Φ(f∗) N(f∗|µ∗, σ2

∗)d

f∗ = Φ

  • µ∗

√ 1 + σ2

  • Carl Edward Rasmussen

Covariance Functions and Classification

slide-25
SLIDE 25

Marginal Likelihood

The EP approximation for the marginal likelihood: ZEP = q(y|X) =

  • p(f)

n

  • i=1

ti(fi| ˜ Zi, ˜ µi, ˜ σ2

i )df.

which evaluates to: log(ZEP|θ) = −1 2 log |K + ˜ Σ| − 1 2 ˜ µ⊤(K + ˜ Σ)−1 ˜ µ +

n

  • i=1

log Φ

  • yiµ−i

√ 1 + σ2

−i

  • + 1

2

n

  • i=1

log(σ2

−i + ˜

σ2

i ) + n

  • i=1

(µ−i − ˜ µi)2 2(σ2

−i + ˜

σ2

i ),

which has a nice interpretation. It is possible to analytically evaluate the derivatives of the estimated log marginal likelihood w.r.t. the hyperparameters.

Carl Edward Rasmussen Covariance Functions and Classification

slide-26
SLIDE 26

Example

−8 −6 −4 −2 2 4 0.2 0.4 0.6 0.8 1 x p(y = 1|x) Class 1 Class −1 Laplace p(y|X) EP p(y|X) True p(y|X) −8 −6 −4 −2 2 4 −10 −5 5 10 15 x f(x) Laplace p(f|X) EP p(f|X)

Carl Edward Rasmussen Covariance Functions and Classification

slide-27
SLIDE 27

USPS Digits, 3s vs 5s

−200 −200 −150 −130 −115 −105 −100

log lengthscale, log(l) log magnitude, log(σf) 2 3 4 5 1 2 3 4 5

−200 −200 −160 −160 −130 −130 −115 −105 −105 −100 −95 −92

log lengthscale, log(l) log magnitude, log(σf) 2 3 4 5 1 2 3 4 5 2 3 4 5 1 2 3 4 5 log lengthscale, log(l) log magnitude, log(σf)

−92−95 −100 −105 −105 −115 −130 −160 −160 −200 −200 0.25 0.25 0.5 0.5 0.7 0.7 0.8 0.8 0.84

log lengthscale, log(l) log magnitude, log(σf) 2 3 4 5 1 2 3 4 5

0.25 0.5 0.7 0.7 0.8 0.8 0.84 0.84 0.86 0.86 0.88 0.89

log lengthscale, log(l) log magnitude, log(σf) 2 3 4 5 1 2 3 4 5

0.25 0.5 0.7 0.7 0.8 0.84 0.84 0.86 0.86 0.88 0.89

log lengthscale, log(l) log magnitude, log(σf) 2 3 4 5 1 2 3 4 5 Carl Edward Rasmussen Covariance Functions and Classification

slide-28
SLIDE 28

USPS Digits, 3s vs 5s

−15 −10 −5 5 0.05 0.1 0.15 0.2 f MCMC samples Laplace p(f|D) EP p(f|D) −40 −30 −20 −10 10 0.01 0.02 0.03 0.04 0.05 0.06 0.07 f MCMC samples Laplace p(f|D) EP p(f|D) Carl Edward Rasmussen Covariance Functions and Classification

slide-29
SLIDE 29

The Structure of the posterior

✂✁ ☎✄ ✆ ✝

2 4 6 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 xi p(xi)

Carl Edward Rasmussen Covariance Functions and Classification

slide-30
SLIDE 30

Conclusions

Covariance functions for Gaussian processes

  • encodes useful information about the functions
  • can be learnt from the data

Whereas inference for regression with Gaussian noise can be done in closed form

  • non-Gaussian likelihoods (as eg in classification) cannot
  • (many) good approximations exist

For the details: Rasmussen and Williams ‘Gaussian Processes for Machine Learning’, the MIT Press 2006. For the (matlab) code www.GaussianProcess.org/gpml.

Carl Edward Rasmussen Covariance Functions and Classification