CSC2541 Lecture 2 Bayesian Occams Razor and Gaussian Processes - - PowerPoint PPT Presentation

csc2541 lecture 2 bayesian occam s razor and gaussian
SMART_READER_LITE
LIVE PREVIEW

CSC2541 Lecture 2 Bayesian Occams Razor and Gaussian Processes - - PowerPoint PPT Presentation

CSC2541 Lecture 2 Bayesian Occams Razor and Gaussian Processes Roger Grosse Roger Grosse CSC2541 Lecture 2 Bayesian Occams Razor and Gaussian Processes 1 / 55 Adminis-Trivia Did everyone get my e-mail last week? If not, let me know.


slide-1
SLIDE 1

CSC2541 Lecture 2 Bayesian Occam’s Razor and Gaussian Processes

Roger Grosse

Roger Grosse CSC2541 Lecture 2 Bayesian Occam’s Razor and Gaussian Processes 1 / 55

slide-2
SLIDE 2

Adminis-Trivia

Did everyone get my e-mail last week?

If not, let me know. You can find the announcement on Blackboard.

Sign up on Piazza. Is everyone signed up for a presentation slot? Form project groups of 3–5. If you don’t know people, try posting to Piazza.

Roger Grosse CSC2541 Lecture 2 Bayesian Occam’s Razor and Gaussian Processes 2 / 55

slide-3
SLIDE 3

Advice on Readings

4–6 readings per week, many are fairly mathematical They get lighter later in the term. Don’t worry about learning every detail. Try to understand the main ideas so you know when you should refer to them.

What problem are they trying to solve? What is their contribution? How does it relate to the other papers? What evidence do they present? Is it convincing?

Roger Grosse CSC2541 Lecture 2 Bayesian Occam’s Razor and Gaussian Processes 3 / 55

slide-4
SLIDE 4

Advice on Readings

4–6 readings per week, many are fairly mathematical They get lighter later in the term. Don’t worry about learning every detail. Try to understand the main ideas so you know when you should refer to them.

What problem are they trying to solve? What is their contribution? How does it relate to the other papers? What evidence do they present? Is it convincing?

Reading mathematical material

You’ll get to use software packages, so no need to go through line-by-line. What assumptions are they making, and how are those used? What is the main insight? Formulas: if you change one variable, how do other things vary? What guarantees do they obtain? How do those relate to the other algorithms we cover?

Don’t let it become a chore. I chose readings where you still get something from them even if you don’t absorb every detail.

Roger Grosse CSC2541 Lecture 2 Bayesian Occam’s Razor and Gaussian Processes 3 / 55

slide-5
SLIDE 5

This Lecture

Linear regression and smoothing splines Bayesian linear regression “Bayesian Occam’s Razor” Gaussian processes We’ll put off the Automatic Statistician for later

Roger Grosse CSC2541 Lecture 2 Bayesian Occam’s Razor and Gaussian Processes 4 / 55

slide-6
SLIDE 6

Function Approximation

Many machine learning tasks can be viewed as function approximation, e.g.

  • bject recognition (image → category)

speech recognition (waveform → text) machine translation (French → English) generative modeling (noise → image) reinforcement learning (state → value, or state → action)

In the last few years, neural nets have revolutionized all of these domains, since they’re really good function approximators Much of this class will focus on being Bayesian about function approximation.

Roger Grosse CSC2541 Lecture 2 Bayesian Occam’s Razor and Gaussian Processes 5 / 55

slide-7
SLIDE 7

Review: Linear Regression

Probably the simplest function approximator is linear regression. This is a useful starting point since we can solve and analyze it analytically. Given a training set of inputs and targets {(x(i), t(i))}N

i=1

Linear model: y = w⊤x + b Squared error loss: L(y, t) = 1 2(t − y)2 Solution 1: solve analytically by setting gradient to 0 w = (X⊤X)−1X⊤t Solution 2: solve approximately using gradient descent w ← w − αX⊤(y − t)

Roger Grosse CSC2541 Lecture 2 Bayesian Occam’s Razor and Gaussian Processes 6 / 55

slide-8
SLIDE 8

Nonlinear Regression: Basis Functions

We can model a function as linear in a set of basis functions (i.e. feature mapping): y = w⊤φ(x) E.g., we can fit a degree-k polynomial using the mapping φ(x) = (1, x, x2, . . . , xk). Exactly the same algorithms/formulas as ordinary linear regression: just pretend φ(x) are the inputs! Best-fitting cubic polynomial:

x t M = 3 1 −1 1

— Bishop, Pattern Recognition and Machine Learning

Before 2012, feature engineering was the hardest part of building many AI systems. Now it’s done automatically with neural nets.

Roger Grosse CSC2541 Lecture 2 Bayesian Occam’s Razor and Gaussian Processes 7 / 55

slide-9
SLIDE 9

Nonlinear Regression: Smoothing Splines

An alternative approach to nonlinear regression: fit an arbitrary function, but encourage it to be smooth. This is called a smoothing spline. E(f , λ) =

N

  • i=1

(t(i) − f (x(i)))2

  • mean squared error

  • (f ′′(z))2 dz
  • regularizer

What happens for λ = 0? λ = ∞?

Roger Grosse CSC2541 Lecture 2 Bayesian Occam’s Razor and Gaussian Processes 8 / 55

slide-10
SLIDE 10

Nonlinear Regression: Smoothing Splines

An alternative approach to nonlinear regression: fit an arbitrary function, but encourage it to be smooth. This is called a smoothing spline. E(f , λ) =

N

  • i=1

(t(i) − f (x(i)))2

  • mean squared error

  • (f ′′(z))2 dz
  • regularizer

What happens for λ = 0? λ = ∞? Even though f is unconstrained, it turns out the optimal f can be expressed as a linear combination of (data-dependent) basis functions

I.e., algorithmically, it’s just linear regression! (minus some numerical issues that we’ll ignore)

Roger Grosse CSC2541 Lecture 2 Bayesian Occam’s Razor and Gaussian Processes 8 / 55

slide-11
SLIDE 11

Nonlinear Regression: Smoothing Splines

Mathematically, we express f as a linear combination of basis functions: f (x) =

  • i

wiφi(x) y = f (x) = Φw Squared error term (just like in linear regression): t − Φw2 Regularizer:

  • (f ′′(z))2 dz =

i

wiφi(z) 2 dz =

i

  • j

wiwj φ′′

i (z) φ′′ j (z) dz

=

  • i
  • j

wiwj

  • φ′′

i (z) φ′′ j (z) dz

  • =Ωij

= w⊤Ωw

Roger Grosse CSC2541 Lecture 2 Bayesian Occam’s Razor and Gaussian Processes 9 / 55

slide-12
SLIDE 12

Nonlinear Regression: Smoothing Splines

Full cost function: E(w, λ) = t − Φw2 + λw⊤Ωw Optimal solution (derived by setting gradient to zero): w = (Φ⊤Φ + λΩ)−1Φ⊤t

Roger Grosse CSC2541 Lecture 2 Bayesian Occam’s Razor and Gaussian Processes 10 / 55

slide-13
SLIDE 13

Foreshadowing

Roger Grosse CSC2541 Lecture 2 Bayesian Occam’s Razor and Gaussian Processes 11 / 55

slide-14
SLIDE 14

Linear Regression as Maximum Likelihood

We can give linear regression a probabilistic interpretation by assuming a Gaussian noise model: t | x ∼ N(w⊤x + b, σ2) Linear regression is just maximum likelihood under this model: 1 N

N

  • i=1

log p(t(i) | x(i); w, b) = 1 N

N

  • i=1

log N(t(i); w⊤x + b, σ2) = 1 N

N

  • i=1

log

  • 1

√ 2πσ exp

  • −(t(i) − w⊤x − b)2

2σ2

  • = const −

1 2Nσ2

N

  • i=1

(t(i) − w⊤x − b)2

Roger Grosse CSC2541 Lecture 2 Bayesian Occam’s Razor and Gaussian Processes 12 / 55

slide-15
SLIDE 15

Bayesian Linear Regression

Bayesian linear regression considers various plausible explanations for how the data were generated. It makes predictions using all possible regression weights, weighted by their posterior probability.

Roger Grosse CSC2541 Lecture 2 Bayesian Occam’s Razor and Gaussian Processes 13 / 55

slide-16
SLIDE 16

Bayesian Linear Regression

Leave out the bias for simplicity Prior distribution: a broad, spherical (multivariate) Gaussian centered at zero: w ∼ N(0, ν2I) Likelihood: same as in the maximum likelihood formulation: t | x, w ∼ N(w⊤x, σ2) Posterior: w | D ∼ N(µ, Σ) µ = σ−2ΣX⊤t Σ−1 = ν−2I + σ−2X⊤X Compare with linear regression formula: w = (X⊤X)−1X⊤t

Roger Grosse CSC2541 Lecture 2 Bayesian Occam’s Razor and Gaussian Processes 14 / 55

slide-17
SLIDE 17

Bayesian Linear Regression

— Bishop, Pattern Recognition and Machine Learning Roger Grosse CSC2541 Lecture 2 Bayesian Occam’s Razor and Gaussian Processes 15 / 55

slide-18
SLIDE 18

Bayesian Linear Regression

We can turn this into nonlinear regression using basis functions. E.g., Gaussian basis functions φj(x) = exp

  • −(x − µj)2

2s2

  • — Bishop, Pattern Recognition and Machine Learning

Roger Grosse CSC2541 Lecture 2 Bayesian Occam’s Razor and Gaussian Processes 16 / 55

slide-19
SLIDE 19

Bayesian Linear Regression

Functions sampled from the posterior:

— Bishop, Pattern Recognition and Machine Learning Roger Grosse CSC2541 Lecture 2 Bayesian Occam’s Razor and Gaussian Processes 17 / 55

slide-20
SLIDE 20

Bayesian Linear Regression

Posterior predictive distribution: p(t | x, D) =

  • p(t | x, w)p(w | D) dw

= N(t | µ⊤x, σ2

pred(x))

σ2

pred(x) = σ2 + x⊤Σx,

where µ and Σ are the posterior mean and covariance of Σ.

Roger Grosse CSC2541 Lecture 2 Bayesian Occam’s Razor and Gaussian Processes 18 / 55

slide-21
SLIDE 21

Bayesian Linear Regression

Posterior predictive distribution:

— Bishop, Pattern Recognition and Machine Learning Roger Grosse CSC2541 Lecture 2 Bayesian Occam’s Razor and Gaussian Processes 19 / 55

slide-22
SLIDE 22

Foreshadowing

Roger Grosse CSC2541 Lecture 2 Bayesian Occam’s Razor and Gaussian Processes 20 / 55

slide-23
SLIDE 23

Foreshadowing

Roger Grosse CSC2541 Lecture 2 Bayesian Occam’s Razor and Gaussian Processes 21 / 55

slide-24
SLIDE 24

Occam’s Razor

Data modeling process according to MacKay:

Roger Grosse CSC2541 Lecture 2 Bayesian Occam’s Razor and Gaussian Processes 22 / 55

slide-25
SLIDE 25

Occam’s Razor

Occam’s Razor: “Entities should not be multiplied beyond necessity.”

Named after the 14th century British theologian William of Occam

Huge number of attempts to formalize mathematically

See Domingos, 1999, “The role of Occam’s Razor in knowledge discovery” for a skeptical overview.

https://homes.cs.washington.edu/~pedrod/papers/dmkd99.pdf

Common misinterpretation: your prior should favor simple explanations

Roger Grosse CSC2541 Lecture 2 Bayesian Occam’s Razor and Gaussian Processes 23 / 55

slide-26
SLIDE 26

Occam’s Razor

Suppose you have a finite set of models, or hypotheses {Hi}M

i=1

(e.g. polynomials of different degrees) Posterior inference over models (Bayes’ Rule): p(Hi | D) ∝ p(Hi)

prior

p(D | Hi)

  • evidence

Which of these terms do you think is more important?

Roger Grosse CSC2541 Lecture 2 Bayesian Occam’s Razor and Gaussian Processes 24 / 55

slide-27
SLIDE 27

Occam’s Razor

Suppose you have a finite set of models, or hypotheses {Hi}M

i=1

(e.g. polynomials of different degrees) Posterior inference over models (Bayes’ Rule): p(Hi | D) ∝ p(Hi)

prior

p(D | Hi)

  • evidence

Which of these terms do you think is more important? The evidence is also called marginal likelihood since it requires marginalizing out the parameters: p(D | Hi) =

  • p(w | Hi) p(D | w, Hi) dw

If we’re comparing a handful of hypotheses, p(Hi) isn’t very important, so we can compare them based on marginal likelihood.

Roger Grosse CSC2541 Lecture 2 Bayesian Occam’s Razor and Gaussian Processes 24 / 55

slide-28
SLIDE 28

Occam’s Razor

Suppose M1, M2, and M3 denote a linear, quadratic, and cubic model. M3 is capable of explaning more datasets than M1. But its distribution over D must integrate to 1, so it must assign lower probability to ones it can explain.

— Bishop, Pattern Recognition and Machine Learning Roger Grosse CSC2541 Lecture 2 Bayesian Occam’s Razor and Gaussian Processes 25 / 55

slide-29
SLIDE 29

Occam’s Razor

How does the evidence (or marginal likelihood) penalize complex models? Approximating the integral: p(D | Hi) =

  • p(D | w, Hi) p(w | Hi)

≃ p(D | wMAP, Hi)

  • best-fit likelihood

p(wMAP | Hi) ∆w

  • Occam factor

Roger Grosse CSC2541 Lecture 2 Bayesian Occam’s Razor and Gaussian Processes 26 / 55

slide-30
SLIDE 30

Occam’s Razor

Multivariate case:

p(D | Hi) ≃ p(D | wMAP, Hi)

  • best-fit likelihood

p(wMAP | Hi) |A|−1/2

  • Occam factor

,

where A = ∇2

w log p(D | w, Hi)

The determinant appears because we’re taking the volume. The more parameters in the model, the higher dimensional the parameter space, and the faster the volume decays.

— Bishop, Pattern Recognition and Machine Learning Roger Grosse CSC2541 Lecture 2 Bayesian Occam’s Razor and Gaussian Processes 27 / 55

slide-31
SLIDE 31

Occam’s Razor

Analyzing the asymptotic behavior: A = ∇2

w log p(D | w, Hi)

=

N

  • j=1

∇2

w log p(yi | xi, w, Hi)

  • Ai

≈ N E[Ai] log Occam factor = log p(wMAP | Hi) + log |A|−1/2 ≈ log p(wMAP | Hi) + log |N E[Ai]|−1/2 = log p(wMAP | Hi) − 1 2 log |E[Ai]| − D log N 2 = const − D log N 2 Bayesian Information Criterion (BIC): penalize the complexity of your model by

1 2D log N. Roger Grosse CSC2541 Lecture 2 Bayesian Occam’s Razor and Gaussian Processes 28 / 55

slide-32
SLIDE 32

Occam’s Razor

Summary p(Hi | D) ∝ p(Hi) p(D | Hi) p(D | Hi) ≃ p(D | wMAP, Hi) p(wMAP | Hi) |A|−1/2 Asymptotically, with lots of data, this behaves like log p(D | Hi) = log p(D | wMAP, Hi) − 1 2D log N. Occam’s Razor is about integration, not priors (over hypotheses).

Roger Grosse CSC2541 Lecture 2 Bayesian Occam’s Razor and Gaussian Processes 29 / 55

slide-33
SLIDE 33

Bayesian Interpolation

So all we need to do is count parameters? Not so fast! Let’s consider the Bayesian analogue of smoothing splines, which MacKay refers to as Bayesian interpolation.

Roger Grosse CSC2541 Lecture 2 Bayesian Occam’s Razor and Gaussian Processes 30 / 55

slide-34
SLIDE 34

Bayesian Interpolation

Recall the smoothing spline objective. How many parameters? E(f , λ) =

N

  • i=1

(t(i) − f (x(i)))2

  • mean squared error

  • (f ′′(z))2 dz
  • regularizer

Roger Grosse CSC2541 Lecture 2 Bayesian Occam’s Razor and Gaussian Processes 31 / 55

slide-35
SLIDE 35

Bayesian Interpolation

Recall the smoothing spline objective. How many parameters? E(f , λ) =

N

  • i=1

(t(i) − f (x(i)))2

  • mean squared error

  • (f ′′(z))2 dz
  • regularizer

Recall we can convert it to basis function regression with one basis function per training example.

So we have N parameters, and hence a log Occam factor ≈ 1

2N log N?

You would never prefer this over a constant function! Fortunately, this is not what happens.

For computational convenience, we could choose some other set of basis functions (e.g. polynomials).

Roger Grosse CSC2541 Lecture 2 Bayesian Occam’s Razor and Gaussian Processes 31 / 55

slide-36
SLIDE 36

Bayesian Interpolation

To define a Bayesian analogue of smoothing splines, let’s convert it to a Bayesian basis function regression problem. The likelihood is easy: p(D | w) =

N

  • i=1

N(yi | w⊤φ(xi), σ2) We’d like a prior which favors smoother functions: p(w) ∝ exp

  • −λ

2

  • (f ′′(z))2 dz
  • = exp
  • −λ

2 w⊤Ωw

  • .

Note: this is a zero-mean Gaussian.

Roger Grosse CSC2541 Lecture 2 Bayesian Occam’s Razor and Gaussian Processes 32 / 55

slide-37
SLIDE 37

Bayesian Interpolation

Posterior distribution and posterior predictive distribution (special case of Bayesian linear regression) w | D ∼ N(µ, Σ) µ = σ−2ΣX⊤t Σ−1 = λΩ + σ−2X⊤X p(t | x, D) = σ2 + x⊤Σx Optimize the hyperparameters σ and λ by maximizing the evidence (marginal likelihood).

This is known as the evidence approximation, or type 2 maximum likelihood.

Roger Grosse CSC2541 Lecture 2 Bayesian Occam’s Razor and Gaussian Processes 33 / 55

slide-38
SLIDE 38

Bayesian Interpolation

This makes reasonable predictions:

Roger Grosse CSC2541 Lecture 2 Bayesian Occam’s Razor and Gaussian Processes 34 / 55

slide-39
SLIDE 39

Bayesian Interpolation

Behavior w/ spherical prior as we add more basis functions:

— Rasmussen and Ghahramani, “Occam’s Razor” Roger Grosse CSC2541 Lecture 2 Bayesian Occam’s Razor and Gaussian Processes 35 / 55

slide-40
SLIDE 40

Bayesian Interpolation

Behavior w/ smoothness prior as we add more basis functions:

— Rasmussen and Ghahramani, “Occam’s Razor” Roger Grosse CSC2541 Lecture 2 Bayesian Occam’s Razor and Gaussian Processes 36 / 55

slide-41
SLIDE 41

Towards Gaussian Processes

Splines stop getting more complex as you add more basis functions. Bayesian Occam’s Razor penalizes the complexity of the distribution

  • ver functions, not the number of parameters.

Maybe we can fit infinitely many parameters! Rasmussen and Ghahramani (2001): in the infinite limit, the distribution over functions approaches a Gaussian process.

Roger Grosse CSC2541 Lecture 2 Bayesian Occam’s Razor and Gaussian Processes 37 / 55

slide-42
SLIDE 42

Towards Gaussian Processes

Gaussian Processes are distributions over functions. They’re actually a simpler and more intuitive way to think about regression, once you’re used to them.

— GPML Roger Grosse CSC2541 Lecture 2 Bayesian Occam’s Razor and Gaussian Processes 38 / 55

slide-43
SLIDE 43

Towards Gaussian Processes

A Bayesian linear regression model defines a distribution over functions: f (x) = w⊤φ(x) Here, w is sampled from the prior N(µw, Σw). Let f = (f1, . . . , fN) denote the vector of function values at (x1, . . . , xN). The distribution of f is a Gaussian with E[fi] = µ⊤

wφ(x)

Cov(fi, fj) = φ(xi)⊤Σwφ(xj) In vectorized form, f ∼ N(µf, Σf) with µf = E[f] = Φµw Σf = Cov(f) = ΦΣwΦ⊤

Roger Grosse CSC2541 Lecture 2 Bayesian Occam’s Razor and Gaussian Processes 39 / 55

slide-44
SLIDE 44

Towards Gaussian Processes

Recall that in Bayesian linear regression, we assume noisy Gaussian

  • bservations of the underlying function.

yi ∼ N(fi, σ2) = N(w⊤φ(xi), σ2). The observations y are jointly Gaussian, just like f. E[yi] = E[f (xi)] Cov(yi, yj) =

  • Var(f (xi)) + σ2

if i = j Cov(f (xi), f (xj)) if i = j In vectorized form, y ∼ N(µy, Σy), with µy = µf Σy = Σf + σ2I

Roger Grosse CSC2541 Lecture 2 Bayesian Occam’s Razor and Gaussian Processes 40 / 55

slide-45
SLIDE 45

Towards Gaussian Processes

Bayesian linear regression is just computing the conditional distribution in a multivariate Gaussian! Let y and y′ denote the observables at the training and test data. They are jointly Gaussian: y y′

  • ∼ N

µy µy′

  • ,

Σyy Σyy′ Σy′y Σy′y′

  • .

The predictive distribution is a special case of the conditioning formula for a multivariate Gaussian: y′ | y ∼ N(µy′|y, Σy′|y) µy′|y = µy′ + Σy′yΣ−1

yy (y − µy)

Σy′|y = Σy′y′ − Σy′yΣ−1

yy Σyy′

We’re implicitly marginalizing out w!

Roger Grosse CSC2541 Lecture 2 Bayesian Occam’s Razor and Gaussian Processes 41 / 55

slide-46
SLIDE 46

Towards Gaussian Processes

The marginal likelihood is just the PDF of a multivariate Gaussian: p(y | X) = N(y; µy, Σy) = 1 (2π)d/2|Σy|1/2 exp

  • −1

2(y − µy)⊤Σ−1

y (y − µy)

  • Roger Grosse

CSC2541 Lecture 2 Bayesian Occam’s Razor and Gaussian Processes 42 / 55

slide-47
SLIDE 47

Towards Gaussian Processes

To summarize: µf = Φµw Σf = ΦΣwΦ⊤ µy = µf Σy = Σf + σ2I µy′|y = µy′ + Σy′yΣ−1

yy (y − µy)

Σy′|y = Σy′y′ − Σy′yΣ−1

yy Σyy′

p(y | X) = N(y; µy, Σy) After defining µf and Σf, we can forget about w and x! What if we just let µf and Σf be anything?

Roger Grosse CSC2541 Lecture 2 Bayesian Occam’s Razor and Gaussian Processes 43 / 55

slide-48
SLIDE 48

Gaussian Processes

When I say let µf and Σf be anything, I mean let them have an arbitrary functional dependence on the inputs. We need to specify

a mean function E[f (xi)] = µ(xi) a covariance function called a kernel function: Cov(f (xi), f (xj)) = k(xi, xj)

Let KX denote the kernel matrix for points X. This is a matrix whose (i, j) entry is k(xi, xj). We require that KX be positive semidefinite for any X. Other than that, µ and k can be arbitrary.

Roger Grosse CSC2541 Lecture 2 Bayesian Occam’s Razor and Gaussian Processes 44 / 55

slide-49
SLIDE 49

Gaussian Processes

We’ve just defined a distribution over function values at an arbitrary finite set of points. This can be extended to a distribution over functions using a kind of black magic called the Kolmogorov Extension Theorem. This distribution over functions is called a Gaussian process (GP). We only ever need to compute with distributions over function values. The formulas from a few slides ago are all you need to do regression with GPs. But distributions over functions are conceptually cleaner.

Roger Grosse CSC2541 Lecture 2 Bayesian Occam’s Razor and Gaussian Processes 45 / 55

slide-50
SLIDE 50

GP Kernels

One way to define a kernel function is to give a set of basis functions and put a Gaussian prior on w. But we have lots of other options. Here’s a useful one, called the squared-exp, or Gaussian, or radial basis function (RBF) kernel: kSE(xi, xj) = σ2 exp

  • −xi − xj2

2ℓ2

  • More accurately, this is a kernel family with hyperparameters σ and ℓ.

It gives a distribution over smooth functions:

Roger Grosse CSC2541 Lecture 2 Bayesian Occam’s Razor and Gaussian Processes 46 / 55

slide-51
SLIDE 51

GP Kernels

kSE(xi, xj) = σ2 exp

  • −(xi − xj)2

2ℓ2

  • The hyperparameters determine key properties of the function.

Varying the output variance σ2: Varying the lengthscale ℓ:

Roger Grosse CSC2541 Lecture 2 Bayesian Occam’s Razor and Gaussian Processes 47 / 55

slide-52
SLIDE 52

GP Kernels

The choice of hyperparameters heavily influences the predictions: In practice, it’s very important to tune the hyperparameters (e.g. by maximizing the marginal likelihood).

Roger Grosse CSC2541 Lecture 2 Bayesian Occam’s Razor and Gaussian Processes 48 / 55

slide-53
SLIDE 53

GP Kernels

kSE(xi, xj) = σ2 exp

  • −(xi − xj)2

2ℓ2

  • The squared-exp kernel is stationary because it only depends on

xi − xj. Most kernels we use in practice are stationary. We can visualize the function k(0, x):

Roger Grosse CSC2541 Lecture 2 Bayesian Occam’s Razor and Gaussian Processes 49 / 55

slide-54
SLIDE 54

GP Kernels

The periodic kernel encodes for a probability distribution over periodic functions

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 −2 −1.5 −1 −0.5 0.5 1 1.5 2 Roger Grosse CSC2541 Lecture 2 Bayesian Occam’s Razor and Gaussian Processes 50 / 55

slide-55
SLIDE 55

GP Kernels

The linear kernel results in a probability distribution over linear functions

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 −2 −1.5 −1 −0.5 0.5 1 1.5 2 Roger Grosse CSC2541 Lecture 2 Bayesian Occam’s Razor and Gaussian Processes 50 / 55

slide-56
SLIDE 56

GP Kernels

The Matern kernel is similar to the squared-exp kernel, but less smooth. See Chapter 4 of GPML for an explanation (advanced). Imagine trying to get this behavior by designing basis functions!

Roger Grosse CSC2541 Lecture 2 Bayesian Occam’s Razor and Gaussian Processes 51 / 55

slide-57
SLIDE 57

GP Kernels

We get exponentially more flexibility by combining kernels. The sum of two kernels is a kernel.

This is because valid covariance matrices (i.e. PSD matrices) are closed under addition.

The sum of two kernels corresponds to the sum of functions. Linear + Periodic

e.g. seasonal pattern w/ trend

Additive kernel k(x, y, x′, y′) = k1(x, x′) + k2(y, y′)

Roger Grosse CSC2541 Lecture 2 Bayesian Occam’s Razor and Gaussian Processes 52 / 55

slide-58
SLIDE 58

GP Kernels

A kernel is like a similarity function on the input space. The sum of two kernels is like the OR of their similarity. Amazingly, the product of two kernels is a kernel. (Follows from the Schur Product Theorem.) The product of two kernels is like the AND of their similarity functions. Example: the product of a squared-exp kernel (spatial similarity) and a periodic kernel (similar location within cycle) gives a locally periodic function.

Roger Grosse CSC2541 Lecture 2 Bayesian Occam’s Razor and Gaussian Processes 53 / 55

slide-59
SLIDE 59

GP Kernels

Modeling CO2 concentrations: trend + (changing) seasonal pattern + short-term variability + noise Encoding the structure allows sensible extrapolation.

Roger Grosse CSC2541 Lecture 2 Bayesian Occam’s Razor and Gaussian Processes 54 / 55

slide-60
SLIDE 60

Summary

Bayesian linear regression lets us determine uncertainty in our predictions. We can make it nonlinear by using fixed basis functions. Bayesian Occam’s Razor is a sophisticated way of penalizing the complexity of a distribution over functions. Gaussian processes are an elegant framework for doing Bayesian inference directly over functions. The choice of kernels gives us much more control over what sort of functions our prior would allow or favor. Next time: Bayesian neural nets, a different way of making Bayesian linear regression more powerful.

Roger Grosse CSC2541 Lecture 2 Bayesian Occam’s Razor and Gaussian Processes 55 / 55