Primer on Bayesian Inference and Gaussian Processes Guido - - PowerPoint PPT Presentation

primer on bayesian inference and gaussian processes
SMART_READER_LITE
LIVE PREVIEW

Primer on Bayesian Inference and Gaussian Processes Guido - - PowerPoint PPT Presentation

Primer on Bayesian Inference and Gaussian Processes Guido Sanguinetti School of Informatics, University of Edinburgh Dagstuhl, March 2018 Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 1 / 35 Talk outline


slide-1
SLIDE 1

Primer on Bayesian Inference and Gaussian Processes

Guido Sanguinetti

School of Informatics, University of Edinburgh

Dagstuhl, March 2018

Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 1 / 35

slide-2
SLIDE 2

Talk outline

1

Bayesian regression

2

Gaussian Processes

3

Bayesian prediction with GPs

4

Bayesian Optimisation

Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 2 / 35

slide-3
SLIDE 3

The Bayesian Way

ALL model ingredients are random variables Statistical framework for quantifying uncertainty when

  • nly some variables are
  • bserved

We have prior distributions on unobserved variables, and model the dependence of the

  • bservations on the unobserved

variables This allows us to make inferences on the unobserved variables

Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 3 / 35

slide-4
SLIDE 4

Bayesian inference and predictions

Models consist of joint probability distributions (with a structure) over observed and unobserved (latent) variables Unobserved variables ✓ have a prior distribution The conditional probability of the observations y given the latents p(y|✓) is called the likelihood

Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 4 / 35

slide-5
SLIDE 5

Bayesian inference and predictions

Models consist of joint probability distributions (with a structure) over observed and unobserved (latent) variables Unobserved variables ✓ have a prior distribution The conditional probability of the observations y given the latents p(y|✓) is called the likelihood The revised belief on the latents is the posterior p(✓|y) = 1 Z p(y|✓)p(✓)

Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 4 / 35

slide-6
SLIDE 6

Bayesian inference and predictions

Models consist of joint probability distributions (with a structure) over observed and unobserved (latent) variables Unobserved variables ✓ have a prior distribution The conditional probability of the observations y given the latents p(y|✓) is called the likelihood The revised belief on the latents is the posterior p(✓|y) = 1 Z p(y|✓)p(✓) The predictive distribution for new observations is p(y new|y old) = Z d✓p(y new|✓)p(✓|y old) Difficulty is computing the integrals

Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 4 / 35

slide-7
SLIDE 7

Bayesian supervised (discriminative) learning

We focus on the (discriminative) supervised learning scenario: data are input-output pairs (x, y) Standard assumption: inputs are noise-free, outputs are noisy

  • bservations of a function f (x): y ⇠ P

s.t. E[y] = f (x) The function f is a random function

Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 5 / 35

slide-8
SLIDE 8

Bayesian supervised (discriminative) learning

We focus on the (discriminative) supervised learning scenario: data are input-output pairs (x, y) Standard assumption: inputs are noise-free, outputs are noisy

  • bservations of a function f (x): y ⇠ P

s.t. E[y] = f (x) The function f is a random function Simplest example f = P

i wii(x) with phii fixed basis functions,

and wi random weights Consequently, the variables f (xi) at input points are (correlated) random variables

Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 5 / 35

slide-9
SLIDE 9

Important exercise

Let 1(x), . . . , N(x) be a fixed set of functions, and let f (x) = P wii(x). If w ⇠ N(0, I), compute:

1

The single-point marginal distribution of f (x)

2

The two-point marginal distribution of f (x1), f (x2)

Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 6 / 35

slide-10
SLIDE 10

Solution (sketch)

Obviously the distributions are Gaussians

Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 7 / 35

slide-11
SLIDE 11

Solution (sketch)

Obviously the distributions are Gaussians Obviously both distributions have mean zero

Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 7 / 35

slide-12
SLIDE 12

Solution (sketch)

Obviously the distributions are Gaussians Obviously both distributions have mean zero To compute the (co)variance, take products and expectations and remember that hwiwji = ij Defining φ(x) = (1(x), . . . , N(x)), we get that hf (xi)f (xj)i = φ(xi)Tφ(xj)

Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 7 / 35

slide-13
SLIDE 13

The Gram matrix

Generalising the exercise to more than two points, we get that any finite dimensional marginal of this process is multivariate Gaussian The covariance matrix of this function is given by evaluating a function of two variables at all possible pairs The function is defined by the set of basis functions k(xi, xj) = φ(xi)Tφ(xj) The covariance matrix is often called Gram matrix and is (necessarily) symmetric and positive definite Bayesian prediction in regression then is essentially the same as computing conditionals for Gaussians (more later)

Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 8 / 35

slide-14
SLIDE 14

Stationary variance

We have seen that the variance of a random combination of functions depends on space as P 2

i (x)

Given any compact set, (e.g. hypercube with centre in the

  • rigin), we can find a finite set of basis functions s.t.

P 2

i (x) = const (partition of unity, e.g. triangulations or

smoother alternatives) We can construct a sequence of such sets which covers the whole of RD in the limit Therefore, we can construct a sequence of priors which all have constant prior variance across all space Covariances would still be computed by evaluating a Gram matrix (and need not be constant)

Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 9 / 35

slide-15
SLIDE 15

Function space view

The argument before shows that we can put a prior over infinite-dimensional spaces of functions s.t. all finite dimensional marginals are multivariate Gaussian The constructive argument, often referred to as weights space view, is useful for intuition but impractical It does demonstrate the existence of truly infinite dimensional Gaussian processes Once we accept that Gaussian processes exist, we are better off proceeding along a more abstract line

Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 10 / 35

slide-16
SLIDE 16

GP definition

A Gaussian Process (GP) is a stochastic process indexed by a continuous variable x s.t. all finite dimensional marginals are multivariate Gaussian A GP is uniquely defined by its mean and covariance functions, denoted by µ(x) and k(x, x0): f ⇠ GP(µ, k) $ f = (f (x1), . . . , f (xN)) ⇠ N (µ, K) , µ = (µ(x1), . . . .µ(xN)), K = (k(xi, xj))i,j The covariance function must satisfy some conditions (Mercer’s theorem), essentially it needs to evaluate to a symmetric positive definite function for all sets of input points

Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 11 / 35

slide-17
SLIDE 17

Covariance functions

The covariance function encapsulates the basis functions used ! it determines the type of functions which can be sampled The radial basis functions (RBF or squared exponential) covariance function k(xi, xj) = ↵2 exp  (xi xj)2 2

  • corresponds to Gaussian bumps basis functions and yields

smooth bumpy samples The Ornstein-Uhlenbeck (OU) covariance k(xi, xj) = ↵2 exp  |xi xj| 2

  • yields rough paths which are nowhere differentiable

Both RBF and OU are stationary and encode exponentially decaying correlations

Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 12 / 35

slide-18
SLIDE 18

More on covariance functions

Recent years have seen much work on designing/ selecting covariance functions One line of thought follows the fact that convex combination/ multiplication of covariance functions still yields a covariance function The automatic statistician project (Z. Ghahramani) combines these operations with a heuristic search to automatically select a covariance Another line of research constructs covariance functions out of steady-state autocorrelations of stochastic process models (work primarily by S¨ arkk¨ a and collaborators)

Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 13 / 35

slide-19
SLIDE 19

Observing GPs

In a regression case, we assume to have observed the function values at some input values with i.i.d. Gaussian noise with variance 2 What is the effect of observation noise? Suppose we have a Gaussian vector f ⇠ N(µ, Σ), and

  • bservations y|f ⇠ N(f, 2I)

Exercise: compute the marginal distribution of y

Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 14 / 35

slide-20
SLIDE 20

Predicting with GPs

Suppose we have noisy observations y of a function value at inputs x, and want to predict the value at a new input xnew The joint prior probability of function values at the observed and new input points is multivariate Gaussian By Bayes’ theorem, we have p(fnew|y) / Z df (x)p(fnew, f (x))p(y|f (x)) (1) where f (x) is the vector of true function values at the input points For Gaussian observation noise, the integral is analytically computed

Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 15 / 35

slide-21
SLIDE 21

Regression calcs

Define Σ = ✓ k⇤⇤ k⇤ k⇤ K ◆ (2) where k⇤⇤ = k(x⇤, x⇤), k⇤j = k(x⇤, xj) and Kij = k(xi, xj) Supposing the observations are i.i.d. Gaussian p(y|f (x)) = N(0, 2I) we can obtain the joint distribution for the observations and the new value p(f⇤, y) = N(0, Σy) Σy = ✓ k⇤⇤ k⇤ k⇤ K + 2I ◆ (3) The final result is that p(f⇤|y) = N(m, v) with m = kT

⇤ Σ1 y y

v = k⇤⇤ kT

⇤ Σ1 y k⇤

(4)

Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 16 / 35

slide-22
SLIDE 22

Covariance parameters

Covariance functions often depend on hyperparameters (e.g. the amplitude and lengthscale of the RBF covariance) These can be tuned by optimising the marginal likelihood (called type II maximum likelihood) L = log Z df (x)p(f (x))p(y|f (x)) = log |(K+2I)|yT(K+2I)1y Usually gradient methods are used; Bayesian methods are complicated by the in general complex functional form (no conjugate prior)

Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 17 / 35

slide-23
SLIDE 23

GP regression - example

1 1.5 2 2.5 3 −30 −28 −26 −24 −22 −20 −18

µ log−likelihood

Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 18 / 35

slide-24
SLIDE 24

GP regression - example

1 1.5 2 2.5 3 −30 −28 −26 −24 −22 −20 −18

µ log−likelihood

Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 19 / 35

slide-25
SLIDE 25

GP regression - example

1 1.5 2 2.5 3 −30 −28 −26 −24 −22 −20 −18

µ log−likelihood

2 = 0.2

Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 20 / 35

slide-26
SLIDE 26

GP regression - example

1 1.5 2 2.5 3 −30 −28 −26 −24 −22 −20 −18

µ log−likelihood

2 = 1

Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 21 / 35

slide-27
SLIDE 27

GP prediction from general observations

Equation (1) holds whatever p(y|f (x)). The posterior statistics can only be computed analytically if

  • bservation noise is Gaussian

For general observation models (e.g. classification, counting), the integral in (1) is no longer analytically computable Several approximate algorithms available, we use Expectation-Propagation (EP) an established fast approximate inference method

Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 22 / 35

slide-28
SLIDE 28

Pitfalls of GP prediction

Addition of a new observation always reduces uncertainty at all points ! vulnerable to outliers Optimisation of hyperparameters often tricky: works well if 2 is known, otherwise it can be seriously multimodal MAIN PROBLEM: GP prediction relies on a matrix inversion which scales cubically with the number of points! Sparsification methods have been proposed but in high dimension GP regression is likely to be tricky nevertheless

Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 23 / 35

slide-29
SLIDE 29

Bayesian Optimisation and Active Learning

Active learning proposes a dynamic world-view where learning takes place in cycles, intelligently selecting instances to query Bayesian Optimisation uses similar ideas to tackle non-convex

  • ptimisation

The scenario is that the function to be optimised is unknown but can be queried (with some computational cost) It replaces a hard optimisation problem with an iterative approach where an easier problem is solved at every iteration Closely related to Reinforcement Learning

Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 24 / 35

slide-30
SLIDE 30

BO key idea and terminology

At every step of the algorithm, we have a few function evaluations and want to select a point where to evaluate the function next KEY IDEA: treat the function as a random variable, use the existing function evaluations to compute a posterior distribution

  • ver the functions, and use this distribution to select the new

point The posterior mean is called the surrogate function The surrogate is used to create an acquisition function which is maximised to find the next evaluation The measure of success is the cumulative regret RT = 1 T

T

X

t=1

(f (xt) f (x⇤))2

Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 25 / 35

slide-31
SLIDE 31

Exploration-exploitation

It will come as no surprise that GPs can be used in Bayesian

  • ptimisation to construct a surrogate function

Directly optimising the surrogate however is a bad idea One needs a strategy to trade-off exploitation (looking around areas which you know to be promising) and exploration (checking out areas which you don’t know much about)

Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 26 / 35

slide-32
SLIDE 32

The GP-UCB rule

GPs also provide uncertainty quantification (in fact, they provide full distributions on outputs at any point) One can trade-off exploration and exploitation by selecting regions where the function could conceivably be high, rather than regions where we expect it to be high The Gaussian Process Upper Confidence Bound (GP-UCB) algorithm maximises the following acquisition function to select its next point F(x) = E[f (x)] + t p var(f (x)) which is an upper quantile of the single point marginals Surprisingly, Srinivas et al proved that this algorithm is globally convergent in probability, i.e. given , ✏ with probability 1 ✏ the regret will become smaller than

Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 27 / 35

slide-33
SLIDE 33

The GP-UCB algorithm - example

1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 10 12 14 16 18 20 22 24 26 28 30

Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 28 / 35

slide-34
SLIDE 34

The GP-UCB algorithm - example

1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 10 12 14 16 18 20 22 24 26 28 30

Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 29 / 35

slide-35
SLIDE 35

The GP-UCB algorithm - example

1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 10 12 14 16 18 20 22 24 26 28 30

Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 30 / 35

slide-36
SLIDE 36

The GP-UCB algorithm - example

1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 10 12 14 16 18 20 22 24 26 28 30

Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 31 / 35

slide-37
SLIDE 37

The GP-UCB algorithm - example

1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 10 12 14 16 18 20 22 24 26 28 30

Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 32 / 35

slide-38
SLIDE 38

The GP-UCB algorithm - example

1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 10 12 14 16 18 20 22 24 26 28 30

Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 33 / 35

slide-39
SLIDE 39

Why it works

Srinivas et al (IEEE Trans Inf Th 2012) linked the expectation of the cumulative regret to a submodular function Submodular functions obey a diminishing returns rule ! greedy

  • ptimisation provably works for submodular functions

The upper quantile t term must be increased according to a specific schedule for the guarantees to work

Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 34 / 35

slide-40
SLIDE 40

Limitations of GP-UCB

The cubic scaling of GP regression limits the dimensionality of the space, in my experience anything above 10 is fancyful Optimising the acquisition function is still an NP-hard problem! In fact, the acquisition function can be nastier (more multimodal) than the original function Convergence is not very fast (O( p T)) ! in cases where the true function is really expensive to optimise this could be too much

Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 35 / 35