Primer on Bayesian Inference and Gaussian Processes Guido - PowerPoint PPT Presentation

Primer on Bayesian Inference and Gaussian Processes Guido Sanguinetti School of Informatics, University of Edinburgh Dagstuhl, March 2018 Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 1 / 35

Talk outline Bayesian regression 1 Gaussian Processes 2 Bayesian prediction with GPs 3 Bayesian Optimisation 4 Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 2 / 35

The Bayesian Way ALL model ingredients are random variables Statistical framework for quantifying uncertainty when only some variables are observed We have prior distributions on unobserved variables, and model the dependence of the observations on the unobserved variables This allows us to make inferences on the unobserved variables Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 3 / 35

Bayesian inference and predictions Models consist of joint probability distributions (with a structure) over observed and unobserved (latent) variables Unobserved variables ✓ have a prior distribution The conditional probability of the observations y given the latents p ( y | ✓ ) is called the likelihood Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 4 / 35

Bayesian inference and predictions Models consist of joint probability distributions (with a structure) over observed and unobserved (latent) variables Unobserved variables ✓ have a prior distribution The conditional probability of the observations y given the latents p ( y | ✓ ) is called the likelihood The revised belief on the latents is the posterior p ( ✓ | y ) = 1 Z p ( y | ✓ ) p ( ✓ ) Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 4 / 35

Bayesian inference and predictions Models consist of joint probability distributions (with a structure) over observed and unobserved (latent) variables Unobserved variables ✓ have a prior distribution The conditional probability of the observations y given the latents p ( y | ✓ ) is called the likelihood The revised belief on the latents is the posterior p ( ✓ | y ) = 1 Z p ( y | ✓ ) p ( ✓ ) The predictive distribution for new observations is Z p ( y new | y old ) = d ✓ p ( y new | ✓ ) p ( ✓ | y old ) Di ffi culty is computing the integrals Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 4 / 35

Bayesian supervised (discriminative) learning We focus on the (discriminative) supervised learning scenario: data are input-output pairs ( x , y ) Standard assumption: inputs are noise-free, outputs are noisy observations of a function f ( x ): y ⇠ P s . t . E [ y ] = f ( x ) The function f is a random function Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 5 / 35

Bayesian supervised (discriminative) learning We focus on the (discriminative) supervised learning scenario: data are input-output pairs ( x , y ) Standard assumption: inputs are noise-free, outputs are noisy observations of a function f ( x ): y ⇠ P s . t . E [ y ] = f ( x ) The function f is a random function Simplest example f = P i w i � i ( x ) with phi i fixed basis functions , and w i random weights Consequently, the variables f ( x i ) at input points are (correlated) random variables Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 5 / 35

Important exercise Let � 1 ( x ) , . . . , � N ( x ) be a fixed set of functions, and let f ( x ) = P w i � i ( x ). If w ⇠ N (0 , I ), compute: The single-point marginal distribution of f ( x ) 1 The two-point marginal distribution of f ( x 1 ) , f ( x 2 ) 2 Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 6 / 35

Solution (sketch) Obviously the distributions are Gaussians Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 7 / 35

Solution (sketch) Obviously the distributions are Gaussians Obviously both distributions have mean zero Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 7 / 35

Solution (sketch) Obviously the distributions are Gaussians Obviously both distributions have mean zero To compute the (co)variance, take products and expectations and remember that h w i w j i = � ij Defining φ ( x ) = ( � 1 ( x ) , . . . , � N ( x )), we get that h f ( x i ) f ( x j ) i = φ ( x i ) T φ ( x j ) Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 7 / 35

The Gram matrix Generalising the exercise to more than two points, we get that any finite dimensional marginal of this process is multivariate Gaussian The covariance matrix of this function is given by evaluating a function of two variables at all possible pairs The function is defined by the set of basis functions k ( x i , x j ) = φ ( x i ) T φ ( x j ) The covariance matrix is often called Gram matrix and is (necessarily) symmetric and positive definite Bayesian prediction in regression then is essentially the same as computing conditionals for Gaussians (more later) Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 8 / 35

Stationary variance We have seen that the variance of a random combination of functions depends on space as P � 2 i ( x ) Given any compact set, (e.g. hypercube with centre in the origin), we can find a finite set of basis functions s.t. P � 2 i ( x ) = const (partition of unity, e.g. triangulations or smoother alternatives) We can construct a sequence of such sets which covers the whole of R D in the limit Therefore, we can construct a sequence of priors which all have constant prior variance across all space Covariances would still be computed by evaluating a Gram matrix (and need not be constant) Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 9 / 35

Function space view The argument before shows that we can put a prior over infinite-dimensional spaces of functions s.t. all finite dimensional marginals are multivariate Gaussian The constructive argument, often referred to as weights space view , is useful for intuition but impractical It does demonstrate the existence of truly infinite dimensional Gaussian processes Once we accept that Gaussian processes exist, we are better o ff proceeding along a more abstract line Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 10 / 35

GP definition A Gaussian Process (GP) is a stochastic process indexed by a continuous variable x s.t. all finite dimensional marginals are multivariate Gaussian A GP is uniquely defined by its mean and covariance functions, denoted by µ ( x ) and k ( x , x 0 ): f ⇠ GP ( µ, k ) $ f = ( f ( x 1 ) , . . . , f ( x N )) ⇠ N ( µ , K ) , µ = ( µ ( x 1 ) , . . . .µ ( x N )) , K = ( k ( x i , x j )) i , j The covariance function must satisfy some conditions (Mercer’s theorem), essentially it needs to evaluate to a symmetric positive definite function for all sets of input points Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 11 / 35

Covariance functions The covariance function encapsulates the basis functions used ! it determines the type of functions which can be sampled The radial basis functions (RBF or squared exponential) covariance function � ( x i � x j ) 2  � k ( x i , x j ) = ↵ 2 exp � 2 corresponds to Gaussian bumps basis functions and yields smooth bumpy samples The Ornstein-Uhlenbeck (OU) covariance  � | x i � x j | � k ( x i , x j ) = ↵ 2 exp � 2 yields rough paths which are nowhere di ff erentiable Both RBF and OU are stationary and encode exponentially decaying correlations Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 12 / 35

More on covariance functions Recent years have seen much work on designing/ selecting covariance functions One line of thought follows the fact that convex combination/ multiplication of covariance functions still yields a covariance function The automatic statistician project (Z. Ghahramani) combines these operations with a heuristic search to automatically select a covariance Another line of research constructs covariance functions out of steady-state autocorrelations of stochastic process models (work primarily by S¨ arkk¨ a and collaborators) Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 13 / 35

Observing GPs In a regression case, we assume to have observed the function values at some input values with i.i.d. Gaussian noise with variance � 2 What is the e ff ect of observation noise? Suppose we have a Gaussian vector f ⇠ N ( µ, Σ ), and observations y | f ⇠ N ( f , � 2 I ) Exercise: compute the marginal distribution of y Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 14 / 35

Predicting with GPs Suppose we have noisy observations y of a function value at inputs x , and want to predict the value at a new input x new The joint prior probability of function values at the observed and new input points is multivariate Gaussian By Bayes’ theorem, we have Z p ( f new | y ) / df ( x ) p ( f new , f ( x )) p ( y | f ( x )) (1) where f ( x ) is the vector of true function values at the input points For Gaussian observation noise, the integral is analytically computed Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 15 / 35

Primer on Bayesian Inference and Gaussian Processes Guido - PowerPoint PPT Presentation

Primer on Bayesian Inference and Gaussian Processes Guido Sanguinetti School of Informatics, University of Edinburgh Dagstuhl, March 2018 Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 1 / 35 Talk outline

Gaussian Processes Dan Cervone NYU CDS November 10, 2015 Dan Cervone (NYU CDS) Gaussian

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

Approximate Bayesian inference for latent Gaussian models avard Rue 1 H Department of

Gaussian Filter The Gaussian filter 1 2 1 A Gaussian kernel gives less 1 2 4 2 weight to

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

CMPUT 466 Introduction to Gaussian Processes Dan Lizotte The Plan Introduction to Gaussian

Polynomial Chaos Acceleration for the Bayesian Inference of Random Fields with Gaussian Priors and

My research over Bayesian Optimization and Gaussian Processes Eduardo C. GarridoMerch an

CSC2541 Lecture 2 Bayesian Occams Razor and Gaussian Processes Roger Grosse Roger Grosse

Non-Gaussian likelihoods for Gaussian Processes Alan Saul Outline Motivation Non-Gaussian

Gaussian Processes for Big Data James Hensman joint work with Nicol o Fusi, Neil D. Lawrence

Lecture 3 Capacity of Multiuser Gaussian Channels The Gaussian uplink: 6.1 The fading

Table S2. Gene-specific PCR primer pairs for all validated SBSs. Forward primer Reverse primer

Basics of Bayesian Inference A frequentist thinks of unknown parameters as fixed Basics of

Case Study: Approximate Bayesian Inference for Latent Gaussian Models by Using Integrated Nested

First things first, sign up! http://openshift.redhat.com OpenShift, a little history Nov 2010

DIAGNOSIS AND TREATMENT OF POSTPARTUM DEPRESSION Learning Objectives Improve the recognition

RDECOM ALR Program Lead-Free, DBX-1 Projects Neha Mehta Oct 2016 Distribution A: Approved for

Leadership Briefing September 2020 Introduction and Welcome Martin Smith Before We Start ...

Employment and Placement DoD CIO UNCLASSIFIED 1 UNCLASSIFIED Learning Topics Appointing

Navigating Academic Appointments and Promotion July 27, 2018 Dianne Durham, PhD Associate Dean

Strategic Human Resources: Onboarding for Organizational Success T.J. Alinen, SHRM SCP, SPHR

Council of Graduate Coordinators and Staff (CGCS) Meeting April 12, 2013 Agenda Items