Sparse Gaussian Process Approximations Dr. Richard E. Turner ( - PowerPoint PPT Presentation

A Brief History of Gaussian Process Approximations approximate generative model exact generative model methods employing exact inference approximate inference pseudo-data A Unifying View of Sparse Approximate Gaussian Process Regression FITC Quinonero-Candela & PITC Rasmussen, 2005 (FITC, PITC, DTC) DTC FITC: Snelson et al. “Sparse Gaussian Processes using Pseudo-inputs” PITC: Snelson et al. “Local and global sparse Gaussian process approximations” EP: Csato and Opper 2002 / Qi et al. "Sparse-posterior Gaussian Processes for general likelihoods.” VFE: Titsias “Variational Learning of Inducing Variables in Sparse Gaussian Processes” DTC / PP: Seeger et al. “Fast Forward Selection to Speed Up Sparse Gaussian Process Regression” 10 / 90

A Brief History of Gaussian Process Approximations approximate generative model exact generative model methods employing exact inference approximate inference pseudo-data A Unifying View of Sparse Approximate Gaussian Process Regression FITC VFE Quinonero-Candela & PITC EP Rasmussen, 2005 (FITC, PITC, DTC) DTC PP FITC: Snelson et al. “Sparse Gaussian Processes using Pseudo-inputs” PITC: Snelson et al. “Local and global sparse Gaussian process approximations” EP: Csato and Opper 2002 / Qi et al. "Sparse-posterior Gaussian Processes for general likelihoods.” VFE: Titsias “Variational Learning of Inducing Variables in Sparse Gaussian Processes” DTC / PP: Seeger et al. “Fast Forward Selection to Speed Up Sparse Gaussian Process Regression” 10 / 90

A Brief History of Gaussian Process Approximations approximate generative model exact generative model methods employing exact inference approximate inference pseudo-data A Unifying View of Sparse Approximate Gaussian Process Regression FITC VFE Quinonero-Candela & PITC EP Rasmussen, 2005 (FITC, PITC, DTC) DTC PP A Unifying Framework for Sparse Gaussian Process Approximation using Power Expectation Propagation Bui, Yan and Turner, 2016 (VFE, EP, FITC, PITC ...) FITC: Snelson et al. “Sparse Gaussian Processes using Pseudo-inputs” PITC: Snelson et al. “Local and global sparse Gaussian process approximations” EP: Csato and Opper 2002 / Qi et al. "Sparse-posterior Gaussian Processes for general likelihoods.” VFE: Titsias “Variational Learning of Inducing Variables in Sparse Gaussian Processes” DTC / PP: Seeger et al. “Fast Forward Selection to Speed Up Sparse Gaussian Process Regression” 10 / 90

Factor Graphs: introduction / reminder factor graph examples 11 / 90

Factor Graphs: introduction / reminder factor graph examples what is the minimal factor graph for this multivariate Gaussian? 4 dimensional solution: 12 / 90

Factor Graphs: introduction / reminder factor graph examples what is the minimal factor graph for this multivariate Gaussian? 4 dimensional solution: 13 / 90

Fully independent training conditional (FITC) approximation construct new generative model (with pseudo-data) cheaper to perform exact learning and inference calibrated to original 14 / 90

Fully independent training conditional (FITC) approximation 1. augment model with M<T pseudo data construct new generative model (with pseudo-data) cheaper to perform exact learning and inference calibrated to original 15 / 90

Fully independent training conditional (FITC) approximation 1. augment model with M<T pseudo data 2. remove some of the dependencies (results in simpler model) all factors construct new generative model (with pseudo-data) cheaper to perform exact learning and inference calibrated to original 16 / 90

Fully independent training conditional (FITC) approximation 1. augment model with M<T pseudo data 2. remove some of the dependencies (results in simpler model) all factors construct new generative model (with pseudo-data) cheaper to perform exact learning and inference calibrated to original 17 / 90

Fully independent training conditional (FITC) approximation 1. augment model with M<T pseudo data 2. remove some of the dependencies (results in simpler model) all factors 3. calibrate model (e.g. using KL divergence, many choices) equal to exact conditionals construct new generative model (with pseudo-data) cheaper to perform exact learning and inference calibrated to original 18 / 90

Fully independent training conditional (FITC) approximation 1. augment model with M<T pseudo data 2. remove some of the dependencies (results in simpler model) all factors 3. calibrate model (e.g. using KL divergence, many choices) equal to exact conditionals construct new generative model (with pseudo-data) indirect cheaper to perform exact learning and inference posterior calibrated to original approximation 19 / 90

Fully independent training conditional (FITC) approximation construct new generative model (with pseudo-data) indirect cheaper to perform exact learning and inference posterior calibrated to original approximation 20 / 90

Fully independent training conditional (FITC) approximation How do we make predictions? construct new generative model (with pseudo-data) indirect cheaper to perform exact learning and inference posterior calibrated to original approximation 23 / 90

Fully independent training conditional (FITC) approximation How do we make predictions? construct new generative model (with pseudo-data) indirect cheaper to perform exact learning and inference posterior calibrated to original approximation 24 / 90

Fully independent training conditional (FITC) approximation cost of computing likelihood is construct new generative model (with pseudo-data) indirect cheaper to perform exact learning and inference posterior calibrated to original approximation 28 / 90

Fully independent training conditional (FITC) approximation cost of computing likelihood is original variances along diagonal: stops variances collapsing construct new generative model (with pseudo-data) indirect cheaper to perform exact learning and inference posterior calibrated to original approximation 31 / 90

FITC: Demo (Snelson) 32 / 90

FITC: Demo (Snelson) 33 / 90

Fully independent training conditional (FITC) approximation parametric (although cleverly so) if I see more data, should I add extra pseudo-data? ◮ unnatural from a generative modelling perspective ◮ natural from a prediction perspective (posterior gets more complex) = ⇒ lost elegant separation of model, inference and approximation example of prior approximation Extensions: inter-domain GP (pseudo-data in a different space) partially independent training conditional and tree-structured approximations 34 / 90

Variational free-energy method (VFE) lower bound the likelihood 35 / 90

Variational free-energy method (VFE) lower bound the likelihood KL between stochastic processes 40 / 90

Variational free-energy method (VFE) lower bound the likelihood KL between stochastic processes assume approximate posterior factorisation with special form exact: 41 / 90

Variational free-energy method (VFE) approximate posterior true posterior optimise variational free-energy wrt to these variational parameters 42 / 90

Variational free-energy method (VFE) same form as prediction from GP-regression approximate posterior true posterior optimise variational free-energy wrt to these variational parameters 43 / 90

Variational free-energy method (VFE) same form as prediction from GP-regression approximate posterior true posterior output locations inputs locations of and covariance 'pseudo' data 'pseudo' data optimise variational free-energy wrt to these variational parameters 44 / 90

Variational free-energy method (VFE) lower bound the likelihood KL between stochastic processes assume approximate posterior factorisation with special form predictive from GP regression exact: 45 / 90

Variational free-energy method (VFE) lower bound the likelihood KL between stochastic processes assume approximate posterior factorisation with special form predictive from GP regression exact: plug into Free-energy: 46 / 90

Variational free-energy method (VFE) lower bound the likelihood where DTC like uncertainty based correction 49 / 90

Variational free-energy method (VFE) lower bound the likelihood where average of KL between two quadratic form multivariate Gaussians DTC like uncertainty based correction 50 / 90

Variational free-energy method (VFE) lower bound the likelihood where average of KL between two quadratic form multivariate Gaussians make bound as tight as possible: DTC like uncertainty based correction 51 / 90

Variational free-energy method (VFE) lower bound the likelihood where average of KL between two quadratic form multivariate Gaussians make bound as tight as possible: (DTC) DTC like uncertainty based correction 52 / 90

Variational free-energy method (VFE) lower bound the likelihood where average of KL between two quadratic form multivariate Gaussians make bound as tight as possible: (DTC) DTC like uncertainty based correction 53 / 90

Summary of VFE method optimisation of pseudo point inputs: VFE has better guarantees than FITC variational methods known to underfit (and have other biases ) no augmentation required: target is posterior over functions, which includes inducing variables ◮ pseudo-input locations are pure variational parameters (do not parameterise the generative model like they do in FITC) ◮ coherent way of adding pseudo-data: more complex posteriors require more computational resources (more pseudo-points) Rule of thumb: VFE returns better mean estimates FITC returns better error-bar estimates how should we select M = number of pseudo-points? 54 / 90

How do we select M = number of pseudo-data? 3 2 1 0 y -1 -2 -3 0 200 400 600 800 1000 1200 1400 1600 1800 2000 x 55 / 90

How do we select M = number of pseudo-data? 3 2 1 0 y -1 -2 -3 0 200 400 600 800 1000 1200 1400 1600 1800 2000 x 56 / 90

How do we select M = number of pseudo-data? 3 2 1 0 y -1 -2 -3 0 200 400 600 800 1000 1200 1400 1600 1800 2000 x 0 10 SMSE -2 10 0 1 10 10 compute time/s 57 / 90

How do we select M = number of pseudo-data? x pseudo-dataset (input location) SMSE compute time/s 58 / 90

How do we select M = number of pseudo-data? 3 2 Exact 1 VFE 0 y -1 -2 -3 0 200 400 600 800 1000 1200 1400 1600 1800 2000 x 0 10 SMSE -2 10 0 1 10 10 compute time/s 59 / 90

Power Expectation Propagation and Gaussian Processes 72 / 90

A Brief History of Gaussian Process Approximations approximate generative model exact generative model methods employing exact inference approximate inference pseudo-data A Unifying View of Sparse Approximate Gaussian Process Regression FITC VFE Quinonero-Candela & PITC EP Rasmussen, 2005 (FITC, PITC, DTC) DTC PP A Unifying Framework for Sparse Gaussian Process Approximation using Power Expectation Propagation Bui, Yan and Turner, 2016 (VFE, EP, FITC, PITC ...) FITC: Snelson et al. “Sparse Gaussian Processes using Pseudo-inputs” PITC: Snelson et al. “Local and global sparse Gaussian process approximations” EP: Csato and Opper 2002 / Qi et al. "Sparse-posterior Gaussian Processes for general likelihoods.” VFE: Titsias “Variational Learning of Inducing Variables in Sparse Gaussian Processes” DTC / PP: Seeger et al. “Fast Forward Selection to Speed Up Sparse Gaussian Process Regression” 73 / 90

EP pseudo-point approximation true posterior 74 / 90

EP pseudo-point approximation marginal posterior likelihood true posterior 74 / 90

EP pseudo-point approximation marginal posterior likelihood true posterior approximate posterior 74 / 90

EP pseudo-point approximation exact joint of new GP regression model marginal posterior likelihood approximate posterior true posterior input locations of outputs and covariance 'pseudo' data 'pseudo' data 74 / 90

EP algorithm 75 / 90

EP algorithm take out one 1. remove pseudo-observation likelihood cavity 75 / 90

EP algorithm take out one 1. remove pseudo-observation likelihood cavity add in one 2. include true observation likelihood tilted 75 / 90

EP algorithm take out one 1. remove pseudo-observation likelihood cavity add in one 2. include true observation likelihood KL between unnormalised tilted stochastic processes project onto 3. project approximating family 75 / 90

EP algorithm take out one 1. remove pseudo-observation likelihood cavity add in one 2. include true observation likelihood KL between unnormalised tilted stochastic processes project onto 3. project approximating family update 4. update pseudo-observation likelihood 75 / 90

EP algorithm take out one 1. remove pseudo-observation likelihood cavity add in one 2. include true observation likelihood KL between unnormalised tilted stochastic processes project onto 3. project approximating family 1. minimum: moments matched at pseudo-inputs 2. Gaussian regression: matches moments everywhere update 4. update pseudo-observation likelihood 75 / 90

Sparse Gaussian Process Approximations Dr. Richard E. Turner ( - PowerPoint PPT Presentation

Sparse Gaussian Process Approximations Dr. Richard E. Turner ( ret26@cam.ac.uk ) Computational and Biological Learning Lab, Department of Engineering, University of Cambridge 1 / 90 Motivating application 1: Audio modelling audio 5

Generating Sparse Representations by Adaptive Multiscale Approximations Angela Kunoth University

Gaussian Filter The Gaussian filter 1 2 1 A Gaussian kernel gives less 1 2 4 2 weight to

Sparse Matrices Example Of Sparse Matrices diagonal tridiagonal sparse many elements are

Variational Model Selection for Sparse Gaussian Process Regression Michalis K. Titsias School of

Lecture 3 Capacity of Multiuser Gaussian Channels The Gaussian uplink: 6.1 The fading

Gaussian Process Lei Tang Arizona State University Jul. 31th, 2007 Lei Tang (ASU) Gaussian

Sparse Matrices sparse many elements are zero dense few elements are zero Example Of

Gaussian Processes for Big Data James Hensman joint work with Nicol o Fusi, Neil D. Lawrence

Faster Gaussian Lattice Sampling using Information Leakage Gaussian Sampling Our Work Lazy

Non-Gaussian likelihoods for Gaussian Processes Alan Saul Outline Motivation Non-Gaussian

CS70: Jean Walrand: Lecture 36. Gaussian and CLT CS70: Jean Walrand: Lecture 36. Gaussian and

Distributed Variational Inference in Sparse Gaussian Process Regression and Latent Variable Models

Gaussian Processes Seung-Hoon Na Chonbuk National University Gaussian Process Regression

On nonlinear approximations and the linear hull effect Anne Canteaut Inria, Paris, France joint

On the Properties of Variational Approximations in Statistical Learning. Pierre Alquier UCD

JUST THE MATHS SLIDES NUMBER 3.3 TRIGONOMETRY 3 (Approximations & inverse functions)

Health and Genetics Engineering & Public Policy Rebecca Balebako y & c S a e v c

Statistical properties for holomorphic endomorphisms of morphisms F. Bianchi, projective spaces

Free Electrons. Kernel, drivers and embedded Linux development, consulting, training and support.

Disclosures Nodal Management in Nothing to disclose Differentiated Thyroid Carcinoma Jonathan

DtCraft: A High-performance Distributed Execution Engine at Scale Dr. Tsung-Wei Huang Department

Firmware at the Mu2e Test Stand Micol Rigatti Final Report 25/09/2019 Mu2e Experiment A search

Analysing Switch-Case Tables by Partial Evaluation Niklas Holsti Tidorum Ltd www.tidorum.fi

Research Paper Recommender System Based on Deep Text Comprehension Dongyu Ru Kun Chen SJTU

Sparse Gaussian Process Approximations Dr. Richard E. Turner ( - PowerPoint PPT Presentation

Sparse Gaussian Process Approximations Dr. Richard E. Turner ( ret26@cam.ac.uk ) Computational and Biological Learning Lab, Department of Engineering, University of Cambridge 1 / 90 Motivating application 1: Audio modelling audio 5

Generating Sparse Representations by Adaptive Multiscale Approximations Angela Kunoth University

Gaussian Filter The Gaussian filter 1 2 1 A Gaussian kernel gives less 1 2 4 2 weight to

Sparse Matrices Example Of Sparse Matrices diagonal tridiagonal sparse many elements are

Variational Model Selection for Sparse Gaussian Process Regression Michalis K. Titsias School of

Lecture 3 Capacity of Multiuser Gaussian Channels The Gaussian uplink: 6.1 The fading

Gaussian Process Lei Tang Arizona State University Jul. 31th, 2007 Lei Tang (ASU) Gaussian

Sparse Matrices sparse many elements are zero dense few elements are zero Example Of

Gaussian Processes for Big Data James Hensman joint work with Nicol o Fusi, Neil D. Lawrence

Faster Gaussian Lattice Sampling using Information Leakage Gaussian Sampling Our Work Lazy

Non-Gaussian likelihoods for Gaussian Processes Alan Saul Outline Motivation Non-Gaussian

CS70: Jean Walrand: Lecture 36. Gaussian and CLT CS70: Jean Walrand: Lecture 36. Gaussian and

Distributed Variational Inference in Sparse Gaussian Process Regression and Latent Variable Models

Gaussian Processes Seung-Hoon Na Chonbuk National University Gaussian Process Regression

On nonlinear approximations and the linear hull effect Anne Canteaut Inria, Paris, France joint

On the Properties of Variational Approximations in Statistical Learning. Pierre Alquier UCD

JUST THE MATHS SLIDES NUMBER 3.3 TRIGONOMETRY 3 (Approximations &amp; inverse functions)

Health and Genetics Engineering &amp; Public Policy Rebecca Balebako y &amp; c S a e v c

Statistical properties for holomorphic endomorphisms of morphisms F. Bianchi, projective spaces

Free Electrons. Kernel, drivers and embedded Linux development, consulting, training and support.

Disclosures Nodal Management in Nothing to disclose Differentiated Thyroid Carcinoma Jonathan

DtCraft: A High-performance Distributed Execution Engine at Scale Dr. Tsung-Wei Huang Department

Firmware at the Mu2e Test Stand Micol Rigatti Final Report 25/09/2019 Mu2e Experiment A search

Analysing Switch-Case Tables by Partial Evaluation Niklas Holsti Tidorum Ltd www.tidorum.fi

Research Paper Recommender System Based on Deep Text Comprehension Dongyu Ru Kun Chen SJTU

JUST THE MATHS SLIDES NUMBER 3.3 TRIGONOMETRY 3 (Approximations & inverse functions)

Health and Genetics Engineering & Public Policy Rebecca Balebako y & c S a e v c