Power Expectation Propagation for Deep Gaussian Processes Dr. - PowerPoint PPT Presentation

Power Expectation Propagation for Deep Gaussian Processes Dr. Richard E. Turner ( ret26@cam.ac.uk ) Computational and Biological Learning Lab, Department of Engineering, University of Cambridge with Thang Bui, Yingzhen Li, Jos´ e Miguel Hern´ andez Lobato, Daniel Hern´ andez Lobato, Josiah Jan 1 / 32

Motivation: Gaussian Process regression outputs inputs 2 / 32

Motivation: Gaussian Process regression ? outputs inputs 2 / 32

Motivation: Gaussian Process regression inference & learning ? outputs inputs 2 / 32

Motivation: Gaussian Process regression inference & learning intractabilities computational analytic ? outputs inputs 2 / 32

EP pseudo-point approximation true posterior 3 / 32

EP pseudo-point approximation marginal posterior likelihood true posterior 3 / 32

EP pseudo-point approximation marginal posterior likelihood true posterior approximate posterior 3 / 32

EP pseudo-point approximation marginal posterior likelihood approximate posterior true posterior input locations of outputs and covariance 'pseudo' data 'pseudo' data 3 / 32

EP algorithm 4 / 32

EP algorithm take out one 1. remove pseudo-observation likelihood cavity 4 / 32

EP algorithm take out one 1. remove pseudo-observation likelihood cavity add in one 2. include true observation likelihood tilted 4 / 32

EP algorithm take out one 1. remove pseudo-observation likelihood cavity add in one 2. include true observation likelihood KL between unnormalised tilted stochastic processes project onto 3. project approximating family 4 / 32

EP algorithm take out one 1. remove pseudo-observation likelihood cavity add in one 2. include true observation likelihood KL between unnormalised tilted stochastic processes project onto 3. project approximating family update 4. update pseudo-observation likelihood 4 / 32

EP algorithm take out one 1. remove pseudo-observation likelihood cavity add in one 2. include true observation likelihood KL between unnormalised tilted stochastic processes project onto 3. project approximating family 1. minimum: moments matched at pseudo-inputs 2. Gaussian regression: matches moments everywhere update 4. update pseudo-observation likelihood 4 / 32

EP algorithm take out one 1. remove pseudo-observation likelihood cavity add in one 2. include true observation likelihood KL between unnormalised tilted stochastic processes project onto 3. project approximating family 1. minimum: moments matched at pseudo-inputs 2. Gaussian regression: matches moments everywhere update 4. update pseudo-observation likelihood rank 1 4 / 32

Fixed points of EP = FITC approximation 5 / 32

Fixed points of EP = FITC approximation suppressed & 5 / 32

Fixed points of EP = FITC approximation suppressed & = equivalent 8 / 32

Fixed points of EP = FITC approximation suppressed & = Csato & Opper (2002) equivalent Qi, Abdel-Gawad & Minka (2010) 8 / 32

Fixed points of EP = FITC approximation suppressed & = Csato & Opper (2002) equivalent Qi, Abdel-Gawad & Minka (2010) Interpretation resolves philosophical issues with FITC (increase M with N) FITC known to overfit => EP over-estimates marginal likelihood 8 / 32

EP algorithm take out one 1. remove pseudo-observation likelihood cavity add in one 2. include true observation likelihood KL between unnormalised tilted stochastic processes project onto 3. project approximating family 1. minimum: moments matched at pseudo-inputs 2. Gaussian regression: matches moments everywhere update 4. update pseudo-observation likelihood rank 1 9 / 32

Power EP algorithm (as tractable as EP) take out fraction of 1. remove pseudo-observation likelihood cavity add in fraction of 2. include true observation likelihood KL between unnormalised tilted stochastic processes project onto 3. project approximating family 1. minimum: moments matched at pseudo-inputs 2. Gaussian regression: matches moments everywhere update 4. update pseudo-observation likelihood rank 1 10 / 32

Power EP: a unifying framework FITC VFE Csato and Opper, 2002 Titsias, 2009 Snelson and Ghahramani, 2005 11 / 32

Power EP: a unifying framework Approximate blocks of data: structured approximations 12 / 32

Power EP: a unifying framework Approximate blocks of data: structured approximations PITC / BCM Schwaighofer & Tresp, 2002, Snelson 2006, VFE Titsias, 2009 12 / 32

Power EP: a unifying framework Approximate blocks of data: structured approximations PITC / BCM Schwaighofer & Tresp, 2002, Snelson 2006, VFE Titsias, 2009 Place pseudo-data in different space: interdomain transformations (linear transform) 12 / 32

Power EP: a unifying framework Approximate blocks of data: structured approximations PITC / BCM Schwaighofer & Tresp, 2002, Snelson 2006, VFE Titsias, 2009 Place pseudo-data in different space: interdomain transformations (linear transform) pseudo-data in new space 12 / 32

Power EP: a unifying framework Approximate blocks of data: structured approximations PITC / BCM Schwaighofer & Tresp, 2002, Snelson 2006, VFE Titsias, 2009 Place pseudo-data in different space: interdomain transformations (linear transform) Figueiras-Vidal & La ́ zaro-Gredilla 2009 pseudo-data in new space T obar et al. 2015 Matthews et al, 2016 12 / 32

Power EP: a unifying framework GP Regression GP Classification ** ** [16*] inter-domain inter-domain [13] [17,13] [7,4*,6*] (PITC) [9,11,8*] [14*] structured structured [12*,15*] [10,5,6*] approx. approx. (FITC) VFE VFE PEP PEP EP EP [4] Quiñonero-Candela et al. 2005 [8] Titsias, 2009 [12] Naish-Guzman et al, 2007 [5] Snelson et al., 2005 [9] Csató, 2002 [13] Qi et al., 2010 [6] Snelson, 2006 [14] Hensman et al., 2015 [10] Csató et al., 2002 [7] Schwaighofer, 2002 [11] Seeger et al., 2003 [15] Hernández-Lobato et al., 2016 [16] Matthews et al., 2016 * = optimised pseudo-inputs [17] Figueiras-Vidal et al., 2009 ** = structured versions of VFE recover VFE 13 / 32

How should I set the power parameter α ? 8 UCI regression datasets 6 UCI classification datasets 20 random splits 20 random splits M = 0 - 200 M = 10, 50, 100 hypers and inducing inputs optimised hypers and inducing inputs optimised 0.8 0.1 0.5 0.8 0.2 0.05 0.4 0 0 1 0.6 1 1 2 3 4 5 6 7 8 1 2 3 4 CD Error Average Rank SMSE Average Rank Error Average Rank indicates significant difference 0.4 0.2 0.8 1 0.1 1 0.5 0 0.6 0.05 0.8 0 1 2 3 4 1 2 3 4 5 6 7 8 CD SMLL Average Rank MLL Average Rank MLL Average Rank = 0.5 does well over all 14 / 32

Deep Gaussian processes f l ∼ GP (0 , k ( ., . )) y n = g ( x n ) = f L ( f L − 1 ( · · · f 2 ( f 1 ( x n )))) + ǫ n x n f 1 h L − 1 , n := f L − 1 ( · · · f 1 ( x n )) , y n = f L ( h L − 1 , n ) + ǫ n h 1 , n Deep GPs a are f 2 multi-layer generalisation of Gaussian processes, equivalent to deep neural networks with infinitely h 2 , n wide hidden layers f 3 Questions: How to perform inference and learning tractably? y n How Deep GPs compare to alternative, e.g. Bayesian neural networks? N a Damianou and Lawrence (2013) [unsupervised learning] 15 / 32

Pros and cons of Deep GPs Why deep GPs? Because x n deep and nonparametric, f 1 discover useful input warping or dimensionality compression/expansion, i.e. automatic, h 1 , n nonparametric Bayesian kernel design, f 2 give a non-Gaussian functional mapping g , h 2 , n f 3 y n N 16 / 32

Power Expectation Propagation for Deep Gaussian Processes Dr. - PowerPoint PPT Presentation

Power Expectation Propagation for Deep Gaussian Processes Dr. Richard E. Turner ( ret26@cam.ac.uk ) Computational and Biological Learning Lab, Department of Engineering, University of Cambridge with Thang Bui, Yingzhen Li, Jos e Miguel Hern

more on expectation 1 2 properties of expectation properties of expectation Linearity, II

Non-Gaussian likelihoods for Gaussian Processes Alan Saul Outline Motivation Non-Gaussian

PLANT PROPAGATION An Overview of Plant Propagation Methods Two Techniques of Stem Cutting

CS70: Jean Walrand: Lecture 27. Expectation; Conditional Expectation; B(n, p); G(p) 1. Review of

Gaussian Filter The Gaussian filter 1 2 1 A Gaussian kernel gives less 1 2 4 2 weight to

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee

Expectation Maximization Greg Mori - CMPT 419/726 Bishop PRML Ch. 9 K-Means Gaussian Mixture

State Space Expectation Propagation Efficient Inference Schemes for Temporal Gaussian Processes

Gaussian Processes Dan Cervone NYU CDS November 10, 2015 Dan Cervone (NYU CDS) Gaussian

CMPUT 466 Introduction to Gaussian Processes Dan Lizotte The Plan Introduction to Gaussian

A Unifying Framework for Sparse Gaussian Process Approximation using Power Expectation Propagation

A Unifying Framework for Sparse Gaussian Process Approximation using Power Expectation Propagation

Expectation Will Perkins January 21, 2013 Expectation Definition The expectation of a random

Lecture 3 Capacity of Multiuser Gaussian Channels The Gaussian uplink: 6.1 The fading

State Space Gaussian Processes with Non-Gaussian Likelihoods Hannes Nickisch 1 Arno Solin 2

CSci 8980: Advanced Topics in Graphical Models Expectation Propagation Instructor: Arindam

The pseudo-GDPR on digital marketplaces challenge a general testbed for normative reasoning and

Basic Pipelining Wrap-up Exploiting ILP (from Slide Set 20) Chapter 6 and beyond 1 2

Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical

Numerical optimization minimizing a function by evaluating it at many trial points. Main

Pseudo-Relevance Feedback CS6200: Information Retrieval Slides by: Jesse Anderton

Stackable GSS Pseudo-Mechs draft-williams-gssapi-stackable-pseudo-mechs-00

Optimal Service Placement using Pseudo Service Chaining Mechanism 2016. 11. 15. Taeheum Na

Data Structures Summary Today In-class work on Java: Gnome Static data and methods