Modern Gaussian Processes: Scalable Inference and Novel Applications - PowerPoint PPT Presentation

Modern Gaussian Processes: Scalable Inference and Novel Applications (Part III) Applications, Challenges & Opportunities Edwin V. Bonilla and Maurizio Filippone CSIRO’s Data61, Sydney, Australia and EURECOM, Sophia Antipolis, France July 14 th , 2019 1

Outline 1 Multi-task Learning 2 The Gaussian Process Latent Variable Model (GPLVM) 3 Bayesian Optimisation 4 Deep Gaussian Processes 5 Other Interesting GP/DGP-based Models 2

Multi-task Learning

Data Fusion and Multi-task Learning (1) • Sharing information across tasks/problems/modalities • Very little data on test task • Can model dependencies a priori • Correlated GP prior over latent functions θ f 1 f 2 f 3 f 1 f 2 f 3 y 1 y 2 y 3 y 1 y 2 y 3 3

Data Fusion and Multi-task Learning (2) Multi-task GP (Bonilla et al, NeurIPS, 2008) • Cov ( f ℓ ( x ) , f m ( x ′ )) = K f ℓ m κ ( x , x ′ ) • K can be estimated from data • Kronecker-product covariances ◮ ‘Efficient’ computation • Robot inverse dynamics (Chai et al, NeurIPS, 2009) 4

Data Fusion and Multi-task Learning (2) Multi-task GP (Bonilla et al, NeurIPS, 2008) • Cov ( f ℓ ( x ) , f m ( x ′ )) = K f ℓ m κ ( x , x ′ ) • K can be estimated from data • Kronecker-product covariances ◮ ‘Efficient’ computation • Robot inverse dynamics (Chai et al, NeurIPS, 2009) Generalisations and other settings : • Convolution formalism (Alvarez and Lawrence, JMLR, 2011) • GP regression networks (Wilson et al, ICML, 2012) • Many more ... 4

The Gaussian Process Latent Variable Model (GPLVM)

Non-linear Dimensionality Reduction with GPs The Gaussian Process Latent Variable Model (GPLVM; Lawrence, NeurIPS, 2004): • Probabilistic non-linear dimensionality reduction • Use independent GPs for x 1 ˜ ˜ x 2 x 3 ˜ each observed dimension 𝒣𝒬 1 𝒣𝒬 D x 1 x 2 x 3 x D ∙ ∙ ∙ • Estimate latent projections of the data via maximum likelihood 5

Modelling of Human Poses with GPLVMs (Grochow et al, SIGGRAPH 2004) Style-Based Inverse Kinematics : Given a set of constraints, produce the most likely pose • High dimensional data derived from pose information ◮ joint angles, vertical orientation, velocity and accelerations • GPLVM used to learn low-dimensional trajectories • GPLVM predictive distribution used in cost function for finding new poses with constraints Fig. and cool videos at http://grail.cs.washington.edu/projects/styleik/ 6

Bayesian Optimisation

Probabilistic Numerics: Bayesian Optimisation (1) Optimisation of black-box functions: • Do not know their implementation • Costly to evaluate • Use GPs as surrogate models 7

Probabilistic Numerics: Bayesian Optimisation (1) Optimisation of black-box functions: • Do not know their implementation • Costly to evaluate • Use GPs as surrogate models Vanilla BO iterates: 1 Get a few samples from true function 7

Probabilistic Numerics: Bayesian Optimisation (1) Optimisation of black-box functions: • Do not know their implementation • Costly to evaluate • Use GPs as surrogate models Vanilla BO iterates: 1 Get a few samples from true function 2 Fit a GP to the samples 7

Probabilistic Numerics: Bayesian Optimisation (1) Optimisation of black-box functions: • Do not know their implementation • Costly to evaluate • Use GPs as surrogate models Vanilla BO iterates: 1 Get a few samples from true function 2 Fit a GP to the samples 3 Use GP predictive distribution along with acquisition function to suggest new sample locations 7

Probabilistic Numerics: Bayesian Optimisation (1) Optimisation of black-box functions: • Do not know their implementation • Costly to evaluate • Use GPs as surrogate models Vanilla BO iterates: 1 Get a few samples from true function 2 Fit a GP to the samples 3 Use GP predictive distribution along with acquisition function to suggest new sample locations What are sensible acquisition functions? 7

Bayesian Optimisation (2) A taxonomy of algorithms proposed by D. R. Jones (2001) • µ ( x ⋆ ) , σ 2 ( x ⋆ ): pred. mean, variance def • I = f ( x ⋆ ) − f best : pred. improvement Fig. from Boyle (2007) 8

Bayesian Optimisation (2) A taxonomy of algorithms proposed by D. R. Jones (2001) • µ ( x ⋆ ) , σ 2 ( x ⋆ ): pred. mean, variance def • I = f ( x ⋆ ) − f best : pred. improvement • Expected improvement : � ∞ I p ( I ) d I EI( x ⋆ ) = 0 Fig. from Boyle (2007) ◮ Simple ‘analytical form’ ◮ Exploration-exploitation 8

Bayesian Optimisation (2) A taxonomy of algorithms proposed by D. R. Jones (2001) • µ ( x ⋆ ) , σ 2 ( x ⋆ ): pred. mean, variance def • I = f ( x ⋆ ) − f best : pred. improvement • Expected improvement : � ∞ I p ( I ) d I EI( x ⋆ ) = 0 Fig. from Boyle (2007) ◮ Simple ‘analytical form’ ◮ Exploration-exploitation Main idea: Sample x ⋆ so as to maximize the EI 8

Bayesian Optimisation (3) Many cool applications of BO and probabilistic numerics: • Optimisation of ML algorithms (Snoek et al, NeurIPS, 2012) • Preference learning (Chu and Gahramani, ICML 2005; Brochu et al, NeurIPS, 2007; Bonilla et al, NeurIPS, 2010) • Multi-task BO (Swersky et al, NeurIPS, 2013) • Bayesian Quadrature See http://probabilistic-numerics.org/ and references therein 9

Deep Gaussian Processes

The Deep Learning Revolution • Large representational power • Big data learning through stochastic optimisation • Exploit GPU and distributed computing • Automatic differentiation • Mature development of regularization (e.g., dropout) • Application-specific representations (e.g., convolutional) 10

Is There Any Hope for Gaussian Process Models? Can we exploit what made Deep Learning successful for practical and scalable learning of Gaussian processes? 11

Deep Gaussian Processes • Composition of Processes ( f ◦ g )( x )?? 12

Teaser — Modern GPs: Flexibility and Scalability • Composition of processes: Deep Gaussian Processes X θ (1) θ (2) F (1) F (2) Y Damianou and Lawrence, AISTATS , 2013 – Cutajar, Bonilla, Michiardi, Filippone, ICML , 2017 13

Learning Deep Gaussian Processes • Inference requires calculating integrals of this kind: � � Y | F ( N h ) , θ ( N h ) � p ( Y | X , θ ) = × p � F ( N h ) | F ( N h − 1) , θ ( N h − 1) � × . . . × p � F (1) | X , θ (0) � d F ( N h ) . . . d F (1) p • Extremely challenging! 14

Inference for DGPs • Inducing-variable approximations ◮ VI+Titsias • Damianou and Lawrence (AISTATS, 2013) • Hensman and Lawrence, (arXiv, 2014) • Salimbeni and Deisenroth, (NeurIPS, 2017) ◮ EP+FITC: Bui et al. (ICML, 2016) ◮ MCMC+Titsias • Havasi et al (arXiv, 2018) • VI+Random feature-based approximations ◮ Gal and Ghahramani (ICML 2016) ◮ Cutajar et al. (ICML 2017) 15

Example: DGPs with Random Features are Bayesian DNNs Recall RF approximations to GPs (part II-a). Then we have: X Φ (0) F (1) Φ (1) F (2) Y Ω (0) W (0) Ω (1) W (1) θ (0) θ (1) 16

Stochastic Variational Inference • Define Ψ = ( Ω (0) , . . . , W (0) , . . . ) • Lower bound for log [ p ( Y | X , θ )] E q ( Ψ ) (log [ p ( Y | X , Ψ , θ )]) − DKL [ q ( Ψ ) � p ( Ψ | θ )] , where q ( Ψ ) approximates p ( Ψ | Y , θ ). • DKL computable analytically if q and p are Gaussian! Optimize the lower bound wrt the parameters of q ( Ψ ) 17

Stochastic Variational Inference • Assume that the likelihood factorizes � p ( Y | X , Ψ , θ ) = p ( y k | x k , Ψ , θ ) k • Doubly stochastic unbiased estimate of the expectation term 18

Stochastic Variational Inference • Assume that the likelihood factorizes � p ( Y | X , Ψ , θ ) = p ( y k | x k , Ψ , θ ) k • Doubly stochastic unbiased estimate of the expectation term ◮ Mini-batch E q ( Ψ ) (log [ p ( Y | X , Ψ , θ )]) ≈ n � E q ( Ψ ) (log [ p ( y k | x k , Ψ , θ )]) m k ∈I m 18

Stochastic Variational Inference • Assume that the likelihood factorizes � p ( Y | X , Ψ , θ ) = p ( y k | x k , Ψ , θ ) k • Doubly stochastic unbiased estimate of the expectation term ◮ Mini-batch E q ( Ψ ) (log [ p ( Y | X , Ψ , θ )]) ≈ n � E q ( Ψ ) (log [ p ( y k | x k , Ψ , θ )]) m k ∈I m ◮ Monte Carlo N MC 1 � log[ p ( y k | x k , ˜ E q ( Ψ ) (log [ p ( y k | x k , Ψ , θ )]) ≈ Ψ r , θ )] N MC r =1 with ˜ Ψ r ∼ q ( Ψ ). 18

Stochastic Variational Inference • Reparameterization trick ( l ) r ) ij = σ ( l ) ij ε ( l ) rij + µ ( l ) ( ˜ W ij , with ε ( l ) rij ∼ N (0 , 1) • . . . same for Ω • Variational parameters µ ( l ) ij , ( σ 2 ) ( l ) ij . . . . . . and the ones for Ω • Optimization with automatic differentiation in TensorFlow Kingma and Welling, ICLR , 2014 19

Other Interesting GP/DGP-based Models

Modern Gaussian Processes: Scalable Inference and Novel Applications - PowerPoint PPT Presentation

Modern Gaussian Processes: Scalable Inference and Novel Applications (Part III) Applications, Challenges & Opportunities Edwin V. Bonilla and Maurizio Filippone CSIROs Data61, Sydney, Australia and EURECOM, Sophia Antipolis, France July 14

Modern Gaussian Processes: Scalable Inference and Novel Applications (Part II-b) Approximate

Modern Gaussian Processes: Scalable Inference and Novel Applications (Part IV) Theory & Code

MODERN 1 MODERN 2 MODERN 3 MODERN 4 MODERN A peep at some distant orb has power to raise

Scalable Gaussian Processes Zhenwen Dai Amazon September 4, 2018 @GPSS2018 Zhenwen Dai (Amazon)

Scalable Gaussian Processes Zhenwen Dai Amazon 9 September 2019 @GPSS 2019 Zhenwen Dai (Amazon)

Gaussian Processes Dan Cervone NYU CDS November 10, 2015 Dan Cervone (NYU CDS) Gaussian

Gaussian Filter The Gaussian filter 1 2 1 A Gaussian kernel gives less 1 2 4 2 weight to

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee

CMPUT 466 Introduction to Gaussian Processes Dan Lizotte The Plan Introduction to Gaussian

Scalable Gaussian processes with a twist of Probabilistic Numerics Kurt Cutajar EURECOM, Sophia

Non-Gaussian likelihoods for Gaussian Processes Alan Saul Outline Motivation Non-Gaussian

Gaussian Processes for Big Data James Hensman joint work with Nicol o Fusi, Neil D. Lawrence

Lecture 3 Capacity of Multiuser Gaussian Channels The Gaussian uplink: 6.1 The fading

Novel Gaits for a Novel Novel Gaits for a Novel Crawling/Grasping Mechanism Crawling/Grasping

State Space Gaussian Processes with Non-Gaussian Likelihoods Hannes Nickisch 1 Arno Solin 2

A Novel Framework For Scalable Video A Novel Framework For Scalable Video Streaming Over

Introducing Asp.Net Ing. Gabriele Zannoni gabriele.zannoni@unibo.it Introducing Asp.Net 1

A Legged Robotic System for Remote Monitoring Franco Tedeschi, Giuseppe Carbone Cosmatesque

Spatial Data Infrastructures W3C Linked Open Data LOD2014 Roma, 20-21 February 2014 Stefano

concepts and cri riteria towards a 0-deaths strategy Validation and verification process Phd.

Q+Faust+SuperCollider (LAC 2006) Q+Faust+SuperCollider Albert Grf Dept. of Music Informatics

Instruction Set Design Instruction Set Architecture: to what purpose? ISA provides the level

Lithium in the globular clusters Centauri & M4 Lorenzo Monaco ESO European Southern

Objectives Typing Semantics Explain the parts of a type judgment. Build proof trees to

Modern Gaussian Processes: Scalable Inference and Novel Applications - PowerPoint PPT Presentation

Modern Gaussian Processes: Scalable Inference and Novel Applications (Part III) Applications, Challenges & Opportunities Edwin V. Bonilla and Maurizio Filippone CSIROs Data61, Sydney, Australia and EURECOM, Sophia Antipolis, France July 14

Modern Gaussian Processes: Scalable Inference and Novel Applications (Part II-b) Approximate

Modern Gaussian Processes: Scalable Inference and Novel Applications (Part IV) Theory &amp; Code

MODERN 1 MODERN 2 MODERN 3 MODERN 4 MODERN A peep at some distant orb has power to raise

Scalable Gaussian Processes Zhenwen Dai Amazon September 4, 2018 @GPSS2018 Zhenwen Dai (Amazon)

Scalable Gaussian Processes Zhenwen Dai Amazon 9 September 2019 @GPSS 2019 Zhenwen Dai (Amazon)

Gaussian Processes Dan Cervone NYU CDS November 10, 2015 Dan Cervone (NYU CDS) Gaussian

Gaussian Filter The Gaussian filter 1 2 1 A Gaussian kernel gives less 1 2 4 2 weight to

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee

CMPUT 466 Introduction to Gaussian Processes Dan Lizotte The Plan Introduction to Gaussian

Scalable Gaussian processes with a twist of Probabilistic Numerics Kurt Cutajar EURECOM, Sophia

Non-Gaussian likelihoods for Gaussian Processes Alan Saul Outline Motivation Non-Gaussian

Gaussian Processes for Big Data James Hensman joint work with Nicol o Fusi, Neil D. Lawrence

Lecture 3 Capacity of Multiuser Gaussian Channels The Gaussian uplink: 6.1 The fading

Novel Gaits for a Novel Novel Gaits for a Novel Crawling/Grasping Mechanism Crawling/Grasping

State Space Gaussian Processes with Non-Gaussian Likelihoods Hannes Nickisch 1 Arno Solin 2

A Novel Framework For Scalable Video A Novel Framework For Scalable Video Streaming Over

Introducing Asp.Net Ing. Gabriele Zannoni gabriele.zannoni@unibo.it Introducing Asp.Net 1

A Legged Robotic System for Remote Monitoring Franco Tedeschi, Giuseppe Carbone Cosmatesque

Spatial Data Infrastructures W3C Linked Open Data LOD2014 Roma, 20-21 February 2014 Stefano

concepts and cri riteria towards a 0-deaths strategy Validation and verification process Phd.

Q+Faust+SuperCollider (LAC 2006) Q+Faust+SuperCollider Albert Grf Dept. of Music Informatics

Instruction Set Design Instruction Set Architecture: to what purpose? ISA provides the level

Lithium in the globular clusters Centauri &amp; M4 Lorenzo Monaco ESO European Southern

Objectives Typing Semantics Explain the parts of a type judgment. Build proof trees to

Modern Gaussian Processes: Scalable Inference and Novel Applications (Part IV) Theory & Code

Lithium in the globular clusters Centauri & M4 Lorenzo Monaco ESO European Southern