Advanced Introduction to Machine Learning CMU-10715 Gaussian - PowerPoint PPT Presentation

Advanced Introduction to Machine Learning CMU-10715 Gaussian Processes Barnabás Póczos

Introduction

http://www.gaussianprocess.org/ Some of these slides in the intro are taken from D. Lizotte, R. Parr, C. Guesterin 3

Contents  Introduction Regression  Properties of Multivariate Gaussian distributions   Ridge Regression  Gaussian Processes  Weight space view o Bayesian Ridge Regression + Kernel trick  Function space view o Prior distribution over functions + calculation posterior distributions 4

Regression

Why GPs for Regression? Regression methods: Linear regression, ridge regression, support vector regression, kNN regression, etc… Motivation 1: All the above regression method give point estimates. We would like a method that could also provide confidence during the estimation. Motivation 2: Let us kernelize linear ridge regression, and see what we get…

Why GPs for Regression? GPs can answer the following questions • Here’s where the function will most likely be. (expected function) • Here are some examples of what it might look like. (sampling from the posterior distribution) • Here is a prediction of what you’ll see if you evaluate your function at x’, with confidence 7

Properties of Multivariate Gaussian Distributions 8

1D Gaussian Distribution Parameters • Mean,  • Variance,  2 9

Multivariate Gaussian 10

Multivariate Gaussian  A 2-dimensional Gaussian is defined by • a mean vector  = [  1 ,  2 ]     2 2   • a covariance matrix: 1 , 1 2 , 1     2 2   1 , 2 2 , 2 where  i,j = E[ (x i –  i ) (x j –  j ) ] 2 is (co)variance  Note:  is symmetric, “positive semi - definite”:  x: x T  x  0 11

Multivariate Gaussian examples   1 0 . 8  = (0,0)       0 . 8 1 12

Multivariate Gaussian examples   1 0 . 8  = (0,0)       0 . 8 1 13

Useful Properties of Gaussians  Marginal distributions of Gaussians are Gaussian  Given:      x ( x a x , ), ( , ) b a b         aa ab       ba bb  Marginal Distribution: 14

Marginal distributions of Gaussians are Gaussian 15

Block Matrix Inversion Theorem Definition: Schur complements 16

Useful Properties of Gaussians  Conditional distributions of Gaussians are Gaussian  Notation:                    aa ab aa ab 1             ba bb ba bb  Conditional Distribution: 17

Higher Dimensions  Visualizing > 3 dimensions is… difficult  Means and marginals are practical, but then we don’t see correlations between those variables  Marginals are Gaussian, e.g., f(6) ~ N( µ(6) , σ 2 (6)) Visualizing an 8-dimensional Gaussian f: σ 2 (6) µ(6) 6 1 2 3 4 5 7 8 18

Yet Higher Dimensions Why stop there? Don’t panic: It’s just a function 19

Getting Ridiculous Why stop there? 20

Gaussian Process Definition :  Probability distribution indexed by an arbitrary set (integer, real, finite dimensional vector, etc)  Each element gets a Gaussian distribution over the reals with mean µ(x)  These distributions are dependent/correlated as defined by k (x,z)  Any finite subset of indices defines a multivariate Gaussian distribution 21

Gaussian Process  Distribution over functions…. Yayyy! If our regression model is a GP, then it won’t be a point estimate anymore! It can provide regression estimates with confidence  Domain (index set) of the functions can be pretty much whatever • Reals • Real vectors • Graphs • Strings • Sets • … 22

Bayesian Updates for GPs • How can we do regression and learn the GP from data? • We will be Bayesians today: • Start with GP prior • Get some data • Compute a posterior 23

Samples from the prior distribution 24 Picture is taken from Rasmussen and Williams

Samples from the posterior distribution 25 Picture is taken from Rasmussen and Williams

Prior 26

Data 27

Posterior 28

Contents  Introduction  Ridge Regression  Gaussian Processes • Weight space view  Bayesian Ridge Regression + Kernel trick • Function space view  Prior distribution over functions + calculation posterior distributions 29

Ridge Regression Linear regression: Ridge regression: The Gaussian Process is a Bayesian Generalization of the kernelized ridge regression 30

Weight Space View GP = Bayesian ridge regression in feature space + Kernel trick to carry out computations The training data 32

Bayesian Analysis of Linear Regression with Gaussian noise 33

Bayesian Analysis of Linear Regression with Gaussian noise The likelihood: 34

Bayesian Analysis of Linear Regression with Gaussian noise The prior: Now, we can calculate the posterior: 35

Bayesian Analysis of Linear Regression with Gaussian noise Ridge Regression After “completing the square” MAP estimation 36

Bayesian Analysis of Linear Regression with Gaussian noise This posterior covariance matrix doesn’t depend on the observations y , A strange property of Gaussian Processes 37

Projections of Inputs into Feature Space The reviewed Bayesian linear regression suffers from limited expressiveness To overcome the problem ) go to a feature space and do linear regression there a., explicit features b., implicit features (kernels) 38

Explicit Features Linear regression in the feature space 39

Explicit Features The predictive distribution after feature map: 40

Explicit Features Shorthands: The predictive distribution after feature map: 41

Explicit Features The predictive distribution after feature map: (*) A problem with (*) is that it needs an NxN matrix inversion... Theorem: (*) can be rewritten: 42

Proofs • Mean expression. We need: Lemma: • Variance expression. We need: Matrix inversion Lemma: 43

From Explicit to Implicit Features Reminder : This was the original formulation: 44

From Explicit to Implicit Features The feature space always enters in the form of: No need to know the explicit N dimensional features. Their inner product is enough. Lemma: 45

Results 46

Results using Netlab , Sin function 47

Results using Netlab, Sin function Increased # of training points 48

Results using Netlab, Sin function Increased noise 49

Results using Netlab, Sinc function 50

Thanks for the Attention!  51

Extra Material 52

Function Space View  An alternative way to get the previous results  Inference directly in function space Definition: (Gaussian Processes) GP is a collection of random variables, s.t. any finite number of them have a joint Gaussian distribution 54

Function Space View Notations: 55

Function Space View Gaussian Processes: 56

Function Space View The Bayesian linear regression is an example of GP 57

Function Space View Special case 58

Function Space View 59 Picture is taken from Rasmussen and Williams

Function Space View Observation Explanation 60

Prediction with noise free observations noise free observations 61

Prediction with noise free observations Goal: 62

Prediction with noise free observations Lemma: Proofs: a bit of calculation using the joint (n+m) dim density Remarks: 63

Prediction with noise free observations 64 Picture is taken from Rasmussen and Williams

Prediction using noisy observations The joint distribution: 65

Prediction using noisy observations The posterior for the noisy observations: where In the weight space view we had: 66

Prediction using noisy observations Short notations: 67

Prediction using noisy observations Two ways to look at it: • Linear predictor • Manifestation of the Representer Theorem 68

Prediction using noisy observations Remarks: 69

GP pseudo code Inputs: 70

GP pseudo code (continued) Outputs: 71

Thanks for the Attention!  72

Advanced Introduction to Machine Learning CMU-10715 Gaussian - PowerPoint PPT Presentation

Advanced Introduction to Machine Learning CMU-10715 Gaussian Processes Barnabs Pczos Introduction http://www.gaussianprocess.org/ Some of these slides in the intro are taken from D. Lizotte, R. Parr, C. Guesterin 3 Contents

Advanced Introduction to Machine Learning, CMU-10715 Manifold Learning Barnabs Pczos

Advanced Introduction to Machine Learning, CMU-10715 Vapnik Chervonenkis Theory Barnabs

Advanced Introduction to Machine Learning CMU-10715 Principal Component Analysis Barnabs

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine

ADVANCED MACHINE LEARNING Kernel PCA 11 ADVANCED MACHINE LEARNING Overview Todays Lecture

ADVANCED MACHINE LEARNING Non-linear regression techniques 1 1 ADVANCED MACHINE LEARNING

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

FACT: A Diagnostic for Group Fairness Trade-offs Joon Kim, CMU (joonsikk@cs.cmu.edu ) Jiahao Chen,

The bluetides simulation Tiziana DiMatteo (CMU ) Yu Feng (Berkeley), Rupert Croft (CMU ), Aklant

A New Boosting Algorithm Using Input-Dependent Regularizer Rong Jin rong+@cs.cmu.edu Yan Liu

Constructive and analytic enumeration of circulant graphs with p 3 vertices; p = 3 , 5 Joint work

Natural Language Generation (Not Only) in Dialogue Systems Ond rej Du sek Institute of

COBS: A Compact Bit-Sliced Signature Index Timo Bingmann, Phelim Bradley, Florian Gauger, and Zamin

Business Statistics CONTENTS Contingency tables Independence of categorical variables 2 2

Introduction to Gaussian Processes Stephen Keeley and Jonathan Pillow Princeton Neuroscience

Lecture 12 Gaussian Process Models Colin Rundel 02/27/2017 1 Multivariate Normal 2

Gaussian, Markov and stationary processes Gonzalo Mateos Dept. of ECE and Goergen Institute for

Ellipse and Gaussian Distribution Prof. Seungchul Lee Industrial AI Lab. Coordinates 2