Advanced Introduction to Machine Learning CMU-10715 Gaussian - - PowerPoint PPT Presentation

advanced introduction to machine learning cmu 10715
SMART_READER_LITE
LIVE PREVIEW

Advanced Introduction to Machine Learning CMU-10715 Gaussian - - PowerPoint PPT Presentation

Advanced Introduction to Machine Learning CMU-10715 Gaussian Processes Barnabs Pczos Introduction http://www.gaussianprocess.org/ Some of these slides in the intro are taken from D. Lizotte, R. Parr, C. Guesterin 3 Contents


slide-1
SLIDE 1

Advanced Introduction to Machine Learning CMU-10715

Gaussian Processes

Barnabás Póczos

slide-2
SLIDE 2

Introduction

slide-3
SLIDE 3

3

http://www.gaussianprocess.org/ Some of these slides in the intro are taken from D. Lizotte, R. Parr, C. Guesterin

slide-4
SLIDE 4

4

 Introduction

  • Regression
  • Properties of Multivariate Gaussian distributions

 Ridge Regression  Gaussian Processes

  • Weight space view
  • Bayesian Ridge Regression + Kernel trick
  • Function space view
  • Prior distribution over functions

+ calculation posterior distributions

Contents

slide-5
SLIDE 5

Regression

slide-6
SLIDE 6

Why GPs for Regression?

Motivation 1: All the above regression method give point estimates. We would like a method that could also provide confidence during the estimation. Motivation 2: Let us kernelize linear ridge regression, and see what we get… Regression methods: Linear regression, ridge regression, support vector regression, kNN regression, etc…

slide-7
SLIDE 7

7

  • Here’s where the function will

most likely be. (expected function)

  • Here are some examples of

what it might look like. (sampling from the posterior distribution)

  • Here is a prediction of what

you’ll see if you evaluate your function at x’, with confidence

GPs can answer the following questions

Why GPs for Regression?

slide-8
SLIDE 8

8

Properties of Multivariate Gaussian Distributions

slide-9
SLIDE 9

9

1D Gaussian Distribution

Parameters

  • Mean, 
  • Variance, 2
slide-10
SLIDE 10

10

Multivariate Gaussian

slide-11
SLIDE 11

11

 A 2-dimensional Gaussian is defined by

  • a mean vector  = [ 1, 2 ]
  • a covariance matrix:

where i,j

2 = E[ (xi – i) (xj – j) ]

is (co)variance  Note:  is symmetric, “positive semi-definite”: x: xT  x  0

       

2 2 , 2 2 2 , 1 2 1 , 2 2 1 , 1

   

Multivariate Gaussian

slide-12
SLIDE 12

12

Multivariate Gaussian examples

 = (0,0)

        1 8 . 8 . 1

slide-13
SLIDE 13

13

Multivariate Gaussian examples

 = (0,0)

        1 8 . 8 . 1

slide-14
SLIDE 14

14

Marginal distributions of Gaussians are Gaussian Given: Marginal Distribution:

               

bb ba ab aa b a b a x

x x ) , ( ), , (   

Useful Properties of Gaussians

slide-15
SLIDE 15

15

Marginal distributions of Gaussians are Gaussian

slide-16
SLIDE 16

16

Block Matrix Inversion

Theorem Definition: Schur complements

slide-17
SLIDE 17

17

Conditional distributions of Gaussians are Gaussian Notation: Conditional Distribution:

               

 bb ba ab aa 1

             

bb ba ab aa

Useful Properties of Gaussians

slide-18
SLIDE 18

18

Higher Dimensions

 Visualizing > 3 dimensions is… difficult  Means and marginals are practical, but then we don’t see correlations between those variables  Marginals are Gaussian, e.g., f(6) ~ N(µ(6), σ2(6)) Visualizing an 8-dimensional Gaussian f: µ(6) σ2(6)

6 5 4 3 2 1 7 8

slide-19
SLIDE 19

19

Yet Higher Dimensions

Why stop there? Don’t panic: It’s just a function

slide-20
SLIDE 20

20

Getting Ridiculous

Why stop there?

slide-21
SLIDE 21

21

Gaussian Process

 Probability distribution indexed by an arbitrary set (integer, real, finite dimensional vector, etc)  Each element gets a Gaussian distribution over the reals with mean µ(x)  These distributions are dependent/correlated as defined by k(x,z)  Any finite subset of indices defines a multivariate Gaussian distribution

Definition:

slide-22
SLIDE 22

22

Gaussian Process

Distribution over functions….

Yayyy! If our regression model is a GP, then it won’t be a point estimate anymore! It can provide regression estimates with confidence

Domain (index set) of the functions can be pretty much whatever

  • Reals
  • Real vectors
  • Graphs
  • Strings
  • Sets
slide-23
SLIDE 23

23

Bayesian Updates for GPs

  • How can we do regression and learn the GP

from data?

  • We will be Bayesians today:
  • Start with GP prior
  • Get some data
  • Compute a posterior
slide-24
SLIDE 24

24

Samples from the prior distribution

Picture is taken from Rasmussen and Williams

slide-25
SLIDE 25

25

Samples from the posterior distribution

Picture is taken from Rasmussen and Williams

slide-26
SLIDE 26

26

Prior

slide-27
SLIDE 27

27

Data

slide-28
SLIDE 28

28

Posterior

slide-29
SLIDE 29

29

 Introduction  Ridge Regression  Gaussian Processes

  • Weight space view
  • Bayesian Ridge Regression + Kernel trick
  • Function space view
  • Prior distribution over functions

+ calculation posterior distributions

Contents

slide-30
SLIDE 30

30

Ridge Regression

Linear regression: Ridge regression: The Gaussian Process is a Bayesian Generalization

  • f the kernelized ridge regression
slide-31
SLIDE 31

31

 Introduction  Ridge Regression  Gaussian Processes

  • Weight space view
  • Bayesian Ridge Regression + Kernel trick
  • Function space view
  • Prior distribution over functions

+ calculation posterior distributions

Contents

slide-32
SLIDE 32

32

Weight Space View

GP = Bayesian ridge regression in feature space + Kernel trick to carry out computations

The training data

slide-33
SLIDE 33

33

Bayesian Analysis of Linear Regression with Gaussian noise

slide-34
SLIDE 34

34

Bayesian Analysis of Linear Regression with Gaussian noise

The likelihood:

slide-35
SLIDE 35

35

Bayesian Analysis of Linear Regression with Gaussian noise

The prior: Now, we can calculate the posterior:

slide-36
SLIDE 36

36

Bayesian Analysis of Linear Regression with Gaussian noise

After “completing the square”

MAP estimation Ridge Regression

slide-37
SLIDE 37

37

Bayesian Analysis of Linear Regression with Gaussian noise

This posterior covariance matrix doesn’t depend on the observations y, A strange property of Gaussian Processes

slide-38
SLIDE 38

38

Projections of Inputs into Feature Space

The reviewed Bayesian linear regression suffers from limited expressiveness To overcome the problem ) go to a feature space and do linear regression there a., explicit features b., implicit features (kernels)

slide-39
SLIDE 39

39

Explicit Features

Linear regression in the feature space

slide-40
SLIDE 40

40

Explicit Features

The predictive distribution after feature map:

slide-41
SLIDE 41

41

Explicit Features

Shorthands: The predictive distribution after feature map:

slide-42
SLIDE 42

42

Explicit Features

The predictive distribution after feature map: A problem with (*) is that it needs an NxN matrix inversion...

(*)

(*) can be rewritten: Theorem:

slide-43
SLIDE 43

43

Proofs

  • Mean expression. We need:
  • Variance expression. We need:

Lemma: Matrix inversion Lemma:

slide-44
SLIDE 44

44

From Explicit to Implicit Features

Reminder: This was the original formulation:

slide-45
SLIDE 45

45

From Explicit to Implicit Features

The feature space always enters in the form of:

Lemma:

No need to know the explicit N dimensional features. Their inner product is enough.

slide-46
SLIDE 46

46

Results

slide-47
SLIDE 47

47

Results using Netlab , Sin function

slide-48
SLIDE 48

48

Results using Netlab, Sin function

Increased # of training points

slide-49
SLIDE 49

49

Results using Netlab, Sin function

Increased noise

slide-50
SLIDE 50

50

Results using Netlab, Sinc function

slide-51
SLIDE 51

51

Thanks for the Attention! 

slide-52
SLIDE 52

52

Extra Material

slide-53
SLIDE 53

53

 Introduction  Ridge Regression  Gaussian Processes

  • Weight space view
  • Bayesian Ridge Regression + Kernel trick
  • Function space view
  • Prior distribution over functions

+ calculation posterior distributions

Contents

slide-54
SLIDE 54

54

Function Space View

An alternative way to get the previous results Inference directly in function space

Definition: (Gaussian Processes) GP is a collection of random variables, s.t. any finite number of them have a joint Gaussian distribution

slide-55
SLIDE 55

55

Function Space View

Notations:

slide-56
SLIDE 56

56

Function Space View

Gaussian Processes:

slide-57
SLIDE 57

57

Function Space View

The Bayesian linear regression is an example of GP

slide-58
SLIDE 58

58

Function Space View

Special case

slide-59
SLIDE 59

59

Function Space View

Picture is taken from Rasmussen and Williams

slide-60
SLIDE 60

60

Function Space View

Observation Explanation

slide-61
SLIDE 61

61

Prediction with noise free observations

noise free observations

slide-62
SLIDE 62

62

Prediction with noise free observations

Goal:

slide-63
SLIDE 63

63

Prediction with noise free observations

Lemma: Proofs: a bit of calculation using the joint (n+m) dim density Remarks:

slide-64
SLIDE 64

64

Prediction with noise free

  • bservations

Picture is taken from Rasmussen and Williams

slide-65
SLIDE 65

65

Prediction using noisy observations

The joint distribution:

slide-66
SLIDE 66

66

Prediction using noisy observations

The posterior for the noisy observations: where In the weight space view we had:

slide-67
SLIDE 67

67

Prediction using noisy observations

Short notations:

slide-68
SLIDE 68

68

Prediction using noisy observations

Two ways to look at it:

  • Linear predictor
  • Manifestation of the Representer Theorem
slide-69
SLIDE 69

69

Prediction using noisy observations

Remarks:

slide-70
SLIDE 70

70

GP pseudo code

Inputs:

slide-71
SLIDE 71

71

GP pseudo code (continued)

Outputs:

slide-72
SLIDE 72

72

Thanks for the Attention! 