Introduction to Gaussian Processes Neil D. Lawrence GPMC 6th - - PowerPoint PPT Presentation

introduction to gaussian processes
SMART_READER_LITE
LIVE PREVIEW

Introduction to Gaussian Processes Neil D. Lawrence GPMC 6th - - PowerPoint PPT Presentation

Introduction to Gaussian Processes Neil D. Lawrence GPMC 6th February 2017 Book Rasmussen and Williams (2006) Outline The Gaussian Density Covariance from Basis Functions Outline The Gaussian Density Covariance from Basis Functions The


slide-1
SLIDE 1

Introduction to Gaussian Processes

Neil D. Lawrence GPMC 6th February 2017

slide-2
SLIDE 2

Book

Rasmussen and Williams (2006)

slide-3
SLIDE 3

Outline

The Gaussian Density Covariance from Basis Functions

slide-4
SLIDE 4

Outline

The Gaussian Density Covariance from Basis Functions

slide-5
SLIDE 5

The Gaussian Density

◮ Perhaps the most common probability density.

p(y|µ, σ2) = 1 √ 2πσ2 exp

  • −(y − µ)2

2σ2

= N

  • y|µ, σ2

◮ The Gaussian density.

slide-6
SLIDE 6

Gaussian Density

1 2 3 1 2 p(h|µ, σ2) h, height/m

The Gaussian PDF with µ = 1.7 and variance σ2 = 0.0225. Mean shown as red line. It could represent the heights of a population of students.

slide-7
SLIDE 7

Gaussian Density

N

  • y|µ, σ2

= 1 √ 2πσ2 exp

  • −(y − µ)2

2σ2

  • σ2 is the variance of the density and µ is

the mean.

slide-8
SLIDE 8

Two Important Gaussian Properties

Sum of Gaussians

◮ Sum of Gaussian variables is also Gaussian.

yi ∼ N

  • µi, σ2

i

slide-9
SLIDE 9

Two Important Gaussian Properties

Sum of Gaussians

◮ Sum of Gaussian variables is also Gaussian.

yi ∼ N

  • µi, σ2

i

  • And the sum is distributed as

n

  • i=1

yi ∼ N       

n

  • i=1

µi,

n

  • i=1

σ2

i

      

slide-10
SLIDE 10

Two Important Gaussian Properties

Sum of Gaussians

◮ Sum of Gaussian variables is also Gaussian.

yi ∼ N

  • µi, σ2

i

  • And the sum is distributed as

n

  • i=1

yi ∼ N       

n

  • i=1

µi,

n

  • i=1

σ2

i

       (Aside: As sum increases, sum of non-Gaussian, finite variance variables is also Gaussian [central limit theorem].)

slide-11
SLIDE 11

Two Important Gaussian Properties

Sum of Gaussians

◮ Sum of Gaussian variables is also Gaussian.

yi ∼ N

  • µi, σ2

i

  • And the sum is distributed as

n

  • i=1

yi ∼ N       

n

  • i=1

µi,

n

  • i=1

σ2

i

       (Aside: As sum increases, sum of non-Gaussian, finite variance variables is also Gaussian [central limit theorem].)

slide-12
SLIDE 12

Two Important Gaussian Properties

Scaling a Gaussian

◮ Scaling a Gaussian leads to a Gaussian.

slide-13
SLIDE 13

Two Important Gaussian Properties

Scaling a Gaussian

◮ Scaling a Gaussian leads to a Gaussian.

y ∼ N

  • µ, σ2
slide-14
SLIDE 14

Two Important Gaussian Properties

Scaling a Gaussian

◮ Scaling a Gaussian leads to a Gaussian.

y ∼ N

  • µ, σ2

And the scaled density is distributed as wy ∼ N

  • wµ, w2σ2
slide-15
SLIDE 15

Linear Function

1 2 50 60 70 80 90 100 y x data points best fit line

A linear regression between x and y.

slide-16
SLIDE 16

Regression Examples

◮ Predict a real value, yi given some inputs xi. ◮ Predict quality of meat given spectral measurements

(Tecator data).

◮ Radiocarbon dating, the C14 calibration curve: predict age

given quantity of C14 isotope.

◮ Predict quality of different Go or Backgammon moves

given expert rated training data.

slide-17
SLIDE 17

y = mx + c

slide-18
SLIDE 18

1 2 3 4 5 1 2 3 4 5 y x

y = mx + c

slide-19
SLIDE 19

1 2 3 4 5 1 2 3 4 5 y x

y = mx + c

c m

slide-20
SLIDE 20

1 2 3 4 5 1 2 3 4 5 y x

y = mx + c

c m

slide-21
SLIDE 21

1 2 3 4 5 1 2 3 4 5 y x

y = mx + c

c m

slide-22
SLIDE 22

1 2 3 4 5 1 2 3 4 5 y x

y = mx + c

slide-23
SLIDE 23

1 2 3 4 5 1 2 3 4 5 y x

y = mx + c

slide-24
SLIDE 24

1 2 3 4 5 1 2 3 4 5 y x

y = mx + c

slide-25
SLIDE 25

y = mx + c

point 1: x = 1, y = 3 3 = m + c point 2: x = 3, y = 1 1 = 3m + c point 3: x = 2, y = 2.5 2.5 = 2m + c

slide-26
SLIDE 26
slide-27
SLIDE 27
slide-28
SLIDE 28
slide-29
SLIDE 29

6

A PHILOSOPHICAL ESSAY ON PROBABILITIES.

height: "The day will come when, by study pursued through several ages, the things now concealed will appear with evidence; and posterity will be astonished that truths so clear had escaped us.

' '

Clairaut then undertook to submit to analysis the perturbations which the comet had experienced by the action of the two great planets, Jupiter and Saturn;

after immense cal- culations he fixed

its next passage

at the perihelion

toward the beginning of April, 1759, which was actually

verified by observation. The regularity which astronomy

shows

us in the movements

  • f the comets

doubtless exists also in all phenomena.

  • The curve described by a simple molecule of air or

vapor

is regulated

in a manner just

as certain as the planetary orbits

;

the only difference between them

is

that which comes from our ignorance. Probability

is

relative, in part to this ignorance, in part to our knowledge.

We know that of three

  • r a

greater number of events a single one ought to occur

;

but nothing induces us to believe that one of them will

  • ccur rather than the others.

In this state of indecision

it is impossible for us to announce their occurrence with

certainty. It

is, however, probable

that one of these events, chosen at will, will not occur because we see several cases equally possible which exclude its occur- rence, while only a single one favors

it.

The

theory of chance consists

in reducing all

the events of the same kind to a certain number of cases equally possible, that

is to

say, to such as we may be equally undecided about in regard to their existence,

and

in determining the number of cases favorable to the

event whose probability

is sought.

The

ratio

  • f
slide-30
SLIDE 30

y = mx + c + ǫ

point 1: x = 1, y = 3 3 = m + c + ǫ1 point 2: x = 3, y = 1 1 = 3m + c + ǫ2 point 3: x = 2, y = 2.5 2.5 = 2m + c + ǫ3

slide-31
SLIDE 31

Underdetermined System

What about two unknowns and

  • ne observation?

y1 = mx1 + c

1 2 3 4 5 1 2 3 y x

slide-32
SLIDE 32

Underdetermined System

Can compute m given c. m = y1 − c x

1 2 3 4 5 1 2 3 y x

slide-33
SLIDE 33

Underdetermined System

Can compute m given c. c = 1.75 =⇒ m = 1.25

1 2 3 4 5 1 2 3 y x

slide-34
SLIDE 34

Underdetermined System

Can compute m given c. c = −0.777 =⇒ m = 3.78

1 2 3 4 5 1 2 3 y x

slide-35
SLIDE 35

Underdetermined System

Can compute m given c. c = −4.01 =⇒ m = 7.01

1 2 3 4 5 1 2 3 y x

slide-36
SLIDE 36

Underdetermined System

Can compute m given c. c = −0.718 =⇒ m = 3.72

1 2 3 4 5 1 2 3 y x

slide-37
SLIDE 37

Underdetermined System

Can compute m given c. c = 2.45 =⇒ m = 0.545

1 2 3 4 5 1 2 3 y x

slide-38
SLIDE 38

Underdetermined System

Can compute m given c. c = −0.657 =⇒ m = 3.66

1 2 3 4 5 1 2 3 y x

slide-39
SLIDE 39

Underdetermined System

Can compute m given c. c = −3.13 =⇒ m = 6.13

1 2 3 4 5 1 2 3 y x

slide-40
SLIDE 40

Underdetermined System

Can compute m given c. c = −1.47 =⇒ m = 4.47

1 2 3 4 5 1 2 3 y x

slide-41
SLIDE 41

Underdetermined System

Can compute m given c. Assume c ∼ N (0, 4) , we find a distribution of solu- tions.

1 2 3 4 5 1 2 3 y x

slide-42
SLIDE 42

Probability for Under- and Overdetermined

◮ To deal with overdetermined introduced probability

distribution for ‘variable’, ǫi.

◮ For underdetermined system introduced probability

distribution for ‘parameter’, c.

◮ This is known as a Bayesian treatment.

slide-43
SLIDE 43

Multivariate Prior Distributions

◮ For general Bayesian inference need multivariate priors. ◮ E.g. for multivariate linear regression:

yi =

  • i

wjxi,j + ǫi (where we’ve dropped c for convenience), we need a prior

  • ver w.

◮ This motivates a multivariate Gaussian density. ◮ We will use the multivariate Gaussian to put a prior directly

  • n the function (a Gaussian process).
slide-44
SLIDE 44

Multivariate Prior Distributions

◮ For general Bayesian inference need multivariate priors. ◮ E.g. for multivariate linear regression:

yi = w⊤xi,: + ǫi (where we’ve dropped c for convenience), we need a prior

  • ver w.

◮ This motivates a multivariate Gaussian density. ◮ We will use the multivariate Gaussian to put a prior directly

  • n the function (a Gaussian process).
slide-45
SLIDE 45

Prior Distribution

◮ Bayesian inference requires a prior on the parameters. ◮ The prior represents your belief before you see the data of

the likely value of the parameters.

◮ For linear regression, consider a Gaussian prior on the

intercept: c ∼ N (0, α1)

slide-46
SLIDE 46

Posterior Distribution

◮ Posterior distribution is found by combining the prior with

the likelihood.

◮ Posterior distribution is your belief after you see the data of

the likely value of the parameters.

◮ The posterior is found through Bayes’ Rule

p(c|y) = p(y|c)p(c) p(y)

slide-47
SLIDE 47

Bayes Update

1 2

  • 3
  • 2
  • 1

1 2 3 4 c p(c) = N (c|0, α1) Figure: A Gaussian prior combines with a Gaussian likelihood for a Gaussian posterior.

slide-48
SLIDE 48

Bayes Update

1 2

  • 3
  • 2
  • 1

1 2 3 4 c p(c) = N (c|0, α1) p(y|m, c, x, σ2) = N

  • y|mx + c, σ2

Figure: A Gaussian prior combines with a Gaussian likelihood for a Gaussian posterior.

slide-49
SLIDE 49

Bayes Update

1 2

  • 3
  • 2
  • 1

1 2 3 4 c p(c) = N (c|0, α1) p(y|m, c, x, σ2) = N

  • y|mx + c, σ2

p(c|y, m, x, σ2) = N

  • c| y−mx

1+σ2/α1 , (σ−2 + α−1 1 )−1

Figure: A Gaussian prior combines with a Gaussian likelihood for a Gaussian posterior.

slide-50
SLIDE 50

Stages to Derivation of the Posterior

◮ Multiply likelihood by prior

◮ they are “exponentiated quadratics”, the answer is always

also an exponentiated quadratic because exp(a2) exp(b2) = exp(a2 + b2).

◮ Complete the square to get the resulting density in the

form of a Gaussian.

◮ Recognise the mean and (co)variance of the Gaussian. This

is the estimate of the posterior.

slide-51
SLIDE 51

Multivariate Regression Likelihood

◮ Noise corrupted data point

yi = w⊤xi,: + ǫi

slide-52
SLIDE 52

Multivariate Regression Likelihood

◮ Noise corrupted data point

yi = w⊤xi,: + ǫi

◮ Multivariate regression likelihood:

p(y|X, w) = 1 (2πσ2)n/2 exp       − 1 2σ2

n

  • i=1
  • yi − w⊤xi,:

2       

slide-53
SLIDE 53

Multivariate Regression Likelihood

◮ Noise corrupted data point

yi = w⊤xi,: + ǫi

◮ Multivariate regression likelihood:

p(y|X, w) = 1 (2πσ2)n/2 exp       − 1 2σ2

n

  • i=1
  • yi − w⊤xi,:

2       

◮ Now use a multivariate Gaussian prior:

p(w) = 1 (2πα)

p 2

exp

  • − 1

2αw⊤w

slide-54
SLIDE 54

Two Dimensional Gaussian

◮ Consider height, h/m and weight, w/kg. ◮ Could sample height from a distribution:

p(h) ∼ N (1.7, 0.0225)

◮ And similarly weight:

p(w) ∼ N (75, 36)

slide-55
SLIDE 55

Height and Weight Models

p(h) h/m p(w) w/kg

Gaussian distributions for height and weight.

slide-56
SLIDE 56

Sampling Two Dimensional Variables

Joint Distribution

w/kg h/m

Marginal Distributions

p(h) p(w)

Samples of height and weight

slide-57
SLIDE 57

Sampling Two Dimensional Variables

Joint Distribution

w/kg h/m

Marginal Distributions

p(h) p(w)

Samples of height and weight

slide-58
SLIDE 58

Sampling Two Dimensional Variables

Joint Distribution

w/kg h/m

Marginal Distributions

p(h) p(w)

Samples of height and weight

slide-59
SLIDE 59

Sampling Two Dimensional Variables

Joint Distribution

w/kg h/m

Marginal Distributions

p(h) p(w)

Samples of height and weight

slide-60
SLIDE 60

Sampling Two Dimensional Variables

Joint Distribution

w/kg h/m

Marginal Distributions

p(h) p(w)

Samples of height and weight

slide-61
SLIDE 61

Sampling Two Dimensional Variables

Joint Distribution

w/kg h/m

Marginal Distributions

p(h) p(w)

Samples of height and weight

slide-62
SLIDE 62

Sampling Two Dimensional Variables

Joint Distribution

w/kg h/m

Marginal Distributions

p(h) p(w)

Samples of height and weight

slide-63
SLIDE 63

Sampling Two Dimensional Variables

Joint Distribution

w/kg h/m

Marginal Distributions

p(h) p(w)

Samples of height and weight

slide-64
SLIDE 64

Sampling Two Dimensional Variables

Joint Distribution

w/kg h/m

Marginal Distributions

p(h) p(w)

Samples of height and weight

slide-65
SLIDE 65

Sampling Two Dimensional Variables

Joint Distribution

w/kg h/m

Marginal Distributions

p(h) p(w)

Samples of height and weight

slide-66
SLIDE 66

Sampling Two Dimensional Variables

Joint Distribution

w/kg h/m

Marginal Distributions

p(h) p(w)

Samples of height and weight

slide-67
SLIDE 67

Sampling Two Dimensional Variables

Joint Distribution

w/kg h/m

Marginal Distributions

p(h) p(w)

Samples of height and weight

slide-68
SLIDE 68

Sampling Two Dimensional Variables

Joint Distribution

w/kg h/m

Marginal Distributions

p(h) p(w)

Samples of height and weight

slide-69
SLIDE 69

Sampling Two Dimensional Variables

Joint Distribution

w/kg h/m

Marginal Distributions

p(h) p(w)

Samples of height and weight

slide-70
SLIDE 70

Sampling Two Dimensional Variables

Joint Distribution

w/kg h/m

Marginal Distributions

p(h) p(w)

Samples of height and weight

slide-71
SLIDE 71

Sampling Two Dimensional Variables

Joint Distribution

w/kg h/m

Marginal Distributions

p(h) p(w)

Samples of height and weight

slide-72
SLIDE 72

Sampling Two Dimensional Variables

Joint Distribution

w/kg h/m

Marginal Distributions

p(h) p(w)

Samples of height and weight

slide-73
SLIDE 73

Sampling Two Dimensional Variables

Joint Distribution

w/kg h/m

Marginal Distributions

p(h) p(w)

Samples of height and weight

slide-74
SLIDE 74

Sampling Two Dimensional Variables

Joint Distribution

w/kg h/m

Marginal Distributions

p(h) p(w)

Samples of height and weight

slide-75
SLIDE 75

Sampling Two Dimensional Variables

Joint Distribution

w/kg h/m

Marginal Distributions

p(h) p(w)

Samples of height and weight

slide-76
SLIDE 76

Sampling Two Dimensional Variables

Joint Distribution

w/kg h/m

Marginal Distributions

p(h) p(w)

Samples of height and weight

slide-77
SLIDE 77

Sampling Two Dimensional Variables

Joint Distribution

w/kg h/m

Marginal Distributions

p(h) p(w)

Samples of height and weight

slide-78
SLIDE 78

Sampling Two Dimensional Variables

Joint Distribution

w/kg h/m

Marginal Distributions

p(h) p(w)

Samples of height and weight

slide-79
SLIDE 79

Independence Assumption

◮ This assumes height and weight are independent.

p(h, w) = p(h)p(w)

◮ In reality they are dependent (body mass index) = w h2 .

slide-80
SLIDE 80

Sampling Two Dimensional Variables

Joint Distribution

w/kg h/m

Marginal Distributions

p(h) p(w)

slide-81
SLIDE 81

Sampling Two Dimensional Variables

Joint Distribution

w/kg h/m

Marginal Distributions

p(h) p(w)

slide-82
SLIDE 82

Sampling Two Dimensional Variables

Joint Distribution

w/kg h/m

Marginal Distributions

p(h) p(w)

slide-83
SLIDE 83

Sampling Two Dimensional Variables

Joint Distribution

w/kg h/m

Marginal Distributions

p(h) p(w)

slide-84
SLIDE 84

Sampling Two Dimensional Variables

Joint Distribution

w/kg h/m

Marginal Distributions

p(h) p(w)

slide-85
SLIDE 85

Sampling Two Dimensional Variables

Joint Distribution

w/kg h/m

Marginal Distributions

p(h) p(w)

slide-86
SLIDE 86

Sampling Two Dimensional Variables

Joint Distribution

w/kg h/m

Marginal Distributions

p(h) p(w)

slide-87
SLIDE 87

Sampling Two Dimensional Variables

Joint Distribution

w/kg h/m

Marginal Distributions

p(h) p(w)

slide-88
SLIDE 88

Sampling Two Dimensional Variables

Joint Distribution

w/kg h/m

Marginal Distributions

p(h) p(w)

slide-89
SLIDE 89

Sampling Two Dimensional Variables

Joint Distribution

w/kg h/m

Marginal Distributions

p(h) p(w)

slide-90
SLIDE 90

Sampling Two Dimensional Variables

Joint Distribution

w/kg h/m

Marginal Distributions

p(h) p(w)

slide-91
SLIDE 91

Sampling Two Dimensional Variables

Joint Distribution

w/kg h/m

Marginal Distributions

p(h) p(w)

slide-92
SLIDE 92

Sampling Two Dimensional Variables

Joint Distribution

w/kg h/m

Marginal Distributions

p(h) p(w)

slide-93
SLIDE 93

Sampling Two Dimensional Variables

Joint Distribution

w/kg h/m

Marginal Distributions

p(h) p(w)

slide-94
SLIDE 94

Sampling Two Dimensional Variables

Joint Distribution

w/kg h/m

Marginal Distributions

p(h) p(w)

slide-95
SLIDE 95

Sampling Two Dimensional Variables

Joint Distribution

w/kg h/m

Marginal Distributions

p(h) p(w)

slide-96
SLIDE 96

Sampling Two Dimensional Variables

Joint Distribution

w/kg h/m

Marginal Distributions

p(h) p(w)

slide-97
SLIDE 97

Sampling Two Dimensional Variables

Joint Distribution

w/kg h/m

Marginal Distributions

p(h) p(w)

slide-98
SLIDE 98

Sampling Two Dimensional Variables

Joint Distribution

w/kg h/m

Marginal Distributions

p(h) p(w)

slide-99
SLIDE 99

Sampling Two Dimensional Variables

Joint Distribution

w/kg h/m

Marginal Distributions

p(h) p(w)

slide-100
SLIDE 100

Sampling Two Dimensional Variables

Joint Distribution

w/kg h/m

Marginal Distributions

p(h) p(w)

slide-101
SLIDE 101

Sampling Two Dimensional Variables

Joint Distribution

w/kg h/m

Marginal Distributions

p(h) p(w)

slide-102
SLIDE 102

Sampling Two Dimensional Variables

Joint Distribution

w/kg h/m

Marginal Distributions

p(h) p(w)

slide-103
SLIDE 103

Independent Gaussians

p(w, h) = p(w)p(h)

slide-104
SLIDE 104

Independent Gaussians

p(w, h) = 1

  • 2πσ2

1

  • 2πσ2

2

exp      −1 2       (w − µ1)2 σ2

1

+ (h − µ2)2 σ2

2

           

slide-105
SLIDE 105

Independent Gaussians

p(w, h) = 1

  • 2πσ2

12πσ2 2

exp      −1 2

  • w

h

  • µ1

µ2 ⊤ σ2

1

σ2

2

−1 w h

  • µ1

µ2      

slide-106
SLIDE 106

Independent Gaussians

p(y) = 1 |2πD|

1 2

exp

  • −1

2(y − µ)⊤D−1(y − µ)

slide-107
SLIDE 107

Correlated Gaussian

Form correlated from original by rotating the data space using matrix R. p(y) = 1 |2πD|

1 2

exp

  • −1

2(y − µ)⊤D−1(y − µ)

slide-108
SLIDE 108

Correlated Gaussian

Form correlated from original by rotating the data space using matrix R. p(y) = 1 |2πD|

1 2

exp

  • −1

2(R⊤y − R⊤µ)⊤D−1(R⊤y − R⊤µ)

slide-109
SLIDE 109

Correlated Gaussian

Form correlated from original by rotating the data space using matrix R. p(y) = 1 |2πD|

1 2

exp

  • −1

2(y − µ)⊤RD−1R⊤(y − µ)

  • this gives a covariance matrix:

C−1 = RD−1R⊤

slide-110
SLIDE 110

Correlated Gaussian

Form correlated from original by rotating the data space using matrix R. p(y) = 1 |2πC|

1 2

exp

  • −1

2(y − µ)⊤C−1(y − µ)

  • this gives a covariance matrix:

C = RDR⊤

slide-111
SLIDE 111

Recall Univariate Gaussian Properties

  • 1. Sum of Gaussian variables is also Gaussian.

yi ∼ N

  • µi, σ2

i

slide-112
SLIDE 112

Recall Univariate Gaussian Properties

  • 1. Sum of Gaussian variables is also Gaussian.

yi ∼ N

  • µi, σ2

i

  • n
  • i=1

yi ∼ N       

n

  • i=1

µi,

n

  • i=1

σ2

i

      

slide-113
SLIDE 113

Recall Univariate Gaussian Properties

  • 1. Sum of Gaussian variables is also Gaussian.

yi ∼ N

  • µi, σ2

i

  • n
  • i=1

yi ∼ N       

n

  • i=1

µi,

n

  • i=1

σ2

i

      

  • 2. Scaling a Gaussian leads to a Gaussian.
slide-114
SLIDE 114

Recall Univariate Gaussian Properties

  • 1. Sum of Gaussian variables is also Gaussian.

yi ∼ N

  • µi, σ2

i

  • n
  • i=1

yi ∼ N       

n

  • i=1

µi,

n

  • i=1

σ2

i

      

  • 2. Scaling a Gaussian leads to a Gaussian.

y ∼ N

  • µ, σ2
slide-115
SLIDE 115

Recall Univariate Gaussian Properties

  • 1. Sum of Gaussian variables is also Gaussian.

yi ∼ N

  • µi, σ2

i

  • n
  • i=1

yi ∼ N       

n

  • i=1

µi,

n

  • i=1

σ2

i

      

  • 2. Scaling a Gaussian leads to a Gaussian.

y ∼ N

  • µ, σ2

wy ∼ N

  • wµ, w2σ2
slide-116
SLIDE 116

Multivariate Consequence

◮ If

x ∼ N

  • µ, Σ
slide-117
SLIDE 117

Multivariate Consequence

◮ If

x ∼ N

  • µ, Σ
  • ◮ And

y = Wx

slide-118
SLIDE 118

Multivariate Consequence

◮ If

x ∼ N

  • µ, Σ
  • ◮ And

y = Wx

◮ Then

y ∼ N

  • Wµ, WΣW⊤
slide-119
SLIDE 119

Sampling a Function

Multi-variate Gaussians

◮ We will consider a Gaussian with a particular structure of

covariance matrix.

◮ Generate a single sample from this 25 dimensional

Gaussian distribution, f = f1, f2 . . . f25 .

◮ We will plot these points against their index.

slide-120
SLIDE 120

Gaussian Distribution Sample

  • 2
  • 1

1 2 5 10 15 20 25 fi i

(a) A 25 dimensional correlated ran- dom variable (values ploted against index)

j i 0.2 0.4 0.6 0.8 1

(b) colormap showing correlations between dimensions.

Figure: A sample from a 25 dimensional Gaussian distribution.

slide-121
SLIDE 121

Gaussian Distribution Sample

  • 2
  • 1

1 2 5 10 15 20 25 fi i

(a) A 25 dimensional correlated ran- dom variable (values ploted against index)

j i 0.2 0.4 0.6 0.8 1

(b) colormap showing correlations between dimensions.

Figure: A sample from a 25 dimensional Gaussian distribution.

slide-122
SLIDE 122

Gaussian Distribution Sample

  • 2
  • 1

1 2 5 10 15 20 25 fi i

(a) A 25 dimensional correlated ran- dom variable (values ploted against index)

0.2 0.4 0.6 0.8 1

(b) colormap showing correlations between dimensions.

Figure: A sample from a 25 dimensional Gaussian distribution.

slide-123
SLIDE 123

Gaussian Distribution Sample

  • 2
  • 1

1 2 5 10 15 20 25 fi i

(a) A 25 dimensional correlated ran- dom variable (values ploted against index)

0.2 0.4 0.6 0.8 1

(b) colormap showing correlations between dimensions.

Figure: A sample from a 25 dimensional Gaussian distribution.

slide-124
SLIDE 124

Gaussian Distribution Sample

  • 2
  • 1

1 2 5 10 15 20 25 fi i

(a) A 25 dimensional correlated ran- dom variable (values ploted against index)

0.2 0.4 0.6 0.8 1

(b) colormap showing correlations between dimensions.

Figure: A sample from a 25 dimensional Gaussian distribution.

slide-125
SLIDE 125

Gaussian Distribution Sample

  • 2
  • 1

1 2 5 10 15 20 25 fi i

(a) A 25 dimensional correlated ran- dom variable (values ploted against index)

0.2 0.4 0.6 0.8 1

(b) colormap showing correlations between dimensions.

Figure: A sample from a 25 dimensional Gaussian distribution.

slide-126
SLIDE 126

Gaussian Distribution Sample

  • 2
  • 1

1 2 5 10 15 20 25 fi i

(a) A 25 dimensional correlated ran- dom variable (values ploted against index)

0.2 0.4 0.6 0.8 1

(b) colormap showing correlations between dimensions.

Figure: A sample from a 25 dimensional Gaussian distribution.

slide-127
SLIDE 127

Gaussian Distribution Sample

  • 2
  • 1

1 2 5 10 15 20 25 fi i

(a) A 25 dimensional correlated ran- dom variable (values ploted against index)

1 0.96587 0.96587 1

(b) correlation between f1 and f2.

Figure: A sample from a 25 dimensional Gaussian distribution.

slide-128
SLIDE 128

Prediction of f2 from f1

  • 1

1

  • 1

1 f1 f2 1 0.96587 0.96587 1

◮ The single contour of the Gaussian density represents the

joint distribution, p( f1, f2).

slide-129
SLIDE 129

Prediction of f2 from f1

  • 1

1

  • 1

1 f1 f2 1 0.96587 0.96587 1

◮ The single contour of the Gaussian density represents the

joint distribution, p( f1, f2).

◮ We observe that f1 = −0.313.

slide-130
SLIDE 130

Prediction of f2 from f1

  • 1

1

  • 1

1 f1 f2 1 0.96587 0.96587 1

◮ The single contour of the Gaussian density represents the

joint distribution, p( f1, f2).

◮ We observe that f1 = −0.313. ◮ Conditional density: p(f2| f1 = −0.313).

slide-131
SLIDE 131

Prediction of f2 from f1

  • 1

1

  • 1

1 f1 f2 1 0.96587 0.96587 1

◮ The single contour of the Gaussian density represents the

joint distribution, p( f1, f2).

◮ We observe that f1 = −0.313. ◮ Conditional density: p(f2| f1 = −0.313).

slide-132
SLIDE 132

Prediction with Correlated Gaussians

◮ Prediction of f2 from f1 requires conditional density. ◮ Conditional density is also Gaussian.

p( f2|f1) = N       f2|k1,2 k1,1 f1, k2,2 − k2

1,2

k1,1        where covariance of joint density is given by K =

  • k1,1

k1,2 k2,1 k2,2

slide-133
SLIDE 133

Prediction of f5 from f1

  • 1

1

  • 1

1 f1 f5 1 0.57375 0.57375 1

◮ The single contour of the Gaussian density represents the

joint distribution, p( f1, f5).

slide-134
SLIDE 134

Prediction of f5 from f1

  • 1

1

  • 1

1 f1 f5 1 0.57375 0.57375 1

◮ The single contour of the Gaussian density represents the

joint distribution, p( f1, f5).

◮ We observe that f1 = −0.313.

slide-135
SLIDE 135

Prediction of f5 from f1

  • 1

1

  • 1

1 f1 f5 1 0.57375 0.57375 1

◮ The single contour of the Gaussian density represents the

joint distribution, p( f1, f5).

◮ We observe that f1 = −0.313. ◮ Conditional density: p(f5| f1 = −0.313).

slide-136
SLIDE 136

Prediction of f5 from f1

  • 1

1

  • 1

1 f1 f5 1 0.57375 0.57375 1

◮ The single contour of the Gaussian density represents the

joint distribution, p( f1, f5).

◮ We observe that f1 = −0.313. ◮ Conditional density: p(f5| f1 = −0.313).

slide-137
SLIDE 137

Prediction with Correlated Gaussians

◮ Prediction of f∗ from f requires multivariate conditional

density.

◮ Multivariate conditional density is also Gaussian.

p(f∗|f) = N

  • f∗|K∗,fK−1

f,ff, K∗,∗ − K∗,fK−1 f,fKf,∗

  • ◮ Here covariance of joint density is given by

K =

  • Kf,f

K∗,f Kf,∗ K∗,∗

slide-138
SLIDE 138

Prediction with Correlated Gaussians

◮ Prediction of f∗ from f requires multivariate conditional

density.

◮ Multivariate conditional density is also Gaussian.

p(f∗|f) = N

  • f∗|µ, Σ
  • µ = K∗,fK−1

f,ff

Σ = K∗,∗ − K∗,fK−1

f,fKf,∗ ◮ Here covariance of joint density is given by

K =

  • Kf,f

K∗,f Kf,∗ K∗,∗

slide-139
SLIDE 139

Covariance Functions

Where did this covariance matrix come from?

Exponentiated Quadratic Kernel Function (RBF, Squared Exponential, Gaussian) k (x, x′) = α exp      −x − x′2

2

2ℓ2      

◮ Covariance matrix is

built using the inputs to the function x.

◮ For the example above it

was based on Euclidean distance.

◮ The covariance function

is also know as a kernel. ¡1¿ ¡2¿

slide-140
SLIDE 140

Covariance Functions

Where did this covariance matrix come from?

k

  • xi, xj
  • = α exp
  • −||xi−xj||

2

2ℓ2

  • x1 = −3.0, x2 = 1.20, and x3 = 1.40 with ℓ = 2.00 and α = 1.00.

x1 = −3.0, x1 = −3.0 k1,1 = 1.00 × exp

  • − (−3.0−−3.0)2

2×2.002

  • x

k

slide-141
SLIDE 141

Outline

The Gaussian Density Covariance from Basis Functions

slide-142
SLIDE 142

Basis Function Form

Radial basis functions commonly have the form φk (xi) = exp        −

  • xi − µk
  • 2

2ℓ2         .

◮ Basis function

maps data into a “feature space” in which a linear sum is a non linear function.

0.5 1

  • 8
  • 6
  • 4
  • 2

2 4 6 8 φ(x) x Figure: A set of radial basis functions with width ℓ = 2 and location parameters µ = [−4 0 4]⊤.

slide-143
SLIDE 143

Basis Function Representations

◮ Represent a function by a linear sum over a basis,

f(xi,:; w) =

m

  • k=1

wkφk(xi,:), (1)

◮ Here: m basis functions and φk(·) is kth basis function and

w = [w1, . . . , wm]⊤ .

◮ For standard linear model: φk(xi,:) = xi,k.

slide-144
SLIDE 144

Random Functions

Functions derived using: f(x) =

m

  • k=1

wkφk(x), where elements of w are independently sampled from a Gaussian density, wk ∼ N (0, α) .

  • 2
  • 1

1 2

  • 8 -6 -4 -2

2 4 6 8 f(x) x

Figure: Functions sampled using the basis set from figure 3. Each line is a separate sample, generated by a weighted sum of the basis set. The weights, w are sampled from a Gaussian density with variance α = 1.

slide-145
SLIDE 145

Direct Construction of Covariance Matrix

Use matrix notation to write function, f (xi; w) =

m

  • k=1

wkφk (xi)

slide-146
SLIDE 146

Direct Construction of Covariance Matrix

Use matrix notation to write function, f (xi; w) =

m

  • k=1

wkφk (xi) computed at training data gives a vector f = Φw.

slide-147
SLIDE 147

Direct Construction of Covariance Matrix

Use matrix notation to write function, f (xi; w) =

m

  • k=1

wkφk (xi) computed at training data gives a vector f = Φw. w ∼ N (0, αI)

slide-148
SLIDE 148

Direct Construction of Covariance Matrix

Use matrix notation to write function, f (xi; w) =

m

  • k=1

wkφk (xi) computed at training data gives a vector f = Φw. w ∼ N (0, αI) w and f are only related by an inner product.

slide-149
SLIDE 149

Direct Construction of Covariance Matrix

Use matrix notation to write function, f (xi; w) =

m

  • k=1

wkφk (xi) computed at training data gives a vector f = Φw. w ∼ N (0, αI) w and f are only related by an inner product. Φ ∈ ℜn×p is a design matrix

slide-150
SLIDE 150

Direct Construction of Covariance Matrix

Use matrix notation to write function, f (xi; w) =

m

  • k=1

wkφk (xi) computed at training data gives a vector f = Φw. w ∼ N (0, αI) w and f are only related by an inner product. Φ ∈ ℜn×p is a design matrix Φ is fixed and non-stochastic for a given training set.

slide-151
SLIDE 151

Direct Construction of Covariance Matrix

Use matrix notation to write function, f (xi; w) =

m

  • k=1

wkφk (xi) computed at training data gives a vector f = Φw. w ∼ N (0, αI) w and f are only related by an inner product. Φ ∈ ℜn×p is a design matrix Φ is fixed and non-stochastic for a given training set. f is Gaussian distributed.

slide-152
SLIDE 152

Expectations

◮ We have

f = Φ w . We use · to denote expectations under prior distributions.

slide-153
SLIDE 153

Expectations

◮ We have

f = Φ w .

◮ Prior mean of w was zero giving

f = 0. We use · to denote expectations under prior distributions.

slide-154
SLIDE 154

Expectations

◮ We have

f = Φ w .

◮ Prior mean of w was zero giving

f = 0.

◮ Prior covariance of f is

K =

  • ff⊤

− f f⊤ We use · to denote expectations under prior distributions.

slide-155
SLIDE 155

Expectations

◮ We have

f = Φ w .

◮ Prior mean of w was zero giving

f = 0.

◮ Prior covariance of f is

K =

  • ff⊤

− f f⊤

  • ff⊤

= Φ

  • ww⊤

Φ⊤, giving K = αΦΦ⊤. We use · to denote expectations under prior distributions.

slide-156
SLIDE 156

Covariance between Two Points

◮ The prior covariance between two points xi and xj is

k

  • xi, xj
  • = αφ: (xi)⊤ φ:
  • xj
  • ,
slide-157
SLIDE 157

Covariance between Two Points

◮ The prior covariance between two points xi and xj is

k

  • xi, xj
  • = αφ: (xi)⊤ φ:
  • xj
  • ,
  • r in sum notation

k

  • xi, xj
  • = α

m

  • k=1

φk (xi) φk

  • xj
slide-158
SLIDE 158

Covariance between Two Points

◮ The prior covariance between two points xi and xj is

k

  • xi, xj
  • = αφ: (xi)⊤ φ:
  • xj
  • ,
  • r in sum notation

k

  • xi, xj
  • = α

m

  • k=1

φk (xi) φk

  • xj
  • ◮ For the radial basis used this gives
slide-159
SLIDE 159

Covariance between Two Points

◮ The prior covariance between two points xi and xj is

k

  • xi, xj
  • = αφ: (xi)⊤ φ:
  • xj
  • ,
  • r in sum notation

k

  • xi, xj
  • = α

m

  • k=1

φk (xi) φk

  • xj
  • ◮ For the radial basis used this gives

k

  • xi, xj
  • = α

m

  • k=1

exp        −

  • xi − µk
  • 2 +
  • xj − µk
  • 2

2ℓ2         .

slide-160
SLIDE 160

Covariance Functions

RBF Basis Functions k (x, x′) = αφ(x)⊤φ(x′) φk(x) = exp        −

  • x − µk
  • 2

2

ℓ2         µ =          −1 1          ¡1¿ ¡2¿

slide-161
SLIDE 161

Selecting Number and Location of Basis

◮ Need to choose

  • 1. location of centers
  • 2. number of basis functions

Restrict analysis to 1-D input, x.

◮ Consider uniform spacing over a region:

k

  • xi, xj
  • = αφk(xi)⊤φk(xj)
slide-162
SLIDE 162

Selecting Number and Location of Basis

◮ Need to choose

  • 1. location of centers
  • 2. number of basis functions

Restrict analysis to 1-D input, x.

◮ Consider uniform spacing over a region:

k

  • xi, xj
  • = α

m

  • k=1

φk(xi)φk(xj)

slide-163
SLIDE 163

Selecting Number and Location of Basis

◮ Need to choose

  • 1. location of centers
  • 2. number of basis functions

Restrict analysis to 1-D input, x.

◮ Consider uniform spacing over a region:

k

  • xi, xj
  • = α

m

  • k=1

exp

  • −(xi − µk)2

2ℓ2

  • exp

     − (xj − µk)2 2ℓ2      

slide-164
SLIDE 164

Selecting Number and Location of Basis

◮ Need to choose

  • 1. location of centers
  • 2. number of basis functions

Restrict analysis to 1-D input, x.

◮ Consider uniform spacing over a region:

k

  • xi, xj
  • = α

m

  • k=1

exp      −(xi − µk)2 2ℓ2 − (xj − µk)2 2ℓ2      

slide-165
SLIDE 165

Selecting Number and Location of Basis

◮ Need to choose

  • 1. location of centers
  • 2. number of basis functions

Restrict analysis to 1-D input, x.

◮ Consider uniform spacing over a region:

k

  • xi, xj
  • = α

m

  • k=1

exp        − x2

i + x2 j − 2µk

  • xi + xj
  • + 2µ2

k

2ℓ2         ,

slide-166
SLIDE 166

Uniform Basis Functions

◮ Set each center location to

µk = a + ∆µ · (k − 1).

slide-167
SLIDE 167

Uniform Basis Functions

◮ Set each center location to

µk = a + ∆µ · (k − 1).

◮ Specify the basis functions in terms of their indices,

k

  • xi, xj
  • =α′∆µ

m

  • k=1

exp

x2

i + x2 j

2ℓ2 − 2 a + ∆µ · (k − 1) xi + xj

  • + 2 a + ∆µ · (k − 1)2

2ℓ2

  • .
slide-168
SLIDE 168

Uniform Basis Functions

◮ Set each center location to

µk = a + ∆µ · (k − 1).

◮ Specify the basis functions in terms of their indices,

k

  • xi, xj
  • =α′∆µ

m

  • k=1

exp

x2

i + x2 j

2ℓ2 − 2 a + ∆µ · (k − 1) xi + xj

  • + 2 a + ∆µ · (k − 1)2

2ℓ2

  • .

◮ Here we’ve scaled variance of process by ∆µ.

slide-169
SLIDE 169

Infinite Basis Functions

◮ Take

µ1 = a and µm = b so b = a + ∆µ · (m − 1)

slide-170
SLIDE 170

Infinite Basis Functions

◮ Take

µ1 = a and µm = b so b = a + ∆µ · (m − 1)

◮ This implies

b − a = ∆µ(m − 1)

slide-171
SLIDE 171

Infinite Basis Functions

◮ Take

µ1 = a and µm = b so b = a + ∆µ · (m − 1)

◮ This implies

b − a = ∆µ(m − 1) and therefore m = b − a ∆µ + 1

slide-172
SLIDE 172

Infinite Basis Functions

◮ Take

µ1 = a and µm = b so b = a + ∆µ · (m − 1)

◮ This implies

b − a = ∆µ(m − 1) and therefore m = b − a ∆µ + 1

◮ Take limit as ∆µ → 0 so m → ∞

slide-173
SLIDE 173

Infinite Basis Functions

◮ Take

µ1 = a and µm = b so b = a + ∆µ · (m − 1)

◮ This implies

b − a = ∆µ(m − 1) and therefore m = b − a ∆µ + 1

◮ Take limit as ∆µ → 0 so m → ∞

k(xi, xj) = α′ b

a

exp

x2

i + x2 j

2ℓ2 + 2

  • µ − 1

2

  • xi + xj

2 − 1

2

  • xi + xj

2 2ℓ2

  • dµ,

where we have used a + k · ∆µ → µ.

slide-174
SLIDE 174

Result

◮ Performing the integration leads to

k(xi,xj) = α′ √ πℓ2 exp         −

  • xi − xj

2 4ℓ2          × 1 2        erf        

  • b − 1

2

  • xi + xj

        − erf        

  • a − 1

2

  • xi + xj

                ,

slide-175
SLIDE 175

Result

◮ Performing the integration leads to

k(xi,xj) = α′ √ πℓ2 exp         −

  • xi − xj

2 4ℓ2          × 1 2        erf        

  • b − 1

2

  • xi + xj

        − erf        

  • a − 1

2

  • xi + xj

                ,

◮ Now take limit as a → −∞ and b → ∞

slide-176
SLIDE 176

Result

◮ Performing the integration leads to

k(xi,xj) = α′ √ πℓ2 exp         −

  • xi − xj

2 4ℓ2          × 1 2        erf        

  • b − 1

2

  • xi + xj

        − erf        

  • a − 1

2

  • xi + xj

                ,

◮ Now take limit as a → −∞ and b → ∞

k

  • xi, xj
  • = α exp

        −

  • xi − xj

2 4ℓ2          . where α = α′ √ πℓ2.

slide-177
SLIDE 177

Infinite Feature Space

◮ An RBF model with infinite basis functions is a Gaussian

process.

slide-178
SLIDE 178

Infinite Feature Space

◮ An RBF model with infinite basis functions is a Gaussian

process.

◮ The covariance function is given by the exponentiated

quadratic covariance function. k

  • xi, xj
  • = α exp

        −

  • xi − xj

2 4ℓ2          .

slide-179
SLIDE 179

Infinite Feature Space

◮ An RBF model with infinite basis functions is a Gaussian

process.

◮ The covariance function is the exponentiated quadratic. ◮ Note: The functional form for the covariance function and

basis functions are similar.

◮ this is a special case, ◮ in general they are very different

Similar results can obtained for multi-dimensional input models Williams (1998); Neal (1996).

slide-180
SLIDE 180

Covariance Functions

Where did this covariance matrix come from?

Exponentiated Quadratic Kernel Function (RBF, Squared Exponential, Gaussian) k (x, x′) = α exp      −x − x′2

2

2ℓ2      

◮ Covariance matrix is

built using the inputs to the function x.

◮ For the example above it

was based on Euclidean distance.

◮ The covariance function

is also know as a kernel. ¡1¿ ¡2¿

slide-181
SLIDE 181

Covariance Functions

RBF Basis Functions k (x, x′) = αφ(x)⊤φ(x′) φk(x) = exp        −

  • x − µk
  • 2

2

ℓ2         µ =          −1 1          ¡1¿ ¡2¿

slide-182
SLIDE 182

References I

  • P. S. Laplace. Essai philosophique sur les probabilit´
  • es. Courcier, Paris, 2nd edition, 1814. Sixth edition of 1840 translated

and repreinted (1951) as A Philosophical Essay on Probabilities, New York: Dover; fifth edition of 1825 reprinted 1986 with notes by Bernard Bru, Paris: Christian Bourgois ´ Editeur, translated by Andrew Dale (1995) as Philosophical Essay on Probabilities, New York:Springer-Verlag.

  • R. M. Neal. Bayesian Learning for Neural Networks. Springer, 1996. Lecture Notes in Statistics 118.
  • C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, Cambridge, MA, 2006.

[Google Books] .

  • C. K. I. Williams. Computation with infinite neural networks. Neural Computation, 10(5):1203–1216, 1998.