Lecture 12 Gaussian Process Models Colin Rundel 02/27/2017 1 - - PowerPoint PPT Presentation

lecture 12
SMART_READER_LITE
LIVE PREVIEW

Lecture 12 Gaussian Process Models Colin Rundel 02/27/2017 1 - - PowerPoint PPT Presentation

Lecture 12 Gaussian Process Models Colin Rundel 02/27/2017 1 Multivariate Normal 2 Multivariate Normal Distribution . . . . ... . . . . . Y n . . . (positive semidefinite) can be written as Y 3 Y 1 For an n -dimension


slide-1
SLIDE 1

Lecture 12

Gaussian Process Models

Colin Rundel 02/27/2017

1

slide-2
SLIDE 2

Multivariate Normal

2

slide-3
SLIDE 3

Multivariate Normal Distribution

For an n-dimension multivate normal distribution with covariance Σ (positive semidefinite) can be written as Y

n×1 ∼ N( µ n×1, Σ n×n) where {Σ}ij = σ2 ij = ρij σi σj

  

Y1 . . . Yn

   ∼ N      

µ1

. . .

µn

   ,   

ρ11σ1σ1 · · · ρ1nσ1σn

. . . ... . . .

ρn1σnσ1 · · · ρnnσnσn

     

3

slide-4
SLIDE 4

Density

For the n dimensional multivate normal given on the last slide, its density is given by

(2π)−n/2 det(Σ)−1/2 exp

(

1 2(Y − µ)′

1×n

Σ−1

n×n (Y − µ) n×1

)

and its log density is given by

n 2 log 2π − 1 2 log det(Σ) − − 1 2(Y − µ)′

1×n

Σ−1

n×n (Y − µ) n×1 4

slide-5
SLIDE 5

Sampling

To generate draws from an n-dimensional multivate normal with mean µ and covariance matrix Σ,

  • Find a matrix A such that

A At, most often we use A Chol

  • Draw n iid unit normals (

0 1 ) as z

  • Construct multivariate normal draws using

Y A z

5

slide-6
SLIDE 6

Sampling

To generate draws from an n-dimensional multivate normal with mean µ and covariance matrix Σ,

  • Find a matrix A such that Σ = A At, most often we use A = Chol(Σ)
  • Draw n iid unit normals (

0 1 ) as z

  • Construct multivariate normal draws using

Y A z

5

slide-7
SLIDE 7

Sampling

To generate draws from an n-dimensional multivate normal with mean µ and covariance matrix Σ,

  • Find a matrix A such that Σ = A At, most often we use A = Chol(Σ)
  • Draw n iid unit normals (N(0, 1)) as z
  • Construct multivariate normal draws using

Y A z

5

slide-8
SLIDE 8

Sampling

To generate draws from an n-dimensional multivate normal with mean µ and covariance matrix Σ,

  • Find a matrix A such that Σ = A At, most often we use A = Chol(Σ)
  • Draw n iid unit normals (N(0, 1)) as z
  • Construct multivariate normal draws using

Y = µ + A z

5

slide-9
SLIDE 9

Bivariate Example

µ =

( )

Σ =

(

1

ρ ρ

1

)

rho=−0.9 rho=−0.7 rho=−0.5 rho=−0.1 rho=0.9 rho=0.7 rho=0.5 rho=0.1 −2 2 −2 2 −2 2 −2 2 −3 −2 −1 1 2 3 −3 −2 −1 1 2 3

x y

6

slide-10
SLIDE 10

Marginal distributions

Proposition - For an n-dimensional multivate normal with mean µ and covariance matrix Σ, any of the possible marginal distributions will also (multivariate) normal. For a univariate marginal distribution, yi

i ii

For a bivariate marginal distribution, yij

i j ii ij ji jj

For a k-dimensional marginal distribution, yi1

ik i1

. . .

j i1i1 i1ik

. . . ... . . .

iki1 ikik

7

slide-11
SLIDE 11

Marginal distributions

Proposition - For an n-dimensional multivate normal with mean µ and covariance matrix Σ, any of the possible marginal distributions will also (multivariate) normal. For a univariate marginal distribution, yi = N(µi, γii) For a bivariate marginal distribution, yij

i j ii ij ji jj

For a k-dimensional marginal distribution, yi1

ik i1

. . .

j i1i1 i1ik

. . . ... . . .

iki1 ikik

7

slide-12
SLIDE 12

Marginal distributions

Proposition - For an n-dimensional multivate normal with mean µ and covariance matrix Σ, any of the possible marginal distributions will also (multivariate) normal. For a univariate marginal distribution, yi = N(µi, γii) For a bivariate marginal distribution, yij = N

((µi µj ) , (γii γij γji γjj ))

For a k-dimensional marginal distribution, yi1

ik i1

. . .

j i1i1 i1ik

. . . ... . . .

iki1 ikik

7

slide-13
SLIDE 13

Marginal distributions

Proposition - For an n-dimensional multivate normal with mean µ and covariance matrix Σ, any of the possible marginal distributions will also (multivariate) normal. For a univariate marginal distribution, yi = N(µi, γii) For a bivariate marginal distribution, yij = N

((µi µj ) , (γii γij γji γjj ))

For a k-dimensional marginal distribution, yi1,··· ,ik = N

      µi1

. . .

µj    ,    γi1i1 · · · γi1ik

. . . ... . . .

γiki1 · · · γikik      

7

slide-14
SLIDE 14

Conditional Distributions

If we partition the n-dimensions into two pieces such that Y = (Y1, Y2)t then

Y

n×1 ∼ N

  (

µ1 µ2

)

n×1

,

(

Σ11 Σ12 Σ21 Σ22

)

n×n

 

Y1

k×1 ∼ N(µ1 k×1

, Σ11

k×k)

Y2

n−k×1 ∼ N( µ2 n−k×1

, Σ22

n−k×n−k)

then the conditional distributions are given by

Y1 Y2 a

1 12 1 22

a

2 11 12 1 22 21

Y2 Y1 b

2 21 1 11

b

1 22 21 1 11 21

8

slide-15
SLIDE 15

Conditional Distributions

If we partition the n-dimensions into two pieces such that Y = (Y1, Y2)t then

Y

n×1 ∼ N

  (

µ1 µ2

)

n×1

,

(

Σ11 Σ12 Σ21 Σ22

)

n×n

 

Y1

k×1 ∼ N(µ1 k×1

, Σ11

k×k)

Y2

n−k×1 ∼ N( µ2 n−k×1

, Σ22

n−k×n−k)

then the conditional distributions are given by

Y1 | Y2 = a ∼ N(µ1 + Σ12 Σ−1

22 (a − µ2), Σ11 − Σ12 Σ−1 22 Σ21)

Y2 | Y1 = b ∼ N(µ2 + Σ21 Σ−1

11 (b − µ1), Σ22 − Σ21 Σ−1 11 Σ21)

8

slide-16
SLIDE 16

Gaussian Processes

From Shumway, A process, Y = {Yt : t ∈ T}, is said to be a Gaussian process if all possible finite dimensional vectors y = (yt1, yt2, ..., ytn)t, for every collection of time points t1, t2, . . . , tn, and every positive integer n, have a multivariate normal distribution. So far we have only looked at examples of time series where T is discete (and evenly spaces & contiguous), it turns out things get a lot more interesting when we explore the case where T is defined on a continuous space (e.g.

  • r some subset of

).

9

slide-17
SLIDE 17

Gaussian Processes

From Shumway, A process, Y = {Yt : t ∈ T}, is said to be a Gaussian process if all possible finite dimensional vectors y = (yt1, yt2, ..., ytn)t, for every collection of time points t1, t2, . . . , tn, and every positive integer n, have a multivariate normal distribution. So far we have only looked at examples of time series where T is discete (and evenly spaces & contiguous), it turns out things get a lot more interesting when we explore the case where T is defined on a continuous space (e.g. R or some subset of R).

9

slide-18
SLIDE 18

Gaussian Process Regression

10

slide-19
SLIDE 19

Parameterizing a Gaussian Process

Imagine we have a Gaussian process defined such that Y = {Yt : t ∈ [0, 1]},

  • We now have an uncountably infinite set of possible Yts.
  • We will only have a (small) finite number of observations Y1

Yn with which to say something useful about this infinite dimension process.

  • The unconstrained covariance matrix for the observed data can have

up to n n 1 2 unique values (p n)

  • Necessary to make some simplifying assumptions:
  • Stationarity
  • Simple parameterization of

11

slide-20
SLIDE 20

Parameterizing a Gaussian Process

Imagine we have a Gaussian process defined such that Y = {Yt : t ∈ [0, 1]},

  • We now have an uncountably infinite set of possible Yts.
  • We will only have a (small) finite number of observations Y1

Yn with which to say something useful about this infinite dimension process.

  • The unconstrained covariance matrix for the observed data can have

up to n n 1 2 unique values (p n)

  • Necessary to make some simplifying assumptions:
  • Stationarity
  • Simple parameterization of

11

slide-21
SLIDE 21

Parameterizing a Gaussian Process

Imagine we have a Gaussian process defined such that Y = {Yt : t ∈ [0, 1]},

  • We now have an uncountably infinite set of possible Yts.
  • We will only have a (small) finite number of observations Y1, . . . , Yn

with which to say something useful about this infinite dimension process.

  • The unconstrained covariance matrix for the observed data can have

up to n n 1 2 unique values (p n)

  • Necessary to make some simplifying assumptions:
  • Stationarity
  • Simple parameterization of

11

slide-22
SLIDE 22

Parameterizing a Gaussian Process

Imagine we have a Gaussian process defined such that Y = {Yt : t ∈ [0, 1]},

  • We now have an uncountably infinite set of possible Yts.
  • We will only have a (small) finite number of observations Y1, . . . , Yn

with which to say something useful about this infinite dimension process.

  • The unconstrained covariance matrix for the observed data can have

up to n(n + 1)/2 unique values (p >>> n)

  • Necessary to make some simplifying assumptions:
  • Stationarity
  • Simple parameterization of

11

slide-23
SLIDE 23

Parameterizing a Gaussian Process

Imagine we have a Gaussian process defined such that Y = {Yt : t ∈ [0, 1]},

  • We now have an uncountably infinite set of possible Yts.
  • We will only have a (small) finite number of observations Y1, . . . , Yn

with which to say something useful about this infinite dimension process.

  • The unconstrained covariance matrix for the observed data can have

up to n(n + 1)/2 unique values (p >>> n)

  • Necessary to make some simplifying assumptions:
  • Stationarity
  • Simple parameterization of Σ

11

slide-24
SLIDE 24

Covariance Functions

More on these next week, but for now some simple / common examples Exponential Covariance:

Σ(yt, yt′) = σ2 exp

( − |t − t′| l )

Squared Exponential Covariance:

Σ(yt, yt′) = σ2 exp

( − (|t − t′| l )2)

Powered Exponential Covariance (p ∈ (0, 2]):

Σ(yt, yt′) = σ2 exp

( − (|t − t′| l )p)

12

slide-25
SLIDE 25

Covariance Function Decay

exponential covariance square exponential covariance 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00

d corr l

1 2 3 4 5 6 7 8 9 10

13

slide-26
SLIDE 26

Example

−2 −1 1 0.00 0.25 0.50 0.75 1.00

t y

14

slide-27
SLIDE 27

Prediction

Our example has 15 observations which we would like to use as the basis for predicting Yt at other values of t (say a grid of values from 0 to 1). For now lets use a square exponential covariance with

2

10 and l 10 We therefore want to sample from Ypred Yobs. Ypred Yobs y

po 1

  • bs y

pred po 1 pred

  • p

15

slide-28
SLIDE 28

Prediction

Our example has 15 observations which we would like to use as the basis for predicting Yt at other values of t (say a grid of values from 0 to 1). For now lets use a square exponential covariance with σ2 = 10 and l = 10 We therefore want to sample from Ypred Yobs. Ypred Yobs y

po 1

  • bs y

pred po 1 pred

  • p

15

slide-29
SLIDE 29

Prediction

Our example has 15 observations which we would like to use as the basis for predicting Yt at other values of t (say a grid of values from 0 to 1). For now lets use a square exponential covariance with σ2 = 10 and l = 10 We therefore want to sample from Ypred|Yobs. Ypred | Yobs = y ∼ N(Σpo Σ−1

  • bs y, Σpred − Σpo Σ−1

pred Σop) 15

slide-30
SLIDE 30

Draw 1

−3 −2 −1 1 0.00 0.25 0.50 0.75 1.00

t y

16

slide-31
SLIDE 31

Draw 2

−3 −2 −1 1 0.00 0.25 0.50 0.75 1.00

t y

17

slide-32
SLIDE 32

Draw 3

−3 −2 −1 1 0.00 0.25 0.50 0.75 1.00

t y

18

slide-33
SLIDE 33

Draw 4

−3 −2 −1 1 2 0.00 0.25 0.50 0.75 1.00

t y

19

slide-34
SLIDE 34

Draw 5

−3 −2 −1 1 2 0.00 0.25 0.50 0.75 1.00

t y

20

slide-35
SLIDE 35

Many draws later

−2 2 0.00 0.25 0.50 0.75 1.00

t y

21

slide-36
SLIDE 36

Exponential Covariance

−5 5 0.00 0.25 0.50 0.75 1.00

t y

22

slide-37
SLIDE 37

Powered Exponential Covariance (p = 1.5)

−5.0 −2.5 0.0 2.5 5.0 0.00 0.25 0.50 0.75 1.00

t y

23

slide-38
SLIDE 38

Back to the square exponential

−2 2 0.00 0.25 0.50 0.75 1.00

t y

24

slide-39
SLIDE 39

Changing the range (l)

Sq Exp Cov − sigma2=10, l=5 Sq Exp Cov − sigma2=10, l=7.5 Sq Exp Cov − sigma2=10, l=12.5 Sq Exp Cov − sigma2=10, l=15 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 −3 3 6 −3 3 6

t y

25

slide-40
SLIDE 40

Effective Range

For the square exponential covariance Cov(d) = σ2 exp

(−(l · d)2)

Corr(d) = exp

(−(l · d)2)

we would like to know, for a given value of l, beyond what distance apart must observations be to have a correlation less than 0.05?

26

slide-41
SLIDE 41

Changing the scale (σ2)

Sq Exp Cov − sigma2=5, l=5 Sq Exp Cov − sigma2=15, l=5 Sq Exp Cov − sigma2=5, l=10 Sq Exp Cov − sigma2=15, l=10 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 −2 2 4 −2 2 4

t y

27

slide-42
SLIDE 42

Fitting

## model{ ## y ~ dmnorm(mu, inverse(Sigma)) ## ## for (i in 1:N) { ## mu[i] <- 0 ## } ## ## for (i in 1:(N-1)) { ## for (j in (i+1):N) { ## Sigma[i,j] <- sigma2 * exp(- pow(l*d[i,j],2)) ## Sigma[j,i] <- Sigma[i,j] ## } ## } ## ## for (k in 1:N) { ## Sigma[k,k] <- sigma2 + 0.01 ## } ## ## sigma2 ~ dlnorm(0, 1) ## l ~ dt(0, 2.5, 1) T(0,) # Half-cauchy(0,2.5) ## }

28

slide-43
SLIDE 43

Trace plots

6000 8000 10000 12000 14000 16000 4 6 8 12 Iterations

Trace of l

4 6 8 10 12 14 16 0.0 0.2 0.4

Density of l

N = 1000 Bandwidth = 0.2788 6000 8000 10000 12000 14000 16000 4 8 12

Trace of sigma2

5 10 15 0.0 0.2 0.4

Density of sigma2

param post_mean post_med post_lower post_upper l 5.981289 5.833655 4.2669795 8.456006 sigma2 2.457979 2.032632 0.8173064 7.168197

29

slide-44
SLIDE 44

Fitted models

Post Mean Model − sigma2=2.32, l=6.03 Post Median Model − sigma2=1.89, l=5.86 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 −2 −1 1

t y

30

slide-45
SLIDE 45

Forcasting

Post Mean Model − sigma2=2.32, l=6.03 Post Median Model − sigma2=1.89, l=5.86 0.0 0.5 1.0 1.5 0.0 0.5 1.0 1.5 −2 2

t y

31