Lecture 12 Gaussian Process Models Colin Rundel 02/27/2017 1 - - PowerPoint PPT Presentation
Lecture 12 Gaussian Process Models Colin Rundel 02/27/2017 1 - - PowerPoint PPT Presentation
Lecture 12 Gaussian Process Models Colin Rundel 02/27/2017 1 Multivariate Normal 2 Multivariate Normal Distribution . . . . ... . . . . . Y n . . . (positive semidefinite) can be written as Y 3 Y 1 For an n -dimension
Multivariate Normal
2
Multivariate Normal Distribution
For an n-dimension multivate normal distribution with covariance Σ (positive semidefinite) can be written as Y
n×1 ∼ N( µ n×1, Σ n×n) where {Σ}ij = σ2 ij = ρij σi σj
Y1 . . . Yn
∼ N
µ1
. . .
µn
,
ρ11σ1σ1 · · · ρ1nσ1σn
. . . ... . . .
ρn1σnσ1 · · · ρnnσnσn
3
Density
For the n dimensional multivate normal given on the last slide, its density is given by
(2π)−n/2 det(Σ)−1/2 exp
(
−
1 2(Y − µ)′
1×n
Σ−1
n×n (Y − µ) n×1
)
and its log density is given by
−
n 2 log 2π − 1 2 log det(Σ) − − 1 2(Y − µ)′
1×n
Σ−1
n×n (Y − µ) n×1 4
Sampling
To generate draws from an n-dimensional multivate normal with mean µ and covariance matrix Σ,
- Find a matrix A such that
A At, most often we use A Chol
- Draw n iid unit normals (
0 1 ) as z
- Construct multivariate normal draws using
Y A z
5
Sampling
To generate draws from an n-dimensional multivate normal with mean µ and covariance matrix Σ,
- Find a matrix A such that Σ = A At, most often we use A = Chol(Σ)
- Draw n iid unit normals (
0 1 ) as z
- Construct multivariate normal draws using
Y A z
5
Sampling
To generate draws from an n-dimensional multivate normal with mean µ and covariance matrix Σ,
- Find a matrix A such that Σ = A At, most often we use A = Chol(Σ)
- Draw n iid unit normals (N(0, 1)) as z
- Construct multivariate normal draws using
Y A z
5
Sampling
To generate draws from an n-dimensional multivate normal with mean µ and covariance matrix Σ,
- Find a matrix A such that Σ = A At, most often we use A = Chol(Σ)
- Draw n iid unit normals (N(0, 1)) as z
- Construct multivariate normal draws using
Y = µ + A z
5
Bivariate Example
µ =
( )
Σ =
(
1
ρ ρ
1
)
rho=−0.9 rho=−0.7 rho=−0.5 rho=−0.1 rho=0.9 rho=0.7 rho=0.5 rho=0.1 −2 2 −2 2 −2 2 −2 2 −3 −2 −1 1 2 3 −3 −2 −1 1 2 3
x y
6
Marginal distributions
Proposition - For an n-dimensional multivate normal with mean µ and covariance matrix Σ, any of the possible marginal distributions will also (multivariate) normal. For a univariate marginal distribution, yi
i ii
For a bivariate marginal distribution, yij
i j ii ij ji jj
For a k-dimensional marginal distribution, yi1
ik i1
. . .
j i1i1 i1ik
. . . ... . . .
iki1 ikik
7
Marginal distributions
Proposition - For an n-dimensional multivate normal with mean µ and covariance matrix Σ, any of the possible marginal distributions will also (multivariate) normal. For a univariate marginal distribution, yi = N(µi, γii) For a bivariate marginal distribution, yij
i j ii ij ji jj
For a k-dimensional marginal distribution, yi1
ik i1
. . .
j i1i1 i1ik
. . . ... . . .
iki1 ikik
7
Marginal distributions
Proposition - For an n-dimensional multivate normal with mean µ and covariance matrix Σ, any of the possible marginal distributions will also (multivariate) normal. For a univariate marginal distribution, yi = N(µi, γii) For a bivariate marginal distribution, yij = N
((µi µj ) , (γii γij γji γjj ))
For a k-dimensional marginal distribution, yi1
ik i1
. . .
j i1i1 i1ik
. . . ... . . .
iki1 ikik
7
Marginal distributions
Proposition - For an n-dimensional multivate normal with mean µ and covariance matrix Σ, any of the possible marginal distributions will also (multivariate) normal. For a univariate marginal distribution, yi = N(µi, γii) For a bivariate marginal distribution, yij = N
((µi µj ) , (γii γij γji γjj ))
For a k-dimensional marginal distribution, yi1,··· ,ik = N
µi1
. . .
µj , γi1i1 · · · γi1ik
. . . ... . . .
γiki1 · · · γikik
7
Conditional Distributions
If we partition the n-dimensions into two pieces such that Y = (Y1, Y2)t then
Y
n×1 ∼ N
(
µ1 µ2
)
n×1
,
(
Σ11 Σ12 Σ21 Σ22
)
n×n
Y1
k×1 ∼ N(µ1 k×1
, Σ11
k×k)
Y2
n−k×1 ∼ N( µ2 n−k×1
, Σ22
n−k×n−k)
then the conditional distributions are given by
Y1 Y2 a
1 12 1 22
a
2 11 12 1 22 21
Y2 Y1 b
2 21 1 11
b
1 22 21 1 11 21
8
Conditional Distributions
If we partition the n-dimensions into two pieces such that Y = (Y1, Y2)t then
Y
n×1 ∼ N
(
µ1 µ2
)
n×1
,
(
Σ11 Σ12 Σ21 Σ22
)
n×n
Y1
k×1 ∼ N(µ1 k×1
, Σ11
k×k)
Y2
n−k×1 ∼ N( µ2 n−k×1
, Σ22
n−k×n−k)
then the conditional distributions are given by
Y1 | Y2 = a ∼ N(µ1 + Σ12 Σ−1
22 (a − µ2), Σ11 − Σ12 Σ−1 22 Σ21)
Y2 | Y1 = b ∼ N(µ2 + Σ21 Σ−1
11 (b − µ1), Σ22 − Σ21 Σ−1 11 Σ21)
8
Gaussian Processes
From Shumway, A process, Y = {Yt : t ∈ T}, is said to be a Gaussian process if all possible finite dimensional vectors y = (yt1, yt2, ..., ytn)t, for every collection of time points t1, t2, . . . , tn, and every positive integer n, have a multivariate normal distribution. So far we have only looked at examples of time series where T is discete (and evenly spaces & contiguous), it turns out things get a lot more interesting when we explore the case where T is defined on a continuous space (e.g.
- r some subset of
).
9
Gaussian Processes
From Shumway, A process, Y = {Yt : t ∈ T}, is said to be a Gaussian process if all possible finite dimensional vectors y = (yt1, yt2, ..., ytn)t, for every collection of time points t1, t2, . . . , tn, and every positive integer n, have a multivariate normal distribution. So far we have only looked at examples of time series where T is discete (and evenly spaces & contiguous), it turns out things get a lot more interesting when we explore the case where T is defined on a continuous space (e.g. R or some subset of R).
9
Gaussian Process Regression
10
Parameterizing a Gaussian Process
Imagine we have a Gaussian process defined such that Y = {Yt : t ∈ [0, 1]},
- We now have an uncountably infinite set of possible Yts.
- We will only have a (small) finite number of observations Y1
Yn with which to say something useful about this infinite dimension process.
- The unconstrained covariance matrix for the observed data can have
up to n n 1 2 unique values (p n)
- Necessary to make some simplifying assumptions:
- Stationarity
- Simple parameterization of
11
Parameterizing a Gaussian Process
Imagine we have a Gaussian process defined such that Y = {Yt : t ∈ [0, 1]},
- We now have an uncountably infinite set of possible Yts.
- We will only have a (small) finite number of observations Y1
Yn with which to say something useful about this infinite dimension process.
- The unconstrained covariance matrix for the observed data can have
up to n n 1 2 unique values (p n)
- Necessary to make some simplifying assumptions:
- Stationarity
- Simple parameterization of
11
Parameterizing a Gaussian Process
Imagine we have a Gaussian process defined such that Y = {Yt : t ∈ [0, 1]},
- We now have an uncountably infinite set of possible Yts.
- We will only have a (small) finite number of observations Y1, . . . , Yn
with which to say something useful about this infinite dimension process.
- The unconstrained covariance matrix for the observed data can have
up to n n 1 2 unique values (p n)
- Necessary to make some simplifying assumptions:
- Stationarity
- Simple parameterization of
11
Parameterizing a Gaussian Process
Imagine we have a Gaussian process defined such that Y = {Yt : t ∈ [0, 1]},
- We now have an uncountably infinite set of possible Yts.
- We will only have a (small) finite number of observations Y1, . . . , Yn
with which to say something useful about this infinite dimension process.
- The unconstrained covariance matrix for the observed data can have
up to n(n + 1)/2 unique values (p >>> n)
- Necessary to make some simplifying assumptions:
- Stationarity
- Simple parameterization of
11
Parameterizing a Gaussian Process
Imagine we have a Gaussian process defined such that Y = {Yt : t ∈ [0, 1]},
- We now have an uncountably infinite set of possible Yts.
- We will only have a (small) finite number of observations Y1, . . . , Yn
with which to say something useful about this infinite dimension process.
- The unconstrained covariance matrix for the observed data can have
up to n(n + 1)/2 unique values (p >>> n)
- Necessary to make some simplifying assumptions:
- Stationarity
- Simple parameterization of Σ
11
Covariance Functions
More on these next week, but for now some simple / common examples Exponential Covariance:
Σ(yt, yt′) = σ2 exp
( − |t − t′| l )
Squared Exponential Covariance:
Σ(yt, yt′) = σ2 exp
( − (|t − t′| l )2)
Powered Exponential Covariance (p ∈ (0, 2]):
Σ(yt, yt′) = σ2 exp
( − (|t − t′| l )p)
12
Covariance Function Decay
exponential covariance square exponential covariance 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
d corr l
1 2 3 4 5 6 7 8 9 10
13
Example
−2 −1 1 0.00 0.25 0.50 0.75 1.00
t y
14
Prediction
Our example has 15 observations which we would like to use as the basis for predicting Yt at other values of t (say a grid of values from 0 to 1). For now lets use a square exponential covariance with
2
10 and l 10 We therefore want to sample from Ypred Yobs. Ypred Yobs y
po 1
- bs y
pred po 1 pred
- p
15
Prediction
Our example has 15 observations which we would like to use as the basis for predicting Yt at other values of t (say a grid of values from 0 to 1). For now lets use a square exponential covariance with σ2 = 10 and l = 10 We therefore want to sample from Ypred Yobs. Ypred Yobs y
po 1
- bs y
pred po 1 pred
- p
15
Prediction
Our example has 15 observations which we would like to use as the basis for predicting Yt at other values of t (say a grid of values from 0 to 1). For now lets use a square exponential covariance with σ2 = 10 and l = 10 We therefore want to sample from Ypred|Yobs. Ypred | Yobs = y ∼ N(Σpo Σ−1
- bs y, Σpred − Σpo Σ−1
pred Σop) 15
Draw 1
−3 −2 −1 1 0.00 0.25 0.50 0.75 1.00
t y
16
Draw 2
−3 −2 −1 1 0.00 0.25 0.50 0.75 1.00
t y
17
Draw 3
−3 −2 −1 1 0.00 0.25 0.50 0.75 1.00
t y
18
Draw 4
−3 −2 −1 1 2 0.00 0.25 0.50 0.75 1.00
t y
19
Draw 5
−3 −2 −1 1 2 0.00 0.25 0.50 0.75 1.00
t y
20
Many draws later
−2 2 0.00 0.25 0.50 0.75 1.00
t y
21
Exponential Covariance
−5 5 0.00 0.25 0.50 0.75 1.00
t y
22
Powered Exponential Covariance (p = 1.5)
−5.0 −2.5 0.0 2.5 5.0 0.00 0.25 0.50 0.75 1.00
t y
23
Back to the square exponential
−2 2 0.00 0.25 0.50 0.75 1.00
t y
24
Changing the range (l)
Sq Exp Cov − sigma2=10, l=5 Sq Exp Cov − sigma2=10, l=7.5 Sq Exp Cov − sigma2=10, l=12.5 Sq Exp Cov − sigma2=10, l=15 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 −3 3 6 −3 3 6
t y
25
Effective Range
For the square exponential covariance Cov(d) = σ2 exp
(−(l · d)2)
Corr(d) = exp
(−(l · d)2)
we would like to know, for a given value of l, beyond what distance apart must observations be to have a correlation less than 0.05?
26
Changing the scale (σ2)
Sq Exp Cov − sigma2=5, l=5 Sq Exp Cov − sigma2=15, l=5 Sq Exp Cov − sigma2=5, l=10 Sq Exp Cov − sigma2=15, l=10 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 −2 2 4 −2 2 4
t y
27
Fitting
## model{ ## y ~ dmnorm(mu, inverse(Sigma)) ## ## for (i in 1:N) { ## mu[i] <- 0 ## } ## ## for (i in 1:(N-1)) { ## for (j in (i+1):N) { ## Sigma[i,j] <- sigma2 * exp(- pow(l*d[i,j],2)) ## Sigma[j,i] <- Sigma[i,j] ## } ## } ## ## for (k in 1:N) { ## Sigma[k,k] <- sigma2 + 0.01 ## } ## ## sigma2 ~ dlnorm(0, 1) ## l ~ dt(0, 2.5, 1) T(0,) # Half-cauchy(0,2.5) ## }
28
Trace plots
6000 8000 10000 12000 14000 16000 4 6 8 12 Iterations
Trace of l
4 6 8 10 12 14 16 0.0 0.2 0.4
Density of l
N = 1000 Bandwidth = 0.2788 6000 8000 10000 12000 14000 16000 4 8 12
Trace of sigma2
5 10 15 0.0 0.2 0.4
Density of sigma2
param post_mean post_med post_lower post_upper l 5.981289 5.833655 4.2669795 8.456006 sigma2 2.457979 2.032632 0.8173064 7.168197
29
Fitted models
Post Mean Model − sigma2=2.32, l=6.03 Post Median Model − sigma2=1.89, l=5.86 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 −2 −1 1
t y
30
Forcasting
Post Mean Model − sigma2=2.32, l=6.03 Post Median Model − sigma2=1.89, l=5.86 0.0 0.5 1.0 1.5 0.0 0.5 1.0 1.5 −2 2
t y