SLIDE 1
Tutorials on the Gaussian Random Process and its OR Applications
By
Juta Pichitlamken
Department of Industrial Engineering Kasetsart University, Bangkok juta.p@ku.ac.th http://pirun.ku.ac.th/∼fengjtp/
September 3, 2009
Operations Research Network of Thailand (OR-NET 2009)
SLIDE 2 Motivation: Dagstuhl Seminar
- Near Frankfurt, Germany. Web: http://www.dagstuhl.de/
- Concentrate on computer sciences, but the topic that I went to
is Sampling-based Optimization in the Presence of Uncertainty
- rganized by Jurgen Branke (Universitat Karlsruhe, Germany),
Barry Nelson (Northwestern University, US), Warren Powell (Princeton University, US), and Thomas J. Santner (Ohio State University, US).
- Small workshop of 20-30 people (invitation only) from different
fields (Statistics, Simulation, OR, Computer Sciences, Business Administration, Mathematics, ...).
- Despite these diversity, many people use Gaussian Random
Processes (GP) as modeling tools.
SLIDE 3 Guassian Process (GP) Application: Spatial Statistics
- Model spatial distribution of environmental or socioeconomic data (e.g.,
geographical distributions of cases of type-A (H1N1) influenza, or data from Geographic Information System (GIS)) for statistical inference.
- Intuition: “Everything is related to everything else, but near things are more
related than distant things.” (Waldo Tobler)
- GP prediction (known as kriging) can be used to model this spatial
dependency, i.e., spatial data is viewed as a realization of a stochastic process.
- To build a model, the stochastic process is assumed stable: mean is
constant, and covariance depends on the distance. Covariance is a measure
- f how much two variables change together:
Cov(X, Y ) = E[(X − µX)(Y − µY )].
- Classic textbook: Cressie, N.A.C. 1993. Statistics for Spatial Data. Wiley.
Source: Cˆ amara, G., A.M. Monteiro, S.D. Fucks, and M. S. Carvalho. 2004. Spatial Analysis and GIS Primer. Downloadable from www.dpi.inpe.br/gilberto/spatial analysis.html
SLIDE 4 GP Application: Metamodeling of Deterministic Responses
- In OR, a metamodel is a mathematical model of a set of related models
(Xavier 2004).
- Design and analysis of computer experiments: Data are generated from a
computer code (e.g., finite element models) whose responses are deterministic; computation may be time consuming; and large number of factors are involved.
- Given the training data {(xi, yi)}n
i=1, we want to find an approximation
(metamodel) to the computer code. Called surrogates in the (deterministic) global optimization literature.
- Ex: a computational fluid dynamics (CFD) model to study the oil mist
separator system in the internal combustion engine (Satoh, Kawai, Ishikawa, and Matsuoka 2000).
- Computer models with multiple levels of fidelity for optimization: Huang,
Allen, Notz, and Miller (2006).
- Textbooks: Fang K.-T., R. Li, and A. Sudjianto. 2006. Design and
Modeling for Computer Experiments. Taylor and Francis. Santner, T.J., B.J. Williams, and W. I. Notz. 2003. The Design and Analysis of Computer Experiments. Springer-Verlag.
SLIDE 5 GP Application: Metamodeling of Probabilistic Responses Probabilistic responses (e.g., discrete-event simulation outputs). Approaches:
- 1. Consider the variability in the observed responses and fitting
errors simultaneously: Ankenman, Nelson, and Staum (2008).
- 2. Separate the trend in the data using least squares models.
Then, apply kriging models to model the residuals: van Beers and Kleijnen (2003, 2007).
SLIDE 6 GP Application: Machine learning
- Supervised learning: f: input → output from empirical data (training data
set). – For continuous outputs (aka responses or dependent variables), f is called regression. Ex: in manufacturing systems, WIP = f(throughput
– For discrete outputs, f is known as classification. Ex: classification of handwritten images into digits (0-9).
- Approach: Assume prior distributions (beliefs over the types of f we will
- bserve, e.g., mean and variance) to be GP. Once we get actual data, we
can reject f that do not agree with the data, i.e., find the GP parameters that best fit the data.
- Desirable properties of prior: smooth and stationary ← covariance function.
- Resources: MacKay, D. 2002. Information Theory, Inference & Learning
- Algorithms. Cambridge University Press.
Rasmussen C.E., and C.K.I. Williams. 2006. Gaussian Processes for Machine Learning. MIT Press. Also available online: www.gaussianprocess.org/gpml
SLIDE 7 Gaussian Process Regression Demo Thanks to Rasmussen C.E., and C.K.I. Williams. 2006. GPML Code. Sample Y from N(0, 1) with exponential covariance function: Cov (Y (xp), Y (xq)) = σ2
f exp
2ℓ2(xp − xq)2
n1{xp = xq}
(1) Note:
- Covariance between outputs are a function of inputs.
- If xp ≈ xq, then Corr(Y (xp), Y (xq) ≈ 1.
- Hyperparameters are length-scale ℓ = 1, signal variance σ2
f = 1,
and noise variance σ2
n = 0.01.
SLIDE 8 Gaussian Process Regression Demo (Cont’d) Consider the following cases:
- 1. Fit a GP assuming zero mean and using the true covariance (1).
- 2. Shorten the length-scale to 0.3 (σf = 1.08, σn = 0.00005); the noise level is
much reduced.
- 3. Lengthen the length-scale to 3 (σf = 1.16, σn = 0.89); the noise level is
higher.
- 4. Assume the hyperparameters are unknown and estimated as the maximizer
- f the posterior probability given the training data, using the same
covariance family as (1).
- 5. Assume the hyperparameters are unknown, using the Mat´
ern covariance family. Directory: D:\GP\gpml_matlab\gpml-demo
SLIDE 9 Definition of GP
- Probability distribution characterizes a random variable (scalar
- r vector). Stochastic process governs the properties of
functions.
- GP is a collection of random variables, any finite number of
which have a joint Gaussian distribution.
- Suppose that X ∈ Rd having positive dimensional volume. We
say that Y (x), for x ∈ X is a Gaussian Process if for any L ≥ 1 and any choice of x1, x2, . . . , xL, the vector (Y (x1), Y (x2), . . . , Y (xL)) has a multivariate normal distribution.
- Other examples of GP: Brownian motion processes and
Kalman filters.
SLIDE 10 Review on Joint, Marginal and Conditional Probability Suppose we partition y into two groups: yA and yB, so that the joint probability is p(y) = p(yA, yB). Marginal probability of yA: p(yA) =
Conditional probability: p(yA|yB) = p(yA, yB) p(yB) , for p(yB) > 0. Using the definitions of both p(yA|yB) and p(yB|yA), we get Bayes’theorem: p(yA|yB) = p(yA)p(yB|yA) p(yB) . In Bayesian terminology, this is posterior = likelihood × prior marginal likelihood.
SLIDE 11 The Univariate Normal Distribution Consider Z ∼ Normal with mean 0 and variance 1 (N(0, 1)). We can transform Z to have mean µ and variance σ2: Y = σZ + µ, where Y is also normal with the density p(y
1 √ 2πσ exp
2σ2
SLIDE 12 The Multivariate Normal Distribution (MVN) Consider Zi ∼ N(0, 1), i = 1, 2, . . . , d, and we group them into a random vector:
Z = (Z1, Z2, . . . , Zd)′ ∼ Nd(0, I).
Similar to a N(0, 1) case, we can transform Z to have mean vector µ and covariance matrix Σ:
Y = Σ1/2Z + µ, where Σ1/2Σ1/2 = Σ,
a positive definite matrix. Y is normal with p(y|µ, Σ) = 1 ( √ 2π)d|Σ|1/2 exp
2(y − µ)′Σ−1(y − µ)
Small |Σ| Large |Σ|
SLIDE 13 The Multivariate Normal Distribution (cont’d) Suppose that
W2
µ2
Σ2,1 Σ2,2
Then, the marginal distribution of W1 is N(µ1, Σ1,1). Conditional distribution: [W1|W2] ∼ Nm
2,2(W2 − µ2), Σ1,1 − Σ1,2Σ−1 2,2Σ2,1
(2) Product of two Gaussians give another (unnormalized) Guassian: Nd(x|a, A)Nd(x|b, B) = Z−1Nd(x|c, C), where c = C(A−1a + B−1b), C = (A−1 + B−1)−1, and Z−1 = (2π)−d/2|A + B|−1/2 exp
2(a − b)′(A + B)−1(a − b)
SLIDE 14 GP Characterization Mostly, we consider strongly stationary GP: (Y (x1), Y (x2), . . . , Y (xL)) ∼ (Y (x1 + h), Y (x2 + h), . . . , Y (xL + h)) , i.e., it is invariant to translation. Therefore, the distribution of Y (x), mean and covariance matrix are the same for all x. The covariance function must depend only on x1 − x2. Thus, we can characterize GPs by:
- 1. Mean function E(Y (x)). Generally, we consider E[Y (x)] = 0.
- 2. Process variance, Cov(0, 0), and correlation function:
R(x1 − x2) or covariance function C(x1 − x2). Because correlation functions dictate the behavior of the GP (e.g., continuity and differentiability), certain families of correlation functions are generally used. Specification of C(x1 − x2) implies a distribution over functions.
SLIDE 15 Correlation or Covariance Functions
- Recall the intuition about GP: “Points with inputs x which are
close are likely to have similar responses y.” The correlation or the covariance function defines this similarity.
- Covariance matrices Σ must be symmetric and positive
definite, i.e., x′Σx ≥ 0.
- Correlation functions also imply the smoothness (and thus
differentiability) of predictors.
- Mostly, we consider stationary correlation function, ones that
depends on x − x′.
- Commonly used in DACE community: power exponential and
Mat´ ern.
SLIDE 16 GP Regression
Example: To forecast housing demand as a function of interest rates. Goal: Given a training data {(xi, Yi)}n
i=1, find the relationship between x and Y .
Standard linear model: Y (x) = f(x) + ǫ, where f(x) = x′β, e.g., f(x) = β0 + β1x. Let X be a n × d design matrix. Assume ǫ ∼ N(0, σ2
n). We
have a likelihood function: p(y|X, β) =
n
p(yi|xi, β) ⇒ Y|X, β ∼ N(Xβ, σ2
nI).
(3) Suppose, before we see the data, we believe that, i.e., prior distribution,
β ∼ Nd(0, Σp).
(4) Given the data, we want to know the distribution of β, i.e., the posterior distribution: p(β|y, X) = p(y|X, β)p(β) p(y|X) . The marginal likelihood p(y|X) is independent of β, so consider only (3) and (4): p(β|X, y) ∝ exp
2σ2
n
(y − Xβ)′(y − Xβ)
2β′Σ−1
p β
∼ Nd
1
σ2
n
A−1X′y, A−1
σ2
n
X′X + Σ−1
p .
SLIDE 17 GP Regression Predictors Because [β|X, y] ∼ Nd
σ2
nA−1X′y, A−1
σ2
nX′X + Σ−1
p ,
- In our usual multiple regression, Σp = 0; ˆ
β becomes
(X′X)−1X′y, the ordinary least square estimate of β.
- In Baysian approach, prediction of f(x0) = f0 is obtained by
averaging over all possible Guassian posterior: p(f0|x0, X, y) =
[f0|x0, X, y] ∼ N
0ˆ
β, x′
0A−1x0
Note that we get a predictive distribution, not just one predicted value.
f(x0) = x′
0β, and Var(
f(x0)) = x′
0A−1x0, so we can construct a
prediction interval.
SLIDE 18
Generalization of GP Regression Projections of inputs into feature space, e.g., a scalar x → φ(x) = (1, x, x2, x3, . . .)′ so that we have a polynomial regression. Consider the basis function φ(x) : Rd → RN. We now have: Y (x) = φ(x)′β + ǫ. To get the predictive distribution, the analysis is the same, but we replace
X
← Φ(X)(columns of φ(x) for x in the training set)
x0
← φ(x0).
SLIDE 19 Design and Analysis of Computer Experiments (DACE): Basics Given the training data {(xi, Y (xi)}n
i=1 where Y (·) is deterministic,
predict Y (x0) ≡ Y0. Let Yn = ((Y (x1), Y (x2), . . . , Y (xn))′ and (Y0, Yn) ∼ F.
Y (x0)] = EF[Y0].
- Mean square prediction error: MSPE(
Y0, F) = EF
Y0 − Y0)2 .
- Theorem 3.2.1 in Santner et al. (2003):
E[Y0|Yn] is the best MSPE predictor (min MSPE) of Y0.
SLIDE 20 DACE: Predicting Outputs from Computer Experiments
Consider another regression model: Y (xi) = φ(xi)′β + Z(xi), 0 ≤ i ≤ n where φ(·) are known regression functions, β is p × 1 vector, and Z(x) is GP with zero mean and following covariance: Cov(Z(xi), Z(xj)) = σ2
ZR(xi − xj).
Then
Yn
Φ
σ2
Z
r′ r0 R
where Φij = φj(xi), r0 = (R(x0 − x1), . . . , R(x0 − xn))′ and Rij = R(xi − xj). From Theorem 3.2.1 and the result on conditional distribution of MVN (2):
Y0 = E[Y0|Yn] = φ(x0)′β + r′
0R−1(Yn − Φβ).
- If the prior on β is given, then
Y0 = φ(x0)′E[β|Yn] + r′
0R−1(Yn − ΦE[β|Yn]),
e.g., [β] ∼ uniform (uninformative prior), then [β|Yn] ∼ Np
Z(Φ′R−1Φ)−1
(5)
SLIDE 21 DACE: Prediction When the Correlation Function is Known We consider the properties of this predictor,
β + r′
0R−1(Yn − Φˆ
β),
(6) where, from (5), ˆ
β = (Φ′R−1Φ)−1Φ′R−1Yn:
- 1. It interpolates the training data; suppose that x0 = xi, for some
i, 1 ≤ i ≤ n, then Y0 = Yi
- 2. It is a linear unbiased predictor (LUP) of Y (x0).
- 3. The behavior (e.g., smoothness) of
Y (x0) depends on x0 and R(x0 − xi).
SLIDE 22 DACE: Prediction When the Correlation Function is Unknown We use correlation estimates (ˆ
r0 and
R) in (6) to get the empirical BLUP (EBLUP) predictor:
β + r′ R−1(Yn − Φˆ β),
where ˆ
β = (Φ′ R−1Φ)−1Φ′ R−1Yn.
We assume the form of correlation function (i.e., family) but we still need to estimate its parameters, say ψ. For example, the exponential correlation function R(h, ψ) = exp
−
d
|hj/θj|pj
has 2d parameters, ψ = (θ1, . . . , θd, p1, . . . , pd).
SLIDE 23 DACE: Estimating Parameters for Correlation Functions
To estimate ψ, (except for the cross-validation method) we assume
β, σ2
Z, ψ
ZR].
(7)
- 1. Maximum likelihood EBLUPs: Most popular. Under the assumption (7), we
can write a log likelihood function ℓ(β, σ2
Z, ψ), from which the MLE of β and
σ2
Z are determined. These
β and
σ2
Z are substituted back into ℓ(ψ).
- 2. Restricted maximum likelihood EBLUPs: To get rid of the unknown β,
transform Yn → W ≡ CYn, so that its mean is a zero vector. Under the assumption (7), we have the likelihood of W.
- 3. Cross-validation EBLUPs: Let
Y−i(ψ) be the predictor when the ith training data is missing. Find ψ that minimizes MSPE = n
i=1(
Y−i(ψ) − y(xi))2.
- 4. Posterior mode EBLUPs is E
- Y (x0)|Yn,
ψ
ψ maximizes
[ψ|Yn] ∝ [Yn|ψ][ψ]. But finding the right choice of ψ prior is non-trivial. From the likelihood or MSPE expressions, we can do steepest descent search, that uses gradients, to find the optimal ψ. Santner, et al (2003) recommends MLE-EBLUP or REML-EBLUP based on the power exponential R(·). Other Considerations: Choice of training sites and the size of training data.
SLIDE 24
Stochastic Kriging Standard kriging considers deterministic responses, and a typical model is Y (x) = φ(x)′β + Z(x). In stochastic kriging, responses or observations have noise; thus, Ankenman, Nelson and Staum (2008) proposes the following model:
Yj(x) = φ(x)′β + Z(x) + εj(x),
where Yj(x) is the jth observation, and Z(x) and ε(x) are independent. Two types of uncertainties: extrinsic uncertainty Z(x) and intrinsic uncertainty ε(x) Training data:
xi, {Y1(xi), . . . Yni(xi)} , for training site i,
i = 1, 2, . . . , n.
SLIDE 25 Stochastic Kriging Results For the case φ(xi)′β = β0 (common in DACE), and β0, ΣZ, and Σε are known, the MSE-optimal predictor is
- Y(x0) = β0 + ΣZ(x0, ·)′ [ΣZ + Σε]−1 ( ¯
Y − β01p), (8) where ¯ Y =
¯
Y(x1), ¯ Y(x2), . . . , ¯ Y(xn)
′, and 1p is the p × 1 vector of
Insights:
- The intrinsic uncertainty impacts the prediction everywhere.
- The choice of training sites and their sample sizes influence the MSE.
For the case that Z(x) ∼ GP(0, ΣZ), εj(x) are i.i.d. N(0, V(xi)), independent of εh(x), and
Σε = Diag
V(x2)/n2, . . . ,
- V(xn)/nn
- , the predictor (8) with Σε ←
Σε is unbiased.
SLIDE 26 GP Classification
- Ex: Classification of handwritten digits into 0-9 (10 classes).
Model images with tensors (x is a multi-dimensional array). Many real-world applications, e.g., classification of satellite images or images from medical scans.
- Focus on probabilistic classification: predictions take the form
- f class probabilities.
- Output y is discrete (class labels), so the Gaussian likelihood
p(y|x) is not appropriate.
- Approach for two-class classification (+1 and -1): Similar to
GP regression but model the latent function and transform it to get p(y = +1|x) (the range is [0, 1]) by using
– the logistic function: λ(z) = 1/(1 + exp (−z) → Logistic regression. – the standard normal cdf: Φ(z) → Probit regression.
SLIDE 27 Resources
- Comprehensive listing: www.gaussianprocess.org
- MATLAB code:
– Rasmussen C.E., and C.K.I. Williams. 2006. Gaussian Processes for Machine Learning. GPML Code: www.gaussianprocess.org/gpml/code/matlab/doc/ – Lophaven, S. N., H. B. Nielsen and J. Søndergaard. 2002. DACE: A Matlab Kriging Toolbox. www2.imm.dtu.dk/∼hbn/dace/
- In other languages (e.g., C, C++, or Octave–Matlab
equivalent freeware): www.gaussianprocess.org and PeRK toolbox for DACE www.stat.osu.edu/∼comp exp/book.html.
SLIDE 28
References