Tutorials on the Gaussian Random Process and its OR Applications By - - PowerPoint PPT Presentation

tutorials on the gaussian random process and its or
SMART_READER_LITE
LIVE PREVIEW

Tutorials on the Gaussian Random Process and its OR Applications By - - PowerPoint PPT Presentation

Tutorials on the Gaussian Random Process and its OR Applications By Juta Pichitlamken Department of Industrial Engineering Kasetsart University, Bangkok juta.p@ku.ac.th http://pirun.ku.ac.th/ fengjtp/ September 3, 2009 Operations


slide-1
SLIDE 1

Tutorials on the Gaussian Random Process and its OR Applications

By

Juta Pichitlamken

Department of Industrial Engineering Kasetsart University, Bangkok juta.p@ku.ac.th http://pirun.ku.ac.th/∼fengjtp/

September 3, 2009

Operations Research Network of Thailand (OR-NET 2009)

slide-2
SLIDE 2

Motivation: Dagstuhl Seminar

  • Near Frankfurt, Germany. Web: http://www.dagstuhl.de/
  • Concentrate on computer sciences, but the topic that I went to

is Sampling-based Optimization in the Presence of Uncertainty

  • rganized by Jurgen Branke (Universitat Karlsruhe, Germany),

Barry Nelson (Northwestern University, US), Warren Powell (Princeton University, US), and Thomas J. Santner (Ohio State University, US).

  • Small workshop of 20-30 people (invitation only) from different

fields (Statistics, Simulation, OR, Computer Sciences, Business Administration, Mathematics, ...).

  • Despite these diversity, many people use Gaussian Random

Processes (GP) as modeling tools.

slide-3
SLIDE 3

Guassian Process (GP) Application: Spatial Statistics

  • Model spatial distribution of environmental or socioeconomic data (e.g.,

geographical distributions of cases of type-A (H1N1) influenza, or data from Geographic Information System (GIS)) for statistical inference.

  • Intuition: “Everything is related to everything else, but near things are more

related than distant things.” (Waldo Tobler)

  • GP prediction (known as kriging) can be used to model this spatial

dependency, i.e., spatial data is viewed as a realization of a stochastic process.

  • To build a model, the stochastic process is assumed stable: mean is

constant, and covariance depends on the distance. Covariance is a measure

  • f how much two variables change together:

Cov(X, Y ) = E[(X − µX)(Y − µY )].

  • Classic textbook: Cressie, N.A.C. 1993. Statistics for Spatial Data. Wiley.

Source: Cˆ amara, G., A.M. Monteiro, S.D. Fucks, and M. S. Carvalho. 2004. Spatial Analysis and GIS Primer. Downloadable from www.dpi.inpe.br/gilberto/spatial analysis.html

slide-4
SLIDE 4

GP Application: Metamodeling of Deterministic Responses

  • In OR, a metamodel is a mathematical model of a set of related models

(Xavier 2004).

  • Design and analysis of computer experiments: Data are generated from a

computer code (e.g., finite element models) whose responses are deterministic; computation may be time consuming; and large number of factors are involved.

  • Given the training data {(xi, yi)}n

i=1, we want to find an approximation

(metamodel) to the computer code. Called surrogates in the (deterministic) global optimization literature.

  • Ex: a computational fluid dynamics (CFD) model to study the oil mist

separator system in the internal combustion engine (Satoh, Kawai, Ishikawa, and Matsuoka 2000).

  • Computer models with multiple levels of fidelity for optimization: Huang,

Allen, Notz, and Miller (2006).

  • Textbooks: Fang K.-T., R. Li, and A. Sudjianto. 2006. Design and

Modeling for Computer Experiments. Taylor and Francis. Santner, T.J., B.J. Williams, and W. I. Notz. 2003. The Design and Analysis of Computer Experiments. Springer-Verlag.

slide-5
SLIDE 5

GP Application: Metamodeling of Probabilistic Responses Probabilistic responses (e.g., discrete-event simulation outputs). Approaches:

  • 1. Consider the variability in the observed responses and fitting

errors simultaneously: Ankenman, Nelson, and Staum (2008).

  • 2. Separate the trend in the data using least squares models.

Then, apply kriging models to model the residuals: van Beers and Kleijnen (2003, 2007).

slide-6
SLIDE 6

GP Application: Machine learning

  • Supervised learning: f: input → output from empirical data (training data

set). – For continuous outputs (aka responses or dependent variables), f is called regression. Ex: in manufacturing systems, WIP = f(throughput

  • r production rate).

– For discrete outputs, f is known as classification. Ex: classification of handwritten images into digits (0-9).

  • Approach: Assume prior distributions (beliefs over the types of f we will
  • bserve, e.g., mean and variance) to be GP. Once we get actual data, we

can reject f that do not agree with the data, i.e., find the GP parameters that best fit the data.

  • Desirable properties of prior: smooth and stationary ← covariance function.
  • Resources: MacKay, D. 2002. Information Theory, Inference & Learning
  • Algorithms. Cambridge University Press.

Rasmussen C.E., and C.K.I. Williams. 2006. Gaussian Processes for Machine Learning. MIT Press. Also available online: www.gaussianprocess.org/gpml

slide-7
SLIDE 7

Gaussian Process Regression Demo Thanks to Rasmussen C.E., and C.K.I. Williams. 2006. GPML Code. Sample Y from N(0, 1) with exponential covariance function: Cov (Y (xp), Y (xq)) = σ2

f exp

  • − 1

2ℓ2(xp − xq)2

  • + σ2

n1{xp = xq}

(1) Note:

  • Covariance between outputs are a function of inputs.
  • If xp ≈ xq, then Corr(Y (xp), Y (xq) ≈ 1.
  • Hyperparameters are length-scale ℓ = 1, signal variance σ2

f = 1,

and noise variance σ2

n = 0.01.

slide-8
SLIDE 8

Gaussian Process Regression Demo (Cont’d) Consider the following cases:

  • 1. Fit a GP assuming zero mean and using the true covariance (1).
  • 2. Shorten the length-scale to 0.3 (σf = 1.08, σn = 0.00005); the noise level is

much reduced.

  • 3. Lengthen the length-scale to 3 (σf = 1.16, σn = 0.89); the noise level is

higher.

  • 4. Assume the hyperparameters are unknown and estimated as the maximizer
  • f the posterior probability given the training data, using the same

covariance family as (1).

  • 5. Assume the hyperparameters are unknown, using the Mat´

ern covariance family. Directory: D:\GP\gpml_matlab\gpml-demo

slide-9
SLIDE 9

Definition of GP

  • Probability distribution characterizes a random variable (scalar
  • r vector). Stochastic process governs the properties of

functions.

  • GP is a collection of random variables, any finite number of

which have a joint Gaussian distribution.

  • Suppose that X ∈ Rd having positive dimensional volume. We

say that Y (x), for x ∈ X is a Gaussian Process if for any L ≥ 1 and any choice of x1, x2, . . . , xL, the vector (Y (x1), Y (x2), . . . , Y (xL)) has a multivariate normal distribution.

  • Other examples of GP: Brownian motion processes and

Kalman filters.

slide-10
SLIDE 10

Review on Joint, Marginal and Conditional Probability Suppose we partition y into two groups: yA and yB, so that the joint probability is p(y) = p(yA, yB). Marginal probability of yA: p(yA) =

  • p(yA, yB)dyB.

Conditional probability: p(yA|yB) = p(yA, yB) p(yB) , for p(yB) > 0. Using the definitions of both p(yA|yB) and p(yB|yA), we get Bayes’theorem: p(yA|yB) = p(yA)p(yB|yA) p(yB) . In Bayesian terminology, this is posterior = likelihood × prior marginal likelihood.

slide-11
SLIDE 11

The Univariate Normal Distribution Consider Z ∼ Normal with mean 0 and variance 1 (N(0, 1)). We can transform Z to have mean µ and variance σ2: Y = σZ + µ, where Y is also normal with the density p(y

  • µ, σ2) =

1 √ 2πσ exp

  • −(y − µ)2

2σ2

  • .
slide-12
SLIDE 12

The Multivariate Normal Distribution (MVN) Consider Zi ∼ N(0, 1), i = 1, 2, . . . , d, and we group them into a random vector:

Z = (Z1, Z2, . . . , Zd)′ ∼ Nd(0, I).

Similar to a N(0, 1) case, we can transform Z to have mean vector µ and covariance matrix Σ:

Y = Σ1/2Z + µ, where Σ1/2Σ1/2 = Σ,

a positive definite matrix. Y is normal with p(y|µ, Σ) = 1 ( √ 2π)d|Σ|1/2 exp

  • −1

2(y − µ)′Σ−1(y − µ)

  • .

Small |Σ| Large |Σ|

slide-13
SLIDE 13

The Multivariate Normal Distribution (cont’d) Suppose that

  • W1

W2

  • ∼ Nm+n
  • µ1

µ2

  • ,
  • Σ1,1 Σ1,2

Σ2,1 Σ2,2

  • .

Then, the marginal distribution of W1 is N(µ1, Σ1,1). Conditional distribution: [W1|W2] ∼ Nm

  • µ1 + Σ1,2Σ−1

2,2(W2 − µ2), Σ1,1 − Σ1,2Σ−1 2,2Σ2,1

  • .

(2) Product of two Gaussians give another (unnormalized) Guassian: Nd(x|a, A)Nd(x|b, B) = Z−1Nd(x|c, C), where c = C(A−1a + B−1b), C = (A−1 + B−1)−1, and Z−1 = (2π)−d/2|A + B|−1/2 exp

  • −1

2(a − b)′(A + B)−1(a − b)

  • .
slide-14
SLIDE 14

GP Characterization Mostly, we consider strongly stationary GP: (Y (x1), Y (x2), . . . , Y (xL)) ∼ (Y (x1 + h), Y (x2 + h), . . . , Y (xL + h)) , i.e., it is invariant to translation. Therefore, the distribution of Y (x), mean and covariance matrix are the same for all x. The covariance function must depend only on x1 − x2. Thus, we can characterize GPs by:

  • 1. Mean function E(Y (x)). Generally, we consider E[Y (x)] = 0.
  • 2. Process variance, Cov(0, 0), and correlation function:

R(x1 − x2) or covariance function C(x1 − x2). Because correlation functions dictate the behavior of the GP (e.g., continuity and differentiability), certain families of correlation functions are generally used. Specification of C(x1 − x2) implies a distribution over functions.

slide-15
SLIDE 15

Correlation or Covariance Functions

  • Recall the intuition about GP: “Points with inputs x which are

close are likely to have similar responses y.” The correlation or the covariance function defines this similarity.

  • Covariance matrices Σ must be symmetric and positive

definite, i.e., x′Σx ≥ 0.

  • Correlation functions also imply the smoothness (and thus

differentiability) of predictors.

  • Mostly, we consider stationary correlation function, ones that

depends on x − x′.

  • Commonly used in DACE community: power exponential and

Mat´ ern.

slide-16
SLIDE 16

GP Regression

Example: To forecast housing demand as a function of interest rates. Goal: Given a training data {(xi, Yi)}n

i=1, find the relationship between x and Y .

Standard linear model: Y (x) = f(x) + ǫ, where f(x) = x′β, e.g., f(x) = β0 + β1x. Let X be a n × d design matrix. Assume ǫ ∼ N(0, σ2

n). We

have a likelihood function: p(y|X, β) =

n

  • i=1

p(yi|xi, β) ⇒ Y|X, β ∼ N(Xβ, σ2

nI).

(3) Suppose, before we see the data, we believe that, i.e., prior distribution,

β ∼ Nd(0, Σp).

(4) Given the data, we want to know the distribution of β, i.e., the posterior distribution: p(β|y, X) = p(y|X, β)p(β) p(y|X) . The marginal likelihood p(y|X) is independent of β, so consider only (3) and (4): p(β|X, y) ∝ exp

  • − 1

2σ2

n

(y − Xβ)′(y − Xβ)

  • exp
  • −1

2β′Σ−1

p β

  • [β|X, y]

∼ Nd

1

σ2

n

A−1X′y, A−1

  • where A = 1

σ2

n

X′X + Σ−1

p .

slide-17
SLIDE 17

GP Regression Predictors Because [β|X, y] ∼ Nd

  • 1

σ2

nA−1X′y, A−1

  • , where A = 1

σ2

nX′X + Σ−1

p ,

  • In our usual multiple regression, Σp = 0; ˆ

β becomes

(X′X)−1X′y, the ordinary least square estimate of β.

  • In Baysian approach, prediction of f(x0) = f0 is obtained by

averaging over all possible Guassian posterior: p(f0|x0, X, y) =

  • p(f0|x0, β)p(β|X, y)dβ

[f0|x0, X, y] ∼ N

  • x′

β, x′

0A−1x0

  • .

Note that we get a predictive distribution, not just one predicted value.

f(x0) = x′

0β, and Var(

f(x0)) = x′

0A−1x0, so we can construct a

prediction interval.

slide-18
SLIDE 18

Generalization of GP Regression Projections of inputs into feature space, e.g., a scalar x → φ(x) = (1, x, x2, x3, . . .)′ so that we have a polynomial regression. Consider the basis function φ(x) : Rd → RN. We now have: Y (x) = φ(x)′β + ǫ. To get the predictive distribution, the analysis is the same, but we replace

X

← Φ(X)(columns of φ(x) for x in the training set)

x0

← φ(x0).

slide-19
SLIDE 19

Design and Analysis of Computer Experiments (DACE): Basics Given the training data {(xi, Y (xi)}n

i=1 where Y (·) is deterministic,

predict Y (x0) ≡ Y0. Let Yn = ((Y (x1), Y (x2), . . . , Y (xn))′ and (Y0, Yn) ∼ F.

  • Unbiased predictor: EF[

Y (x0)] = EF[Y0].

  • Mean square prediction error: MSPE(

Y0, F) = EF

  • (

Y0 − Y0)2 .

  • Theorem 3.2.1 in Santner et al. (2003):

E[Y0|Yn] is the best MSPE predictor (min MSPE) of Y0.

slide-20
SLIDE 20

DACE: Predicting Outputs from Computer Experiments

Consider another regression model: Y (xi) = φ(xi)′β + Z(xi), 0 ≤ i ≤ n where φ(·) are known regression functions, β is p × 1 vector, and Z(x) is GP with zero mean and following covariance: Cov(Z(xi), Z(xj)) = σ2

ZR(xi − xj).

Then

  • Y0

Yn

  • ∼ N1+n
  • φ(x0)′

Φ

  • β,

σ2

Z

  • 1

r′ r0 R

  • ,

where Φij = φj(xi), r0 = (R(x0 − x1), . . . , R(x0 − xn))′ and Rij = R(xi − xj). From Theorem 3.2.1 and the result on conditional distribution of MVN (2):

  • If β is given, then

Y0 = E[Y0|Yn] = φ(x0)′β + r′

0R−1(Yn − Φβ).

  • If the prior on β is given, then

Y0 = φ(x0)′E[β|Yn] + r′

0R−1(Yn − ΦE[β|Yn]),

e.g., [β] ∼ uniform (uninformative prior), then [β|Yn] ∼ Np

  • (Φ′R−1Φ)−1Φ′R−1Yn, σ2

Z(Φ′R−1Φ)−1

(5)

slide-21
SLIDE 21

DACE: Prediction When the Correlation Function is Known We consider the properties of this predictor,

  • Y0 = φ(x0)′ˆ

β + r′

0R−1(Yn − Φˆ

β),

(6) where, from (5), ˆ

β = (Φ′R−1Φ)−1Φ′R−1Yn:

  • 1. It interpolates the training data; suppose that x0 = xi, for some

i, 1 ≤ i ≤ n, then Y0 = Yi

  • 2. It is a linear unbiased predictor (LUP) of Y (x0).
  • 3. The behavior (e.g., smoothness) of

Y (x0) depends on x0 and R(x0 − xi).

slide-22
SLIDE 22

DACE: Prediction When the Correlation Function is Unknown We use correlation estimates (ˆ

r0 and

R) in (6) to get the empirical BLUP (EBLUP) predictor:

  • Y0 = φ(x0)′ˆ

β + r′ R−1(Yn − Φˆ β),

where ˆ

β = (Φ′ R−1Φ)−1Φ′ R−1Yn.

We assume the form of correlation function (i.e., family) but we still need to estimate its parameters, say ψ. For example, the exponential correlation function R(h, ψ) = exp

  −

d

  • j=1

|hj/θj|pj

  

has 2d parameters, ψ = (θ1, . . . , θd, p1, . . . , pd).

slide-23
SLIDE 23

DACE: Estimating Parameters for Correlation Functions

To estimate ψ, (except for the cross-validation method) we assume

  • Yn

β, σ2

Z, ψ

  • ∼ N[Φβ, σ2

ZR].

(7)

  • 1. Maximum likelihood EBLUPs: Most popular. Under the assumption (7), we

can write a log likelihood function ℓ(β, σ2

Z, ψ), from which the MLE of β and

σ2

Z are determined. These

β and

σ2

Z are substituted back into ℓ(ψ).

  • 2. Restricted maximum likelihood EBLUPs: To get rid of the unknown β,

transform Yn → W ≡ CYn, so that its mean is a zero vector. Under the assumption (7), we have the likelihood of W.

  • 3. Cross-validation EBLUPs: Let

Y−i(ψ) be the predictor when the ith training data is missing. Find ψ that minimizes MSPE = n

i=1(

Y−i(ψ) − y(xi))2.

  • 4. Posterior mode EBLUPs is E
  • Y (x0)|Yn,

ψ

  • , where

ψ maximizes

[ψ|Yn] ∝ [Yn|ψ][ψ]. But finding the right choice of ψ prior is non-trivial. From the likelihood or MSPE expressions, we can do steepest descent search, that uses gradients, to find the optimal ψ. Santner, et al (2003) recommends MLE-EBLUP or REML-EBLUP based on the power exponential R(·). Other Considerations: Choice of training sites and the size of training data.

slide-24
SLIDE 24

Stochastic Kriging Standard kriging considers deterministic responses, and a typical model is Y (x) = φ(x)′β + Z(x). In stochastic kriging, responses or observations have noise; thus, Ankenman, Nelson and Staum (2008) proposes the following model:

Yj(x) = φ(x)′β + Z(x) + εj(x),

where Yj(x) is the jth observation, and Z(x) and ε(x) are independent. Two types of uncertainties: extrinsic uncertainty Z(x) and intrinsic uncertainty ε(x) Training data:

xi, {Y1(xi), . . . Yni(xi)} , for training site i,

i = 1, 2, . . . , n.

slide-25
SLIDE 25

Stochastic Kriging Results For the case φ(xi)′β = β0 (common in DACE), and β0, ΣZ, and Σε are known, the MSE-optimal predictor is

  • Y(x0) = β0 + ΣZ(x0, ·)′ [ΣZ + Σε]−1 ( ¯

Y − β01p), (8) where ¯ Y =

¯

Y(x1), ¯ Y(x2), . . . , ¯ Y(xn)

′, and 1p is the p × 1 vector of

  • nes.

Insights:

  • The intrinsic uncertainty impacts the prediction everywhere.
  • The choice of training sites and their sample sizes influence the MSE.

For the case that Z(x) ∼ GP(0, ΣZ), εj(x) are i.i.d. N(0, V(xi)), independent of εh(x), and

Σε = Diag

  • V(x1)/n1,

V(x2)/n2, . . . ,

  • V(xn)/nn
  • , the predictor (8) with Σε ←

Σε is unbiased.

slide-26
SLIDE 26

GP Classification

  • Ex: Classification of handwritten digits into 0-9 (10 classes).

Model images with tensors (x is a multi-dimensional array). Many real-world applications, e.g., classification of satellite images or images from medical scans.

  • Focus on probabilistic classification: predictions take the form
  • f class probabilities.
  • Output y is discrete (class labels), so the Gaussian likelihood

p(y|x) is not appropriate.

  • Approach for two-class classification (+1 and -1): Similar to

GP regression but model the latent function and transform it to get p(y = +1|x) (the range is [0, 1]) by using

– the logistic function: λ(z) = 1/(1 + exp (−z) → Logistic regression. – the standard normal cdf: Φ(z) → Probit regression.

slide-27
SLIDE 27

Resources

  • Comprehensive listing: www.gaussianprocess.org
  • MATLAB code:

– Rasmussen C.E., and C.K.I. Williams. 2006. Gaussian Processes for Machine Learning. GPML Code: www.gaussianprocess.org/gpml/code/matlab/doc/ – Lophaven, S. N., H. B. Nielsen and J. Søndergaard. 2002. DACE: A Matlab Kriging Toolbox. www2.imm.dtu.dk/∼hbn/dace/

  • In other languages (e.g., C, C++, or Octave–Matlab

equivalent freeware): www.gaussianprocess.org and PeRK toolbox for DACE www.stat.osu.edu/∼comp exp/book.html.

slide-28
SLIDE 28

References