A Sequential Split-Conquer-Combine Approach for Gaussian Process - - PowerPoint PPT Presentation

a sequential split conquer combine approach for gaussian
SMART_READER_LITE
LIVE PREVIEW

A Sequential Split-Conquer-Combine Approach for Gaussian Process - - PowerPoint PPT Presentation

A Sequential Split-Conquer-Combine Approach for Gaussian Process Modeling in Computer Experiments Chengrui Li Department of Statistics and Biostatistics, Rutgers University Joint work with Ying Hung and Min-ge Xie 2017 QPRC JUNE 13, 2017 1


slide-1
SLIDE 1

A Sequential Split-Conquer-Combine Approach for Gaussian Process Modeling in Computer Experiments

Chengrui Li

Department of Statistics and Biostatistics, Rutgers University Joint work with Ying Hung and Min-ge Xie 2017 QPRC JUNE 13, 2017 1

slide-2
SLIDE 2

Outline

Introduction A Unified Framework with Theoretical Supports Simulation Study Real Data Example Summary

2

slide-3
SLIDE 3

Introduction

slide-4
SLIDE 4

Motivating example: Data center thermal management

  • A data center is an integrated facility housing multiple-unit

servers, providing application services or management for data processing.

  • Goal: Design a data center with an efficient heat removal

mechanism.

  • Computational Fluid Dynamics (CFD) simulation (n = 26820,

p = 9)

Figure 1: Heat map for IBM T. J. Watson Data Center

3

slide-5
SLIDE 5

Gaussian process model

  • Gaussian process (GP) model:

y = Xβ + Z(x),

  • y: n × 1 vector of observations (e.g., room temperatures)
  • X: n × p design matrix
  • β: p × 1 unknown parameters

4

slide-6
SLIDE 6

Gaussian process model

  • Gaussian process (GP) model:

y = Xβ + Z(x),

  • y: n × 1 vector of observations (e.g., room temperatures)
  • X: n × p design matrix
  • β: p × 1 unknown parameters
  • Z(x) is a GP process with mean ✵ and covariance σ2Σ(θ)
  • Σ(θ): n-by-n Correlation matrix with correlation parameters θ

The ijth element of Σ is defined by a power exponential function corr(Z(xi), Z(xj)) = exp(−θT|xi − xj|) = exp(−

p

  • k=1

θk|xik − xjk|).

  • Remark: Assume σ is known for simplicity in this talk.

4

slide-7
SLIDE 7

Estimation and prediction

  • Likelihood inference

l(β, θ, σ) = − 1 2σ2 (y − Xβ)⊤Σ−1(θ)(y − Xβ) − 1 2 log |Σ(θ)| − n 2 log(σ2) So,

  • β|θ = ❛r❣ ♠❛①

β

{l(β|θ, σ2)} = (X⊤Σ−1(θ)X)−1X⊤Σ−1(θ)y

  • θ|

β, σ2 = ❛r❣ ♠❛①

θ

{l(θ| β, σ2)}

✵ ✵ ✵ ✵ ✵ ✵ ✵ ✵

5

slide-8
SLIDE 8

Estimation and prediction

  • Likelihood inference

l(β, θ, σ) = − 1 2σ2 (y − Xβ)⊤Σ−1(θ)(y − Xβ) − 1 2 log |Σ(θ)| − n 2 log(σ2) So,

  • β|θ = ❛r❣ ♠❛①

β

{l(β|θ, σ2)} = (X⊤Σ−1(θ)X)−1X⊤Σ−1(θ)y

  • θ|

β, σ2 = ❛r❣ ♠❛①

θ

{l(θ| β, σ2)}

  • GP prediction, say y✵, at a new point x✵, given parameters (β, θ), follows a

normal distribution with mean p✵(β, θ) and variance m✵(β, θ), where, p✵(β, θ) = x⊤

✵ β + γ(θ)⊤Σ−1(θ)(y − Xβ)

m✵(β, θ) = σ2(1 − γ(θ)⊤Σ−1(θ)γ(θ)), and γ(θ) is a n × 1 vector of ith element equals to φ(||xi − x✵||; θ).

5

slide-9
SLIDE 9

Two challenges in GP modeling

Computational issue:

  • Estimation and prediction involve Σ−1 and |Σ| with order of O(n3):
  • Not feasible when n is large

6

slide-10
SLIDE 10

Two challenges in GP modeling

Computational issue:

  • Estimation and prediction involve Σ−1 and |Σ| with order of O(n3):
  • Not feasible when n is large

Uncertainty quantification of GP predictor

  • Plug-in predictive distribution is widely used
  • It underestimates the uncertainty

6

slide-11
SLIDE 11

Existing methods

For the computational issue:

  • Change the model to one that is computationally convenient: Rue

and Held (2005), Cressie and Johannesson (2008).

  • Approximate the likelihood function: Stein et al. (2004), Furrer et
  • al. (2006), Fuentes (2007), Kaufman et al. (2008).
  • Not focus on uncertainty quantification and bring in addition

uncertainty

For uncertainty quantification of GP predictor

  • Bayesian predictive distribution
  • Bootstrap approach (Luna and Young 2003)
  • Intensive computation

7

slide-12
SLIDE 12

Solve both problems by a unified framework?

  • Yes!

8

slide-13
SLIDE 13

A Unified Framework

slide-14
SLIDE 14

Introduction to confidence distribution (CD)

Statistical inference (Parameter estimation):

  • Point estimate
  • Interval estimate
  • Distribution estimate

Example: X1, . . . , Xn i.i.d. follows N(µ, 1)

  • Point estimate: ¯

xn = 1

n

n

i=1 xi

  • Interval estimate: (¯

xn − 1.96/√n, ¯ xn + 1.96/√n)

  • Distribution estimate: N(¯

xn, 1

n )

9

slide-15
SLIDE 15

Introduction to confidence distribution (CD)

Statistical inference (Parameter estimation):

  • Point estimate
  • Interval estimate
  • Distribution estimate

Example: X1, . . . , Xn i.i.d. follows N(µ, 1)

  • Point estimate: ¯

xn = 1

n

n

i=1 xi

  • Interval estimate: (¯

xn − 1.96/√n, ¯ xn + 1.96/√n)

  • Distribution estimate: N(¯

xn, 1

n )

The idea of the CD approach is to use a sample-dependent distribution (or density) function to estimate the parameter of interest.

  • Wide range of examples: bootstrap distribution, (normalized) likelihood

function, p-value functions, fiducial distributions, some informative priors and Bayesian posteriors, among others (Xie and Singh 2013)

9

slide-16
SLIDE 16

Overview: Sequential Split-Conquer-Combine

1

D

2

D

3

D

m

D



Step 1:

* 1 1

ˆ : D 

* 2 2

ˆ : D 

Step 2:

* 3 3

ˆ : D 

Step 3:

*

ˆ :

m m

D 

Step m: Split and Conquer Combine



3

ˆ 

1

ˆ 

2

ˆ  ˆ

m

 ˆ

c

Data Figure 2: Sequential Split-Conquer-Combine Approach

10

slide-17
SLIDE 17

Ingredients

Split the entire dataset into subsets (correlated) based on compact support correlation assumption for 1-D

11

slide-18
SLIDE 18

Ingredients

Split the entire dataset into subsets (correlated) based on compact support correlation assumption for 1-D Perform a sequential updating to create independent subsets and estimate on each updated subsets

11

slide-19
SLIDE 19

Ingredients

Split the entire dataset into subsets (correlated) based on compact support correlation assumption for 1-D Perform a sequential updating to create independent subsets and estimate on each updated subsets Combine estimators

11

slide-20
SLIDE 20

Ingredients

Split the entire dataset into subsets (correlated) based on compact support correlation assumption for 1-D Perform a sequential updating to create independent subsets and estimate on each updated subsets Combine estimators Quantify prediction uncertainty

11

slide-21
SLIDE 21

Split

Split the entire dataset into subsets y = {ya}, a = 1, ..., m. Denote the size of ya by na, i.e. na = n.

  • Assumption: compactly supported correlation

Σt =           Σ11 Σ12 · · · O O Σ21 Σ22 ... · · · O . . . ... ... ... . . . O · · · ... Σ(m−1)(m−1) Σ(m−1)m O O · · · Σm(m−1) Σmm          

n×n

, (after index sorting according to X1 values)

12

slide-22
SLIDE 22

Sequentially update data

Transform y to y∗ by sequentially updating: y∗

a = ya − La(a−1)y∗ a−1,

where L(a+1)a = Σt(a+1)aD−1

a , Da = Σaa − La(a−1)D(a−1)L⊤ a(a−1).

  • Sequential updates are computationally efficient.
  • The updated block y∗

a’s are independent. 13

slide-23
SLIDE 23

Estimation from each subset

Given θ, we have

  • MLE of the ath subset:
  • βa = ❛r❣ ♠❛①

β|θ

l(a)

t (β|θ) = (C⊤ a D−1 a Ca)−1C⊤ a D−1 a y∗ a.

  • An individual CD for the ath updated subset is (cf., Xie and

Singh 2013): Np( βa, Cov( βa)). Given β, we have

  • MLE of the ath subset:

θa = ❛r❣ ♠❛①θ l(a)

t (θ|β).

  • Given β, an individual CD for the ath updated subset is

N( θa, Cov( θa)). Significant computational reduction because Da is much smaller than the original covariance matrix.

14

slide-24
SLIDE 24

CD combining

  • Following Singh, Xie and Strawderman (2005), Liu, Liu and Xie

(2014) and Yang et al. (2014), a combined CD is Np(βc, Sc), where

  • βc = ( Wa)−1( Wa

βa) with Wa = (C⊤

a D−1 a Ca)−1 and

Sc = Cov( βc).

  • Similar framework can be applied to all the parameters (β, θ).

15

slide-25
SLIDE 25

CD combining

  • Following Singh, Xie and Strawderman (2005), Liu, Liu and Xie

(2014) and Yang et al. (2014), a combined CD is Np(βc, Sc), where

  • βc = ( Wa)−1( Wa

βa) with Wa = (C⊤

a D−1 a Ca)−1 and

Sc = Cov( βc).

  • Similar framework can be applied to all the parameters (β, θ).

Theorem 1 Under some regularity assumptions, when τ > Op(n1/2) and n → ∞, the SSCC estimator λc = ( βc, θc) is asymptotically as efficient as MLE

  • λmle = (

βmle, θmle).

15

slide-26
SLIDE 26

GP predictive distribution

  • GP predictor at a new point x✵, given parameters (β, θ), follows

a normal distribution with mean p✵(β, θ) and variance m✵(β, θ), where, p✵(β, θ) = x⊤

✵ β + γ(θ)⊤Σ−1(θ)(y − Xβ)

m✵(β, θ) = σ2(1 − γ(θ)⊤Σ−1(θ)γ(θ)), and γ(θ) is a n×1 vector of ith element equals to φ(||xi −x✵||; θ).

16

slide-27
SLIDE 27

GP predictive distribution

  • GP predictor at a new point x✵, given parameters (β, θ), follows

a normal distribution with mean p✵(β, θ) and variance m✵(β, θ), where, p✵(β, θ) = x⊤

✵ β + γ(θ)⊤Σ−1(θ)(y − Xβ)

m✵(β, θ) = σ2(1 − γ(θ)⊤Σ−1(θ)γ(θ)), and γ(θ) is a n×1 vector of ith element equals to φ(||xi −x✵||; θ).

  • Drawbacks:

Computational issue again! Conventional plug-in predictive distribution underestimate the uncertainty

16

slide-28
SLIDE 28

Quantify GP prediction uncertainty using SSCC and CD

An easy-to-compute GP predictor: p1(β, θ) =x⊤

✵ β + m

  • a=1

γ∗

a(θ)⊤D−1 a (θ)y∗ a + m

  • a=1

γ∗

a(θ)⊤D−1 a (θ)Ca(θ)β

m1(β, θ) =σ2(1 −

m

  • a=1

γ∗

a(θ)⊤D−1 a (θ)γ∗ a(θ)) ✵ ✵ 17

slide-29
SLIDE 29

Quantify GP prediction uncertainty using SSCC and CD

An easy-to-compute GP predictor: p1(β, θ) =x⊤

✵ β + m

  • a=1

γ∗

a(θ)⊤D−1 a (θ)y∗ a + m

  • a=1

γ∗

a(θ)⊤D−1 a (θ)Ca(θ)β

m1(β, θ) =σ2(1 −

m

  • a=1

γ∗

a(θ)⊤D−1 a (θ)γ∗ a(θ))

A better way to quantify uncertainty: use a CD-based predictive distribution function: Q(y✵; y) =

  • λ∈Θ

Gλ(y✵)dHc(λ; y), where Gλ(·) is the CDF of N(p1(λ), m1(λ)), and Hc(λ; y) is joint CD of λ = (β, θ) obtained from our SSCC approach

17

slide-30
SLIDE 30

Theoretical support

Theorem 2 Under some regularity assumptions, when τ > Op(n1/2) and n → ∞, p1(β, θ) → p✵(β, θ), m1(β, θ) → m✵(β, θ) The easy-to-compute GP predictor is asymptotically equivalent to the original predictor Compared with the plug-in predictive distribution, the predictive confidence distribution has smaller average Kullback-Leibler distance to the true predictive distribution (Shen, Liu, and Xie 2017+).

18

slide-31
SLIDE 31

Simulation Study

slide-32
SLIDE 32

Simulation study: setup

Suppose we have the following GP model, y = Xβ + Z(X),

  • Sample points: n = 1000, 1500, 2000; Four covariates p = 4;

x ∈ [0, 1]4

  • True parameter values: β = (2, 3, 1, 2, 1.5), θ = (15, 1.5, 2, 3) and

σ2 = 1

  • The ijth element of Σ(θ) is computed by

σij = exp(− p

k=1 θk|xik − xjk|)

  • The number of splits: m = 5

19

slide-33
SLIDE 33

Simulation study: analysis results

n = 1000 n = 1500 n = 2000 MLE Compact SSCC MLE Compact SSCC MLE Compact SSCC ˆ β0 1.96(0.33) 1.96(0.33) 1.96(0.33) 2.00(0.30) 2.00(0.30) 2.00(0.30) 2.08(0.23) 2.08(0.23) 2.08(0.23) ˆ β1 3.03(0.43) 3.02(0.43) 3.02(0.43) 3.01(0.36) 3.01(0.36) 3.01(0.36) 2.90(0.28) 2.90(0.28) 2.90(0.28) ˆ β2 1.00(0.23) 1.00(0.23) 1.00(0.23) 0.99(0.21) 0.99(0.21) 0.99(0.21) 1.04(0.20) 1.04(0.20) 1.04(0.20) ˆ β3 2.04(0.24) 2.04(0.24) 2.04(0.24) 1.97(0.25) 1.97(0.25) 1.97(0.25) 1.98(0.24) 1.98(0.24) 1.98(0.24) ˆ β4 1.53(0.25) 1.53(0.25) 1.53(0.25) 1.52(0.28) 1.52(0.28) 1.52(0.28) 1.49(0.18) 1.49(0.18) 1.49(0.18) ˆ θ1 14.72(0.42) 14.72(0.42) 14.80(0.43) 14.65(0.39) 14.65(0.39) 14.65(0.49) 14.69(0.45) 14.74(0.45) 14.79(0.46) ˆ θ2 1.49(0.06) 1.49(0.06) 1.51(0.06) 1.49(0.06) 1.49(0.06) 1.50(0.10) 1.49(0.06) 1.49(0.06) 1.49(0.10) ˆ θ3 2.00(0.06) 2.00(0.06) 2.01(0.06) 1.98(0.06) 1.98(0.06) 2.00(0.06) 1.99(0.07) 1.99(0.07) 2.00(0.10) ˆ θ4 3.01(0.06) 3.01(0.06) 2.99(0.06) 3.00(0.06) 3.00(0.06) 3.00(0.08) 3.01(0.06) 3.01(0.06) 3.00(0.07) CT 32.46(1.29) 30.61(1.34) 4.54(0.11) 99.66(3.83) 95.90(5.44) 10.39(0.53) 227.63(6.96) 222.18(9.33) 20.32(0.95)

Table 1: Simulation results for n = 1000, 1500, 2000 with 100 replications

20

slide-34
SLIDE 34

Simulation study: computing time

50 100 150 200 1000 1500 2000

sample_size seconds method

1−MLE 2−compact 3−SSCC

Remark: Reduction of computational complexity from O(n3) to O( n3

a), where na is the size of the ath subset and na = n. 21

slide-35
SLIDE 35

Real data example

slide-36
SLIDE 36

Real data example: IBM data center thermal management

n = 1800 n = 3600 n = 26820 Variable MLE Compact SSCC MLE Compact SSCC MLE Compact SSCC x1

  • β1 -8.28(0.10) -8.29(0.10) -8.29(0.10)
  • 8.05(0.09)
  • 7.39(0.08)
  • θ1 0.84(0.01) 0.84(0.01) 0.86(0.01)
  • 0.85(0.01)
  • 0.86(0.01)

x2

  • β2 -9.00(0.10) -8.99(0.10) -8.99(0.10)
  • 10.14(0.09)
  • 9.09(0.08)
  • θ2 0.76(0.01)

0.76(0.01) 0.79(0.01)

  • 0.77(0.01)
  • 0.76(0.01)

x3

  • β3 -6.44(0.10) -6.45(0.10) -6.45(0.10)
  • 7.08(0.09)
  • 6.59(0.09)
  • θ3 1.20(0.01)

1.20(0.01) 1.19(0.01)

  • 1.14(0.01)
  • 1.13(0.01)

x4

  • β4 -5.42(0.11) -5.41(0.11) -5.41(0.11)
  • 6.52(0.10)
  • 5.86(0.10)
  • θ4 1.90(0.01)

1.90(0.01) 1.80(0.01)

  • 1.70(0.01)
  • 1.83(0.01)

x5

  • β5 -0.08(0.13) -0.07(0.13) -0.07(0.13)
  • 0.68(0.13)
  • 0.29(0.13)
  • θ5 3.50(0.01) 3.50(0.01) 3.40(0.01)
  • 3.39(0.01)
  • 3.50(0.01)

x6

  • β6 -1.98(0.10) -1.97(0.10) -1.97(0.10)
  • 2.12(0.10)
  • 1.79(0.09)
  • θ6 1.29(0.01)

1.29(0.01) 1.20(0.01)

  • 1.24(0.01)
  • 1.28(0.01)

x7

  • β7 -3.39(0.06) -3.41(0.06) -3.41(0.06)
  • 4.04(0.04)
  • 2.99(0.03)
  • θ7 0.20(0.01) 0.20(0.01) 0.17(0.01)
  • 0.14(0.01)
  • 0.15(0.01)

x8

  • β8 2.80(0.08) 2.80(0.08) 2.80(0.08)
  • 1.72(0.07)
  • 0.04(0.06)
  • θ8 0.60(0.01) 0.60(0.01) 0.50(0.01)
  • 0.62(0.01)
  • 0.50(0.01)

x9

  • β9 22.33(0.18) 22.35(0.18) 22.35(0.18)
  • 24.75(0.18)
  • 23.90(0.18)
  • θ9 21.90(0.03) 21.90(0.03) 21.45(0.03)
  • 21.61(0.03)
  • 21.10(0.03)

CT (seconds) 2768.70 2753.91 55.07

  • 372.67
  • 26257.2

Table 2: Mean, standard deviation and computing time of estimations by MLE, compact and SSCC methods with subsample size n = 1800, 3600 and the entire CFD data

22

slide-37
SLIDE 37

Real data example: quantification of the prediction uncertainty

56 58 60 62 0.0 0.1 0.2 0.3 0.4 0.5 0.6

density

height = 0 Density SSCC Plug−in

Figure 3: Comparison of CD predictive distribution and plug-in predictive distribution when n = 1800

23

slide-38
SLIDE 38

Summary

slide-39
SLIDE 39

Summary

  • Introduced a unified framework for GP modeling that can

dramatically reduce computation and also provide a better quantification of the prediction uncertainty

  • A Sequential Split-Conquer-Combine (SSCC) procedure
  • CD Combining
  • Predictive confidence distribution
  • Provide asymptotically equivalent estimation
  • Reduce computational complexity from O(n3) to O( n3

a), where

na is the size of the ath subset and na = n.

24

slide-40
SLIDE 40

Thank You!

25

slide-41
SLIDE 41

A Gerneral Framework of CD Combining

  • Univariate CD (Singh, Xie and Strawderman (2005)) Suppose

H1(θ), ..., Hk(θ) are CDs for k studies,

  • A continuous and monotonic function (in each coordinate)

gc(u1, ..., uk) : [0, 1]k → R1 Define Gc(t) = P(Gc(U1, ..., Uk) ≤ t), where U1, ..., Uk are i.i.d. U(0, 1) random variables.

  • Combined CD function:

Hc(θ) = Gc{gc(H1(θ), ..., Hk(θ)} is a CD function for the parameter θ.

  • Multivariate CD Combining (Liu, Liu and Xie (2014) and Yang et

al(2014))

26

slide-42
SLIDE 42

Introduction to computer experiments

  • If either

(i) the components of the process of interest and their interactions are adequately understood so it can be simulated or (ii) the physics of the process is sufficiently well understood so it can be described by a mathematical model relating the response to the potential factors that affect the outputs,

  • then the computer code/simulation can be served as a proxy

for the physical process.

  • Features of computer experiments:

(i) deterministic outputs, (ii) traditional principles used in designing physical experiments (eg randomization, blocking, replication) are irrelevant, (iii) high-dimensional input and time-consuming to run.

27