a sequential split conquer combine approach for gaussian
play

A Sequential Split-Conquer-Combine Approach for Gaussian Process - PowerPoint PPT Presentation

A Sequential Split-Conquer-Combine Approach for Gaussian Process Modeling in Computer Experiments Chengrui Li Department of Statistics and Biostatistics, Rutgers University Joint work with Ying Hung and Min-ge Xie 2017 QPRC JUNE 13, 2017 1


  1. A Sequential Split-Conquer-Combine Approach for Gaussian Process Modeling in Computer Experiments Chengrui Li Department of Statistics and Biostatistics, Rutgers University Joint work with Ying Hung and Min-ge Xie 2017 QPRC JUNE 13, 2017 1

  2. Outline � Introduction � A Unified Framework with Theoretical Supports � Simulation Study � Real Data Example � Summary 2

  3. Introduction

  4. Motivating example: Data center thermal management • A data center is an integrated facility housing multiple-unit servers, providing application services or management for data processing. • Goal : Design a data center with an efficient heat removal mechanism. • Computational Fluid Dynamics (CFD) simulation ( n = 26820, p = 9) Figure 1: Heat map for IBM T. J. Watson Data Center 3

  5. ✵ Gaussian process model • Gaussian process (GP) model: y = X β + Z ( x ) , • y : n × 1 vector of observations (e.g., room temperatures) • X : n × p design matrix • β : p × 1 unknown parameters 4

  6. Gaussian process model • Gaussian process (GP) model: y = X β + Z ( x ) , • y : n × 1 vector of observations (e.g., room temperatures) • X : n × p design matrix • β : p × 1 unknown parameters • Z ( x ) is a GP process with mean ✵ and covariance σ 2 Σ( θ ) • Σ( θ ) : n -by- n Correlation matrix with correlation parameters θ The ij th element of Σ is defined by a power exponential function p corr ( Z ( x i ) , Z ( x j )) = exp( − θ T | x i − x j | ) = exp( − θ k | x ik − x jk | ) . � k = 1 • Remark : Assume σ is known for simplicity in this talk. 4

  7. ✵ ✵ ✵ ✵ ✵ ✵ ✵ ✵ Estimation and prediction • Likelihood inference l ( β , θ , σ ) = − 1 2 σ 2 ( y − X β ) ⊤ Σ − 1 ( θ )( y − X β ) − 1 2 log | Σ( θ ) | − n 2 log( σ 2 ) So, { l ( β | θ , σ 2 ) } = ( X ⊤ Σ − 1 ( θ ) X ) − 1 X ⊤ Σ − 1 ( θ ) y � β | θ = ❛r❣ ♠❛① β β , σ 2 = ❛r❣ ♠❛① β , σ 2 ) } θ | � � { l ( θ | � θ 5

  8. Estimation and prediction • Likelihood inference l ( β , θ , σ ) = − 1 2 σ 2 ( y − X β ) ⊤ Σ − 1 ( θ )( y − X β ) − 1 2 log | Σ( θ ) | − n 2 log( σ 2 ) So, { l ( β | θ , σ 2 ) } = ( X ⊤ Σ − 1 ( θ ) X ) − 1 X ⊤ Σ − 1 ( θ ) y � β | θ = ❛r❣ ♠❛① β β , σ 2 = ❛r❣ ♠❛① β , σ 2 ) } θ | � � { l ( θ | � θ • GP prediction, say y ✵ , at a new point x ✵ , given parameters ( β , θ ) , follows a normal distribution with mean p ✵ ( β , θ ) and variance m ✵ ( β , θ ) , where, ✵ β + γ ( θ ) ⊤ Σ − 1 ( θ )( y − X β ) p ✵ ( β , θ ) = x ⊤ m ✵ ( β , θ ) = σ 2 ( 1 − γ ( θ ) ⊤ Σ − 1 ( θ ) γ ( θ )) , and γ ( θ ) is a n × 1 vector of i th element equals to φ ( || x i − x ✵ || ; θ ) . 5

  9. Two challenges in GP modeling � Computational issue: • Estimation and prediction involve Σ − 1 and | Σ | with order of O ( n 3 ) : • Not feasible when n is large 6

  10. Two challenges in GP modeling � Computational issue: • Estimation and prediction involve Σ − 1 and | Σ | with order of O ( n 3 ) : • Not feasible when n is large � Uncertainty quantification of GP predictor • Plug-in predictive distribution is widely used • It underestimates the uncertainty 6

  11. Existing methods � For the computational issue: • Change the model to one that is computationally convenient: Rue and Held (2005), Cressie and Johannesson (2008). • Approximate the likelihood function: Stein et al. (2004), Furrer et al. (2006), Fuentes (2007), Kaufman et al. (2008). • Not focus on uncertainty quantification and bring in addition uncertainty � For uncertainty quantification of GP predictor • Bayesian predictive distribution • Bootstrap approach (Luna and Young 2003) • Intensive computation 7

  12. Solve both problems by a unified framework? • Yes! 8

  13. A Unified Framework

  14. Introduction to confidence distribution (CD) Statistical inference (Parameter estimation): • Point estimate • Interval estimate • Distribution estimate Example : X 1 , . . . , X n i.i.d. follows N ( µ, 1 ) � n x n = 1 • Point estimate: ¯ i = 1 x i n x n − 1 . 96 / √ n , ¯ x n + 1 . 96 / √ n ) • Interval estimate: (¯ x n , 1 • Distribution estimate: N (¯ n ) 9

  15. Introduction to confidence distribution (CD) Statistical inference (Parameter estimation): • Point estimate • Interval estimate • Distribution estimate Example : X 1 , . . . , X n i.i.d. follows N ( µ, 1 ) � n x n = 1 • Point estimate: ¯ i = 1 x i n x n − 1 . 96 / √ n , ¯ x n + 1 . 96 / √ n ) • Interval estimate: (¯ x n , 1 • Distribution estimate: N (¯ n ) The idea of the CD approach is to use a sample-dependent distribution (or density) function to estimate the parameter of interest. • Wide range of examples: bootstrap distribution, (normalized) likelihood function, p -value functions, fiducial distributions, some informative priors and Bayesian posteriors, among others (Xie and Singh 2013) 9

  16. Overview: Sequential Split-Conquer-Combine  D D D D Data 1 2 3 m Split and Conquer ˆ  D * : Step 1: 1 1 ˆ *  Step 2: D : 2 2 ˆ *  Step 3: D : 3 3   ˆ *  Step m : D : m m  Combine ˆ ˆ ˆ ˆ     m 2 3 1 ˆ  c Figure 2: Sequential Split-Conquer-Combine Approach 10

  17. Ingredients � Split the entire dataset into subsets (correlated) based on compact support correlation assumption for 1-D 11

  18. Ingredients � Split the entire dataset into subsets (correlated) based on compact support correlation assumption for 1-D � Perform a sequential updating to create independent subsets and estimate on each updated subsets 11

  19. Ingredients � Split the entire dataset into subsets (correlated) based on compact support correlation assumption for 1-D � Perform a sequential updating to create independent subsets and estimate on each updated subsets � Combine estimators 11

  20. Ingredients � Split the entire dataset into subsets (correlated) based on compact support correlation assumption for 1-D � Perform a sequential updating to create independent subsets and estimate on each updated subsets � Combine estimators � Quantify prediction uncertainty 11

  21. Split � Split the entire dataset into subsets y = { y a } , a = 1 , ..., m . Denote the size of y a by n a , i.e. � n a = n . • Assumption: compactly supported correlation   O O Σ 11 Σ 12 · · ·  ...    O Σ 21 Σ 22 · · ·     . . ... ... ... . .   Σ t = . . ,     ...   O   · · · Σ ( m − 1 )( m − 1 ) Σ ( m − 1 ) m O O · · · Σ m ( m − 1 ) Σ mm n × n (after index sorting according to X 1 values) 12

  22. Sequentially update data � Transform y to y ∗ by sequentially updating: y ∗ a = y a − L a ( a − 1 ) y ∗ a − 1 , where L ( a + 1 ) a = Σ t ( a + 1 ) a D − 1 a , D a = Σ aa − L a ( a − 1 ) D ( a − 1 ) L ⊤ a ( a − 1 ) . • Sequential updates are computationally efficient . • The updated block y ∗ a ’s are independent. 13

  23. Estimation from each subset Given θ , we have • MLE of the a th subset: l ( a ) a D − 1 a C a ) − 1 C ⊤ a D − 1 � t ( β | θ ) = ( C ⊤ a y ∗ β a = ❛r❣ ♠❛① a . β | θ • An individual CD for the a th updated subset is (cf., Xie and Singh 2013): N p ( � β a , Cov ( � β a )) . Given β , we have θ a = ❛r❣ ♠❛① θ l ( a ) • MLE of the a th subset: � t ( θ | β ) . • Given β , an individual CD for the a th updated subset is N ( � θ a , Cov ( � θ a )) . Significant computational reduction because D a is much smaller than the original covariance matrix. 14

  24. CD combining • Following Singh, Xie and Strawderman (2005), Liu, Liu and Xie (2014) and Yang et al. (2014), a combined CD is N p ( β c , S c ) , where β c = ( � W a ) − 1 ( � W a � a C a ) − 1 and β a ) with W a = ( C ⊤ a D − 1 � S c = Cov ( � β c ) . • Similar framework can be applied to all the parameters ( β , θ ) . 15

  25. CD combining • Following Singh, Xie and Strawderman (2005), Liu, Liu and Xie (2014) and Yang et al. (2014), a combined CD is N p ( β c , S c ) , where β c = ( � W a ) − 1 ( � W a � a C a ) − 1 and β a ) with W a = ( C ⊤ a D − 1 � S c = Cov ( � β c ) . • Similar framework can be applied to all the parameters ( β , θ ) . Theorem 1 Under some regularity assumptions, when τ > O p ( n 1 / 2 ) and n → ∞ , the SSCC estimator � θ c ) is asymptotically as efficient as MLE λ c = ( � β c , � θ mle ) . � λ mle = ( � β mle , � 15

  26. GP predictive distribution • GP predictor at a new point x ✵ , given parameters ( β , θ ) , follows a normal distribution with mean p ✵ ( β , θ ) and variance m ✵ ( β , θ ) , where, ✵ β + γ ( θ ) ⊤ Σ − 1 ( θ )( y − X β ) p ✵ ( β , θ ) = x ⊤ m ✵ ( β , θ ) = σ 2 ( 1 − γ ( θ ) ⊤ Σ − 1 ( θ ) γ ( θ )) , and γ ( θ ) is a n × 1 vector of i th element equals to φ ( || x i − x ✵ || ; θ ) . 16

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend