nearest neighbor gaussian processes for large spatial data
play

Nearest Neighbor Gaussian Processes for Large Spatial Data Abhi - PowerPoint PPT Presentation

Nearest Neighbor Gaussian Processes for Large Spatial Data Abhi Datta 1 , Sudipto Banerjee 2 and Andrew O. Finley 3 July 31, 2017 1 Department of Biostatistics, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, Maryland. 2


  1. Nearest Neighbor Gaussian Processes for Large Spatial Data Abhi Datta 1 , Sudipto Banerjee 2 and Andrew O. Finley 3 July 31, 2017 1 Department of Biostatistics, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, Maryland. 2 Department of Biostatistics, Fielding School of Public Health, University of California, Los Angeles. 3 Departments of Forestry and Geography, Michigan State University, East Lansing, Michigan.

  2. Low rank Gaussian Predictive Process Pros • Proper Gaussian process • Allows for coherent spatial interpolation at arbitrary resolution • Can be used as prior for spatial random effects in any hierarchical setup for spatial data • Computationally tractable 1

  3. Low rank Gaussian Predictive Process Cons True w Full GP PP 64 knots Figure: Comparing full GP vs low-rank GP with 2500 locations • Low rank models like the Predictive Process (PP) often tends to oversmooth • Increasing the number of knots can fix this but will lead to heavy computation 1

  4. Sparse matrices • Idea: Use a sparse matrix instead of a low rank matrix to approximate the dense full GP covariance matrix • Goals: • Scalability: Both in terms of storage and computing inverse and determinants • Closely approximate full GP inference • Proper Gaussian process model like the Predictive Process 2

  5. Cholesky factors • Write a joint density p ( w ) = p ( w 1 , w 2 , . . . , w n ) as: p ( w 1 ) p ( w 2 | w 1 ) p ( w 3 | w 1 , w 2 ) · · · p ( w n | w 1 , w 2 , . . . , w n − 1 ) • For Gaussian distribution w ∼ N (0 , C ) this ⇒ w 1 = 0 + η 1 ; w 2 = a 21 w 1 + η 2 ; · · · · · · · · · w n = a n 1 w 1 + a n 2 w 2 + · · · + a n , n − 1 w n − 1 + η n ; 3

  6. Cholesky factors • Write a joint density p ( w ) = p ( w 1 , w 2 , . . . , w n ) as: p ( w 1 ) p ( w 2 | w 1 ) p ( w 3 | w 1 , w 2 ) · · · p ( w n | w 1 , w 2 , . . . , w n − 1 ) • For Gaussian distribution w ∼ N (0 , C ) this ⇒         w 1 0 0 0 . . . 0 0 w 1 η 1 0 0 . . . 0 0 η 2 w 2 a 21 w 2                         w 3 a 31 a 32 0 . . . 0 0 w 3 η 3 = +                 . . . . . . . . .  .   . . . . . .   .   .  . . . . . . . . .                 w n a n 1 a n 2 a n 3 . . . a n , n − 1 0 w n η n = ⇒ w = Aw + η ; η ∼ N (0 , D ) , where D = diag ( d 1 , d 2 , . . . , d n ) . 3

  7. Cholesky factors • Write a joint density p ( w ) = p ( w 1 , w 2 , . . . , w n ) as: p ( w 1 ) p ( w 2 | w 1 ) p ( w 3 | w 1 , w 2 ) · · · p ( w n | w 1 , w 2 , . . . , w n − 1 ) • For Gaussian distribution w ∼ N (0 , C ) this ⇒         w 1 0 0 0 . . . 0 0 w 1 η 1 w 2 a 21 0 0 . . . 0 0 w 2 η 2                         0 . . . 0 0 η 3 w 3 a 31 a 32 w 3 = +                 . . . . . . . . .  .   . . . . . .   .   .  . . . . . . . . .                 . . . 0 η n w n a n 1 a n 2 a n 3 a n , n − 1 w n = ⇒ w = Aw + η ; η ∼ N (0 , D ) , where D = diag ( d 1 , d 2 , . . . , d n ) . • Cholesky factorization: C − 1 = ( I − A ) ′ D − 1 ( I − A ) 3

  8. Cholesky factors • w < i = ( w 1 , w 2 , . . . , w i − 1 ) ′ • c i = Cov ( w i , w < i ), C i = Var ( w < i ) • i th row of A and d i = Var ( η i ) are obtained from p ( w i | w < i ) as follows: • Solve for a ij ’s from � i − 1 i C − 1 j =1 a ij w j = E ( w i | w < i ) = c ′ w < i i 4

  9. Cholesky factors • w < i = ( w 1 , w 2 , . . . , w i − 1 ) ′ • c i = Cov ( w i , w < i ), C i = Var ( w < i ) • i th row of A and d i = Var ( η i ) are obtained from p ( w i | w < i ) as follows: • Solve for a ij ’s from � i − 1 i C − 1 j =1 a ij w j = E ( w i | w < i ) = c ′ w < i i • d i = Var ( w i | w < i ) = σ 2 − c ′ i C − 1 c i i 4

  10. Cholesky factors • w < i = ( w 1 , w 2 , . . . , w i − 1 ) ′ • c i = Cov ( w i , w < i ), C i = Var ( w < i ) • i th row of A and d i = Var ( η i ) are obtained from p ( w i | w < i ) as follows: • Solve for a ij ’s from � i − 1 i C − 1 j =1 a ij w j = E ( w i | w < i ) = c ′ w < i i • d i = Var ( w i | w < i ) = σ 2 − c ′ i C − 1 c i i 4

  11. Cholesky factors • w < i = ( w 1 , w 2 , . . . , w i − 1 ) ′ • c i = Cov ( w i , w < i ), C i = Var ( w < i ) • i th row of A and d i = Var ( η i ) are obtained from p ( w i | w < i ) as follows: • Solve for a ij ’s from � i − 1 i C − 1 j =1 a ij w j = E ( w i | w < i ) = c ′ w < i i • d i = Var ( w i | w < i ) = σ 2 − c ′ i C − 1 c i i • For large i , inverting C i becomes slow • The Cholesky factor approach for the full GP covariance matrix C does not offer any computational benefits 4

  12. Cholesky Factors and Directed Acyclic Graphs (DAGs) 7 6 1 5 2 4 3 • Number of non-zero entries (sparsity) of A equals number of arrows in the graph • In particular: Sparsity of the i th row of A is same as the number of arrows towards i in the DAG 5

  13. Introducing sparsity via graphical models 7 6 1 5 2 4 3 p ( y 1 ) p ( y 2 | y 1 ) p ( y 3 | y 1 , y 2 ) p ( y 4 | y 1 , y 2 , y 3 ) × p ( y 5 | y 1 , y 2 , y 3 , y 4 ) p ( y 6 | y 1 , y 2 , . . . , y 5 ) p ( y 7 | y 1 , y 2 , . . . , y 6 ) . 6

  14. Introducing sparsity via graphical models 7 6 1 5 2 4 3 p ( y 1 ) p ( y 2 | y 1 ) p ( y 3 | y 1 , y 2 ) p ( y 4 | y 1 , y 2 , y 3 ) p ( y 5 | ✚ y 1 , y 2 , y 3 , y 4 ) p ( y 6 | y 1 , ✚ y 2 , ✚ y 3 , y 4 , y 5 ) p ( y 7 | y 1 , y 2 , ✚ y 3 , ✚ y 4 , ✚ y 5 , y 6 ) ✚ ✚ ✚ ✚ ✚ ✚ 6

  15. Introducing sparsity via graphical models 7 6 1 5 2 4 3 • Create a sparse DAG by keeping at most m arrows pointing to each node • Set a ij = 0 for all i , j which has no arrow between them • Fixing a ij = 0 introduces conditional independence and w j drops out from the conditional set in p ( w i | { w k : l < i } ) 7

  16. Introducing sparsity via graphical models 7 6 1 5 2 4 3 • N ( i ) denote neighbor set of i , i.e., the set of nodes from which there are arrows to i • a ij = 0 for j / ∈ N ( i ) and nonzero a ij ’s obtained by solving: � E[ w i | w N ( i ) ] = a ij w j j ∈ N ( i ) • The above linear system is only m × m 7

  17. Choosing neighbor sets Matern Covariance Function: 1 2 ν − 1 Γ( ν )( || s i − s j || φ ) ν K ν ( || s i − s j || φ ); φ > 0 , ν > 0 , C ( s i , s j ) = 8

  18. Choosing neighbor sets • Spatial covariance functions decay with distance • Vecchia (1988): N ( s i ) = m − nearest neighbors of s i in s 1 , s 2 , . . . , s i − 1 • Nearest points have highest correlations • Theory: ”Screening effect” – Stein, 2002 • We use Vecchia’s choice of m -nearest neighbor • Other choices proposed in Stein et al. (2004); Gramacy and Apley (2015); Guinness (2016) can also be used 9

  19. Nearest neighbors 10

  20. Nearest neighbors 10

  21. Nearest neighbors 10

  22. Sparse precision matrices • The neighbor sets and the covariance function C ( · , · ) define a sparse Cholesky factor A C − 1 = ( I − A ) ⊤ D − 1 ( I − A ) • N ( w | 0 , C ) ≈ N ( w | 0 , ˜ C ) ; ˜ ˜ C − 1 A D • det(˜ C ) = � n i =1 D i , C − 1 is sparse with O ( nm 2 ) entries • ˜ 11

  23. Sparse precision matrices • The neighbor sets and the covariance function C ( · , · ) define a sparse Cholesky factor A C − 1 = ( I − A ) ⊤ D − 1 ( I − A ) • N ( w | 0 , C ) ≈ N ( w | 0 , ˜ C ) ; ˜ ˜ C − 1 A D • det(˜ C ) = � n i =1 D i , C − 1 is sparse with O ( nm 2 ) entries • ˜ 11

  24. Extension to a Process • We have defined w ∼ N (0 , ˜ C ) over the set of data locations S = { s 1 , s 2 , . . . , s n } • For s / ∈ S , define N ( s ) as set of m -nearest neighbors of s in S • Define w ( s ) = � i : s i ∈ N ( s ) a i ( s ) w ( s i ) + η ( s ) where η ( s ) ind ∼ N (0 , d ( s )) • a i ( s ) and d ( s ) are once again obtained by solving m × m system • Well-defined GP over entire domain • Nearest Neighbor GP (NNGP) – Datta et al., JASA, (2016) 12

  25. Hierarchical spatial regression with NNGP Spatial linear model y ( s ) = x ( s ) ′ β + w ( s ) + ǫ ( s ) • w ( s ) modeled as NNGP derived from a GP (0 , C ( · , · , | σ 2 , φ )) • ǫ ( s ) iid ∼ N (0 , τ 2 ) contributes to the nugget • Priors for the parameters β , σ 2 , τ 2 and φ • Only difference from a full GP model is the NNGP prior w ( s ) 13

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend