Nearest Neighbor Gaussian Processes for Large Spatial Data Abhi - PowerPoint PPT Presentation

Nearest Neighbor Gaussian Processes for Large Spatial Data Abhi Datta 1 , Sudipto Banerjee 2 and Andrew O. Finley 3 July 31, 2017 1 Department of Biostatistics, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, Maryland. 2 Department of Biostatistics, Fielding School of Public Health, University of California, Los Angeles. 3 Departments of Forestry and Geography, Michigan State University, East Lansing, Michigan.

Low rank Gaussian Predictive Process Pros • Proper Gaussian process • Allows for coherent spatial interpolation at arbitrary resolution • Can be used as prior for spatial random effects in any hierarchical setup for spatial data • Computationally tractable 1

Low rank Gaussian Predictive Process Cons True w Full GP PP 64 knots Figure: Comparing full GP vs low-rank GP with 2500 locations • Low rank models like the Predictive Process (PP) often tends to oversmooth • Increasing the number of knots can fix this but will lead to heavy computation 1

Sparse matrices • Idea: Use a sparse matrix instead of a low rank matrix to approximate the dense full GP covariance matrix • Goals: • Scalability: Both in terms of storage and computing inverse and determinants • Closely approximate full GP inference • Proper Gaussian process model like the Predictive Process 2

Cholesky factors • Write a joint density p ( w ) = p ( w 1 , w 2 , . . . , w n ) as: p ( w 1 ) p ( w 2 | w 1 ) p ( w 3 | w 1 , w 2 ) · · · p ( w n | w 1 , w 2 , . . . , w n − 1 ) • For Gaussian distribution w ∼ N (0 , C ) this ⇒ w 1 = 0 + η 1 ; w 2 = a 21 w 1 + η 2 ; · · · · · · · · · w n = a n 1 w 1 + a n 2 w 2 + · · · + a n , n − 1 w n − 1 + η n ; 3

Cholesky factors • Write a joint density p ( w ) = p ( w 1 , w 2 , . . . , w n ) as: p ( w 1 ) p ( w 2 | w 1 ) p ( w 3 | w 1 , w 2 ) · · · p ( w n | w 1 , w 2 , . . . , w n − 1 ) • For Gaussian distribution w ∼ N (0 , C ) this ⇒         w 1 0 0 0 . . . 0 0 w 1 η 1 0 0 . . . 0 0 η 2 w 2 a 21 w 2                         w 3 a 31 a 32 0 . . . 0 0 w 3 η 3 = +                 . . . . . . . . .  .   . . . . . .   .   .  . . . . . . . . .                 w n a n 1 a n 2 a n 3 . . . a n , n − 1 0 w n η n = ⇒ w = Aw + η ; η ∼ N (0 , D ) , where D = diag ( d 1 , d 2 , . . . , d n ) . 3

Cholesky factors • Write a joint density p ( w ) = p ( w 1 , w 2 , . . . , w n ) as: p ( w 1 ) p ( w 2 | w 1 ) p ( w 3 | w 1 , w 2 ) · · · p ( w n | w 1 , w 2 , . . . , w n − 1 ) • For Gaussian distribution w ∼ N (0 , C ) this ⇒         w 1 0 0 0 . . . 0 0 w 1 η 1 w 2 a 21 0 0 . . . 0 0 w 2 η 2                         0 . . . 0 0 η 3 w 3 a 31 a 32 w 3 = +                 . . . . . . . . .  .   . . . . . .   .   .  . . . . . . . . .                 . . . 0 η n w n a n 1 a n 2 a n 3 a n , n − 1 w n = ⇒ w = Aw + η ; η ∼ N (0 , D ) , where D = diag ( d 1 , d 2 , . . . , d n ) . • Cholesky factorization: C − 1 = ( I − A ) ′ D − 1 ( I − A ) 3

Cholesky factors • w < i = ( w 1 , w 2 , . . . , w i − 1 ) ′ • c i = Cov ( w i , w < i ), C i = Var ( w < i ) • i th row of A and d i = Var ( η i ) are obtained from p ( w i | w < i ) as follows: • Solve for a ij ’s from � i − 1 i C − 1 j =1 a ij w j = E ( w i | w < i ) = c ′ w < i i 4

Cholesky factors • w < i = ( w 1 , w 2 , . . . , w i − 1 ) ′ • c i = Cov ( w i , w < i ), C i = Var ( w < i ) • i th row of A and d i = Var ( η i ) are obtained from p ( w i | w < i ) as follows: • Solve for a ij ’s from � i − 1 i C − 1 j =1 a ij w j = E ( w i | w < i ) = c ′ w < i i • d i = Var ( w i | w < i ) = σ 2 − c ′ i C − 1 c i i 4

Cholesky factors • w < i = ( w 1 , w 2 , . . . , w i − 1 ) ′ • c i = Cov ( w i , w < i ), C i = Var ( w < i ) • i th row of A and d i = Var ( η i ) are obtained from p ( w i | w < i ) as follows: • Solve for a ij ’s from � i − 1 i C − 1 j =1 a ij w j = E ( w i | w < i ) = c ′ w < i i • d i = Var ( w i | w < i ) = σ 2 − c ′ i C − 1 c i i • For large i , inverting C i becomes slow • The Cholesky factor approach for the full GP covariance matrix C does not offer any computational benefits 4

Cholesky Factors and Directed Acyclic Graphs (DAGs) 7 6 1 5 2 4 3 • Number of non-zero entries (sparsity) of A equals number of arrows in the graph • In particular: Sparsity of the i th row of A is same as the number of arrows towards i in the DAG 5

Introducing sparsity via graphical models 7 6 1 5 2 4 3 • Create a sparse DAG by keeping at most m arrows pointing to each node • Set a ij = 0 for all i , j which has no arrow between them • Fixing a ij = 0 introduces conditional independence and w j drops out from the conditional set in p ( w i | { w k : l < i } ) 7

Introducing sparsity via graphical models 7 6 1 5 2 4 3 • N ( i ) denote neighbor set of i , i.e., the set of nodes from which there are arrows to i • a ij = 0 for j / ∈ N ( i ) and nonzero a ij ’s obtained by solving: � E[ w i | w N ( i ) ] = a ij w j j ∈ N ( i ) • The above linear system is only m × m 7

Choosing neighbor sets Matern Covariance Function: 1 2 ν − 1 Γ( ν )( || s i − s j || φ ) ν K ν ( || s i − s j || φ ); φ > 0 , ν > 0 , C ( s i , s j ) = 8

Choosing neighbor sets • Spatial covariance functions decay with distance • Vecchia (1988): N ( s i ) = m − nearest neighbors of s i in s 1 , s 2 , . . . , s i − 1 • Nearest points have highest correlations • Theory: ”Screening effect” – Stein, 2002 • We use Vecchia’s choice of m -nearest neighbor • Other choices proposed in Stein et al. (2004); Gramacy and Apley (2015); Guinness (2016) can also be used 9

Nearest neighbors 10

Sparse precision matrices • The neighbor sets and the covariance function C ( · , · ) define a sparse Cholesky factor A C − 1 = ( I − A ) ⊤ D − 1 ( I − A ) • N ( w | 0 , C ) ≈ N ( w | 0 , ˜ C ) ; ˜ ˜ C − 1 A D • det(˜ C ) = � n i =1 D i , C − 1 is sparse with O ( nm 2 ) entries • ˜ 11

Extension to a Process • We have defined w ∼ N (0 , ˜ C ) over the set of data locations S = { s 1 , s 2 , . . . , s n } • For s / ∈ S , define N ( s ) as set of m -nearest neighbors of s in S • Define w ( s ) = � i : s i ∈ N ( s ) a i ( s ) w ( s i ) + η ( s ) where η ( s ) ind ∼ N (0 , d ( s )) • a i ( s ) and d ( s ) are once again obtained by solving m × m system • Well-defined GP over entire domain • Nearest Neighbor GP (NNGP) – Datta et al., JASA, (2016) 12

Hierarchical spatial regression with NNGP Spatial linear model y ( s ) = x ( s ) ′ β + w ( s ) + ǫ ( s ) • w ( s ) modeled as NNGP derived from a GP (0 , C ( · , · , | σ 2 , φ )) • ǫ ( s ) iid ∼ N (0 , τ 2 ) contributes to the nugget • Priors for the parameters β , σ 2 , τ 2 and φ • Only difference from a full GP model is the NNGP prior w ( s ) 13

Nearest Neighbor Gaussian Processes for Large Spatial Data Abhi - PowerPoint PPT Presentation

Nearest Neighbor Gaussian Processes for Large Spatial Data Abhi Datta 1 , Sudipto Banerjee 2 and Andrew O. Finley 3 July 31, 2017 1 Department of Biostatistics, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, Maryland. 2

Nearest Neighbor and Locality-Sensitive Hashing Nearest Neighbor Set Similarity

NEAREST NEIGHBOR RULE Jeff Robble, Brian Renzenbrink, Doug Roberts Nearest Neighbor Rule

CSCI 447/547 MACHINE LEARNING Outline Nearest Neighbor K-Nearest Neighbor Algorithm

A Spatial Cloaking Framework A Spatial Cloaking Framework A Spatial Cloaking Framework A Spatial

Learning From Data Lecture 16 Similarity and Nearest Neighbor Similarity Nearest Neighbor M.

High-Dimensional Nearest Neighbor Search High-Dimensional Nearest Neighbor Search Who?

Nearest Neighbor Classification Machine Learning 1 This lecture K-nearest neighbor

Proximity in the Age of Distraction: Robust Approximate Nearest Neighbor Search Sariel Har-Peled

Graph-based Nearest Neighbor Search: From Practice to Theory Liudmila Prokhorenkova, Aleksandr

The Nearest Neighbor Algorithm The Nearest Neighbor Algorithm Hypothesis Space Hypothesis Space

Simultaneous Nearest Neighbor Search Piotr Indyk Robert Kleinberg MIT Cornell Sepideh

BAYES AND NEAREST NEIGHBOR BAYES AND NEAREST NEIGHBOR CLASSIFIERS CLASSIFIERS Matthieu R Bloch

Simple and Fast Nearest Neighbor Search Marcel Birn, Manuel Holtgrewe, Peter Sanders , Johannes

9/28/2009 Nearest Neighbor Queries What are the two nearest stars to Andromeda? Reverse

Gaussian Processes Dan Cervone NYU CDS November 10, 2015 Dan Cervone (NYU CDS) Gaussian

Learning From Data Lecture 17 Memory and Efficiency in Nearest Neighbor Memory Efficiency M.

New Drugs in AML New version of old drugs Inhibitors of signaling pathways CPX-351

Why is it plausible? Barry Mazur January 5, 2012 Rough notes in preparation for a lecture at the

Machine Learning: a Basic Toolkit Lorenzo Rosasco, - Universita di Genova - Istituto Italiano

Feature engineering W IN N IN G A K AGGLE COMP ETITION IN P YTH ON Yauhen Babakhin Kaggle

Condence Sets Based on Sparse Estimators Are Necessarily Large Benedikt M. Ptscher

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo Andrew Gordon Wilson

1 1.2. Guidelines for Information Security of Cloud Computing Category Main contents of measure

The Power of Unbiased Recursive Partitioning: A Unifying View of CTree, MOB, and GUIDE Lisa