Fast Direct Methods for Gaussian Processes
Mike O’Neil
Departments of Mathematics New York University
- neil@cims.nyu.edu
December 12, 2015
1
Fast Direct Methods for Gaussian Processes Mike ONeil Departments - - PowerPoint PPT Presentation
Fast Direct Methods for Gaussian Processes Mike ONeil Departments of Mathematics New York University oneil@cims.nyu.edu December 12, 2015 1 Collaborators This is joint work with: Siva Ambikasaran (ICTS, Tata Institute) Dan
Mike O’Neil
Departments of Mathematics New York University
December 12, 2015
1
This is joint work with: Siva Ambikasaran (ICTS, Tata Institute) Dan Foreman-Mackey (UW Astronomy) Leslie Greengard (NYU Math, Simons Fnd.) David Hogg (NYU Physics) Sunli Tang (NYU Math)
2
3
I will present a fast, direct (not iterative), algorithm for rapidly evaluating likelihood functions of the form: These likelihood functions universally occur when modeling under Gaussian process priors.
4
The n-dimensional normal distribution has the following properties: C is a positive-definite covariance matrix Cij = Cov(Xi, Xj) P( ~ X ∈ !) = Z
!
1 (2⇡)n/2| det C|1/2 e−(~
y−~ µ)tC−1(~ y−~ µ)/2d~
y
5
Stochastic processes are simply the generalization
6
7
8
We say that if and only if f (x) ∼ GP (m(x), k(x, x0)) f (x1), . . . , f (xn) ∼ N(m(x), C) m = (m(x1) · · · m(xn))t Cij = k(xi, xj) The function k is the covariance kernel
9
This means that each realization of a Gaussian process is a function - the corresponding probability distribution is
This function sampled at a selection of points behaves like a multivariate Normal distribution Each marginal distribution is normal:f (xi) ∼ N(m(xi), Cii)
10
Admissible covariance kernels must give rise to positive-definite covariance matrices: for all choices of x, y. Common choices are: k(x, x0) = e|xx0|/a k(x, x0) = e(xx0)2/b k(x, x0) = ✓|x − x0| c ◆ν Kν ✓√ 2ν|x − x0| c ◆ y tK(x, x)y > 0
11
12
2 4 0.2 0.4 0.6 0.8 1.0
2 4 0.2 0.4 0.6 0.8 1.0
2 4 6 2 4 6 8
2 4
0.5 1.0
Figure 10.26.4: Kν(x), 0.1 ≤ x ≤ 5, 0 ≤ ν ≤ 4.
(c) Nist Handbook
The covariance kernel controls the regularity and the variability of the Gaussian process. k(x, x0) = e(xx0)2/(20) k(x, x0) = e(xx0)2/(0.2) (95% Conditional confidence intervals are shaded)
13
Imagine we have data {x,y}, and assume a model of the form: Furthermore, assume that f is a Gaussian process and that is i.i.d. Gaussian noise (priors): f ∼ GP(mθ, kθ) ✏ ∼ N(0, 2) We’ve assumed that f can depend on some
θ ✏i yi = f (xi) + ✏i
14
This implies that y is also a Gaussian process: y(x) ∼ GP(mθ(x), δ(x − x0)σ2 + kθ(x, x0)) The new covariance matrix is: C = σ2 I + Kθ
15
Assume m(x) = 0. Then the joint distribution of y and a new predicted value is: y ∗ y y ∗
✓ 0, C k(x, x∗) k(x∗, x) k(x∗, x∗) ◆
16
The conditional distribution (posterior) of the predicted value can then be calculated as: y ∗|x, y, x∗ ∼ N(µ, η2) µ = k(x∗, x)C−1y η2 = k(x∗, x∗) − k(x∗, x)C−1k(x, x∗) where
Schur complement of C
17
In general, The parameters have to be fit to the data (inference),
18
k = k(x, x0; θ1) m = m(x; θ2)
There are three main computational bottlenecks in dealing with large covariance matrices:
C−1 log det C C = W tW L(θ) = e−y tC−1
θ (x)y/2
(2π)n/2| det Cθ(x)|1/2
19
To avoid large-scale dense linear algebra, various techniques are often used to approximate the inverse and determinant:
In many cases, these methods sacrifice the fidelity of the underlying model.
20
We want a data-sparse factorization of the covariance matrix which allows for fast inversion: A = A1 A2 · · · Am means that A−1 = A−1
m A−1 m−1 · · · A−1 1
det A = det A1 det A2 · · · det Am Many factorizations can be used: HODLR, HBS, HSS, etc.
21
What properties of the Gaussian process covariance matrix allow for it to be rapidly factored? For example, most covariance kernels yield matrices which are Hierarchically Off-Diagonal Low-Rank
Full rank; Low rank;
22
K(3) = K3 × K2 × K1 × K0
Full rank; Low-rank; Identity matrix; Zero matrix;
The resulting (numerical) factorization will have the form:
23
Physical interpretation?
24
F = −G m1m2 r 2
25
Calculations required: (In FINITE arithmetic)
O(N)
Fast Multipole Methods F = −G m1m2 r 2
26
Similar algorithms exist for Heat Flow, where the distribution looks like a Gaussian:
2 4 0.2 0.4 0.6 0.8 1.0
Distance Temperature
27
Time series data which
relationship than data
Often this covariance is modeled using the heat kernel:
2 4 0.2 0.4 0.6 0.8 1.0
Distance Temperature
Full rank; Low-rank; Identity matrix; Zero matrix;
We can easily show how the factorization works for a 1-level scheme: = x
28
We can easily show how the factorization works for a 1-level scheme: = x A11 A22 UV t V Ut A11 A22 I I
A−1
11 UV t
A−1
22 V Ut
29
Assume we have the factors U and V for now.
A 2-level factorization is slightly more complicated… =
A11 A22
UV t V Ut I
A33 A44
U1V t
1
U2V t
2
V2Ut
2
V1Ut
1
A11 A22
A33 A44
x I I I x I I
A−1
11 U1V t 1
A−1
22 V1Ut 1
A−1
33 U2V t 2
A−1
44 V2Ut 2
I
A−1
1 UV t
A−1
2 V Ut
where A1
A11
A22
U1V t
1
V1Ut
1
= and = A2
A33 A44
U2V t
2
V2Ut
2
30
A1
A11 A22
U1V t
1
V1Ut
1
= How do we invert We can use the Sherman-Morrison-Woodbury formula: (A + UV )−1 = A−1 − A−1U
−1 V A−1
If U,V are rank k, this is a k x k matrix
A11 A22
U1 V1 0 V t
1
Ut
1
?
31
32
Proceeding recursively…
K(3) = K3 × K2 × K1 × K0
Full rank; Low-rank; Identity matrix; Zero matrix;
Two classes of techniques:
Most linear algebraic dense methods scale as . Analytic (polynomial) interpolation of the kernel can scale as . Special function expansions may be fastest. O(n2k) O(nk2)
33
For example, the squared-exponential (Gaussian) function can be written as:
Similar (approximate) expansions can be calculated for most kernels in terms of other special functions. e−(t−s)2/` = e−t2/`
∞
X
n=0
1 n! ✓ s √ ` ◆n Hn ✓ t √ ` ◆
34
Matérn kernels:
35
k(r) = 21−ν Γ(⌫) ✓√ 2⌫ r ` ◆ν Kν ✓√ 2⌫ r ` ◆ Bessel functions obey the addition formula: Kν(w − z) =
∞
X
j=−∞
(−1)j Kν+j(w) Ij(z)
36
When given a form of the covariance kernel, we have the ability to evaluate it anywhere, including at locations other than those supplied in the data: K(x, y) ≈
p
X
i=1 p
X
j=1
K(xi, yj) Lj(x) Lk(y) Choose interpolation nodes to be Chebyshev nodes for stable evaluation, and formulaic factorizations.
37
Using ACA (LU) with randomized accuracy checking or Chebyshev interpolation, we can factor A = UV in time, p is the numerical rank. Compute all low-rank factorizations: O(np2) On level k, we have to update k other U blocks: O(knp2) O(np2)
log n
X
k=1
O(kp2n) = O(p2n log2 n) Inversion and determinant are both: O(p2n log n)
Rapidly calculating the determinant of this factorization relies on Sylvester’s Determinant Theorem.
Theorem 4.1 (Sylvester’s Determinant Theorem). If A ∈ Rm×n and B ∈ Rn×m, then det (Im + AB) = det (In + BA) , where Ik ∈ Rk×k is the identity matrix. In particular, for a rank p update to the identity matrix, det(In + Un×pVp×n) = det(Ip + Vp×nUn×p).
The dense determinant of a p x p matrix requires calculations. O(p3)
38
Time taken in seconds n Assembly Factor Solve Det. Error 104 0.12 0.11 0.008 0.01 10−13 2 × 104 0.15 0.23 0.016 0.03 10−13 5 × 104 0.47 0.71 0.036 0.12 10−12 105 1.24 1.46 0.052 0.24 10−12 2 × 105 2.14 3.12 0.121 0.39 10−13 5 × 105 6.13 10.2 0.388 0.56 10−12 106 14.1 23.2 0.834 1.52 10−12
Timings for one-dimensional Gaussian covariance functions. The matrix entry is given as Cij = δij + exp
, where ri are random uniformly distributed points in the interval [−3, 3].
(Results computed on a MacBook Air)
103 104 105 106 10−1 102 105 System Size Solve Time Direct 1D Fast 2D Fast 3D Fast
39
(Results computed on a MacBook Air)
Time taken in seconds n Assembly Factor Solve Det. Error 104 0.28 0.31 0.015 0.03 10−12 2 × 104 0.61 0.59 0.028 0.06 10−12 5 × 104 1.62 1.68 0.067 0.15 10−12 105 3.61 3.93 0.123 0.34 10−12 2 × 105 8.03 10.7 0.236 0.65 10−12 5 × 105 26.7 31.2 0.632 1.41 10−11 106 51.3 81.9 1.28 3.40 10−11
103 104 105 106 10−1 102 105 System Size Solve Time Direct Exponential Inverse Multi-quadric Biharmonic Timings for one-dimensional biharmonic covariance function. The matrix entry is given as Cij = 2δij + |ri − rj|2 log(|ri − rj|), where ri are random uniformly distributed points in the interval [−3, 3].
40
41
http://dan.iel.fm/george/
42
To sample from a high-dimensional Gaussian distribution it is necessary to apply a symmetric factor of C. If then where z = N(0, I) x = C1/2z ∼ N(0, C) C = C1/2 C1/2 = LLT = WW T
43
A = AκAκ−1Aκ−2 · · · A0 | {z }
W W T
z }| { AT
0 · · · AT κ−2AT κ−1AT κ ,
We wish to obtain a factorization of the form:
A3 ⇥ A2 ⇥ A1 ⇥ A0 x A3 ⇥ A2 ⇥ ⇥ A1 ⇥ ⇥ A0
=
We just presented fast direct algorithms for computing with Gaussian processes. A certain class of algorithms can be used to efficiently manipulate them:
44
45
46
Even more activities
in classical mathematical physics
The best Gaussian process reference: www.gaussianprocess.org/gpml/ For a nice Python package: dan.iel.fm/george/ Reference papers: www.cims.nyu.edu/~oneil
Funding: AFSOR NSSEFF Program Award FA9550-10-1-0180 AIG-NYU Award #A15-0098-001
47