Fast Direct Methods for Gaussian Processes Mike ONeil Departments - - PowerPoint PPT Presentation

fast direct methods for gaussian processes
SMART_READER_LITE
LIVE PREVIEW

Fast Direct Methods for Gaussian Processes Mike ONeil Departments - - PowerPoint PPT Presentation

Fast Direct Methods for Gaussian Processes Mike ONeil Departments of Mathematics New York University oneil@cims.nyu.edu December 12, 2015 1 Collaborators This is joint work with: Siva Ambikasaran (ICTS, Tata Institute) Dan


slide-1
SLIDE 1

Fast Direct Methods for Gaussian Processes

Mike O’Neil

Departments of Mathematics New York University

  • neil@cims.nyu.edu

December 12, 2015

1

slide-2
SLIDE 2

Collaborators

This is joint work with: Siva Ambikasaran (ICTS, Tata Institute) Dan Foreman-Mackey (UW Astronomy) Leslie Greengard (NYU Math, Simons Fnd.) David Hogg (NYU Physics) Sunli Tang (NYU Math)

2

slide-3
SLIDE 3

The short story: Results

3

I will present a fast, direct (not iterative), algorithm for rapidly evaluating likelihood functions of the form: These likelihood functions universally occur when modeling under Gaussian process priors.

θ) ∝ 1 (det C)1/2 e−y tC−1y

L(θ)

slide-4
SLIDE 4

Outline

  • Probability review
  • Gaussian process overview
  • Applications: Regression & prediction
  • Computational bottlenecks
  • A fast algorithm
  • Numerical results
  • Other and future work

4

slide-5
SLIDE 5

Multivariate normal

The n-dimensional normal distribution has the following properties: C is a positive-definite covariance matrix Cij = Cov(Xi, Xj) P( ~ X ∈ !) = Z

!

1 (2⇡)n/2| det C|1/2 e−(~

y−~ µ)tC−1(~ y−~ µ)/2d~

y

5

slide-6
SLIDE 6

Stochastic processes

Stochastic processes are simply the generalization

  • f random variables to random functions

6

slide-7
SLIDE 7

Gaussian processes are stochastic processes

7

slide-8
SLIDE 8

Gaussian processes

Random functions with structured covariance

8

slide-9
SLIDE 9

Gaussian processes

We say that if and only if f (x) ∼ GP (m(x), k(x, x0)) f (x1), . . . , f (xn) ∼ N(m(x), C) m = (m(x1) · · · m(xn))t Cij = k(xi, xj) The function k is the covariance kernel

9

slide-10
SLIDE 10

Gaussian processes

This means that each realization of a Gaussian process is a function - the corresponding probability distribution is

  • ver functions.

This function sampled at a selection of points behaves like a multivariate Normal distribution Each marginal distribution is normal:f (xi) ∼ N(m(xi), Cii)

10

slide-11
SLIDE 11

Role of the covariance kernel

Admissible covariance kernels must give rise to positive-definite covariance matrices: for all choices of x, y. Common choices are: k(x, x0) = e|xx0|/a k(x, x0) = e(xx0)2/b k(x, x0) = ✓|x − x0| c ◆ν Kν ✓√ 2ν|x − x0| c ◆ y tK(x, x)y > 0

11

slide-12
SLIDE 12

Many options for the covariance kernel

12

  • 4
  • 2

2 4 0.2 0.4 0.6 0.8 1.0

  • 4
  • 2

2 4 0.2 0.4 0.6 0.8 1.0

  • 6
  • 4
  • 2

2 4 6 2 4 6 8

  • 4
  • 2

2 4

  • 0.5

0.5 1.0

Figure 10.26.4: Kν(x), 0.1 ≤ x ≤ 5, 0 ≤ ν ≤ 4.

(c) Nist Handbook

slide-13
SLIDE 13

Role of the covariance kernel

The covariance kernel controls the regularity and the variability of the Gaussian process. k(x, x0) = e(xx0)2/(20) k(x, x0) = e(xx0)2/(0.2) (95% Conditional confidence intervals are shaded)

13

slide-14
SLIDE 14

GP’s in action: Prediction

Imagine we have data {x,y}, and assume a model of the form: Furthermore, assume that f is a Gaussian process and that is i.i.d. Gaussian noise (priors): f ∼ GP(mθ, kθ) ✏ ∼ N(0, 2) We’ve assumed that f can depend on some

  • ther set of parameters, which we may fit later.

θ ✏i yi = f (xi) + ✏i

14

slide-15
SLIDE 15

Prediction

This implies that y is also a Gaussian process: y(x) ∼ GP(mθ(x), δ(x − x0)σ2 + kθ(x, x0)) The new covariance matrix is: C = σ2 I + Kθ

15

slide-16
SLIDE 16

Prediction

Assume m(x) = 0. Then the joint distribution of y and a new predicted value is: y ∗  y y ∗

  • ∼ N

✓ 0,  C k(x, x∗) k(x∗, x) k(x∗, x∗) ◆

16

slide-17
SLIDE 17

Prediction

The conditional distribution (posterior) of the predicted value can then be calculated as: y ∗|x, y, x∗ ∼ N(µ, η2) µ = k(x∗, x)C−1y η2 = k(x∗, x∗) − k(x∗, x)C−1k(x, x∗) where

Schur complement of C

17

slide-18
SLIDE 18

What about the extra parameters?

In general, The parameters have to be fit to the data (inference),

  • r marginalized away (integrated out).

18

k = k(x, x0; θ1) m = m(x; θ2)

slide-19
SLIDE 19

Computational bottlenecks

There are three main computational bottlenecks in dealing with large covariance matrices:

  • Computation/application of
  • Computation of
  • Symmetric factorization

C−1 log det C C = W tW L(θ) = e−y tC−1

θ (x)y/2

(2π)n/2| det Cθ(x)|1/2

19

slide-20
SLIDE 20

Alternative existing methods

To avoid large-scale dense linear algebra, various techniques are often used to approximate the inverse and determinant:

  • Element thresholding (banded/sparse matrix)
  • Special-case kernels (exponential)
  • Subsampling
  • Iterative methods

In many cases, these methods sacrifice the fidelity of the underlying model.

20

slide-21
SLIDE 21

Hierarchical matrix compression

We want a data-sparse factorization of the covariance matrix which allows for fast inversion: A = A1 A2 · · · Am means that A−1 = A−1

m A−1 m−1 · · · A−1 1

det A = det A1 det A2 · · · det Am Many factorizations can be used: HODLR, HBS, HSS, etc.

21

slide-22
SLIDE 22

Hierarchical matrix compression

What properties of the Gaussian process covariance matrix allow for it to be rapidly factored? For example, most covariance kernels yield matrices which are Hierarchically Off-Diagonal Low-Rank

Full rank; Low rank;

22

slide-23
SLIDE 23

Hierarchical matrix compression

K(3) = K3 × K2 × K1 × K0

Full rank; Low-rank; Identity matrix; Zero matrix;

The resulting (numerical) factorization will have the form:

23

Physical interpretation?

slide-24
SLIDE 24

24

Motivation: Physics

F = −G m1m2 r 2

slide-25
SLIDE 25

25

Motivation: Physics

Calculations required: (In FINITE arithmetic)

O(N)

Fast Multipole Methods F = −G m1m2 r 2

slide-26
SLIDE 26

26

Motivation: Physics

Similar algorithms exist for Heat Flow, where the distribution looks like a Gaussian:

  • 4
  • 2

2 4 0.2 0.4 0.6 0.8 1.0

Distance Temperature

slide-27
SLIDE 27

27

Connection with statistics

Time series data which

  • ccurs close in time has larger

relationship than data

  • ccurring far apart.

Often this covariance is modeled using the heat kernel:

  • 4
  • 2

2 4 0.2 0.4 0.6 0.8 1.0

Distance Temperature

slide-28
SLIDE 28

A 1-level scheme - in pictures

Full rank; Low-rank; Identity matrix; Zero matrix;

We can easily show how the factorization works for a 1-level scheme: = x

28

slide-29
SLIDE 29

A 1-level scheme - in formulae

We can easily show how the factorization works for a 1-level scheme: = x A11 A22 UV t V Ut A11 A22 I I

A−1

11 UV t

A−1

22 V Ut

29

Assume we have the factors U and V for now.

slide-30
SLIDE 30

A 2-level scheme - in formulae

A 2-level factorization is slightly more complicated… =

A11 A22

UV t V Ut I

A33 A44

U1V t

1

U2V t

2

V2Ut

2

V1Ut

1

A11 A22

A33 A44

x I I I x I I

A−1

11 U1V t 1

A−1

22 V1Ut 1

A−1

33 U2V t 2

A−1

44 V2Ut 2

I

A−1

1 UV t

A−1

2 V Ut

where A1

A11

A22

U1V t

1

V1Ut

1

= and = A2

A33 A44

U2V t

2

V2Ut

2

30

slide-31
SLIDE 31

A 2-level scheme - in formulae

A1

A11 A22

U1V t

1

V1Ut

1

= How do we invert We can use the Sherman-Morrison-Woodbury formula: (A + UV )−1 = A−1 − A−1U

  • I + V A−1U

−1 V A−1

If U,V are rank k, this is a k x k matrix

A11 A22

  • +

U1 V1  0 V t

1

Ut

1

  • =

?

31

slide-32
SLIDE 32

A log(n)-level scheme

32

Proceeding recursively…

K(3) = K3 × K2 × K1 × K0

Full rank; Low-rank; Identity matrix; Zero matrix;

slide-33
SLIDE 33

Low-rank factorizations

Two classes of techniques:

  • Linear algebraic
  • SVD, LU, LSR, etc.
  • Analytic
  • Explicit outer product

Most linear algebraic dense methods scale as . Analytic (polynomial) interpolation of the kernel can scale as . Special function expansions may be fastest. O(n2k) O(nk2)

33

slide-34
SLIDE 34

For example, the squared-exponential (Gaussian) function can be written as:

Analytic approximation

Similar (approximate) expansions can be calculated for most kernels in terms of other special functions. e−(t−s)2/` = e−t2/`

X

n=0

1 n! ✓ s √ ` ◆n Hn ✓ t √ ` ◆

34

slide-35
SLIDE 35

Matérn kernels:

Analytic approximation

35

k(r) = 21−ν Γ(⌫) ✓√ 2⌫ r ` ◆ν Kν ✓√ 2⌫ r ` ◆ Bessel functions obey the addition formula: Kν(w − z) =

X

j=−∞

(−1)j Kν+j(w) Ij(z)

slide-36
SLIDE 36

High-order interpolation

36

When given a form of the covariance kernel, we have the ability to evaluate it anywhere, including at locations other than those supplied in the data: K(x, y) ≈

p

X

i=1 p

X

j=1

K(xi, yj) Lj(x) Lk(y) Choose interpolation nodes to be Chebyshev nodes for stable evaluation, and formulaic factorizations.

slide-37
SLIDE 37

Computational complexity

37

Using ACA (LU) with randomized accuracy checking or Chebyshev interpolation, we can factor A = UV in time, p is the numerical rank. Compute all low-rank factorizations: O(np2) On level k, we have to update k other U blocks: O(knp2) O(np2)

log n

X

k=1

O(kp2n) = O(p2n log2 n) Inversion and determinant are both: O(p2n log n)

slide-38
SLIDE 38

Fast determinant calculation

Rapidly calculating the determinant of this factorization relies on Sylvester’s Determinant Theorem.

Theorem 4.1 (Sylvester’s Determinant Theorem). If A ∈ Rm×n and B ∈ Rn×m, then det (Im + AB) = det (In + BA) , where Ik ∈ Rk×k is the identity matrix. In particular, for a rank p update to the identity matrix, det(In + Un×pVp×n) = det(Ip + Vp×nUn×p).

The dense determinant of a p x p matrix requires calculations. O(p3)

38

slide-39
SLIDE 39

Numerical results

Time taken in seconds n Assembly Factor Solve Det. Error 104 0.12 0.11 0.008 0.01 10−13 2 × 104 0.15 0.23 0.016 0.03 10−13 5 × 104 0.47 0.71 0.036 0.12 10−12 105 1.24 1.46 0.052 0.24 10−12 2 × 105 2.14 3.12 0.121 0.39 10−13 5 × 105 6.13 10.2 0.388 0.56 10−12 106 14.1 23.2 0.834 1.52 10−12

Timings for one-dimensional Gaussian covariance functions. The matrix entry is given as Cij = δij + exp

  • −|ri − rj|2

, where ri are random uniformly distributed points in the interval [−3, 3].

(Results computed on a MacBook Air)

103 104 105 106 10−1 102 105 System Size Solve Time Direct 1D Fast 2D Fast 3D Fast

39

slide-40
SLIDE 40

Numerical results

(Results computed on a MacBook Air)

Time taken in seconds n Assembly Factor Solve Det. Error 104 0.28 0.31 0.015 0.03 10−12 2 × 104 0.61 0.59 0.028 0.06 10−12 5 × 104 1.62 1.68 0.067 0.15 10−12 105 3.61 3.93 0.123 0.34 10−12 2 × 105 8.03 10.7 0.236 0.65 10−12 5 × 105 26.7 31.2 0.632 1.41 10−11 106 51.3 81.9 1.28 3.40 10−11

103 104 105 106 10−1 102 105 System Size Solve Time Direct Exponential Inverse Multi-quadric Biharmonic Timings for one-dimensional biharmonic covariance function. The matrix entry is given as Cij = 2δij + |ri − rj|2 log(|ri − rj|), where ri are random uniformly distributed points in the interval [−3, 3].

40

slide-41
SLIDE 41

Versus existing Python packages

41

http://dan.iel.fm/george/

slide-42
SLIDE 42

Random variate sampling

42

To sample from a high-dimensional Gaussian distribution it is necessary to apply a symmetric factor of C. If then where z = N(0, I) x = C1/2z ∼ N(0, C) C = C1/2 C1/2 = LLT = WW T

slide-43
SLIDE 43

Hierarchical Symmetric Factorization

43

A = AκAκ−1Aκ−2 · · · A0 | {z }

W W T

z }| { AT

0 · · · AT κ−2AT κ−1AT κ ,

We wish to obtain a factorization of the form:

A3 ⇥ A2 ⇥ A1 ⇥ A0 x A3 ⇥ A2 ⇥ ⇥ A1 ⇥ ⇥ A0

=

slide-44
SLIDE 44

In conclusion

We just presented fast direct algorithms for computing with Gaussian processes. A certain class of algorithms can be used to efficiently manipulate them:

  • Hierarchical factorization of covariance matrices
  • Inversion
  • Determinant
  • Symmetric factorization

44

slide-45
SLIDE 45

Available Python software

45

Download from http://dan.iel.fm/george/

slide-46
SLIDE 46

Other activities

  • Accelerated HSS-HBS matrix factorizations
  • Custom covariances
  • Efficient stable distribution calculations
  • Symmetric factorizations in physics (mobility)

46

Even more activities

  • Fast algorithms for PDEs and integral equations

in classical mathematical physics

  • Electromagnetics, fluids, plasma physics, etc.
slide-47
SLIDE 47

The best Gaussian process reference: www.gaussianprocess.org/gpml/ For a nice Python package: dan.iel.fm/george/ Reference papers: www.cims.nyu.edu/~oneil

Thanks!

Funding: AFSOR NSSEFF Program Award FA9550-10-1-0180 AIG-NYU Award #A15-0098-001

47