[PPT] - An introduction to holonomic gradient method in statistics Akimichi PowerPoint Presentation

SLIDE 1

An introduction to holonomic gradient method in statistics Akimichi Takemura, Univ. Tokyo September 5, 2012

SLIDE 2

Items

1. First example: Airy-like function
2. Holonomic function and holonomic gradient

method (HGM)

3. Another example: incomplete gamma function
4. Wishart distribution and hypergeometric

function of a matrix argument

5. HGM for two-dimensional Wishart matrix
6. Pfaffian system for general dimension
7. Numerical experiments

1

SLIDE 3

Coauthors on HGM: H.Hashiguchi, T.Koyama,

H.Nakayama, K.Nishiyama, M.Noro, Y.Numata, K.Ohara, T.Sei, N.Takayama.

HGM proposed in “Holonomic gradient

descent and its application to the Fisher-Bingham integral”, Advances in Applied Mathematics, 47, 639–658. N3OST2. 2011.

We have now about 7 manuscripts on HGM,

mainly by Takayama group.

Wishart discussed in arXiv:1201.0472v3

2

SLIDE 4

Ex.1: Airy-like function

An exercise problem by Nobuki Takayama on Sep.16, 2009, during “Kobe Gr¨

bner School”.

Question: Let A(x) = ∞ e−t−xt3dt, x > 0. Derive a differential equation satisfied by A(x). Answer: 1 = (27x3∂2

x + 54x2∂x + 6x + 1)A(x)

= 27x3A(x) + 54x2A(x) + (6x + 1)A(x).

3

SLIDE 5

Actually this question is pretty hard, even if

you are told the answer.

I was struggling with this problem, wasting

lots of papers and wondering why I was doing this exercise.

After one hour, I suddenly realized that this is

indeed an important problem in statistics.

4

SLIDE 6

We try to confirm the given answer.
First

A(x) = ∂x ∞ e−t−xt3dt = − ∞ t3e−t−xt3dt A(x) = ∞ t6e−t−xt3dt.

Hence

27x3A(x) + 54x2A(x) + (6x + 1)A(x) = ∞ (27x3t6 − 54x2t3 + 6x + 1)e−t−xt3dt = 1 ??

5

SLIDE 7

Integration by parts:

∂t e−t−xt3 = −(1 + 3xt2)e−t−xt3. Hence 0 =

te−t−xt3∞

= A(x) − ∞ t(1 + 3xt2)e−t−xt3.

6

SLIDE 8

Similarly, if you work hard and work out

1 =

(−9x2t4 + 3xt2 + 6xt − 1)e−t−xt3∞

by integration by parts (how did you get this?), you

btain the answer:

∞ (27x3t6 − 54x2t3 + 6x + 1)e−t−xt3dt = 1.

Believe and try

(−9x2t4 + 3xt2 + 6xt − 1) + (9x2t4 − 3xt2 − 6xt + 1)(1 + 3xt2) = (−36x2t3 + 6xt + 6x) + (9x2t4 − 3xt2 − 6xt + 1) + (27x3t6 − 9x2t4 − 18x2t3 + 3xt2) = 27x3t6 − 54x2t3 + 6x + 1 (OK!) 7

SLIDE 9

Actually Prof. Takayama (or one of his

students) did this by computer, asking us to do this by hand(!) ← Integration algorithm for D-modules based on Gr¨

bner basis for D-modules

8

SLIDE 10

Is this exercise related to statistics?

Change the notation and let

A(θ) = ∞ e−x−θx3dx.

Let

f(x; θ) = 1 A(θ)e−x−θx3, x, θ > 0.

This is an exponential family with the

sufficient statistic T(x) = x3. (We can absorb e−x into the base measure dx.)

9

SLIDE 11

Therefore we are evaluating the normalizing

constant and its derivatives of an exponential family.

We now know

1 = 27θ3A(θ) + 54θ2A(θ) + (6θ + 1)A(θ).

Hence the Fisher information A(θ) is

automatically obtained from A(θ) and A(θ).

10

SLIDE 12

Suppose that we have independent
bservations x1, . . . , xn from f(x; θ).
Then the log likelihood (θ) is written as

(θ) = −θ

n

i=1

x3

i − n log A(θ)

and the likelihood equation is 0 = −

n

i=1

x3

i − nA(θ)

A(θ) .

Can we numerically evaluate A(θ) and A(θ)?

(Also A(θ) for Newton-Raphson?)

11

SLIDE 13

For illustration, we use simple linear

approximation (Euler method for numerical integration). A(θ + ∆θ) . = A(θ) + ∆θA(θ) A(θ + ∆θ) . = A(θ) + ∆θA(θ).

But from the differential equation we know

A(θ) = 1 27θ3(1 − (6θ + 1)A(θ) − 54θ2A(θ)).

12

SLIDE 14

Punch line: if you keep numerical values of

A(θ), A(θ) at one point θ, then you can compute these values at nearby θ + ∆θ.

At each point, higher-order derivatives A(θ),

A(θ), . . . , can be computed as needed.

Hence by numerically solving ODE, you can

compute A(θ) and its derivatives at any point → “Holonomic Gradient Method”

For explanation we used Euler method, but in
ur actual implementation we use

Runge-Kutta method to solve ODE.

13

SLIDE 15

Holonomic function and holonomic gradient method (HGM)

Univariate homogeneous case:

A smooth function f is holonomic if f satisfies

the following ODE 0 = hk(x)f (k)(x) + · · · + h1(x)f (x) + h0(x)f(x), where hk(x), . . . , h1(x), h0(x) are rational functions of x.

14

SLIDE 16

Approximation

        f(x + ∆x) f(x + ∆x) . . . f(k−1)(x + ∆x)         . =           Ik + ∆x           1 . . . 1 . . . . . . . . . 1 − h0(x)

hk(x)

− h1(x)

hk(x)

. . . − hk−1(x)

hk(x)

                            f(x) f(x) . . . f(k−1)(x)         15

SLIDE 17

Multivariate case (for simplicity let dim = 2)

A smooth f(x, y) is holonomic if for each of x

and y, fixing the other variable arbitrarily, we have a holonomic function in x and y.

Namely, if there exist rational functions

h1

0(x, y), . . . , h1 k1(x, y), h2 0(x, y), . . . , h2 k2(x, y) in x, y

such that

k1

i=0

h1

i (x, y)∂i xf(x, y) = 0, k2

i=0

h2

i (x, y)∂i yf(x, y) = 0. 16

SLIDE 18

Consider ∂r1

x ∂r2 y f(x, y). If r1 ≥ k1 or r2 ≥ k2, we

can always compute this by recursively applying the differential equations.

As in the univariate case, if we keep numerical

values of ∂i

x∂j yf(x, y) in the range i = 0, . . . , k1 − 1,

j = 0, . . . , k2 − 1, then we can always compute

ther higher-order derivatives.
We can also approximate

∂i

x∂j yf(x + ∆x, y + ∆y)

by the values {∂i

x∂j yf(x, y)}i=0,...,k1−1,j=0,...,k2−1. 17

SLIDE 19

We usually only need to keep a subset of

{∂i

x∂j yf(x, y)}i=0,...,k1−1,j=0,...,k2−1 in memory. The

subset is given by the set of standard monomials obtained by the division algorithm based on a Gr¨

bner basis.

18

SLIDE 20

Which functions are holonomic (Zeilberger(1990))?

Polynomials and rational functions are

holonomic.

exp(rational), log(rational) are holonomic.
f, g : holonomic ⇒ f + g, f × g : holonomic
f(x1, . . . , xm) : holonomic

⇒

f(x1, . . . , xm)dxm : holonomic
Restriction of a holonomic f to an affine

subspace is holonomic.

19

SLIDE 21

Holonomocity is also defined for generalized

functions.

From the above properties, it is often easy to

tell that a given function (such as the Airy-like function) is holonomic, i.e., it must satisfy a differential equation with rational function coefficients.

The problem is to find the explicit form of the

differential equation.

20

SLIDE 22

Ex.2: Incomplete Gamma function

Consider incomplete Gamma function

G(x) = x yα−1e−ydy, α, x > 0.

From general theory, G(x) is holonomic.
We integrate in the opposite direction and

change scale: x (x − y)α−1e−(x−y)dy = xα−1e−x x (1 − y x)α−1eydy = xαe−x 1 (1 − z)α−1exzdz

21

SLIDE 23

Expand exz into Taylor series

(not clever?)

G(x) = xαe−x

∞

k=0

1 k!xk 1 (1 − z)α−1zkdz = xαe−x

∞

k=0

Γ(α)Γ(k + 1) k!Γ(α + k + 1)xk = 1 αxαe−x

∞

k=0

(1)k (α + 1)kk!xk (a)k = a(a + 1) . . . (a + k − 1) = 1 αxαe−x

1F1(1; α + 1; x) 22

SLIDE 24

Confluent hypergeometric function 1F1

F = 1F1(a; c; x) =

∞

k=0

(a)k (c)kk!xk

Differential equation (ODE) satisfied by F :

xF (x) + (c − x)F (x) − aF = 0 (Comparison of coefficients: a + k c + k k+ca + k c + k −k−a = 0)

23

SLIDE 25

“Holonomic gradient method”: We obtain the

value of F(x + ∆x) from F(x) by solving the ODE.

If we know F (x), then we can do the following

F(x + ∆x) . = F(x) + ∆xF (x)

For the next step we also need to update

F (x + ∆x): F (x + ∆x) . = F (x) + ∆xF (x)

24

SLIDE 26

Further we need F (x), . . . . But note that at

any x we have F (x) = −((c − x)F (x) − aF(x))1 x. Hence we can keep only (F(x), F (x)) in

memory. We can always compute higher-order

derivatives from these two.

In summary, we can use the updating

procedure  F(x) F (x)   →   F(x) + ∆xF (x) F (x) + ∆xF (x)   .

25

SLIDE 27

In matrix form

 F(x) F (x)   + ∆x  0 1

a x

− c−x

x

   F(x) F (x)   .

How about the initial value?

Use the series expansion at x . = 0.

At x = 0 our ODE has singularity. Hence we

have to step away from x = 0 by tiny amount.

Generalization to matrix argument?

26

SLIDE 28

Wishart distribution and hypergeometric function of a matrix argument

W: m × m symmetric positive definite (W > 0)
Density of Wishart distribution with d.f. n and

covariance matrix Σ > 0: f(W) = C × |W|

n−m−1 2

|Σ|

n 2

exp(−1 2trWΣ−1)

C is known (containing gamma functions).

27

SLIDE 29

1 : the largest root of W
We want to evaluate the probability Pr(1 < x).

1 < x ⇔ W < xIm, where Im : m × m is the identity matrix

Hence the probability is given in the

incomplete gamma form: Pr(1 < x) = C

0<W<xIm

|W|

n−m−1 2

|Σ|

n 2

exp(−1 2trWΣ−1)dW

From general theory Pr(1 < x) is holonomic.

28

SLIDE 30

Just as in dim=1, Pr(1 < x) is written as

C exp

−x

2trΣ−1 x

1 2 nm

1F1

m + 1 2 ; n + m + 1 2 ; x 2Σ−1

Hypergeometric function of a matrix argument

(Herz(1955)):

1F1(a; c; Y ) =

Γm(c) Γm(a)Γm(c − a)

0<X<Im

exp(trXY ) × |X|a−(m+1)/2|Im − X|c−a−(m+1)/2dX, where Γm(a) = π

1 4 m(m−1)

m

i=1

Γ

a − i − 1

2

.

29

SLIDE 31

1F1(a; c; Y ) is a symmetric function of

characteristic roots of Y ⇒ its series expression is written in terms of symmetric polynomials.

Zonal polynomials (A.T.James)

Cκ(Y ), κ k homogeneous symmetric polynomial of degree k in the characteristic roots of Y .

Pochhammer symbol:

(a)κ =

l

i=1
a − i − 1

2

ki

, (a)ki =

ki

j=1

(a + j − 1)

30

SLIDE 32

Series expansion of 1F1 (Constantine(1963))

1F1(a; c; Y ) = ∞

k=0
κk

(a)κ (c)κ Cκ(Y ) k! .

This is a beautiful mathematical result.

However for numerical computation, zonal polynomials have enormous combinatorial difficulties and statisticians pretty much forgot zonal polynomials.

31

SLIDE 33

The partial differential equation satisfied by

F(Y ) = 1F1(a; c; y1, . . . , ym) was obtained by Muirhead(1970). giF = 0, i = 1, . . . , m, where gi = yi∂2

i + (c − yi)∂i + 1

2

j=i

yj yi − yj (∂i − ∂j) − a.

32

SLIDE 34

Can we use this PDE for numerical

computation? (People never tried this for 40 years).

Works! works very well up to dimension

m = 10! like Columbus’s Egg

33

SLIDE 35

Holonomic gradient method for dimension two

Two partial differential equations
y1∂2

1 + (c − y1)∂1 + 1

2 y2 y1 − y2 (∂1 − ∂2) − a

F = 0,
y2∂2

2 + (c − y2)∂2 + 1

2 y1 y2 − y1 (∂2 − ∂1) − a

F = 0.
Let us compute higher-order derivative from

these equations.

34

SLIDE 36

Divide the second equation by y2 and write

∂n1

1 ∂n2 2 F = ∂n1 1 ∂n2−2 2

− c

y2 ∂2 + ∂2 − 1 2 y1 y2(y2 − y1)(∂2 − ∂1) + a y2

F.
The RHS becomes messy, but an important

fact is that the number of differentiations is reduced by 1.

We can reduce the number of differentiations

as long as there are more than 1 differentiations with respect to each variable.

35

SLIDE 37

This implies that all higher-order derivatives

can be written as a rational function combination of the following 4 square-free mixed derivatives: F(Y ), ∂1F(Y ), ∂2F(Y ), ∂1∂2F(Y ).

In algebraic terminology, let K = C(y1, y2) the

field of rational functions and let R = K∂1, ∂2 = C(y1, y2)∂1, ∂2 be the ring of differential operators.

We take graded lexicographic term order as

the term order.

36

SLIDE 38

Theorem 1 {g1, g2} is a Gr¨

bner basis and

{1, ∂1, ∂2, ∂1∂2} is the set of standard monomials. (This theorem hold for general dimension.)

[Division algorithm] by repeating the

differentiations, for each n1, n2 there exist rational functions h(n1,n2)

00

, h(n1,n2)

10

, h(n1,n2)

01

, h(n1,n2)

11

in y1, y2, such that ∂n1

1 ∂n2 2 F = h(n1,n2) 00

F + h(n1,n2)

10

∂1F + h(n1,n2)

01

∂2F + h(n1,n2)

11

∂1∂2F.

37

SLIDE 39

Hence we only keep

F(Y ), ∂1F(Y ), ∂2F(Y ), ∂1∂2F(Y ) in memory. We can always compute higher-order derivatives from these 4 values. (For dimension m, we need to keep 2m square-free mixed derivatives in memory.)

38

SLIDE 40

Actually for m = 2 we only need ∂1∂2

2F.

∂1∂2

2F = ∂1

− c − y2

y2 ∂2 − 1 2 y1 y2(y2 − y1)(∂2 − ∂1) + a y2

F

=

− c − y2

y2 ∂1∂2 − 1 2 1 (y2 − y1)2(∂2 − ∂1) − 1 2 y1 y2(y2 − y1)(∂1∂2 − ∂2

1) + a

y2 ∂1

F.

(By symmetry, ∂2

1∂F is obtained by the

interchange y1 ↔ y2.)

We further substitute ∂2

1 to the RHS 39

SLIDE 41

Result:

∂1∂2

2F =

a 2y2(y2 − y1)F + 3 4 1 (y2 − y1)2 + a y2 − c − y1 2y2(y2 − y1)

∂1F

− 3 4 1 (y2 − y1)2∂2F − c − y2 y2 + 1 2 y1 y2(y2 − y1)

∂1∂2F

= h(1,2)

00 F + h(1,2) 10 ∂1F + h(1,2) 01 ∂2F + h(1,2) 11 ∂1∂2F. 40

SLIDE 42

Then as in the one-dimensional case, we can

update: F(y1 + ∆y1, y2 + ∆y2) . = F(y1, y2) + ∆y1∂1F(y1, y2) + ∆y2∂2F(y1, y2), ∂1F(y1 + ∆y1, y2 + ∆y2) . = ∂1F(y1, y2) + ∆y1∂2

1F(y1, y2) + ∆y2∂1∂2F(y1, y2),

[∂2F is similar] ∂1∂2F(y1 + ∆y1, y2 + ∆y2) . = ∂1∂2F(y1, y2) + ∆y1∂2

1∂2F(y1, y2) + ∆y2∂1∂2 2F(y1, y2). 41

SLIDE 43

Let

F = (F, ∂1F, ∂2F, ∂1∂2F)t

Our updating formula is
F(Y +∆Y ) .

= F(Y )+∆y1×P1(Y ) F(Y )+∆y2×P2(Y ) F(Y ), where P1, P2 are 4 × 4 matrices consisting of rational functions in Y and called Pfaffian system.

This simple form works for dimension two very

well.

For general dimension m, the matrices in the

Pfaffian system are 2m × 2m. The size is 2m = 1024 for m = 10.

42

SLIDE 44

Pfaffian system for general dimension

Theorem 1 on the non-diagonal region holds

for general dimension.

For Pfaffian system, we should leave the

recursive form as it is, without really expanding the recursive form.

43

SLIDE 45

For J ⊂ {1, . . . , m}, denote the square-free

derivative w.r.t. variables in J as ∂J =

j∈J

∂j.

We need ∂J∂2

i F = ∂2 i ∂JF for i ∈ J. 44

SLIDE 46

Let I = J ∪ {i}. Collect square-free terms in

∂Jgi as r(i, J; y) = −

(c − yi)∂I − a∂J + 1

2

k∈I

yk yi − yk (∂I − ∂J∂k) + 1 2

k∈J

yk yi − yk ∂I + 1 2

k∈J

yi (yi − yk)2(∂i∂J\{k} − ∂J)

.

45

SLIDE 47

Separating terms, for which we need recursive

differentiations in ∂2

i ∂JF, we have

yi∂2

i ∂JF = r(i, J; y)F + 1

2

k∈J

1 yi − yk (yk∂2

k∂J\{k})F.

This form is suitable for efficient programming

(done by N.Takayama).

46

SLIDE 48

When we expand the recursion to the end, we
btain

yi∂2

i ∂JF = r(i, J; y)F + 1

2

k1∈J

1 yi − yk1 r(k1, J \ {k}; y)F + 1 4

k1,k2∈J

k1,k2:distinct

1 (yi − yk1)(yk1 − yk2)r(k2, J \ {k1, k2}; y)F + . . . + 1 2|J|

k1,...,k|J|∈J

k1,...,k|J|:distinct

1 (yi − yk1) . . . (yk|J|−1 − yk|J|)r(k|J|, ∅; y)F.

47

SLIDE 49

This form is also useful for theoretical

consideration of the Pfaffian system.

The problem of initial values is also difficult in

general dimension. N.Takayama expanded the algorithm of Koev-Edelman(2006) to handle partial derivatives.

48

SLIDE 50

Numerical experiments

Statisticians need good numerical performance

(recall zonal polynomials).

We were not sure whether it works up to

dimension m = 10.

How can we check the numerical accuracy?

– The initial value is taken to be close to

zero. We can check the numerical

convergence limx→∞ Pr(1 < x) = 1.

49

SLIDE 51

– The following easy bounds are available: Pr[1 < x|diag(σ2

1, . . . , σ2 1)]

≤ Pr[1 < x|diag(σ2

1, . . . , σ2 m)]

≤ Pr[1 < x|diag(σ2

1, 0, . . . , 0)].

Example: m = 10, d.f. n = 12,

Σ−1 = 2 diag(1, 2, . . . , 10).

50

SLIDE 52

For x = 30 the bounds are

(0.99866943, 0.99999998).

Intel Core i7 machine. The computation of the

initial value at x0 = 0.2 takes 20 seconds. Then with the step size 0.001, we solve the PDE up to x = 30, which takes 75 seconds. Output: Pr[1 < 30] = 0.999545

This accuracy is somewhat amazing, if we

consider that we updated a 1024-dimensional vector 30,000 times.

51

SLIDE 53

Plot of the cumulative distribution

0.2 0.4 0.6 0.8 1 1.2 5 10 15 20 by hg

52

SLIDE 54

Summary

Holonomic gradient method is practical if we

implement it efficiently.

Our approach brought a totally new approach

to a longstanding difficult problem in multivariate statistics.

Holonomic gradient methods is general and

can be applied to many problems.

We stand at the beginning of applications of

D-module theory to statistics!

53