An introduction to holonomic gradient method in statistics Akimichi - - PowerPoint PPT Presentation

an introduction to holonomic gradient method in
SMART_READER_LITE
LIVE PREVIEW

An introduction to holonomic gradient method in statistics Akimichi - - PowerPoint PPT Presentation

An introduction to holonomic gradient method in statistics Akimichi Takemura, Univ. Tokyo September 5, 2012 Items 1. First example: Airy-like function 2. Holonomic function and holonomic gradient method (HGM) 3. Another example: incomplete


slide-1
SLIDE 1

An introduction to holonomic gradient method in statistics Akimichi Takemura, Univ. Tokyo September 5, 2012

slide-2
SLIDE 2

Items

  • 1. First example: Airy-like function
  • 2. Holonomic function and holonomic gradient

method (HGM)

  • 3. Another example: incomplete gamma function
  • 4. Wishart distribution and hypergeometric

function of a matrix argument

  • 5. HGM for two-dimensional Wishart matrix
  • 6. Pfaffian system for general dimension
  • 7. Numerical experiments

1

slide-3
SLIDE 3
  • Coauthors on HGM: H.Hashiguchi, T.Koyama,

H.Nakayama, K.Nishiyama, M.Noro, Y.Numata, K.Ohara, T.Sei, N.Takayama.

  • HGM proposed in “Holonomic gradient

descent and its application to the Fisher-Bingham integral”, Advances in Applied Mathematics, 47, 639–658. N3OST2. 2011.

  • We have now about 7 manuscripts on HGM,

mainly by Takayama group.

  • Wishart discussed in arXiv:1201.0472v3

2

slide-4
SLIDE 4

Ex.1: Airy-like function

An exercise problem by Nobuki Takayama on Sep.16, 2009, during “Kobe Gr¨

  • bner School”.

Question: Let A(x) = ∞ e−t−xt3dt, x > 0. Derive a differential equation satisfied by A(x). Answer: 1 = (27x3∂2

x + 54x2∂x + 6x + 1)A(x)

= 27x3A(x) + 54x2A(x) + (6x + 1)A(x).

3

slide-5
SLIDE 5
  • Actually this question is pretty hard, even if

you are told the answer.

  • I was struggling with this problem, wasting

lots of papers and wondering why I was doing this exercise.

  • After one hour, I suddenly realized that this is

indeed an important problem in statistics.

4

slide-6
SLIDE 6
  • We try to confirm the given answer.
  • First

A(x) = ∂x ∞ e−t−xt3dt = − ∞ t3e−t−xt3dt A(x) = ∞ t6e−t−xt3dt.

  • Hence

27x3A(x) + 54x2A(x) + (6x + 1)A(x) = ∞ (27x3t6 − 54x2t3 + 6x + 1)e−t−xt3dt = 1 ??

5

slide-7
SLIDE 7
  • Integration by parts:

∂t e−t−xt3 = −(1 + 3xt2)e−t−xt3. Hence 0 =

  • te−t−xt3∞

= A(x) − ∞ t(1 + 3xt2)e−t−xt3.

6

slide-8
SLIDE 8
  • Similarly, if you work hard and work out

1 =

  • (−9x2t4 + 3xt2 + 6xt − 1)e−t−xt3∞

by integration by parts (how did you get this?), you

  • btain the answer:

∞ (27x3t6 − 54x2t3 + 6x + 1)e−t−xt3dt = 1.

  • Believe and try

(−9x2t4 + 3xt2 + 6xt − 1) + (9x2t4 − 3xt2 − 6xt + 1)(1 + 3xt2) = (−36x2t3 + 6xt + 6x) + (9x2t4 − 3xt2 − 6xt + 1) + (27x3t6 − 9x2t4 − 18x2t3 + 3xt2) = 27x3t6 − 54x2t3 + 6x + 1 (OK!) 7

slide-9
SLIDE 9
  • Actually Prof. Takayama (or one of his

students) did this by computer, asking us to do this by hand(!) ← Integration algorithm for D-modules based on Gr¨

  • bner basis for D-modules

8

slide-10
SLIDE 10

Is this exercise related to statistics?

  • Change the notation and let

A(θ) = ∞ e−x−θx3dx.

  • Let

f(x; θ) = 1 A(θ)e−x−θx3, x, θ > 0.

  • This is an exponential family with the

sufficient statistic T(x) = x3. (We can absorb e−x into the base measure dx.)

9

slide-11
SLIDE 11
  • Therefore we are evaluating the normalizing

constant and its derivatives of an exponential family.

  • We now know

1 = 27θ3A(θ) + 54θ2A(θ) + (6θ + 1)A(θ).

  • Hence the Fisher information A(θ) is

automatically obtained from A(θ) and A(θ).

10

slide-12
SLIDE 12
  • Suppose that we have independent
  • bservations x1, . . . , xn from f(x; θ).
  • Then the log likelihood (θ) is written as

(θ) = −θ

n

  • i=1

x3

i − n log A(θ)

and the likelihood equation is 0 = −

n

  • i=1

x3

i − nA(θ)

A(θ) .

  • Can we numerically evaluate A(θ) and A(θ)?

(Also A(θ) for Newton-Raphson?)

11

slide-13
SLIDE 13
  • For illustration, we use simple linear

approximation (Euler method for numerical integration). A(θ + ∆θ) . = A(θ) + ∆θA(θ) A(θ + ∆θ) . = A(θ) + ∆θA(θ).

  • But from the differential equation we know

A(θ) = 1 27θ3(1 − (6θ + 1)A(θ) − 54θ2A(θ)).

12

slide-14
SLIDE 14
  • Punch line: if you keep numerical values of

A(θ), A(θ) at one point θ, then you can compute these values at nearby θ + ∆θ.

  • At each point, higher-order derivatives A(θ),

A(θ), . . . , can be computed as needed.

  • Hence by numerically solving ODE, you can

compute A(θ) and its derivatives at any point → “Holonomic Gradient Method”

  • For explanation we used Euler method, but in
  • ur actual implementation we use

Runge-Kutta method to solve ODE.

13

slide-15
SLIDE 15

Holonomic function and holonomic gradient method (HGM)

Univariate homogeneous case:

  • A smooth function f is holonomic if f satisfies

the following ODE 0 = hk(x)f (k)(x) + · · · + h1(x)f (x) + h0(x)f(x), where hk(x), . . . , h1(x), h0(x) are rational functions of x.

14

slide-16
SLIDE 16
  • Approximation

        f(x + ∆x) f(x + ∆x) . . . f(k−1)(x + ∆x)         . =           Ik + ∆x           1 . . . 1 . . . . . . . . . 1 − h0(x)

hk(x)

− h1(x)

hk(x)

. . . − hk−1(x)

hk(x)

                            f(x) f(x) . . . f(k−1)(x)         15

slide-17
SLIDE 17

Multivariate case (for simplicity let dim = 2)

  • A smooth f(x, y) is holonomic if for each of x

and y, fixing the other variable arbitrarily, we have a holonomic function in x and y.

  • Namely, if there exist rational functions

h1

0(x, y), . . . , h1 k1(x, y), h2 0(x, y), . . . , h2 k2(x, y) in x, y

such that

k1

  • i=0

h1

i (x, y)∂i xf(x, y) = 0, k2

  • i=0

h2

i (x, y)∂i yf(x, y) = 0. 16

slide-18
SLIDE 18
  • Consider ∂r1

x ∂r2 y f(x, y). If r1 ≥ k1 or r2 ≥ k2, we

can always compute this by recursively applying the differential equations.

  • As in the univariate case, if we keep numerical

values of ∂i

x∂j yf(x, y) in the range i = 0, . . . , k1 − 1,

j = 0, . . . , k2 − 1, then we can always compute

  • ther higher-order derivatives.
  • We can also approximate

∂i

x∂j yf(x + ∆x, y + ∆y)

by the values {∂i

x∂j yf(x, y)}i=0,...,k1−1,j=0,...,k2−1. 17

slide-19
SLIDE 19
  • We usually only need to keep a subset of

{∂i

x∂j yf(x, y)}i=0,...,k1−1,j=0,...,k2−1 in memory. The

subset is given by the set of standard monomials obtained by the division algorithm based on a Gr¨

  • bner basis.

18

slide-20
SLIDE 20

Which functions are holonomic (Zeilberger(1990))?

  • Polynomials and rational functions are

holonomic.

  • exp(rational), log(rational) are holonomic.
  • f, g : holonomic ⇒ f + g, f × g : holonomic
  • f(x1, . . . , xm) : holonomic

  • f(x1, . . . , xm)dxm : holonomic
  • Restriction of a holonomic f to an affine

subspace is holonomic.

19

slide-21
SLIDE 21
  • Holonomocity is also defined for generalized

functions.

  • From the above properties, it is often easy to

tell that a given function (such as the Airy-like function) is holonomic, i.e., it must satisfy a differential equation with rational function coefficients.

  • The problem is to find the explicit form of the

differential equation.

20

slide-22
SLIDE 22

Ex.2: Incomplete Gamma function

  • Consider incomplete Gamma function

G(x) = x yα−1e−ydy, α, x > 0.

  • From general theory, G(x) is holonomic.
  • We integrate in the opposite direction and

change scale: x (x − y)α−1e−(x−y)dy = xα−1e−x x (1 − y x)α−1eydy = xαe−x 1 (1 − z)α−1exzdz

21

slide-23
SLIDE 23
  • Expand exz into Taylor series

(not clever?)

G(x) = xαe−x

  • k=0

1 k!xk 1 (1 − z)α−1zkdz = xαe−x

  • k=0

Γ(α)Γ(k + 1) k!Γ(α + k + 1)xk = 1 αxαe−x

  • k=0

(1)k (α + 1)kk!xk (a)k = a(a + 1) . . . (a + k − 1) = 1 αxαe−x

1F1(1; α + 1; x) 22

slide-24
SLIDE 24
  • Confluent hypergeometric function 1F1

F = 1F1(a; c; x) =

  • k=0

(a)k (c)kk!xk

  • Differential equation (ODE) satisfied by F :

xF (x) + (c − x)F (x) − aF = 0 (Comparison of coefficients: a + k c + k k+ca + k c + k −k−a = 0)

23

slide-25
SLIDE 25
  • “Holonomic gradient method”: We obtain the

value of F(x + ∆x) from F(x) by solving the ODE.

  • If we know F (x), then we can do the following

F(x + ∆x) . = F(x) + ∆xF (x)

  • For the next step we also need to update

F (x + ∆x): F (x + ∆x) . = F (x) + ∆xF (x)

24

slide-26
SLIDE 26
  • Further we need F (x), . . . . But note that at

any x we have F (x) = −((c − x)F (x) − aF(x))1 x. Hence we can keep only (F(x), F (x)) in

  • memory. We can always compute higher-order

derivatives from these two.

  • In summary, we can use the updating

procedure  F(x) F (x)   →   F(x) + ∆xF (x) F (x) + ∆xF (x)   .

25

slide-27
SLIDE 27
  • In matrix form

 F(x) F (x)   + ∆x  0 1

a x

− c−x

x

   F(x) F (x)   .

  • How about the initial value?

Use the series expansion at x . = 0.

  • At x = 0 our ODE has singularity. Hence we

have to step away from x = 0 by tiny amount.

  • Generalization to matrix argument?

26

slide-28
SLIDE 28

Wishart distribution and hypergeometric function of a matrix argument

  • W: m × m symmetric positive definite (W > 0)
  • Density of Wishart distribution with d.f. n and

covariance matrix Σ > 0: f(W) = C × |W|

n−m−1 2

|Σ|

n 2

exp(−1 2trWΣ−1)

  • C is known (containing gamma functions).

27

slide-29
SLIDE 29
  • 1 : the largest root of W
  • We want to evaluate the probability Pr(1 < x).

1 < x ⇔ W < xIm, where Im : m × m is the identity matrix

  • Hence the probability is given in the

incomplete gamma form: Pr(1 < x) = C

  • 0<W<xIm

|W|

n−m−1 2

|Σ|

n 2

exp(−1 2trWΣ−1)dW

  • From general theory Pr(1 < x) is holonomic.

28

slide-30
SLIDE 30
  • Just as in dim=1, Pr(1 < x) is written as

C exp

  • −x

2trΣ−1 x

1 2 nm

1F1

m + 1 2 ; n + m + 1 2 ; x 2Σ−1

  • Hypergeometric function of a matrix argument

(Herz(1955)):

1F1(a; c; Y ) =

Γm(c) Γm(a)Γm(c − a)

  • 0<X<Im

exp(trXY ) × |X|a−(m+1)/2|Im − X|c−a−(m+1)/2dX, where Γm(a) = π

1 4 m(m−1)

m

  • i=1

Γ

  • a − i − 1

2

  • .

29

slide-31
SLIDE 31
  • 1F1(a; c; Y ) is a symmetric function of

characteristic roots of Y ⇒ its series expression is written in terms of symmetric polynomials.

  • Zonal polynomials (A.T.James)

Cκ(Y ), κ k homogeneous symmetric polynomial of degree k in the characteristic roots of Y .

  • Pochhammer symbol:

(a)κ =

l

  • i=1
  • a − i − 1

2

  • ki

, (a)ki =

ki

  • j=1

(a + j − 1)

30

slide-32
SLIDE 32
  • Series expansion of 1F1 (Constantine(1963))

1F1(a; c; Y ) = ∞

  • k=0
  • κk

(a)κ (c)κ Cκ(Y ) k! .

  • This is a beautiful mathematical result.

However for numerical computation, zonal polynomials have enormous combinatorial difficulties and statisticians pretty much forgot zonal polynomials.

31

slide-33
SLIDE 33
  • The partial differential equation satisfied by

F(Y ) = 1F1(a; c; y1, . . . , ym) was obtained by Muirhead(1970). giF = 0, i = 1, . . . , m, where gi = yi∂2

i + (c − yi)∂i + 1

2

  • j=i

yj yi − yj (∂i − ∂j) − a.

32

slide-34
SLIDE 34
  • Can we use this PDE for numerical

computation? (People never tried this for 40 years).

  • Works! works very well up to dimension

m = 10! like Columbus’s Egg

33

slide-35
SLIDE 35

Holonomic gradient method for dimension two

  • Two partial differential equations
  • y1∂2

1 + (c − y1)∂1 + 1

2 y2 y1 − y2 (∂1 − ∂2) − a

  • F = 0,
  • y2∂2

2 + (c − y2)∂2 + 1

2 y1 y2 − y1 (∂2 − ∂1) − a

  • F = 0.
  • Let us compute higher-order derivative from

these equations.

34

slide-36
SLIDE 36
  • Divide the second equation by y2 and write

∂n1

1 ∂n2 2 F = ∂n1 1 ∂n2−2 2

  • − c

y2 ∂2 + ∂2 − 1 2 y1 y2(y2 − y1)(∂2 − ∂1) + a y2

  • F.
  • The RHS becomes messy, but an important

fact is that the number of differentiations is reduced by 1.

  • We can reduce the number of differentiations

as long as there are more than 1 differentiations with respect to each variable.

35

slide-37
SLIDE 37
  • This implies that all higher-order derivatives

can be written as a rational function combination of the following 4 square-free mixed derivatives: F(Y ), ∂1F(Y ), ∂2F(Y ), ∂1∂2F(Y ).

  • In algebraic terminology, let K = C(y1, y2) the

field of rational functions and let R = K∂1, ∂2 = C(y1, y2)∂1, ∂2 be the ring of differential operators.

  • We take graded lexicographic term order as

the term order.

36

slide-38
SLIDE 38

Theorem 1 {g1, g2} is a Gr¨

  • bner basis and

{1, ∂1, ∂2, ∂1∂2} is the set of standard monomials. (This theorem hold for general dimension.)

  • [Division algorithm] by repeating the

differentiations, for each n1, n2 there exist rational functions h(n1,n2)

00

, h(n1,n2)

10

, h(n1,n2)

01

, h(n1,n2)

11

in y1, y2, such that ∂n1

1 ∂n2 2 F = h(n1,n2) 00

F + h(n1,n2)

10

∂1F + h(n1,n2)

01

∂2F + h(n1,n2)

11

∂1∂2F.

37

slide-39
SLIDE 39
  • Hence we only keep

F(Y ), ∂1F(Y ), ∂2F(Y ), ∂1∂2F(Y ) in memory. We can always compute higher-order derivatives from these 4 values. (For dimension m, we need to keep 2m square-free mixed derivatives in memory.)

38

slide-40
SLIDE 40
  • Actually for m = 2 we only need ∂1∂2

2F.

∂1∂2

2F = ∂1

  • − c − y2

y2 ∂2 − 1 2 y1 y2(y2 − y1)(∂2 − ∂1) + a y2

  • F

=

  • − c − y2

y2 ∂1∂2 − 1 2 1 (y2 − y1)2(∂2 − ∂1) − 1 2 y1 y2(y2 − y1)(∂1∂2 − ∂2

1) + a

y2 ∂1

  • F.

(By symmetry, ∂2

1∂F is obtained by the

interchange y1 ↔ y2.)

  • We further substitute ∂2

1 to the RHS 39

slide-41
SLIDE 41
  • Result:

∂1∂2

2F =

a 2y2(y2 − y1)F + 3 4 1 (y2 − y1)2 + a y2 − c − y1 2y2(y2 − y1)

  • ∂1F

− 3 4 1 (y2 − y1)2∂2F − c − y2 y2 + 1 2 y1 y2(y2 − y1)

  • ∂1∂2F

= h(1,2)

00 F + h(1,2) 10 ∂1F + h(1,2) 01 ∂2F + h(1,2) 11 ∂1∂2F. 40

slide-42
SLIDE 42
  • Then as in the one-dimensional case, we can

update: F(y1 + ∆y1, y2 + ∆y2) . = F(y1, y2) + ∆y1∂1F(y1, y2) + ∆y2∂2F(y1, y2), ∂1F(y1 + ∆y1, y2 + ∆y2) . = ∂1F(y1, y2) + ∆y1∂2

1F(y1, y2) + ∆y2∂1∂2F(y1, y2),

[∂2F is similar] ∂1∂2F(y1 + ∆y1, y2 + ∆y2) . = ∂1∂2F(y1, y2) + ∆y1∂2

1∂2F(y1, y2) + ∆y2∂1∂2 2F(y1, y2). 41

slide-43
SLIDE 43
  • Let

F = (F, ∂1F, ∂2F, ∂1∂2F)t

  • Our updating formula is
  • F(Y +∆Y ) .

= F(Y )+∆y1×P1(Y ) F(Y )+∆y2×P2(Y ) F(Y ), where P1, P2 are 4 × 4 matrices consisting of rational functions in Y and called Pfaffian system.

  • This simple form works for dimension two very

well.

  • For general dimension m, the matrices in the

Pfaffian system are 2m × 2m. The size is 2m = 1024 for m = 10.

42

slide-44
SLIDE 44

Pfaffian system for general dimension

  • Theorem 1 on the non-diagonal region holds

for general dimension.

  • For Pfaffian system, we should leave the

recursive form as it is, without really expanding the recursive form.

43

slide-45
SLIDE 45
  • For J ⊂ {1, . . . , m}, denote the square-free

derivative w.r.t. variables in J as ∂J =

  • j∈J

∂j.

  • We need ∂J∂2

i F = ∂2 i ∂JF for i ∈ J. 44

slide-46
SLIDE 46
  • Let I = J ∪ {i}. Collect square-free terms in

∂Jgi as r(i, J; y) = −

  • (c − yi)∂I − a∂J + 1

2

  • k∈I

yk yi − yk (∂I − ∂J∂k) + 1 2

  • k∈J

yk yi − yk ∂I + 1 2

  • k∈J

yi (yi − yk)2(∂i∂J\{k} − ∂J)

  • .

45

slide-47
SLIDE 47
  • Separating terms, for which we need recursive

differentiations in ∂2

i ∂JF, we have

yi∂2

i ∂JF = r(i, J; y)F + 1

2

  • k∈J

1 yi − yk (yk∂2

k∂J\{k})F.

  • This form is suitable for efficient programming

(done by N.Takayama).

46

slide-48
SLIDE 48
  • When we expand the recursion to the end, we
  • btain

yi∂2

i ∂JF = r(i, J; y)F + 1

2

  • k1∈J

1 yi − yk1 r(k1, J \ {k}; y)F + 1 4

  • k1,k2∈J

k1,k2:distinct

1 (yi − yk1)(yk1 − yk2)r(k2, J \ {k1, k2}; y)F + . . . + 1 2|J|

  • k1,...,k|J|∈J

k1,...,k|J|:distinct

1 (yi − yk1) . . . (yk|J|−1 − yk|J|)r(k|J|, ∅; y)F.

47

slide-49
SLIDE 49
  • This form is also useful for theoretical

consideration of the Pfaffian system.

  • The problem of initial values is also difficult in

general dimension. N.Takayama expanded the algorithm of Koev-Edelman(2006) to handle partial derivatives.

48

slide-50
SLIDE 50

Numerical experiments

  • Statisticians need good numerical performance

(recall zonal polynomials).

  • We were not sure whether it works up to

dimension m = 10.

  • How can we check the numerical accuracy?

– The initial value is taken to be close to

  • zero. We can check the numerical

convergence limx→∞ Pr(1 < x) = 1.

49

slide-51
SLIDE 51

– The following easy bounds are available: Pr[1 < x|diag(σ2

1, . . . , σ2 1)]

≤ Pr[1 < x|diag(σ2

1, . . . , σ2 m)]

≤ Pr[1 < x|diag(σ2

1, 0, . . . , 0)].

  • Example: m = 10, d.f. n = 12,

Σ−1 = 2 diag(1, 2, . . . , 10).

50

slide-52
SLIDE 52
  • For x = 30 the bounds are

(0.99866943, 0.99999998).

  • Intel Core i7 machine. The computation of the

initial value at x0 = 0.2 takes 20 seconds. Then with the step size 0.001, we solve the PDE up to x = 30, which takes 75 seconds. Output: Pr[1 < 30] = 0.999545

  • This accuracy is somewhat amazing, if we

consider that we updated a 1024-dimensional vector 30,000 times.

51

slide-53
SLIDE 53
  • Plot of the cumulative distribution

0.2 0.4 0.6 0.8 1 1.2 5 10 15 20 by hg

52

slide-54
SLIDE 54

Summary

  • Holonomic gradient method is practical if we

implement it efficiently.

  • Our approach brought a totally new approach

to a longstanding difficult problem in multivariate statistics.

  • Holonomic gradient methods is general and

can be applied to many problems.

  • We stand at the beginning of applications of

D-module theory to statistics!

53