An introduction to holonomic gradient method in statistics Akimichi - - PowerPoint PPT Presentation
An introduction to holonomic gradient method in statistics Akimichi - - PowerPoint PPT Presentation
An introduction to holonomic gradient method in statistics Akimichi Takemura, Univ. Tokyo September 5, 2012 Items 1. First example: Airy-like function 2. Holonomic function and holonomic gradient method (HGM) 3. Another example: incomplete
Items
- 1. First example: Airy-like function
- 2. Holonomic function and holonomic gradient
method (HGM)
- 3. Another example: incomplete gamma function
- 4. Wishart distribution and hypergeometric
function of a matrix argument
- 5. HGM for two-dimensional Wishart matrix
- 6. Pfaffian system for general dimension
- 7. Numerical experiments
1
- Coauthors on HGM: H.Hashiguchi, T.Koyama,
H.Nakayama, K.Nishiyama, M.Noro, Y.Numata, K.Ohara, T.Sei, N.Takayama.
- HGM proposed in “Holonomic gradient
descent and its application to the Fisher-Bingham integral”, Advances in Applied Mathematics, 47, 639–658. N3OST2. 2011.
- We have now about 7 manuscripts on HGM,
mainly by Takayama group.
- Wishart discussed in arXiv:1201.0472v3
2
Ex.1: Airy-like function
An exercise problem by Nobuki Takayama on Sep.16, 2009, during “Kobe Gr¨
- bner School”.
Question: Let A(x) = ∞ e−t−xt3dt, x > 0. Derive a differential equation satisfied by A(x). Answer: 1 = (27x3∂2
x + 54x2∂x + 6x + 1)A(x)
= 27x3A(x) + 54x2A(x) + (6x + 1)A(x).
3
- Actually this question is pretty hard, even if
you are told the answer.
- I was struggling with this problem, wasting
lots of papers and wondering why I was doing this exercise.
- After one hour, I suddenly realized that this is
indeed an important problem in statistics.
4
- We try to confirm the given answer.
- First
A(x) = ∂x ∞ e−t−xt3dt = − ∞ t3e−t−xt3dt A(x) = ∞ t6e−t−xt3dt.
- Hence
27x3A(x) + 54x2A(x) + (6x + 1)A(x) = ∞ (27x3t6 − 54x2t3 + 6x + 1)e−t−xt3dt = 1 ??
5
- Integration by parts:
∂t e−t−xt3 = −(1 + 3xt2)e−t−xt3. Hence 0 =
- te−t−xt3∞
= A(x) − ∞ t(1 + 3xt2)e−t−xt3.
6
- Similarly, if you work hard and work out
1 =
- (−9x2t4 + 3xt2 + 6xt − 1)e−t−xt3∞
by integration by parts (how did you get this?), you
- btain the answer:
∞ (27x3t6 − 54x2t3 + 6x + 1)e−t−xt3dt = 1.
- Believe and try
(−9x2t4 + 3xt2 + 6xt − 1) + (9x2t4 − 3xt2 − 6xt + 1)(1 + 3xt2) = (−36x2t3 + 6xt + 6x) + (9x2t4 − 3xt2 − 6xt + 1) + (27x3t6 − 9x2t4 − 18x2t3 + 3xt2) = 27x3t6 − 54x2t3 + 6x + 1 (OK!) 7
- Actually Prof. Takayama (or one of his
students) did this by computer, asking us to do this by hand(!) ← Integration algorithm for D-modules based on Gr¨
- bner basis for D-modules
8
Is this exercise related to statistics?
- Change the notation and let
A(θ) = ∞ e−x−θx3dx.
- Let
f(x; θ) = 1 A(θ)e−x−θx3, x, θ > 0.
- This is an exponential family with the
sufficient statistic T(x) = x3. (We can absorb e−x into the base measure dx.)
9
- Therefore we are evaluating the normalizing
constant and its derivatives of an exponential family.
- We now know
1 = 27θ3A(θ) + 54θ2A(θ) + (6θ + 1)A(θ).
- Hence the Fisher information A(θ) is
automatically obtained from A(θ) and A(θ).
10
- Suppose that we have independent
- bservations x1, . . . , xn from f(x; θ).
- Then the log likelihood (θ) is written as
(θ) = −θ
n
- i=1
x3
i − n log A(θ)
and the likelihood equation is 0 = −
n
- i=1
x3
i − nA(θ)
A(θ) .
- Can we numerically evaluate A(θ) and A(θ)?
(Also A(θ) for Newton-Raphson?)
11
- For illustration, we use simple linear
approximation (Euler method for numerical integration). A(θ + ∆θ) . = A(θ) + ∆θA(θ) A(θ + ∆θ) . = A(θ) + ∆θA(θ).
- But from the differential equation we know
A(θ) = 1 27θ3(1 − (6θ + 1)A(θ) − 54θ2A(θ)).
12
- Punch line: if you keep numerical values of
A(θ), A(θ) at one point θ, then you can compute these values at nearby θ + ∆θ.
- At each point, higher-order derivatives A(θ),
A(θ), . . . , can be computed as needed.
- Hence by numerically solving ODE, you can
compute A(θ) and its derivatives at any point → “Holonomic Gradient Method”
- For explanation we used Euler method, but in
- ur actual implementation we use
Runge-Kutta method to solve ODE.
13
Holonomic function and holonomic gradient method (HGM)
Univariate homogeneous case:
- A smooth function f is holonomic if f satisfies
the following ODE 0 = hk(x)f (k)(x) + · · · + h1(x)f (x) + h0(x)f(x), where hk(x), . . . , h1(x), h0(x) are rational functions of x.
14
- Approximation
f(x + ∆x) f(x + ∆x) . . . f(k−1)(x + ∆x) . = Ik + ∆x 1 . . . 1 . . . . . . . . . 1 − h0(x)
hk(x)
− h1(x)
hk(x)
. . . − hk−1(x)
hk(x)
f(x) f(x) . . . f(k−1)(x) 15
Multivariate case (for simplicity let dim = 2)
- A smooth f(x, y) is holonomic if for each of x
and y, fixing the other variable arbitrarily, we have a holonomic function in x and y.
- Namely, if there exist rational functions
h1
0(x, y), . . . , h1 k1(x, y), h2 0(x, y), . . . , h2 k2(x, y) in x, y
such that
k1
- i=0
h1
i (x, y)∂i xf(x, y) = 0, k2
- i=0
h2
i (x, y)∂i yf(x, y) = 0. 16
- Consider ∂r1
x ∂r2 y f(x, y). If r1 ≥ k1 or r2 ≥ k2, we
can always compute this by recursively applying the differential equations.
- As in the univariate case, if we keep numerical
values of ∂i
x∂j yf(x, y) in the range i = 0, . . . , k1 − 1,
j = 0, . . . , k2 − 1, then we can always compute
- ther higher-order derivatives.
- We can also approximate
∂i
x∂j yf(x + ∆x, y + ∆y)
by the values {∂i
x∂j yf(x, y)}i=0,...,k1−1,j=0,...,k2−1. 17
- We usually only need to keep a subset of
{∂i
x∂j yf(x, y)}i=0,...,k1−1,j=0,...,k2−1 in memory. The
subset is given by the set of standard monomials obtained by the division algorithm based on a Gr¨
- bner basis.
18
Which functions are holonomic (Zeilberger(1990))?
- Polynomials and rational functions are
holonomic.
- exp(rational), log(rational) are holonomic.
- f, g : holonomic ⇒ f + g, f × g : holonomic
- f(x1, . . . , xm) : holonomic
⇒
- f(x1, . . . , xm)dxm : holonomic
- Restriction of a holonomic f to an affine
subspace is holonomic.
19
- Holonomocity is also defined for generalized
functions.
- From the above properties, it is often easy to
tell that a given function (such as the Airy-like function) is holonomic, i.e., it must satisfy a differential equation with rational function coefficients.
- The problem is to find the explicit form of the
differential equation.
20
Ex.2: Incomplete Gamma function
- Consider incomplete Gamma function
G(x) = x yα−1e−ydy, α, x > 0.
- From general theory, G(x) is holonomic.
- We integrate in the opposite direction and
change scale: x (x − y)α−1e−(x−y)dy = xα−1e−x x (1 − y x)α−1eydy = xαe−x 1 (1 − z)α−1exzdz
21
- Expand exz into Taylor series
(not clever?)
G(x) = xαe−x
∞
- k=0
1 k!xk 1 (1 − z)α−1zkdz = xαe−x
∞
- k=0
Γ(α)Γ(k + 1) k!Γ(α + k + 1)xk = 1 αxαe−x
∞
- k=0
(1)k (α + 1)kk!xk (a)k = a(a + 1) . . . (a + k − 1) = 1 αxαe−x
1F1(1; α + 1; x) 22
- Confluent hypergeometric function 1F1
F = 1F1(a; c; x) =
∞
- k=0
(a)k (c)kk!xk
- Differential equation (ODE) satisfied by F :
xF (x) + (c − x)F (x) − aF = 0 (Comparison of coefficients: a + k c + k k+ca + k c + k −k−a = 0)
23
- “Holonomic gradient method”: We obtain the
value of F(x + ∆x) from F(x) by solving the ODE.
- If we know F (x), then we can do the following
F(x + ∆x) . = F(x) + ∆xF (x)
- For the next step we also need to update
F (x + ∆x): F (x + ∆x) . = F (x) + ∆xF (x)
24
- Further we need F (x), . . . . But note that at
any x we have F (x) = −((c − x)F (x) − aF(x))1 x. Hence we can keep only (F(x), F (x)) in
- memory. We can always compute higher-order
derivatives from these two.
- In summary, we can use the updating
procedure F(x) F (x) → F(x) + ∆xF (x) F (x) + ∆xF (x) .
25
- In matrix form
F(x) F (x) + ∆x 0 1
a x
− c−x
x
F(x) F (x) .
- How about the initial value?
Use the series expansion at x . = 0.
- At x = 0 our ODE has singularity. Hence we
have to step away from x = 0 by tiny amount.
- Generalization to matrix argument?
26
Wishart distribution and hypergeometric function of a matrix argument
- W: m × m symmetric positive definite (W > 0)
- Density of Wishart distribution with d.f. n and
covariance matrix Σ > 0: f(W) = C × |W|
n−m−1 2
|Σ|
n 2
exp(−1 2trWΣ−1)
- C is known (containing gamma functions).
27
- 1 : the largest root of W
- We want to evaluate the probability Pr(1 < x).
1 < x ⇔ W < xIm, where Im : m × m is the identity matrix
- Hence the probability is given in the
incomplete gamma form: Pr(1 < x) = C
- 0<W<xIm
|W|
n−m−1 2
|Σ|
n 2
exp(−1 2trWΣ−1)dW
- From general theory Pr(1 < x) is holonomic.
28
- Just as in dim=1, Pr(1 < x) is written as
C exp
- −x
2trΣ−1 x
1 2 nm
1F1
m + 1 2 ; n + m + 1 2 ; x 2Σ−1
- Hypergeometric function of a matrix argument
(Herz(1955)):
1F1(a; c; Y ) =
Γm(c) Γm(a)Γm(c − a)
- 0<X<Im
exp(trXY ) × |X|a−(m+1)/2|Im − X|c−a−(m+1)/2dX, where Γm(a) = π
1 4 m(m−1)
m
- i=1
Γ
- a − i − 1
2
- .
29
- 1F1(a; c; Y ) is a symmetric function of
characteristic roots of Y ⇒ its series expression is written in terms of symmetric polynomials.
- Zonal polynomials (A.T.James)
Cκ(Y ), κ k homogeneous symmetric polynomial of degree k in the characteristic roots of Y .
- Pochhammer symbol:
(a)κ =
l
- i=1
- a − i − 1
2
- ki
, (a)ki =
ki
- j=1
(a + j − 1)
30
- Series expansion of 1F1 (Constantine(1963))
1F1(a; c; Y ) = ∞
- k=0
- κk
(a)κ (c)κ Cκ(Y ) k! .
- This is a beautiful mathematical result.
However for numerical computation, zonal polynomials have enormous combinatorial difficulties and statisticians pretty much forgot zonal polynomials.
31
- The partial differential equation satisfied by
F(Y ) = 1F1(a; c; y1, . . . , ym) was obtained by Muirhead(1970). giF = 0, i = 1, . . . , m, where gi = yi∂2
i + (c − yi)∂i + 1
2
- j=i
yj yi − yj (∂i − ∂j) − a.
32
- Can we use this PDE for numerical
computation? (People never tried this for 40 years).
- Works! works very well up to dimension
m = 10! like Columbus’s Egg
33
Holonomic gradient method for dimension two
- Two partial differential equations
- y1∂2
1 + (c − y1)∂1 + 1
2 y2 y1 − y2 (∂1 − ∂2) − a
- F = 0,
- y2∂2
2 + (c − y2)∂2 + 1
2 y1 y2 − y1 (∂2 − ∂1) − a
- F = 0.
- Let us compute higher-order derivative from
these equations.
34
- Divide the second equation by y2 and write
∂n1
1 ∂n2 2 F = ∂n1 1 ∂n2−2 2
- − c
y2 ∂2 + ∂2 − 1 2 y1 y2(y2 − y1)(∂2 − ∂1) + a y2
- F.
- The RHS becomes messy, but an important
fact is that the number of differentiations is reduced by 1.
- We can reduce the number of differentiations
as long as there are more than 1 differentiations with respect to each variable.
35
- This implies that all higher-order derivatives
can be written as a rational function combination of the following 4 square-free mixed derivatives: F(Y ), ∂1F(Y ), ∂2F(Y ), ∂1∂2F(Y ).
- In algebraic terminology, let K = C(y1, y2) the
field of rational functions and let R = K∂1, ∂2 = C(y1, y2)∂1, ∂2 be the ring of differential operators.
- We take graded lexicographic term order as
the term order.
36
Theorem 1 {g1, g2} is a Gr¨
- bner basis and
{1, ∂1, ∂2, ∂1∂2} is the set of standard monomials. (This theorem hold for general dimension.)
- [Division algorithm] by repeating the
differentiations, for each n1, n2 there exist rational functions h(n1,n2)
00
, h(n1,n2)
10
, h(n1,n2)
01
, h(n1,n2)
11
in y1, y2, such that ∂n1
1 ∂n2 2 F = h(n1,n2) 00
F + h(n1,n2)
10
∂1F + h(n1,n2)
01
∂2F + h(n1,n2)
11
∂1∂2F.
37
- Hence we only keep
F(Y ), ∂1F(Y ), ∂2F(Y ), ∂1∂2F(Y ) in memory. We can always compute higher-order derivatives from these 4 values. (For dimension m, we need to keep 2m square-free mixed derivatives in memory.)
38
- Actually for m = 2 we only need ∂1∂2
2F.
∂1∂2
2F = ∂1
- − c − y2
y2 ∂2 − 1 2 y1 y2(y2 − y1)(∂2 − ∂1) + a y2
- F
=
- − c − y2
y2 ∂1∂2 − 1 2 1 (y2 − y1)2(∂2 − ∂1) − 1 2 y1 y2(y2 − y1)(∂1∂2 − ∂2
1) + a
y2 ∂1
- F.
(By symmetry, ∂2
1∂F is obtained by the
interchange y1 ↔ y2.)
- We further substitute ∂2
1 to the RHS 39
- Result:
∂1∂2
2F =
a 2y2(y2 − y1)F + 3 4 1 (y2 − y1)2 + a y2 − c − y1 2y2(y2 − y1)
- ∂1F
− 3 4 1 (y2 − y1)2∂2F − c − y2 y2 + 1 2 y1 y2(y2 − y1)
- ∂1∂2F
= h(1,2)
00 F + h(1,2) 10 ∂1F + h(1,2) 01 ∂2F + h(1,2) 11 ∂1∂2F. 40
- Then as in the one-dimensional case, we can
update: F(y1 + ∆y1, y2 + ∆y2) . = F(y1, y2) + ∆y1∂1F(y1, y2) + ∆y2∂2F(y1, y2), ∂1F(y1 + ∆y1, y2 + ∆y2) . = ∂1F(y1, y2) + ∆y1∂2
1F(y1, y2) + ∆y2∂1∂2F(y1, y2),
[∂2F is similar] ∂1∂2F(y1 + ∆y1, y2 + ∆y2) . = ∂1∂2F(y1, y2) + ∆y1∂2
1∂2F(y1, y2) + ∆y2∂1∂2 2F(y1, y2). 41
- Let
F = (F, ∂1F, ∂2F, ∂1∂2F)t
- Our updating formula is
- F(Y +∆Y ) .
= F(Y )+∆y1×P1(Y ) F(Y )+∆y2×P2(Y ) F(Y ), where P1, P2 are 4 × 4 matrices consisting of rational functions in Y and called Pfaffian system.
- This simple form works for dimension two very
well.
- For general dimension m, the matrices in the
Pfaffian system are 2m × 2m. The size is 2m = 1024 for m = 10.
42
Pfaffian system for general dimension
- Theorem 1 on the non-diagonal region holds
for general dimension.
- For Pfaffian system, we should leave the
recursive form as it is, without really expanding the recursive form.
43
- For J ⊂ {1, . . . , m}, denote the square-free
derivative w.r.t. variables in J as ∂J =
- j∈J
∂j.
- We need ∂J∂2
i F = ∂2 i ∂JF for i ∈ J. 44
- Let I = J ∪ {i}. Collect square-free terms in
∂Jgi as r(i, J; y) = −
- (c − yi)∂I − a∂J + 1
2
- k∈I
yk yi − yk (∂I − ∂J∂k) + 1 2
- k∈J
yk yi − yk ∂I + 1 2
- k∈J
yi (yi − yk)2(∂i∂J\{k} − ∂J)
- .
45
- Separating terms, for which we need recursive
differentiations in ∂2
i ∂JF, we have
yi∂2
i ∂JF = r(i, J; y)F + 1
2
- k∈J
1 yi − yk (yk∂2
k∂J\{k})F.
- This form is suitable for efficient programming
(done by N.Takayama).
46
- When we expand the recursion to the end, we
- btain
yi∂2
i ∂JF = r(i, J; y)F + 1
2
- k1∈J
1 yi − yk1 r(k1, J \ {k}; y)F + 1 4
- k1,k2∈J
k1,k2:distinct
1 (yi − yk1)(yk1 − yk2)r(k2, J \ {k1, k2}; y)F + . . . + 1 2|J|
- k1,...,k|J|∈J
k1,...,k|J|:distinct
1 (yi − yk1) . . . (yk|J|−1 − yk|J|)r(k|J|, ∅; y)F.
47
- This form is also useful for theoretical
consideration of the Pfaffian system.
- The problem of initial values is also difficult in
general dimension. N.Takayama expanded the algorithm of Koev-Edelman(2006) to handle partial derivatives.
48
Numerical experiments
- Statisticians need good numerical performance
(recall zonal polynomials).
- We were not sure whether it works up to
dimension m = 10.
- How can we check the numerical accuracy?
– The initial value is taken to be close to
- zero. We can check the numerical
convergence limx→∞ Pr(1 < x) = 1.
49
– The following easy bounds are available: Pr[1 < x|diag(σ2
1, . . . , σ2 1)]
≤ Pr[1 < x|diag(σ2
1, . . . , σ2 m)]
≤ Pr[1 < x|diag(σ2
1, 0, . . . , 0)].
- Example: m = 10, d.f. n = 12,
Σ−1 = 2 diag(1, 2, . . . , 10).
50
- For x = 30 the bounds are
(0.99866943, 0.99999998).
- Intel Core i7 machine. The computation of the
initial value at x0 = 0.2 takes 20 seconds. Then with the step size 0.001, we solve the PDE up to x = 30, which takes 75 seconds. Output: Pr[1 < 30] = 0.999545
- This accuracy is somewhat amazing, if we
consider that we updated a 1024-dimensional vector 30,000 times.
51
- Plot of the cumulative distribution
0.2 0.4 0.6 0.8 1 1.2 5 10 15 20 by hg
52
Summary
- Holonomic gradient method is practical if we
implement it efficiently.
- Our approach brought a totally new approach
to a longstanding difficult problem in multivariate statistics.
- Holonomic gradient methods is general and
can be applied to many problems.
- We stand at the beginning of applications of
D-module theory to statistics!
53