The Orlicz-Sobolev-Gauss Exponential Manifold Giovanni Pistone - - PowerPoint PPT Presentation

the orlicz sobolev gauss exponential manifold
SMART_READER_LITE
LIVE PREVIEW

The Orlicz-Sobolev-Gauss Exponential Manifold Giovanni Pistone - - PowerPoint PPT Presentation

IGAIA IV Information Geometry and its Applications IV The Orlicz-Sobolev-Gauss Exponential Manifold Giovanni Pistone www.giannidiorestino.it Liblice June 13 2016 My four parts 1. Amaris Information Geometry when the state space is not


slide-1
SLIDE 1

IGAIA IV

Information Geometry and its Applications IV

The Orlicz-Sobolev-Gauss Exponential Manifold

Giovanni Pistone www.giannidiorestino.it Liblice June 13 2016

slide-2
SLIDE 2

My four parts

  • 1. Amari’s Information Geometry when the state

space is not finite and the model is not parametric

  • 2. An example: computing the Wasserstein’s

distance

  • 3. Gauss-Orlicz-Sobolev model spaces
  • 4. Second order geometry

Cette conversation est d´ edi´ ee ` a Michel Metivier, mon maitre ` a Rennes (1973-75)

slide-3
SLIDE 3

Part I

Amari’s Information Geometry when the state space is not finite and the model in not parametric

slide-4
SLIDE 4

In IG the velocity is the score

  • θ → pθ is a curve
  • The score θ →

d dθ log pθ is an estimating function because

Eθ d

dθ log pθ

  • = 0
  • Fisher-Rao computation:

d dθEθ [U] =

  • U(x) d

dθp(x; θ) µ(dx) p(x; θ) > 0 =

  • U(x) d

dθ log p(x; θ) p(x; θ) µ(dx) = =

  • U − Eθ [U] , d

dθ log pθ

  • θ
  • U − Eθ [U] is the statistical gradient of θ → Eθ [U].
  • Cf recent work by Ay, Jost, Lˆ

e, Schwachh¨

  • fer on measure models
slide-5
SLIDE 5

IG is the geometry of the statistical bundle

  • P is a set of probabilities on a given sample space (Ω, F)
  • For each p ∈ P, Bp ֒

→ L1

0(p)

  • A statistical bundle is

TP = {(p, U)|p ∈ P, U ∈ Bp}

  • We expect the fibers Bp to be isomorphic and express a tangent

space at p ∈ P

  • A chart at p σp : (q, V ) → (sp(q), ˙

sp(V )) ∈ Bp × Bp

  • S.-i. Amari and M. Kumon. Estimation in the presence of infinitely many nuisance parameters—geometry
  • f estimating functions.
  • Ann. Statist., 16(3):1044–1068, 1988
  • P. Gibilisco and G. Pistone. Connections on non-parametric statistical manifolds by Orlicz space geometry.

IDAQP, 1(2):325–347, 1998

  • Cf Otto, cf Lˆ

e

slide-6
SLIDE 6

Fibers: Bp = L(cosh −1) (p)

  • The exponential space L(cosh −1) (p) and the mixture space

L(cosh −1)∗ (p) are the Orlicz spaces respectively defined by the conjugate Young functions (cosh −1)(x) = cosh x − 1

  • with

xy (cosh −1)(x) + (cosh −1)∗(y)

  • The closed unit balls of the exponential and mixture space are,

respectively,

  • f
  • f L(cosh −1)(p) 1
  • =
  • f
  • (cosh −1)(f (x)) p(x)dx 1
  • g
  • L(cosh −1)∗(p) 1
  • =
  • g
  • (cosh −1)∗(g(x)) p(x)dx 1
  • .
  • Bp =
  • U ∈ L(cosh −1) (p)
  • Ep [U] = 0
  • is the dual of

∗Bp =

  • V ∈ L(cosh −1)∗ (p)
  • Ep [V ] = 0
slide-7
SLIDE 7

Bp is the space of scores

Bp is exactly the space of scores of Gibbs model through p

Theorem

  • 1. U ∈ Bp iff Ep [U] = 0 and Ep [(cosh −1)(ρU)] < ∞ for some ρ > 0
  • 2. U ∈ Bp iff Ep [U] = 0 and the moment generating function

α → Ep

  • eθU

is finite in a neighbourhood of 0

  • 3. The Gibbs model θ →

eθU Ep[eθU] is defined in a neighborhood of 0 and d dθEp

  • eθU
  • θ=0 = 0.
  • 4. The score of the Gibbs model at 0 is

d dθ log pθ

  • θ=0 = U

This set-up applies to the set P> of strictly positive densities.

  • G. Pistone and C. Sempi. An infinite-dimensional geometric structure on the space of all the probability

measures equivalent to a given one.

  • Ann. Statist., 23(5):1543–1561, October 1995
slide-8
SLIDE 8

Isomorphism of the L(cosh −1) (p) spaces

Theorem

L(cosh −1) (p) = L(cosh −1) (q) as Banach spaces if θ →

  • p1−θqθ dµ is

finite on an open neighbourhood I of [0, 1], i.e. It is an equivalence relation p ⌣ q and we denote by E (p) the class containing p.

Proof.

Assume U ∈ L(cosh −1) (p) and consider the restrictions to the axes of the convex function (s, θ) →

  • esUp1−θqθ dµ =
  • exp
  • sU + θ log q

p

  • p dµ
  • G. Pistone and C. Sempi. An infinite-dimensional geometric structure on the space of all the probability

measures equivalent to a given one.

  • Ann. Statist., 23(5):1543–1561, October 1995
  • A. Cena. Geometric structures on the non-parametric statistical manifold.

PhD thesis, Dottorato in Matematica, Universit` a di Milano, 2002

slide-9
SLIDE 9

Portmanteau theorem

Theorem

The following statements are equivalent for p, q ∈ P>:

  • q ∈ E (p);
  • p ⌣ q;
  • E (p) = E (q);
  • L(cosh −1) (p) = L(cosh −1) (q);
  • log
  • q

p

  • ∈ Lcosh −1(p) ∩ Lcosh −1(q).
  • q

p ∈ L1+ǫ(p) and p q ∈ L1+ǫ(q) for some ǫ > 0.

  • A. Cena and G. Pistone. Exponential statistical manifold.
  • Ann. Inst. Statist. Math., 59(1):27–56, 2007
  • M. Santacroce, P. Siri, and B. Trivellato. New results on mixture and exponential models by Orlicz spaces.

Bernoulli, 22(3):1431–1447, 2016

slide-10
SLIDE 10

Maximal exponential family

  • For each p ∈ P>, the moment generating functional is the positive

lower-semi-continuous convex function Gp : Bp ∋ U → Ep

  • eU

and

  • the cumulant generating functional is the non-negative lower

semicontinuous convex function Kp = log Gp.

  • The interior of the proper domain

Sp =

  • U ∈ L(cosh −1) (p)
  • Gp(U) < +∞
  • is an open convex set containing the open unit ball of L(cosh −1) (p).
  • For each p ∈ P>, the maximal exponential family at p is

E (p) =

  • eu−Kp(u) · p
  • u ∈ Sp
  • .

From now on the maximal exponential family of interest is the family of the Maxwell density on Rn, E (M)

slide-11
SLIDE 11

e-chart at p ∈ E (M)

  • For each p ∈ E (M) we define a chart sp : E (M) → Sp ⊂ Bp.
  • The chart is defined by

sp(q) → log q p

  • + D(pq) = log

q p

  • − Ep
  • log

q p

  • The inverse of the chart e−1

p

= sp : Sp → E (M) is ep(U) = exp (U − Kp(U)) · p

  • {sp|p ∈ E (M)} is an affine atlas on E (M) that defines the

exponential manifold

  • The information closure of any E (M) is P. The reverse

information closure of any E (M) is P>.

  • I. Csisz´

ar and F. Mat´ uˇ

  • s. Information projections revisited.

IEEE Trans. Inform. Theory, 49(6):1474–1490, 2003

  • D. Imparato and B. Trivellato. Geometry of extended exponential models.

In Algebraic and geometric methods in statistics, pages 307–326. Cambridge Univ. Press, Cambridge, 2010

slide-12
SLIDE 12

e-chart at (p, U) ∈ TE (M)

  • A curve t → p(t), p(0) = p in the exponential manifold E (M) is

expressed in the chart sp as p(t) = eU(t)−Kp(U(t)) · p.

  • The expression of the velocity at t = 0 is ˙

U(0) =

d dt log p(t)

  • t=0
  • It follows that the exponential bundle

TE (M) = {(p, U)|p ∈ E (M) , U ∈ Bp} is the expression of the tangent bundle of the exponential manifold

  • The transition map sp2 ◦ ep1 : Sp1 → Sp2 is affine with derivative

eUp2 p1 : Bp1 → Bp2 given by eUp2 p1U = U − Ep2 [U]

  • We define an atlas of charts on TE (M) by

σp(q, V ) =

  • sp(q), eUp

qV

  • G. Pistone. Nonparametric information geometry.

In F. Nielsen and F. Barbaresco, editors, Geometric science of information, volume 8085 of Lecture Notes in Comput. Sci., pages 5–36. Springer, Heidelberg, 2013. First International Conference, GSI 2013 Paris, France, August 28-30, 2013 Proceedings

slide-13
SLIDE 13

Cumulant functional

  • The r-divergence q → D (p q) is represented in the chart centered

at p by D (p ep(U)) = Kp(U) = log Ep

  • eU

.

  • Kp : Bp → R ∪ {+∞} is convex and its proper domain contains

the open unit ball of Bp. It is infinitely Gˆ ateaux-differentiable on the interior Sp of its proper domain and analytic on the unit ball of Bp.

  • For all V , V1, V2, V3 ∈ Bp the first derivatives are:

d Kp(U)[V ] = Eq [V ] d2 Kp(U)[V1, V2] = Covq (V1, V2) d3 Kp(U)[V1, V2, V3] = Covq(V1, V2, V3)

  • G. Pistone and M. Rogantin. The exponential statistical manifold: mean parameters, orthogonality and

space transformations. Bernoulli, 5(4):721–760, August 1999

  • A. Cena and G. Pistone. Exponential statistical manifold.
  • Ann. Inst. Statist. Math., 59(1):27–56, 2007
slide-14
SLIDE 14

Pre-dual statistical bundle

  • Recall L(cosh −1)∗ (M) is the pre-dual of L(cosh −1) (M)
  • Define the pre-dual statistical bundle with fibers

∗Bp =

  • V ∈ L(cosh −1)∗ (M)
  • Ep [V ] = 0
  • .
  • Compute the adjoint of the transport eUq
  • p. For U ∈ Bp and

V ∈ ∗Bq, eUq

pU, V

  • q = U − Eq [U] , V q = Eq [UV ]

= Ep q p UV

  • =
  • U, q

p V

  • p

=

  • U, mUp

qV

  • p
  • Define the charts on ∗TE (M) by

σ∗

p(q, W ) =

  • sp(q), mUp

qW

slide-15
SLIDE 15

Statistical gradient

  • The score of the curve t → p(t) is a curve in the statistical bundle

t → (p(t), Dp(t)) ∈ TE (M) such that for all X ∈ L(cosh −1)∗ (M) it holds d dt Ep(t) [X] =

  • X − Ep(t) [X] , Dp(t)
  • p(t)
  • Dp(t) is the expression in the exponential atlas of the velocity

Dp(t) = ˙ p(t) p(t) = d dt log p(t)

  • The statistical gradient of F : E (M) → R is a section of the

pre-dual statistical bundle ∗TE (M), p → (p, grad F(p)) ∈ ∗TE (p) such that for each regular curve d dt F(p(t)) = grad F(p(t)), Dp(t)p(t)

  • L. Malag`
  • , M. Matteucci, and G. Pistone. Towards the geometry of estimation of distribution algorithms

based on the exponential family. In Proceedings of the 11th workshop on Foundations of genetic algorithms, FOGA ’11, pages 230–242, New York, NY, USA, 2011. ACM

  • G. Pistone. Examples of the application of nonparametric information geometry to statistical physics.

Entropy, 15(10):4042–4065, 2013

slide-16
SLIDE 16

Part II An example: computing the Wasserstein distance

  • This is an example of the use of the formalism. There is considerable literature, e.g. F. Otto
  • Unpublished talk at the workshop Computational information geometry for image and signal processing Sep

21st25th, 2015 at ICMS, Edinburgh. Finite state space.

  • Unpublished work in progress with Luigi Malag`
  • .
slide-17
SLIDE 17

Transport plan

(R2n, M2n) = (Rn, Mn) ⊗ (Rn, Mn) with projection X and Y

  • The marginalization mapping

E (M2n) ∋ γ → (γ1, γ2) ∈ E (Mn) × E (M)n has fibers Γ(µ1, µ2) = {γ ∈ E (M2n)|X#γ = µ1, Y#γ = µ2} which are convex subsets

  • If t → γ(t) ∈ Γ(µ1, µ2), then

0 = d dt Eγ(t) [f ◦ X] = f ◦ X − Eµ1 [f ] , Dγ(t)γ(t) =

  • f ◦ X − Eµ1 [f ] , Eγ(t) (Dγ(t)|X)
  • γ(t)
  • Eγ(t) (Dγ(t)|X) = 0,

Eγ(t) (Dγ(t)|Y ) = 0, Dγ(t) ∈ Bγ(t)

slide-18
SLIDE 18

Splitting of TΓ(µ1, µ2)

  • Consider subspaces of the ANOVA

Bγ,1 = {f ◦ X|f ∈ Bγ1} , Bγ,2 = {f ◦ Y |f ∈ Bγ2} ,

∗Bγ,12 = (Bγ,0 + Bγ,1 + Bγ,2)⊥

  • For U ∈ Bγ, let U = U1 + U2 + U12 be a splitting. Then

Eγ (U − (U1 + U2)|X) = 0 Eγ (U − (U1 + U2)|Y ) = 0

  • or in terms of transport

EM2n mUM2n

γ

U

  • X
  • − EM2n

γ M2n

  • X
  • U1 + EM2n

mUM2n

γ

U2

  • X
  • = 0

EM2n mUM2n

γ

U

  • X
  • − EM2n

mUM2n

γ

U1

  • X
  • + EM2n

γ M2n

  • X
  • U2 = 0
slide-19
SLIDE 19

Gradient flow

  • Given a cost function w : Rn × Rn → R, the expected cost function

is W : E (M2n) ∋ γ → Eγ [w]. Then the function W restricted to the open plan Γ(µ1, µ2) ∩ E (M2n) has statistical gradient obtained by the projection of the unconstrained gradient w − Eγ [w] onto the the space of interactions Bγ,12 grad W (γ): γ → w − Eγ [w] − w1,γ − w2,γ

  • The gradient flow equation is

Dγ(t) = − (w − Eγ [w] − w1,γ − w2,γ) .

  • Any solution t → γ(t) of the gradient flow converges to a measure

γ∗ = limt→∞ γ(t), in Γ(µ1, µ2) (but not in E (M2n)) such that Eγ∗ [w] = min {Eγ [w]|γ ∈ Γ(µ1, µ2)}

slide-20
SLIDE 20

Part III Gauss-Orlicz-Sobolev model spaces

  • M. R. Grasselli. Dual connections in nonparametric classical information geometry.
  • Ann. Inst. Statist. Math., 62(5):873–896, 2010
  • B. Lods and G. Pistone. Information geometry formalism for the spatially homogeneous Boltzmann

equation. Entropy, 17(6):4323–4363, 2015

  • D. Brigo and G. Pistone. Projection based dimensionality reduction for measure valued evolution equations

in statistical manifolds. arXiv:1601.04189, 2016

  • D. Brigo and G. Pistone. Eigenfunctions based maximum likelihood estimation of the fokker planck

equation and hellinger projection. submitted, 2016

  • Luigi Montrucchio and GP. Unpublished working paper (2016) based on N. Newton deformed logarithm
slide-21
SLIDE 21

Inclusions

  • 1. If 1 < a < ∞,

L∞(M) ⊂ L(cosh −1) (M) ⊂ La(M) ⊂ L(cosh −1)∗ (M) ⊂ L1(M)

  • 2. Local inclusions hold, if 1 a < ∞, ΩR = {x ∈ Rn||x| < R},

L(cosh −1) (M) ֒ → La(ΩR), L(cosh −1)∗ (M) ֒ → L1(ΩR)

  • 3. The Orlicz space L(cosh −1) (M) contains all functions f ∈ C 2(Rn; R)

whose Hessian is uniformely bounded in operator’s norm. In particular, it contains all polynomials with degree up to 2 and, moreover, all functions which are bounded by such a polynomial.

  • 4. The Orlicz space L(cosh −1)∗ (M) contains all random variables

f : Rd → R which are bounded by a polynomial, in particular, all polynomials.

slide-22
SLIDE 22

Pointwise density

The space L(cosh −1) (M) is not separable nor reflexive. However, we have the following monotone class theorem

  • For each f ∈ L(cosh −1) (M) there exist a nonnegative function

h ∈ L(cosh −1) (M) and a sequence fn ∈ C0(Rn) with |fn| h, n = 1, 2, . . . , such that limn→∞ fn = f a.s..

  • The space C0(Rn) is strongly dense in L(cosh −1)∗(M) and it is

weakly∗-dense in L(cosh −1)(M).

  • For each f ∈ L(cosh −1) (M) there exist a nonnegative function

h ∈ L(cosh −1) (M) and a sequence φn ∈ C ∞

0 (Rn) with |fn| h,

n = 1, 2, . . . , such that limn→∞ φn = f a.s..

  • The space C ∞

0 (Rn) is strongly dense in L(cosh −1)∗(M) and it is

weakly∗-dense in L(cosh −1)(M).

slide-23
SLIDE 23

Orlicz class

Definition

We define the exponential (Orlicz) class, C (cosh −1) (M), to be the closure

  • f C0 (Rn) in the exponential (Orlicz) space L(cosh −1) (M).

Theorem

Assume f ∈ L(cosh −1) (M) and write fR(x) = f (x)(|x| > R). The following conditions are equivalent:

  • 1. The real function ρ →
  • (cosh −1)(ρf (x)) M(x)dx is finite for all

ρ > 0.

  • 2. f ∈ C cosh −1(M).
  • 3. limR→∞ fRL(cosh −1)(M) = 0.

For example (x → x2) ∈ L(cosh −1) (M) \ C (cosh −1) (M)

slide-24
SLIDE 24

Translation models

  • We look for statistical models induced by the gromrtry of the state
  • space. E.g. the n-dimensional model defined by the translation of

p = eU−KM(U) · M ∈ E (M) p(x; h) = p(x − h) = eU(x−h)−KM(U)eh·x− |h|2

2 · M

  • EM [U(X − h) + h · X] =
  • U(x − h)M(x) dx =
  • U(x)e−h·x− |h|2

2

M(x)dx = EM

  • Ue−h·X |h|2

2

  • p(x; h) = p(x − h) = exp (Uh − KM(Uh)) · M

with Uh = τhU + h · X − EM [τhU] ∈ BM and KM(τhU) = KM(U) − 1

2 |h|2

slide-25
SLIDE 25

Translations in L(cosh −1) (M)

  • For each h ∈ Rn, the mapping f → τhf is linear and bounded from

L(cosh −1) (M) to itself and τhf L(cosh −1)(M) 2 f L(cosh −1)(M) if |h| √log 2.

  • For each f ∈ L(cosh −1) (M) and h ∈ Rn we have

τhf ∈ L(cosh −1) (M). For all g ∈ L(cosh −1)∗ (M) we have τhf , gM = f , τ ∗

h gM ,

τ ∗

h g(x) = e−h·x− |h|2

2 τ−hg(x) ,

and |h| √ 2 implies τ ∗

h gL(cosh −1)∗(M) 4 gL(cosh −1)∗(M).

Moreover, h → τ ∗

h g is continuous in L(cosh −1)∗ (M).

  • If f ∈ C (cosh −1)

(M) then τhf ∈ C (cosh −1) (M), h ∈ Rn and the mapping Rn : h → τhf is continous in L(cosh −1) (M).

slide-26
SLIDE 26

Translation by a probability

  • Let

τµf (x) =

  • f (x − y) µ(dy) = f ∗ µ(x)

for µ ∈ Pe, namely EM

  • e

1 2 |h|2

< ∞.

  • The mapping f → τµf is linear and bounded from L(cosh −1) (M) to
  • itself. If, moreover,
  • e|h|2/2 µ(dh)

√ 2, then its norm is bounded by 2.

  • If f ∈ C (cosh −1)

(M) then τµf ∈ C (cosh −1) (M). The mapping P : µ → τµf is continous at δ0 from the weak convergence to the L(cosh −1) (M) norm.

slide-27
SLIDE 27

Mollifiers

  • Let be given a family of mollifiers ωλ ∈ C ∞

0 (Rn). Let

f ∈ C (cosh −1) (M). For each λ > 0 the function τωλf (x) =

  • f (x − y)λ−nω(λ−1y) dy = f ∗ ωλ(x)

belongs to C ∞(Rn) and limλ→0 f ∗ ωλ = f in L(cosh −1) (M)

  • For each f ∈ L(cosh −1) (M) there exists a sequence fn, in C ∞

0 (Rn),

n ∈ N, and a bound h ∈ L(cosh −1) (M) such that |fn(x)| h(x) and limn→∞ fn(x) = f (x) for all x.

  • For each f ∈ C (cosh −1)

(M) there exists a sequence fn, in C ∞

0 (Rn),

n ∈ N, with limn→∞ fn − f L(cosh −1)(M) = 0.

slide-28
SLIDE 28

Differentiable densities

Definition

The Orlicz-Sobolev (O-S) spaces with weight M are W 1,(cosh −1) (M) =

  • f ∈ L(cosh −1) (M)
  • ∂jf ∈ L(cosh −1) (M) , j = 1, . . . , n
  • W 1,(cosh −1)∗ (M) =
  • f ∈ L(cosh −1)∗ (M)
  • ∂jf ∈ L(cosh −1)∗ (M) , j = 1, . . . , n
  • where ∂j is the derivative in the sense of distributions. The spaces

W 1,(cosh −1) (M) and W 1,(cosh −1)∗ (M) are both Banach spaces for the graph norms.

  • ∂jf , φM = − f , ∂jφM − XjφM = f , (Xj − ∂j)φM

∂jf , φM = f , δjφM δjφ = (Xj − ∂j)φ

  • Cf Malliavin Calculus
slide-29
SLIDE 29

Inclusions

Theorem

Let R > 0 and let ΩR denote the open sphere of radius R.

  • 1. We have the embeddings

W 1,(cosh −1)(Rn) ⊂ W 1,(cosh −1) (M) ⊂ W 1,(cosh −1)(ΩR) ⊂ W 1,p(ΩR), p 1.

  • 2. We have the embeddings

W 1,p(Rn) ⊂ W 1,(cosh −1)∗(Rn) ⊂ W 1,(cosh −1)∗ (M) ⊂ W 1,(cosh −1)∗(ΩR) ⊂ W 1,1(ΩR), p > 1.

  • 3. Each u ∈ W 1,(cosh −1) (M) is a.s. continuous and H¨
  • lder of all
  • rders on each ΩR.
slide-30
SLIDE 30

Directional derivative

Theorem

  • For each f ∈ W 1,(cosh −1) (M), each unit vector h ∈ Sn, and all

t ∈ R, it holds f (x + th) − f (x) = t 1

n

  • j=1

∂jf (x + sth)hj ds . Moreover, |t| √ 2 implies f (x + th) − f (x)L(cosh −1)(M) 2t ∇f L(cosh −1)(M) , especially, limt→0 f (x + th) − f (x)L(cosh −1)(M) = 0 uniformely in h.

  • For each f ∈ W 1,(cosh −1) (M) and each g ∈ L(cosh −1)∗ (M), the

mapping h → τhf , gM is differentiable. Viceversa, if f ∈ L(cosh −1) (M) and h → τhf is weakly differentiable, then f ∈ W 1,(cosh −1) (M)

  • If ∂jf ∈ C (cosh −1)

(M), j = 1, . . . , n, then strong differentiability in L(cosh −1) (M) holds.

slide-31
SLIDE 31

Orlicz-Sobolev class

Definition

The Orlicz-Sobolev-Gauss exponential class is C 1,(cosh −1) (M) =

  • f ∈ W 1,(cosh −1) (M)
  • f , ∂jf ∈ C (cosh −1)

(M) , j = 1, . . . , n

  • The translation model is qualified as

ph = τhp, p = eU−KM(U) · M, U ∈ SM ∩ C 1,(cosh −1) (M)

  • The score in the direction j is (xj − tej) − ∂jU(x − tej):

d dt p(x − tej )

p(x − tej ) =

d dt eU(x−tej )−KM (U)M(x − tej )

p(x − tej ) = −∂j U(x − tej )eU(x−tej )−KM (U)M(x − tej ) + (xj − tej )eU(x−tej )−KM (U)M(x − tej ) p(x − tej ) = (xj − tej ) − ∂j U(x − tej )

slide-32
SLIDE 32

Calculus in C 1,(cosh −1) (M)

Theorem

  • For each f ∈ C 1,(cosh −1)

(M) the sequence f ∗ ωn, n ∈ N, belongs to C ∞(Rn) ∩ W 1,(cosh −1) (M). Precisely, for each n and j = 1, . . . , n, we have the equality ∂j(f ∗ ωn) = (∂jf ) ∗ ωn; the sequences f ∗ ωn, respectively ∂jf ∗ ωn, j = 1, . . . , n, converge to f , respectively ∂jf , j = 1, . . . , n, strongly in W 1,(cosh −1)∗ (M).

  • Same statement is true if f ∈ W 1,(cosh −1)∗ (M).
  • Let be given f ∈ C (cosh −1)

(M) and g ∈ W 1,(cosh −1)∗ (M). Then fg ∈ W 1,1(M) and ∂j(fg) = ∂jfg + f ∂jg.

  • Let be given F ∈ C 1(R) with F ′∞ < ∞. For each

U ∈ C (cosh −1) (M), we have F ◦ U, F ′ ◦ U∂jU ∈ C (cosh −1) (M) and ∂jF ◦ U = F ′ ◦ U∂jU, in particular F(U) ∈ C 1,(cosh −1) (M).

slide-33
SLIDE 33

Exponential family modeled on C 1,(cosh −1) (M)

  • Restrict the exponential family E (M) to C 1,(cosh −1)

(M), E1(M) =

  • eU−KM(U) · M
  • U ∈ C 1,(cosh −1)

(M) ∩ SM

  • Because of C 1,(cosh −1)

(M) ֒ → Lcosh −1(M) the domain C 1,(cosh −1) (M) ∩ SM is open and the cumulant functional KM : C 1,(cosh −1) (M) ∩ SM → R remains convex and differentiable.

  • Every feature of the exponential manifold carries over to this case.

Define B1(p) = Bp ∩ C 1,(cosh −1) (M) to be models for the tangent spaces of E1(M). The e-transport acts on these spaces

eUg f : B1(f ) ∋ U → U − Eg [U] ∈ B1(g) ,

so that we can define the statistical bundle to be TE1(M) = {(g, V )|g ∈ E1(M), V ∈ B1(g)} and take as charts the restrictions of the charts defined on TE (M).

slide-34
SLIDE 34

Application: Hyv¨ arinen divergence

  • For each f , g ∈ E1(M) the Hyv¨

arinen divergence is DH (g|f ) = Eg

  • |∇ log f − ∇ log g|2

.

  • The expression in the chart centered at M is

DHM (v|u) = DH (eM(v)|eM(u)) = EM

  • |∇u − ∇v|2 ev−KM(v)

, where f = eM(u), g = eM(v).

  • grad(f → DH (g|f )) = −2∇ log g · ∇ log f

g − 2∆ log f g

  • grad(g → DH (f |g)) = 2∇ log g · ∇ log f

g + 2∆ log f g + DH (f |g)

slide-35
SLIDE 35

Example: Elliptic operator

  • Elliptic operator as section of the tangent bundle is

Ap(x) = p(x)−1

d

  • i,j=1

∂ ∂xi

  • aij(x) ∂

∂xj p(x)

  • ,

x ∈ Rd .

  • The expression in the statistical bundle is

U → AM(U) = eU−KM(U)A(eU−KM(U) · M) = eU−KM(U) eU−KM(U) · M A(eU−KM(U) · M) = M−1L∗(eU−KM(U) · M)

  • Computation gives

M−1L∗(eU−KM(U) · M) = eU−KM(U)

d

  • i,j=1

∂ ∂xi

  • aij(x)

∂ ∂xj U(x) − xj

  • p(x)+

eU−KM(U)

d

  • i,j=1

aij(x) ∂ ∂xi U(x) − xi ∂ ∂xj U(x) − xj

  • p(x).
slide-36
SLIDE 36

Part IV Second order geometry

  • L. Malag`
  • and G. Pistone. Combinatorial optimization with information geometry: Newton method.

Entropy, 16:4260–4289, 2014

  • L. Malag`
  • and G. Pistone. Second-order optimization over the multivariate Gaussian distribution.

In F. Barbaresco and F. Nielsen, editors, Geometric Science of Information, number 9389 in LNCS, pages 349–358. Springer, 2015

slide-37
SLIDE 37

Parallel transport

  • e-transport:

eUq p : Bp ∋ U → U − Eq [U] ∈ Bq .

  • m-transport: for each V ∈ ∗Bq

mUp q : ∗Bq ∋ V → q

p V ∈ ∗Bp

Properties

U, mUp

qV

  • p =

eUq

pU, V

  • q
  • eUr

q eUq p = eUr p

  • mUr

q mUq p = mUr p

  • eUq

pU, mUq pV

  • q = U, V p
  • d2Kp(q)[U, V ] =

eUq

pU, eUq pV

  • q =

mUp

q eUq pU, V

  • p.
slide-38
SLIDE 38

Statistical exponential manifold and bundles

  • The exponential manifold is the maximal exponential family E with

the affine atlas of global charts (sp : p ∈ E), sp(q) = log q p − Ep

  • log q

p

  • .
  • The statistical exponential bundle TE (M) is the manifold defined
  • n the set

{(p, V )|p ∈ E, V ∈ Bp} by the affine atlas of global charts σp : (q, V ) →

  • sp(q), eUp

qV

  • ∈ Bp × Bp,

p ∈ E

  • The statistical predual bundle ∗TE (M) is the manifold defined on

the set {(p, W )|p ∈ E, W ∈ ∗Bp} by the affine atlas of global charts

∗σp : (q, W ) →

  • sp(q), mUp

qW

  • ∈ Bp × ∗Bp,

p ∈ E

slide-39
SLIDE 39

Score and statistical gradient

Definition

t → p(t) is a curve in E (p) and F : E (p) → R.

  • The score of the curve t → p(t) is a curve in the statistical bundle

t → (p(t), Dp(t)) ∈ SE (p) such that for all X ∈ L(cosh −1)∗ (p) it holds d dt Ep(t) [X] =

  • X − Ep(t) [X] , Dp(t)
  • p(t)
  • The statistical gradient of F is a section of the statistical bundle,

p → (p, grad f (p)) ∈ eTofp such that for each regular curve t → p(t), it holds d dt f (p(t)) = grad f (p(t)), Dp(t)p(t) Everithing applies if the tangent space is in C 1,(cosh −1) (p), but technical detail have to be checked, e.d. mUq

p

slide-40
SLIDE 40

Taylor formula in the Statistical Bundle

  • For a curve t → p(t) ∈ E (M) connecting p = p(0) to q = p(1) and

a function F : E (M) → R the Taylor formula is F(q) = F(p) + d dt F(p(t))

  • t=0

+ 1 2 d2 dt2 F(p(t))

  • t=0

+ R2(f , p(·)) with R2(f , p(·)) = 1

0 (1 − t)

  • d2

dt2 F(p(t)) − d2 dt2 F(p(t))

  • t=o
  • dt
  • The first derivative is computed with the statistical gradient and the

score F(q) = F(p) + grad F(p(0)), Dp(0)p + 1 2 d dt grad F(p(t)), Dp(t)p(t)

  • t=o

+ R2(f , p(·)) ,

  • where

d dt grad F(p(t)), Dp(t)p(t)

  • t=o depends on p(·)
slide-41
SLIDE 41

Accellerations

  • p(t) = eU(t)−Kp(U(t)) · p, U ∈ Bp.
  • Let us define the acceleration at t of a curve t → p(t) ∈ E (M).

The velocity is defined to be t → (p(t), Dp(t)) =

  • p(t), d

dt log (p(t))

  • ∈ TE (M)
  • The exponential acceleration is t → (p(t), eD2p(t)) ∈ TE (M) with

eD2p(t) = d

ds

eUp(t) p(s)Dp(s)

  • s=t

= ¨ U(t) − Ep(t)

  • ¨

U(t)

  • The mixture acceleration is

mD2p(t) = d

ds

mUp(t) p(s)Dp(s)

  • s=t

= ¨ p(t) p(t)

slide-42
SLIDE 42

Computation

mD2p(t) = d

ds

mUp(t) p(s)Dp(s)

  • s=t

= d ds

mUp(t) p(s)

d ds (U(s) − Kp(U(s)))

  • s=t

= d ds

mUp(t) p(s)( ˙

U(s) − dKp(U(s)) ˙ U(s))

  • s=t

= d ds p(s) p(t)( ˙ U(s) − Ep(t)

  • ˙

U(s)

  • )
  • s=t

= ˙ p(s) p(t)( ˙ U(s) − Ep(t)

  • ˙

U(s)

  • ) + p(s)

p(t)( ¨ U(s) − Ep(t)

  • ¨

U(s)

  • )
  • s=t

= ˙ p(t) p(t)( ˙ U(t) − Ep(t)

  • ˙

U(t)

  • ) + p(t)

p(t)( ¨ U(t) − Ep(t)

  • ¨

U(t)

  • )

= ( ˙ U(t) − Ep(t)

  • ˙

U(t)

  • )2 + ( ¨

U(t) − Ep(t)

  • ¨

U(t)

  • )

= ¨ p(t) p(t)

slide-43
SLIDE 43

Autoparallel curves

  • The exponential accelleration is

eD2p(t) = ¨

U(t) − Ep(t)

  • ¨

U(t)

  • The e-accelleration of a 1d-exponential family t → etU−KM(tU) · M is

zero, ¨ tU = 0.

  • The mixture acceleration is

mD2p(t) = ¨

p(t) p(t)

  • The m-accelleration of a 1d-mixture family

t → p(0) + t(p(1) − p(0)) is zero, ¨ p(0) + t(p(1) − p(0)) p(0) + t(p(1) − p(0)) = 0

slide-44
SLIDE 44

Taylor’s formulæ

The computation of

d dt grad F(p(t)), Dp(t)p(t)

  • t=o reduces to one term

by chosing an autoparallel segment connecting p and q

  • If t → p(t) is the mixture geodesic connecting p = p(0) to

q = p(1), F(q) = F(p) + grad F(p), Dp(0)p + 1 2 eHessDp(0)F(p), Dp(0)

  • p + R+

2 (p, q)

  • If t → p(t) is the exponential geodesic connecting p = p(0) to

q = p(1), F(q) = F(p) + grad F(p), Dp(0)p + 1 2 mHessDp(0)F(p), Dp(0)

  • p + R−

2 (p, q)