IGAIA IV
Information Geometry and its Applications IV
The Orlicz-Sobolev-Gauss Exponential Manifold
Giovanni Pistone www.giannidiorestino.it Liblice June 13 2016
The Orlicz-Sobolev-Gauss Exponential Manifold Giovanni Pistone - - PowerPoint PPT Presentation
IGAIA IV Information Geometry and its Applications IV The Orlicz-Sobolev-Gauss Exponential Manifold Giovanni Pistone www.giannidiorestino.it Liblice June 13 2016 My four parts 1. Amaris Information Geometry when the state space is not
IGAIA IV
Information Geometry and its Applications IV
Giovanni Pistone www.giannidiorestino.it Liblice June 13 2016
d dθ log pθ is an estimating function because
Eθ d
dθ log pθ
d dθEθ [U] =
dθp(x; θ) µ(dx) p(x; θ) > 0 =
dθ log p(x; θ) p(x; θ) µ(dx) = =
dθ log pθ
e, Schwachh¨
→ L1
0(p)
TP = {(p, U)|p ∈ P, U ∈ Bp}
space at p ∈ P
sp(V )) ∈ Bp × Bp
IDAQP, 1(2):325–347, 1998
e
L(cosh −1)∗ (p) are the Orlicz spaces respectively defined by the conjugate Young functions (cosh −1)(x) = cosh x − 1
xy (cosh −1)(x) + (cosh −1)∗(y)
respectively,
∗Bp =
Bp is exactly the space of scores of Gibbs model through p
Theorem
α → Ep
is finite in a neighbourhood of 0
eθU Ep[eθU] is defined in a neighborhood of 0 and d dθEp
d dθ log pθ
This set-up applies to the set P> of strictly positive densities.
measures equivalent to a given one.
Theorem
L(cosh −1) (p) = L(cosh −1) (q) as Banach spaces if θ →
finite on an open neighbourhood I of [0, 1], i.e. It is an equivalence relation p ⌣ q and we denote by E (p) the class containing p.
Proof.
Assume U ∈ L(cosh −1) (p) and consider the restrictions to the axes of the convex function (s, θ) →
p
measures equivalent to a given one.
PhD thesis, Dottorato in Matematica, Universit` a di Milano, 2002
Theorem
The following statements are equivalent for p, q ∈ P>:
p
p ∈ L1+ǫ(p) and p q ∈ L1+ǫ(q) for some ǫ > 0.
Bernoulli, 22(3):1431–1447, 2016
lower-semi-continuous convex function Gp : Bp ∋ U → Ep
and
semicontinuous convex function Kp = log Gp.
Sp =
E (p) =
From now on the maximal exponential family of interest is the family of the Maxwell density on Rn, E (M)
sp(q) → log q p
q p
q p
p
= sp : Sp → E (M) is ep(U) = exp (U − Kp(U)) · p
exponential manifold
information closure of any E (M) is P>.
ar and F. Mat´ uˇ
IEEE Trans. Inform. Theory, 49(6):1474–1490, 2003
In Algebraic and geometric methods in statistics, pages 307–326. Cambridge Univ. Press, Cambridge, 2010
expressed in the chart sp as p(t) = eU(t)−Kp(U(t)) · p.
U(0) =
d dt log p(t)
TE (M) = {(p, U)|p ∈ E (M) , U ∈ Bp} is the expression of the tangent bundle of the exponential manifold
eUp2 p1 : Bp1 → Bp2 given by eUp2 p1U = U − Ep2 [U]
σp(q, V ) =
qV
In F. Nielsen and F. Barbaresco, editors, Geometric science of information, volume 8085 of Lecture Notes in Comput. Sci., pages 5–36. Springer, Heidelberg, 2013. First International Conference, GSI 2013 Paris, France, August 28-30, 2013 Proceedings
at p by D (p ep(U)) = Kp(U) = log Ep
.
the open unit ball of Bp. It is infinitely Gˆ ateaux-differentiable on the interior Sp of its proper domain and analytic on the unit ball of Bp.
d Kp(U)[V ] = Eq [V ] d2 Kp(U)[V1, V2] = Covq (V1, V2) d3 Kp(U)[V1, V2, V3] = Covq(V1, V2, V3)
space transformations. Bernoulli, 5(4):721–760, August 1999
∗Bp =
V ∈ ∗Bq, eUq
pU, V
= Ep q p UV
p V
=
qV
σ∗
p(q, W ) =
qW
t → (p(t), Dp(t)) ∈ TE (M) such that for all X ∈ L(cosh −1)∗ (M) it holds d dt Ep(t) [X] =
Dp(t) = ˙ p(t) p(t) = d dt log p(t)
pre-dual statistical bundle ∗TE (M), p → (p, grad F(p)) ∈ ∗TE (p) such that for each regular curve d dt F(p(t)) = grad F(p(t)), Dp(t)p(t)
based on the exponential family. In Proceedings of the 11th workshop on Foundations of genetic algorithms, FOGA ’11, pages 230–242, New York, NY, USA, 2011. ACM
Entropy, 15(10):4042–4065, 2013
21st25th, 2015 at ICMS, Edinburgh. Finite state space.
(R2n, M2n) = (Rn, Mn) ⊗ (Rn, Mn) with projection X and Y
E (M2n) ∋ γ → (γ1, γ2) ∈ E (Mn) × E (M)n has fibers Γ(µ1, µ2) = {γ ∈ E (M2n)|X#γ = µ1, Y#γ = µ2} which are convex subsets
0 = d dt Eγ(t) [f ◦ X] = f ◦ X − Eµ1 [f ] , Dγ(t)γ(t) =
Eγ(t) (Dγ(t)|Y ) = 0, Dγ(t) ∈ Bγ(t)
Bγ,1 = {f ◦ X|f ∈ Bγ1} , Bγ,2 = {f ◦ Y |f ∈ Bγ2} ,
∗Bγ,12 = (Bγ,0 + Bγ,1 + Bγ,2)⊥
Eγ (U − (U1 + U2)|X) = 0 Eγ (U − (U1 + U2)|Y ) = 0
EM2n mUM2n
γ
U
γ M2n
mUM2n
γ
U2
EM2n mUM2n
γ
U
mUM2n
γ
U1
γ M2n
is W : E (M2n) ∋ γ → Eγ [w]. Then the function W restricted to the open plan Γ(µ1, µ2) ∩ E (M2n) has statistical gradient obtained by the projection of the unconstrained gradient w − Eγ [w] onto the the space of interactions Bγ,12 grad W (γ): γ → w − Eγ [w] − w1,γ − w2,γ
Dγ(t) = − (w − Eγ [w] − w1,γ − w2,γ) .
γ∗ = limt→∞ γ(t), in Γ(µ1, µ2) (but not in E (M2n)) such that Eγ∗ [w] = min {Eγ [w]|γ ∈ Γ(µ1, µ2)}
equation. Entropy, 17(6):4323–4363, 2015
in statistical manifolds. arXiv:1601.04189, 2016
equation and hellinger projection. submitted, 2016
L∞(M) ⊂ L(cosh −1) (M) ⊂ La(M) ⊂ L(cosh −1)∗ (M) ⊂ L1(M)
L(cosh −1) (M) ֒ → La(ΩR), L(cosh −1)∗ (M) ֒ → L1(ΩR)
whose Hessian is uniformely bounded in operator’s norm. In particular, it contains all polynomials with degree up to 2 and, moreover, all functions which are bounded by such a polynomial.
f : Rd → R which are bounded by a polynomial, in particular, all polynomials.
The space L(cosh −1) (M) is not separable nor reflexive. However, we have the following monotone class theorem
h ∈ L(cosh −1) (M) and a sequence fn ∈ C0(Rn) with |fn| h, n = 1, 2, . . . , such that limn→∞ fn = f a.s..
weakly∗-dense in L(cosh −1)(M).
h ∈ L(cosh −1) (M) and a sequence φn ∈ C ∞
0 (Rn) with |fn| h,
n = 1, 2, . . . , such that limn→∞ φn = f a.s..
0 (Rn) is strongly dense in L(cosh −1)∗(M) and it is
weakly∗-dense in L(cosh −1)(M).
Definition
We define the exponential (Orlicz) class, C (cosh −1) (M), to be the closure
Theorem
Assume f ∈ L(cosh −1) (M) and write fR(x) = f (x)(|x| > R). The following conditions are equivalent:
ρ > 0.
For example (x → x2) ∈ L(cosh −1) (M) \ C (cosh −1) (M)
p = eU−KM(U) · M ∈ E (M) p(x; h) = p(x − h) = eU(x−h)−KM(U)eh·x− |h|2
2 · M
2
M(x)dx = EM
2
with Uh = τhU + h · X − EM [τhU] ∈ BM and KM(τhU) = KM(U) − 1
2 |h|2
L(cosh −1) (M) to itself and τhf L(cosh −1)(M) 2 f L(cosh −1)(M) if |h| √log 2.
τhf ∈ L(cosh −1) (M). For all g ∈ L(cosh −1)∗ (M) we have τhf , gM = f , τ ∗
h gM ,
τ ∗
h g(x) = e−h·x− |h|2
2 τ−hg(x) ,
and |h| √ 2 implies τ ∗
h gL(cosh −1)∗(M) 4 gL(cosh −1)∗(M).
Moreover, h → τ ∗
h g is continuous in L(cosh −1)∗ (M).
(M) then τhf ∈ C (cosh −1) (M), h ∈ Rn and the mapping Rn : h → τhf is continous in L(cosh −1) (M).
τµf (x) =
for µ ∈ Pe, namely EM
1 2 |h|2
< ∞.
√ 2, then its norm is bounded by 2.
(M) then τµf ∈ C (cosh −1) (M). The mapping P : µ → τµf is continous at δ0 from the weak convergence to the L(cosh −1) (M) norm.
0 (Rn). Let
f ∈ C (cosh −1) (M). For each λ > 0 the function τωλf (x) =
belongs to C ∞(Rn) and limλ→0 f ∗ ωλ = f in L(cosh −1) (M)
0 (Rn),
n ∈ N, and a bound h ∈ L(cosh −1) (M) such that |fn(x)| h(x) and limn→∞ fn(x) = f (x) for all x.
(M) there exists a sequence fn, in C ∞
0 (Rn),
n ∈ N, with limn→∞ fn − f L(cosh −1)(M) = 0.
Definition
The Orlicz-Sobolev (O-S) spaces with weight M are W 1,(cosh −1) (M) =
W 1,(cosh −1) (M) and W 1,(cosh −1)∗ (M) are both Banach spaces for the graph norms.
∂jf , φM = f , δjφM δjφ = (Xj − ∂j)φ
Theorem
Let R > 0 and let ΩR denote the open sphere of radius R.
W 1,(cosh −1)(Rn) ⊂ W 1,(cosh −1) (M) ⊂ W 1,(cosh −1)(ΩR) ⊂ W 1,p(ΩR), p 1.
W 1,p(Rn) ⊂ W 1,(cosh −1)∗(Rn) ⊂ W 1,(cosh −1)∗ (M) ⊂ W 1,(cosh −1)∗(ΩR) ⊂ W 1,1(ΩR), p > 1.
Theorem
t ∈ R, it holds f (x + th) − f (x) = t 1
n
∂jf (x + sth)hj ds . Moreover, |t| √ 2 implies f (x + th) − f (x)L(cosh −1)(M) 2t ∇f L(cosh −1)(M) , especially, limt→0 f (x + th) − f (x)L(cosh −1)(M) = 0 uniformely in h.
mapping h → τhf , gM is differentiable. Viceversa, if f ∈ L(cosh −1) (M) and h → τhf is weakly differentiable, then f ∈ W 1,(cosh −1) (M)
(M), j = 1, . . . , n, then strong differentiability in L(cosh −1) (M) holds.
Definition
The Orlicz-Sobolev-Gauss exponential class is C 1,(cosh −1) (M) =
(M) , j = 1, . . . , n
ph = τhp, p = eU−KM(U) · M, U ∈ SM ∩ C 1,(cosh −1) (M)
d dt p(x − tej )
p(x − tej ) =
d dt eU(x−tej )−KM (U)M(x − tej )
p(x − tej ) = −∂j U(x − tej )eU(x−tej )−KM (U)M(x − tej ) + (xj − tej )eU(x−tej )−KM (U)M(x − tej ) p(x − tej ) = (xj − tej ) − ∂j U(x − tej )
Theorem
(M) the sequence f ∗ ωn, n ∈ N, belongs to C ∞(Rn) ∩ W 1,(cosh −1) (M). Precisely, for each n and j = 1, . . . , n, we have the equality ∂j(f ∗ ωn) = (∂jf ) ∗ ωn; the sequences f ∗ ωn, respectively ∂jf ∗ ωn, j = 1, . . . , n, converge to f , respectively ∂jf , j = 1, . . . , n, strongly in W 1,(cosh −1)∗ (M).
(M) and g ∈ W 1,(cosh −1)∗ (M). Then fg ∈ W 1,1(M) and ∂j(fg) = ∂jfg + f ∂jg.
U ∈ C (cosh −1) (M), we have F ◦ U, F ′ ◦ U∂jU ∈ C (cosh −1) (M) and ∂jF ◦ U = F ′ ◦ U∂jU, in particular F(U) ∈ C 1,(cosh −1) (M).
(M), E1(M) =
(M) ∩ SM
(M) ֒ → Lcosh −1(M) the domain C 1,(cosh −1) (M) ∩ SM is open and the cumulant functional KM : C 1,(cosh −1) (M) ∩ SM → R remains convex and differentiable.
Define B1(p) = Bp ∩ C 1,(cosh −1) (M) to be models for the tangent spaces of E1(M). The e-transport acts on these spaces
eUg f : B1(f ) ∋ U → U − Eg [U] ∈ B1(g) ,
so that we can define the statistical bundle to be TE1(M) = {(g, V )|g ∈ E1(M), V ∈ B1(g)} and take as charts the restrictions of the charts defined on TE (M).
arinen divergence is DH (g|f ) = Eg
.
DHM (v|u) = DH (eM(v)|eM(u)) = EM
, where f = eM(u), g = eM(v).
g − 2∆ log f g
g + 2∆ log f g + DH (f |g)
Ap(x) = p(x)−1
d
∂ ∂xi
∂xj p(x)
x ∈ Rd .
U → AM(U) = eU−KM(U)A(eU−KM(U) · M) = eU−KM(U) eU−KM(U) · M A(eU−KM(U) · M) = M−1L∗(eU−KM(U) · M)
M−1L∗(eU−KM(U) · M) = eU−KM(U)
d
∂ ∂xi
∂ ∂xj U(x) − xj
eU−KM(U)
d
aij(x) ∂ ∂xi U(x) − xi ∂ ∂xj U(x) − xj
Entropy, 16:4260–4289, 2014
In F. Barbaresco and F. Nielsen, editors, Geometric Science of Information, number 9389 in LNCS, pages 349–358. Springer, 2015
eUq p : Bp ∋ U → U − Eq [U] ∈ Bq .
mUp q : ∗Bq ∋ V → q
p V ∈ ∗Bp
Properties
U, mUp
qV
eUq
pU, V
q eUq p = eUr p
q mUq p = mUr p
pU, mUq pV
eUq
pU, eUq pV
mUp
q eUq pU, V
the affine atlas of global charts (sp : p ∈ E), sp(q) = log q p − Ep
p
{(p, V )|p ∈ E, V ∈ Bp} by the affine atlas of global charts σp : (q, V ) →
qV
p ∈ E
the set {(p, W )|p ∈ E, W ∈ ∗Bp} by the affine atlas of global charts
∗σp : (q, W ) →
qW
p ∈ E
Definition
t → p(t) is a curve in E (p) and F : E (p) → R.
t → (p(t), Dp(t)) ∈ SE (p) such that for all X ∈ L(cosh −1)∗ (p) it holds d dt Ep(t) [X] =
p → (p, grad f (p)) ∈ eTofp such that for each regular curve t → p(t), it holds d dt f (p(t)) = grad f (p(t)), Dp(t)p(t) Everithing applies if the tangent space is in C 1,(cosh −1) (p), but technical detail have to be checked, e.d. mUq
p
a function F : E (M) → R the Taylor formula is F(q) = F(p) + d dt F(p(t))
+ 1 2 d2 dt2 F(p(t))
+ R2(f , p(·)) with R2(f , p(·)) = 1
0 (1 − t)
dt2 F(p(t)) − d2 dt2 F(p(t))
score F(q) = F(p) + grad F(p(0)), Dp(0)p + 1 2 d dt grad F(p(t)), Dp(t)p(t)
+ R2(f , p(·)) ,
d dt grad F(p(t)), Dp(t)p(t)
The velocity is defined to be t → (p(t), Dp(t)) =
dt log (p(t))
eD2p(t) = d
ds
eUp(t) p(s)Dp(s)
= ¨ U(t) − Ep(t)
U(t)
mD2p(t) = d
ds
mUp(t) p(s)Dp(s)
= ¨ p(t) p(t)
mD2p(t) = d
ds
mUp(t) p(s)Dp(s)
= d ds
mUp(t) p(s)
d ds (U(s) − Kp(U(s)))
= d ds
mUp(t) p(s)( ˙
U(s) − dKp(U(s)) ˙ U(s))
= d ds p(s) p(t)( ˙ U(s) − Ep(t)
U(s)
= ˙ p(s) p(t)( ˙ U(s) − Ep(t)
U(s)
p(t)( ¨ U(s) − Ep(t)
U(s)
= ˙ p(t) p(t)( ˙ U(t) − Ep(t)
U(t)
p(t)( ¨ U(t) − Ep(t)
U(t)
= ( ˙ U(t) − Ep(t)
U(t)
U(t) − Ep(t)
U(t)
= ¨ p(t) p(t)
eD2p(t) = ¨
U(t) − Ep(t)
U(t)
zero, ¨ tU = 0.
mD2p(t) = ¨
p(t) p(t)
t → p(0) + t(p(1) − p(0)) is zero, ¨ p(0) + t(p(1) − p(0)) p(0) + t(p(1) − p(0)) = 0
The computation of
d dt grad F(p(t)), Dp(t)p(t)
by chosing an autoparallel segment connecting p and q
q = p(1), F(q) = F(p) + grad F(p), Dp(0)p + 1 2 eHessDp(0)F(p), Dp(0)
2 (p, q)
q = p(1), F(q) = F(p) + grad F(p), Dp(0)p + 1 2 mHessDp(0)F(p), Dp(0)
2 (p, q)