IGAIA 4 Bohemia
Information Geometry
ーー Historical Episodes and Future
with Recent Developments
Shun‐ichi Amari RIKEN Brain Science Institute
Information Geometry Historical Episodes and Future with Recent - - PowerPoint PPT Presentation
IGAIA 4 Bohemia Information Geometry Historical Episodes and Future with Recent Developments Shun ichi Amari RIKEN Brain Science Institute Prehistory Riemannian Geometry H. Hotteling 1929 Riemannian metric and Fisher
IGAIA 4 Bohemia
ーー Historical Episodes and Future
Shun‐ichi Amari RIKEN Brain Science Institute
1929
Riemannian metric and Fisher information location‐scale model : constant curvature
1936 Euclidean distance (multivariate‐Gaussian)
1946 Bayesian theory and Jeffreys invariant prior
1972 invariance, {g, T}, α‐connection
Riemannian metric (suggested by S. Moriguti)
Gausssian
: geodesic, constant curvature (Poincare‐half plane)
beautiful structure essential meaning?
mathematical engineering graph and topology of networks: homology non‐Riemannian geometry of materials manifold: dislocations information systems, learning and neural networks
2
( , ) N
Fisher’s idea; exponential connection and mixture connection
(Rao, Kano K? Fisher’s dream)
Amari and M.Kumon higher‐order power of statistical test
2 1 2 2 2 2 3
1 1 1 ( )
e m m
Error G H H K n n n
Reviewers (S. Lauritzen and A.P. Dawid) Chentsov work (handwritten manuscript)
Zeitshrift fur Wahrsceinlichkeitstheorie und VerwandteGebiete geometry has nothing to do with statistics IEEE Trans. Inf. Theory Shannon Theory, now well‐known
Cox visited Japan in 1983 patron of information geometry
Rao, Efron, Dawid, Barndorff‐Nielsen, Lauritzen Kass, Eguchi many others Dodson, Critchley, Marriot, Komaki, Zhang, Ay, Pistone, Giblisco, Nielsen, …
statistics, time‐series and systems, machine learning, signal processing, optimization brain theory, consciousness physics, economics, mathematics (Banach manifold, affine differential geometry and beyond) quantum information, Tsallis entropy
Many monographs
new journal (Jun Zhang); where to publish mailing list and society still small community; united and cooperative, blessed by all
‐‐ Stochastic approach
1
x
2
x
1
y
2
y
( , ) ( ) ( | ) p p p x y x y x
x: state of the brain y: next state of the brain
Integrated Information Theory G. Tononi Φ
Necessary condition; sufficient?
full model: Disconnected model: measure of interaction : N. Ay information integration : Tononi Barrett and Seth
1
x
2
x
1
y
2
y
1
x
2
x
1
y
2
y
{ ( , )}
F
S p x y { ( , )}
dis
S q x y
( | ) ( | )
i i
q q y x y x
Measure of information integration,
Information Geometry N. Ay
Definition of : Postulates
1)
2)
3) Disconnected model: Markov conditions
min [ ( , ) : ( , )],
dis q
D p x y q x y q S
( , ) [ : ] ( , )log ( , )
KL
p x y D D p q p x y q x y
Markov Condition
(12) branch deleted: Markov condition:
1 2 2 1 2 2 2
1 2 2
dis i j
1
x
2
x
1
y
2
y
1 2 2 2 1 1
X X Y X X Y
1) 2) 3) 4)
D[p :q] 0, = 0, when and only when p = q
i i i
D[p :q] = d[p(x ), q(x )]
: x : induces flat structure dually
KL dis q geo
Geometric degree of information integration
=E[ ] ' ' '=E[ ' ' ] ':diagonal | '| log | |
T T geo
A A A y x e ee y x e e e
1
x
2
x
1
y
2
y
A
=E[ ] ' ' '=E[ ' ' ] ':diagonal | '| log | |
T T geo
A A A y x e ee y x e e e
1
x
2
x
1
y
2
y
A
Many other definitions of Φ
Full model Disconnected models
Full model : graphical model
F
S
12 1 2 12 1 2
, exp higher-order terms
X Y X Y XY i i i i ij i j
p x y x x y y x y
x y
,
, ,
X XY i i ij i j
E x E x y
exponential family ‐coordinates ‐coordinates
x y
Split Model : Ay, Barrett & Seth
H
S
12 21 12
, : min : ˆ ˆ :
H S
i i H KL H KL q S i i M XY XY Y H i i
q q q y x D p S D p q q p q p y x H Y X H
x Y X y x y x
p ˆ q
H
S
x y
x y
1 1 2 2
Y X X Y
Mixed Coordinates :
12 12 21 12 21 12
, , , , ; , , ˆ ˆ , ;
X X Y XY XY XY XY Y i i
q p q
x y
12 21 12
ˆ : 0;
XY XY Y H q q
S
x y
Markovian Condition
1 1 2 2 1 2 2 1 1 2
; Y X X Y X X Y Y X X
0, >0
Y
ind X
p p p I x y
problem
: p I X Y
Gr
S
1 2 2 1 1 2 1 2 2 12 21 1
, , , , , : ˆ ˆ , ˆ
X Y i i X Y Y Y i XY XY i i i
q q q q y x q x y x y q x x y q y x y I X Y q p q p q y x p y x x y x y x x y y
graphical model
Problem: Gaussian channel
( : p A x, y) y x
ˆ ˆ :
G
q A S y x
ˆ A is not diagonal
12 21 XY XY
A
:
1 1
1 , exp 2
X
p A A
x y x x y x y x
Mismatched Decoding Model
M
S
Best mismatched decoding from y to x
1
x
2
x
1
y
2
y
{ ( , ) ( ) ( ) (y | x ) } * [ : ]
M i i KL M
S q p p p D p S
x y x y ˆ y x
, : , dually flat : , not flat , ;
H Gr Geo Gr H M Gr M Geo Gr H M Geo H
S S S S S S S S S
geo
Transfer Entropy; Granger causality
[ ] min [ : ] ( ) disconnected [ | ] [ | ]
i j KL j i j
TE x y D p q q i j H Y X X H Y X
Non‐additive
[ ; ] [ ] [ ]
i j k m i j k m
TE x y x y TE x y TE x y
1
x
2
x
1
y
2
y
, ,
i i j i i j
X X X X Y Y Y Y
cutting branches split models
x y
Partition of X
X Y
subadditivity
Information Geometry of Hyvärinen Game Score
Following Grinwald, Dawid, Parry, Lauritzen, Hyvärinen
, , : , , , log
: ,
: : :
p x S p S S S
L p q E l x a a q x S x q l x q l x q q x S H p q E S x q S D p q H p q H p p
2 2
1 , , , 2 1 : log log , 2 : : , , , ,
S p
S x q l x l x d D p q E p x q x dx D p cq D p q x l x l x l x s
parametric case
{ , }: M q A x
min : ,
S
D p q x
1
, , : estimating function , 0 : estimating equation
N i i
S
s x x s x
Information geometry of
:
S
D p q
Asymptotic Analysis of estimator
1 1
1
T T
E K VK G N
, , ,
T
K E V E s x s x s x
G
1 1
, , , log , ,
T T
E G AG A E c
a x a x s x x a x
log : efficient q a
2 2
1 , , , 2 1 : log log , 2 : : , , , ,
S p
S x q l x l x d D p q E p x q x dx D p cq D p q x l x l x l x s
Fisher efficient
q multivariate Gaussian
Discrete case : graph nodes
x
1
N x
f f f N
x
x x x const
x
N x x
2 2 ,
, 2 :
S q
q q S q q q q , q , D E q , q , f h f h f x h x dx f x h x dx
x x x
x x x x x x x x x x x x x
, f h x
Self‐Organization + Supervised Learning
RBM: Restricted Boltzmann Machine Auto‐Encoder, Recurrent Net Dropout Contrastive divergence
convolution
Mathematical Neurons
i i
y w x h w x
x
y
( ) u
u
Multilayer Perceptrons
i i
y v
w x
,
i i
f v
x w x
1 2
( , ,..., )
n
x x x x
1 1
( ,..., ; ,..., )
m m
w w v v
1
w x
y
Multilayer Perceptron
1 1,
, , ; ,
i i m m
y f v v v
x θ w x θ w w
neuromanifold
( ) x
space of functions S
Backpropagation --- stochastic gradient learning Backpropagation --- stochastic gradient learning
1 1 2
examples : , , , training set 1 ( , ; ) , 2 log , ;
t t
y y l y x y f p y x x x x
( , ; ) ,
t t t t t i i
l y x f v
x w x
Geometry of singular model
y v n w x
W
v
2
1 1 2 2 2
, , 1 2
t u
f w w y f u e dt
x J x J x x
2
1 : , ; , 2 : teacher signal : , ,
t t t
l y y f y l y stochasti loss fu backprop : vanilla gradient c descent learnin nction g x x x
1
, , :
t t t t
G y G l l
Fisher Natural Gradient Stochastic Descent Information Matrix invarint; steepest descent x
Steepest descent; invariant Yan Ollivier
Fisher‐efficient Natural gradient is non‐vanishing even in multiple layers Good at singular regions (avoid plateaus: Milnor attractor)
Adaptive Natural Gradient
Singular Region in Parameter Space
1 2 1 2 1 2 2 1 2 1 1 1 2 2
, , 0, , , 0, , R w w w w w w w w w w f w w J J J J J J J J x J x J x
1 1 2 2 1 2 1 2 2 1 2 1 1 2
, , , , , , w w w w w w w w w z w w w z J J v u J J v u
Singular Region
, 1 R w z J u
Milnor attractor
Topology of singular R
2 2 1 2 3 2
blow-down coordinates , , 1 , 1 , , 1
n
c z u u c z z u S : = e u u e e u
Singular Region
, 1 R w z J u
natural gradient learning near singularity
: 1 : d R dt d O R dt true model true model Milnor attractor
Nihat Ay and S. Amari
3
: 1 : 2
i j ij
D p q D d g d d O d
: Riemannian metric, positive‐definite
G
: : ;
ijk ijk ijk i j k ijk i j k i j i j
D D
, , , , , , , , , 2
X X ijk ijk ijk
ijk ijk
M g X Y Z Y Z Y Z M g T T T
: Levi‐Civita connection
Dual geometry
: dually flat : , : M D
t
1 log p p q X q
X p q
p q
2
: : D p q X p q
‐divergence
2
: : D p q X p q
‐geodesic
induces geometry
3
Standard divergence:
2 stan 1/3
: , D p q X p q
exponential map divergence recovers the original geometry
1 3
1 , 1 2 , 1 2 ,
[ : ] ( , ), ( ) || ( ) || ( ) || ( ) ||
t q p q p q p
D p q X q p t dt t t dt w t t dt
ξ ξ ξ
ˆ argmin :
q S
p D p q
grad :
q
X c D p q
p ˆ p S
IEEE ISIT‐2011 Sankt Petersburg
Data Compression in Multiterminal Statistical Inference Shun‐ichi Amari RIKEN Brain Science Institute
T. T.Berger; Csi Csiszar, Ahl Ahlsw swede, ede, Burnashe Burnashev, Han, Han, Amari Amari
co correlate ated sour sources ces X, Y data compression and statistical inference
1 2
:
n
X x x x bits
X
k ˆ q
1 2
:
n
Y y y y bits
Y
k
, ; ; p x y q iid
1 binary : , 0,1 ; Prob 1 Prob 1 2 x y x y
, ; ; p x y q iid
Prob x y q
1 2
:
n
X x x x bits
X
k ˆ q
1 2
:
n
Y y y y bits
Y
k
Encoding : data compression
x
n
X c 2n x 2 X
k
c
N bits K bits
1,
X Y
k k n c y
sgn c a x
1 2
:
n
X x x x c 1bit ˆ q
1 2
:
n
Y y y y n bits
when
1/ 2 q
but not for general .
( independent),
, x y
q
1,
X Y
k k n
1 2 s
x x x
in progress!
Information Geometry and Transportation Problem (Wasserstein distance) entropic relaxation : min <c, p> ‐ a{‐H(p)} dual New Paper
Exponentially concave function and a new information geometry
Portfolio theory, transportation problem and information geometry (dually projectively flat)