[PPT] - Information Geometry Historical Episodes and Future with Recent PowerPoint Presentation

SLIDE 1

IGAIA 4 Bohemia

Information Geometry

ーー Historical Episodes and Future

with Recent Developments

Shun‐ichi Amari RIKEN Brain Science Institute

SLIDE 2

Prehistory ‐‐‐ Riemannian Geometry

H. Hotteling

1929

Riemannian metric and Fisher information location‐scale model : constant curvature

P. Ch. Maharanobis

1936 Euclidean distance (multivariate‐Gaussian)

C. R. Rao 1945 Cramer‐Rao Theorem; Riemannian
H. Jeffreys

1946 Bayesian theory and Jeffreys invariant prior

SLIDE 3

Dual Geometry, Invariance

N. Chentsov

1972 invariance, {g, T}, α‐connection

B. Efron 1975 (A. P. Dawid) statistical curvature; higher‐order asymptotics
O. Barndorff‐Nielsen 1976 exponential family; Legendre transform
S. Amari 1982 duality; curvature and statistics (M. Kumon)
H. Nagaoka and S. Amari 1982 duality, Pythagorean theorem

SLIDE 4

Amari’s personal history

1958: statistics seminar (master course at U Tokyo)

S. Kullback, “Information and Statistics”

Riemannian metric (suggested by S. Moriguti)

Gausssian

: geodesic, constant curvature (Poincare‐half plane)

beautiful structure  essential meaning?

mathematical engineering graph and topology of networks: homology non‐Riemannian geometry of materials manifold: dislocations information systems, learning and neural networks

2

( , ) N  

SLIDE 5

Statistical curvature and higher‐order inference

B. Efron, 1975

Fisher’s idea; exponential connection and mixture connection

A. P. Dawid, 1975 e‐ and m‐connections
S. Amari : α‐geometry

(Rao, Kano K? Fisher’s dream)

Amari and M.Kumon higher‐order power of statistical test

2 1 2 2 2 2 3

1 1 1 ( )

e m m

Error G H H K n n n



     

SLIDE 6

Amari paper : Ann. Statist. 1982:

Reviewers (S. Lauritzen and A.P. Dawid) Chentsov work (handwritten manuscript)

H. Nagaoka and S. Amari 1982 (Technical Report)
Ann. Probability Theory 7 reviewers

Zeitshrift fur Wahrsceinlichkeitstheorie und VerwandteGebiete geometry has nothing to do with statistics IEEE Trans. Inf. Theory Shannon Theory, now well‐known

SLIDE 7

London Workshop: 1984 (D. Cox)

Cox visited Japan in 1983 patron of information geometry

Rao, Efron, Dawid, Barndorff‐Nielsen, Lauritzen Kass, Eguchi many others Dodson, Critchley, Marriot, Komaki, Zhang, Ay, Pistone, Giblisco, Nielsen, …

SLIDE 8

Information Geometry ‐‐‐ lucky naming

Applications area:

statistics, time‐series and systems, machine learning, signal processing, optimization brain theory, consciousness physics, economics, mathematics (Banach manifold, affine differential geometry and beyond) quantum information, Tsallis entropy

SLIDE 9

International Conferences

IGAIA series; GSIS series, …

Many monographs

new journal (Jun Zhang); where to publish mailing list and society still small community; united and cooperative, blessed by all

SLIDE 10

My recent works

1. Systems complexity and consciousness (IIT)
2. Geometry of score matching (Hyvarinen score)
3. Natural gradient descent and topology of deep learning)
4. Canonical divergence
5. Multi‐terminal statistical inference
6. Information geometry and Wasserstein distance

SLIDE 11

Information Integration and Complexity of Systems

‐‐ Stochastic approach

1

x

2

x

1

y

2

y

( , ) ( ) ( | ) p p p  x y x y x

x: state of the brain y: next state of the brain

SLIDE 12

Integrated Information Theory G. Tononi Φ

Necessary condition; sufficient?

SLIDE 13

full model: Disconnected model: measure of interaction : N. Ay information integration : Tononi Barrett and Seth

Many other

1

x

2

x

1

y

2

y

1

x

2

x

1

y

2

y

{ ( , )}

F

S p  x y { ( , )}

dis

S q  x y

( | ) ( | )

i i

q q y x   y x



SLIDE 14

Measure of information integration,

r system complexity



Information Geometry N. Ay

SLIDE 15

Definition of : Postulates

1)

2)

3) Disconnected model: Markov conditions



min [ ( , ) : ( , )],

dis q

D p x y q x y q S   

( , ) [ : ] ( , )log ( , )

KL

p x y D D p q p x y q x y   

SLIDE 16

Markov Condition

(12) branch deleted: Markov condition:

1 2 2 1 2 2 2

( , | ) ( | ) ( | ) p x y x p x x p y x 

1 2 2

x x y  

: all ( ) deleted

dis i j

S x y i j  

1

x

2

x

1

y

2

y

1 2 2 2 1 1

X X Y X X Y    

SLIDE 17

Why KL‐divergence?

1) 2) 3) 4)

 D[p :q] 0, = 0, when and only when p = q



i i i

D[p :q] = d[p(x ), q(x )]

: x : induces flat structure dually

SLIDE 18

min [ ( , ) : ( , )],

KL dis q geo

D p x y q x y q S   

Geometric degree of information integration

SLIDE 19

Gaussian case

=E[ ] ' ' '=E[ ' ' ] ':diagonal | '| log | |

T T geo

A A A           y x e ee y x e e e

1

x

2

x

1

y

2

y

A

SLIDE 20

Gaussian case

=E[ ] ' ' '=E[ ' ' ] ':diagonal | '| log | |

T T geo

A A A           y x e ee y x e e e

1

x

2

x

1

y

2

y

A

SLIDE 21

Many other definitions of Φ

Full model Disconnected models

SLIDE 22

Full model : graphical model

F

S

 



 

12 1 2 12 1 2

, exp higher-order terms

X Y X Y XY i i i i ij i j

p x y x x y y x y             

  

x y 

 ,

, ,

X XY i i ij i j

E x E x y          

exponential family ‐coordinates ‐coordinates

 x y

There are many disconnected models!! 





SLIDE 23

Split Model : Ay, Barrett & Seth

H

S

   

 

   

   

12 21 12

, : min : ˆ ˆ :

H S

i i H KL H KL q S i i M XY XY Y H i i

q q q y x D p S D p q q p q p y x H Y X H   



                   

   

x Y X y x y x

p ˆ q

H

S

x y

1 1 2 2

Y X X Y   

SLIDE 24

Mixed Coordinates :  

 

12 12 21 12 21 12

, , , , ; , , ˆ ˆ , ;

X X Y XY XY XY XY Y i i

q p q           



x y  

12 21 12

ˆ : 0;

XY XY Y H q q

S         

x y

Markovian Condition

1 1 2 2 1 2 2 1 1 2

; Y X X Y X X Y Y X X        

   

0, >0

Y

ind X

p p p I     x y

problem

   

: p I X Y   

SLIDE 25

Split Model

Gr

S

     

       

         

   

1 2 2 1 1 2 1 2 2 12 21 1

, , , , , : ˆ ˆ , ˆ

X Y i i X Y Y Y i XY XY i i i

q q q q y x q x y x y q x x y q y x y I X Y q p q p q y x p y x              x y x y x x y y

graphical model

SLIDE 26

Problem: Gaussian channel

( : p A   x, y) y x 

ˆ ˆ :

G

q A S      y x 

ˆ A is not diagonal

12 21 XY XY

    

A

:

     

 

1 1

1 , exp 2

X

p A A





 

             

 

x y x x y x y x

SLIDE 27

Mismatched Decoding Model

M

S

Best mismatched decoding from y to x

1

x

2

x

1

y

2

y

{ ( , ) ( ) ( ) (y | x ) } * [ : ]

M i i KL M

S q p p p D p S

  

     x y x y ˆ  y x

SLIDE 28

, : , dually flat : , not flat , ;

H Gr Geo Gr H M Gr M Geo Gr H M Geo H

S S S S S S S S S         

geo

SLIDE 29

Transfer Entropy; Granger causality

[ ] min [ : ] ( ) disconnected [ | ] [ | ]

i j KL j i j

TE x y D p q q i j H Y X X H Y X       

Non‐additive

[ ; ] [ ] [ ]

i j k m i j k m

TE x y x y TE x y TE x y      

1

x

2

x

1

y

2

y

SLIDE 30

Hierarchy: transfer entropy

, ,

i i j i i j

X X X X Y Y Y Y          

cutting branches split models

x y

Partition of X

X Y

subadditivity

SLIDE 31

Information Geometry of Hyvärinen Game Score

Following Grinwald, Dawid, Parry, Lauritzen, Hyvärinen  

 

                     

, , : , , , log

entropy

: ,

divergence

: : :

p x S p S S S

L p q E l x a a q x S x q l x q l x q q x S H p q E S x q S D p q H p q H p p               

SLIDE 32

Hyvärinen score

     

 

     

 

           

2 2

1 , , , 2 1 : log log , 2 : : , , , ,

S p

S x q l x l x d D p q E p x q x dx D p cq D p q x l x l x l x                      s

 

     

SLIDE 33

parametric case

 

{ , }: M q A  x 

 

min : ,

S

D p q     x



     

1

, , : estimating function , 0 : estimating equation

N i i

S



  



s x x s x



  

Information geometry of

 

:

S

D p q

SLIDE 34

Asymptotic Analysis of estimator

1 1

1

T T

E K VK G N

  

         

     

, , ,

T

K E V E            s x s x s x



   G 

SLIDE 35

         

 

1 1

, , , log , ,

T T

E G AG A E c

 

               a x a x s x x a x



       log : efficient q   a



SLIDE 36

Hyvärinen score

     

 

     

 

           

2 2

1 , , , 2 1 : log log , 2 : : , , , ,

S p

S x q l x l x d D p q E p x q x dx D p cq D p q x l x l x l x                      s

 

     

SLIDE 37

Hyvärinen estimator

Fisher efficient

q multivariate Gaussian

SLIDE 38

Discrete case : graph nodes

 x

     

 

1

N x

f f f N



   



x

x x x const

x

N   x x

SLIDE 39

           

 

                       

2 2 ,

, 2 :

S q

q q S q q q q , q , D E q , q , f h f h f x h x dx f x h x dx                                                  

   

x x x

x x x x x x x x x x x x x



      , f h x   

SLIDE 40

Deep Deep Learning Learning

Self‐Organization + Supervised Learning

RBM: Restricted Boltzmann Machine Auto‐Encoder, Recurrent Net Dropout Contrastive divergence

convolution

SLIDE 41

Mathematical Neurons

   

i i

y w x h        w x

x

y

( ) u 

u

SLIDE 42

Multilayer Perceptrons

 

i i

y v  



w x

   

,

i i

f v  



x w x 

1 2

( , ,..., )

n

x x x x 

1 1

( ,..., ; ,..., )

m m

w w v v  

 

1

  w x

y

x

SLIDE 43

Multilayer Perceptron

   

 

1 1,

, , ; ,

i i m m

y f v v v     



  x θ w x θ w w

neuromanifold

( ) x 

space of functions S

SLIDE 44

Backpropagation --- stochastic gradient learning Backpropagation --- stochastic gradient learning

       

1 1 2

examples : , , , training set 1 ( , ; ) , 2 log , ;

t t

y y l y x y f p y         x x x x  

   

( , ; ) ,

t t t t t i i

l y x f v          



x w x   



SLIDE 45

singularities

SLIDE 46

Geometry of singular model

 

y v n     w x

Ｗ

v

v | | 0  w

SLIDE 47

model: 2 hidden neurons

         

2

1 1 2 2 2

, , 1 2

t u

f w w y f u e dt     

 

      



x J x J x x  

SLIDE 48

   

 

 

2

1 : , ; , 2 : teacher signal : , ,

t t t

l y y f y l y        stochasti loss fu backprop : vanilla gradient c descent learnin nction g  x x x      

SLIDE 49

     

1

, , :

t t t t

G y G l l 



          Fisher Natural Gradient Stochastic Descent Information Matrix invarint; steepest descent  x

   

    

SLIDE 50

Natural gradient is superior

Steepest descent; invariant Yan Ollivier

Fisher‐efficient Natural gradient is non‐vanishing even in multiple layers Good at singular regions (avoid plateaus: Milnor attractor)

SLIDE 51

Adaptive Natural Gradient

SLIDE 52

Singular Region in Parameter Space

  

    

     

1 2 1 2 1 2 2 1 2 1 1 1 2 2

, , 0, , , 0, , R w w w w w w w w w w f w w                    J J J J J J J J x J x J x    

SLIDE 53

Coordinate transformation

 

1 1 2 2 1 2 1 2 2 1 2 1 1 2

, , , , , , w w w w w w w w w z w w w z             J J v u J J v u

SLIDE 54

Singular Region

     

, 1 R w z      J u

SLIDE 55

Milnor attractor

SLIDE 56

Fig. 2: trajectories

SLIDE 57

Saddle and plateau

SLIDE 58

Topology of singular R

 

   

2 2 1 2 3 2

blow-down coordinates , , 1 , 1 , , 1

n

c z u u c z z u S             : = e u u e e u 

SLIDE 59

Singular Region

     

, 1 R w z      J u

SLIDE 60

SLIDE 61

Sphere Sn and Projective space Pn

SLIDE 62

natural gradient learning near singularity

 

: 1 : d R dt d O R dt                               true model true model Milnor attractor

SLIDE 63

Canonical Divergence in Manifold

f Dual Affine Connections

Nihat Ay and S. Amari

SLIDE 64

Divergence and metric

     

 

3

: 1 : 2

i j ij

D p q D d g d d O d           

: Riemannian metric, positive‐definite

G

SLIDE 65

Divergence and dual affine connections

   

: : ;

ijk ijk ijk i j k ijk i j k i j i j

D D  

    

                              

   

SLIDE 66

Dual geometry

 

, , , , , , , , , 2

X X ijk ijk ijk

ijk

ijk ijk

M g X Y Z Y Z Y Z M g T T T



   

             

: Levi‐Civita connection



SLIDE 67

Dual geometry

canonical divergence

         

: dually flat : , : M D                    

Bregman divergence

SLIDE 68

Exponential map : geodesic

 

t 

     

1 log p p q X q      

 





  

X p q

SLIDE 69

p q

Exponential map divergence

   

2

: : D p q X p q 

‐divergence



   

2

: : D p q X p q

 



‐geodesic



SLIDE 70

Theorem 1. Exponential map divergence

induces geometry

3   

Standard divergence:

   

2 stan 1/3

: , D p q X p q





Theorem 2.

exponential map divergence recovers the original geometry

1 3   

SLIDE 71

1 , 1 2 , 1 2 ,

[ : ] ( , ), ( ) || ( ) || ( ) || ( ) ||

t q p q p q p

D p q X q p t dt t t dt w t t dt    

  

ξ ξ ξ   

SLIDE 72

Divergence and projection

projection theorem:

 

ˆ argmin :

q S

p D p q





 

grad :

q

X c D p q 

p ˆ p S

SLIDE 73

IEEE ISIT‐2011 Sankt Petersburg

Data Compression in Multiterminal Statistical Inference Shun‐ichi Amari RIKEN Brain Science Institute

SLIDE 74

A lo long standi anding ng pr probl

blem

T. T.Berger; Csi Csiszar, Ahl Ahlsw swede, ede, Burnashe Burnashev, Han, Han, Amari Amari

co correlate ated sour sources ces X, Y data compression and statistical inference

1 2

:

n

X x x x  bits

X

k ˆ q

1 2

:

n

Y y y y  bits

Y

k

 

, ; ; p x y q iid

SLIDE 75

   

1 binary : , 0,1 ; Prob 1 Prob 1 2 x y x y     

 

, ; ; p x y q iid

 

Prob x y q  

1 2

:

n

X x x x  bits

X

k ˆ q

1 2

:

n

Y y y y  bits

Y

k

SLIDE 76

Encoding : data compression

x

n

X c 2n  x 2 X

k

 c

N bits K bits

SLIDE 77

One‐bit helper case

1,

X Y

k k n c   y

 

sgn c   a x

1 2

:

n

X x x x c   1bit ˆ q

1 2

:

n

Y y y y  n bits

SLIDE 78

Is single‐bit encoding optimal?

when

1/ 2 q 

but not for general .

( independent),

, x y

q

It is optimal

SLIDE 79

Fisher information:

1,

X Y

k k n  

SLIDE 80

Kingo Kobayashi: parity encoding

1 2 s

x x x    

in progress！

SLIDE 81

Information Geometry and Transportation Problem (Wasserstein distance) entropic relaxation : min <c, p> ‐ a{‐H(p)} dual New Paper

S. Pal and T‐K L. Wong,

Exponentially concave function and a new information geometry

Portfolio theory, transportation problem and information geometry (dually projectively flat)