Information Geometry Historical Episodes and Future with Recent - - PowerPoint PPT Presentation

information geometry
SMART_READER_LITE
LIVE PREVIEW

Information Geometry Historical Episodes and Future with Recent - - PowerPoint PPT Presentation

IGAIA 4 Bohemia Information Geometry Historical Episodes and Future with Recent Developments Shun ichi Amari RIKEN Brain Science Institute Prehistory Riemannian Geometry H. Hotteling 1929 Riemannian metric and Fisher


slide-1
SLIDE 1

IGAIA 4 Bohemia

Information Geometry

ーー Historical Episodes and Future

with Recent Developments

Shun‐ichi Amari RIKEN Brain Science Institute

slide-2
SLIDE 2

Prehistory ‐‐‐ Riemannian Geometry

  • H. Hotteling

1929

Riemannian metric and Fisher information location‐scale model : constant curvature

  • P. Ch. Maharanobis

1936 Euclidean distance (multivariate‐Gaussian)

  • C. R. Rao 1945 Cramer‐Rao Theorem; Riemannian
  • H. Jeffreys

1946 Bayesian theory and Jeffreys invariant prior

slide-3
SLIDE 3

Dual Geometry, Invariance

  • N. Chentsov

1972 invariance, {g, T}, α‐connection

  • B. Efron 1975 (A. P. Dawid) statistical curvature; higher‐order asymptotics
  • O. Barndorff‐Nielsen 1976 exponential family; Legendre transform
  • S. Amari 1982 duality; curvature and statistics (M. Kumon)
  • H. Nagaoka and S. Amari 1982 duality, Pythagorean theorem
slide-4
SLIDE 4

Amari’s personal history

1958: statistics seminar (master course at U Tokyo)

  • S. Kullback, “Information and Statistics”

Riemannian metric (suggested by S. Moriguti)

Gausssian

: geodesic, constant curvature (Poincare‐half plane)

beautiful structure  essential meaning?

mathematical engineering graph and topology of networks: homology non‐Riemannian geometry of materials manifold: dislocations information systems, learning and neural networks

2

( , ) N  

slide-5
SLIDE 5

Statistical curvature and higher‐order inference

  • B. Efron, 1975

Fisher’s idea; exponential connection and mixture connection

  • A. P. Dawid, 1975 e‐ and m‐connections
  • S. Amari : α‐geometry

(Rao, Kano K? Fisher’s dream)

Amari and M.Kumon higher‐order power of statistical test

2 1 2 2 2 2 3

1 1 1 ( )

e m m

Error G H H K n n n

     

slide-6
SLIDE 6

Amari paper : Ann. Statist. 1982:

Reviewers (S. Lauritzen and A.P. Dawid) Chentsov work (handwritten manuscript)

  • H. Nagaoka and S. Amari 1982 (Technical Report)
  • Ann. Probability Theory 7 reviewers

Zeitshrift fur Wahrsceinlichkeitstheorie und VerwandteGebiete geometry has nothing to do with statistics IEEE Trans. Inf. Theory Shannon Theory, now well‐known

slide-7
SLIDE 7

London Workshop: 1984 (D. Cox)

Cox visited Japan in 1983 patron of information geometry

Rao, Efron, Dawid, Barndorff‐Nielsen, Lauritzen Kass, Eguchi many others Dodson, Critchley, Marriot, Komaki, Zhang, Ay, Pistone, Giblisco, Nielsen, …

slide-8
SLIDE 8

Information Geometry ‐‐‐ lucky naming

Applications area:

statistics, time‐series and systems, machine learning, signal processing, optimization brain theory, consciousness physics, economics, mathematics (Banach manifold, affine differential geometry and beyond) quantum information, Tsallis entropy

slide-9
SLIDE 9

International Conferences

IGAIA series; GSIS series, …

Many monographs

new journal (Jun Zhang); where to publish mailing list and society still small community; united and cooperative, blessed by all

slide-10
SLIDE 10

My recent works

  • 1. Systems complexity and consciousness (IIT)
  • 2. Geometry of score matching (Hyvarinen score)
  • 3. Natural gradient descent and topology of deep learning)
  • 4. Canonical divergence
  • 5. Multi‐terminal statistical inference
  • 6. Information geometry and Wasserstein distance
slide-11
SLIDE 11

Information Integration and Complexity of Systems

‐‐ Stochastic approach

1

x

2

x

1

y

2

y

( , ) ( ) ( | ) p p p  x y x y x

x: state of the brain y: next state of the brain

slide-12
SLIDE 12

Integrated Information Theory G. Tononi Φ

Necessary condition; sufficient?

slide-13
SLIDE 13

full model: Disconnected model: measure of interaction : N. Ay information integration : Tononi Barrett and Seth

Many other

1

x

2

x

1

y

2

y

1

x

2

x

1

y

2

y

{ ( , )}

F

S p  x y { ( , )}

dis

S q  x y

( | ) ( | )

i i

q q y x   y x

slide-14
SLIDE 14

Measure of information integration,

  • r system complexity

Information Geometry N. Ay

slide-15
SLIDE 15

Definition of : Postulates

1)

2)

3) Disconnected model: Markov conditions

min [ ( , ) : ( , )],

dis q

D p x y q x y q S   

( , ) [ : ] ( , )log ( , )

KL

p x y D D p q p x y q x y   

slide-16
SLIDE 16

Markov Condition

(12) branch deleted: Markov condition:

1 2 2 1 2 2 2

( , | ) ( | ) ( | ) p x y x p x x p y x 

1 2 2

x x y  

: all ( ) deleted

dis i j

S x y i j  

1

x

2

x

1

y

2

y

1 2 2 2 1 1

X X Y X X Y    

slide-17
SLIDE 17

Why KL‐divergence?

1) 2) 3) 4)

 D[p :q] 0, = 0, when and only when p = q

i i i

D[p :q] = d[p(x ), q(x )]

: x : induces flat structure dually

slide-18
SLIDE 18

min [ ( , ) : ( , )],

KL dis q geo

D p x y q x y q S   

Geometric degree of information integration

slide-19
SLIDE 19

Gaussian case

=E[ ] ' ' '=E[ ' ' ] ':diagonal | '| log | |

T T geo

A A A           y x e ee y x e e e

1

x

2

x

1

y

2

y

A

slide-20
SLIDE 20

Gaussian case

=E[ ] ' ' '=E[ ' ' ] ':diagonal | '| log | |

T T geo

A A A           y x e ee y x e e e

1

x

2

x

1

y

2

y

A

slide-21
SLIDE 21

Many other definitions of Φ

Full model Disconnected models

slide-22
SLIDE 22

Full model : graphical model

F

S

 

 

12 1 2 12 1 2

, exp higher-order terms

X Y X Y XY i i i i ij i j

p x y x x y y x y             

  

x y 

 ,

, ,

X XY i i ij i j

E x E x y          

exponential family ‐coordinates ‐coordinates

 x y

There are many disconnected models!! 

slide-23
SLIDE 23

Split Model : Ay, Barrett & Seth

H

S

   

 

   

   

12 21 12

, : min : ˆ ˆ :

H S

i i H KL H KL q S i i M XY XY Y H i i

q q q y x D p S D p q q p q p y x H Y X H   

                   

   

x Y X y x y x

p ˆ q

H

S

x y

x y

1 1 2 2

Y X X Y   

slide-24
SLIDE 24

Mixed Coordinates :  

 

12 12 21 12 21 12

, , , , ; , , ˆ ˆ , ;

X X Y XY XY XY XY Y i i

q p q           

x y  

12 21 12

ˆ : 0;

XY XY Y H q q

S         

x y

Markovian Condition

1 1 2 2 1 2 2 1 1 2

; Y X X Y X X Y Y X X        

   

0, >0

Y

ind X

p p p I     x y

problem

   

: p I X Y   

slide-25
SLIDE 25

Split Model

Gr

S

     

       

         

   

1 2 2 1 1 2 1 2 2 12 21 1

, , , , , : ˆ ˆ , ˆ

X Y i i X Y Y Y i XY XY i i i

q q q q y x q x y x y q x x y q y x y I X Y q p q p q y x p y x              x y x y x x y y

graphical model

slide-26
SLIDE 26

Problem: Gaussian channel

( : p A   x, y) y x 

ˆ ˆ :

G

q A S      y x 

ˆ A is not diagonal

12 21 XY XY

    

A

:

     

 

1 1

1 , exp 2

X

p A A

 

             

 

x y x x y x y x

slide-27
SLIDE 27

Mismatched Decoding Model

M

S

Best mismatched decoding from y to x

1

x

2

x

1

y

2

y

{ ( , ) ( ) ( ) (y | x ) } * [ : ]

M i i KL M

S q p p p D p S

  

     x y x y ˆ  y x

slide-28
SLIDE 28

, : , dually flat : , not flat , ;

H Gr Geo Gr H M Gr M Geo Gr H M Geo H

S S S S S S S S S         

geo

slide-29
SLIDE 29

Transfer Entropy; Granger causality

[ ] min [ : ] ( ) disconnected [ | ] [ | ]

i j KL j i j

TE x y D p q q i j H Y X X H Y X       

Non‐additive

[ ; ] [ ] [ ]

i j k m i j k m

TE x y x y TE x y TE x y      

1

x

2

x

1

y

2

y

slide-30
SLIDE 30

Hierarchy: transfer entropy

, ,

i i j i i j

X X X X Y Y Y Y          

cutting branches split models

x y

Partition of X

X Y

subadditivity

slide-31
SLIDE 31

Information Geometry of Hyvärinen Game Score

Following Grinwald, Dawid, Parry, Lauritzen, Hyvärinen  

 

                     

, , : , , , log

  • entropy

: ,

  • divergence

: : :

p x S p S S S

L p q E l x a a q x S x q l x q l x q q x S H p q E S x q S D p q H p q H p p               

slide-32
SLIDE 32

Hyvärinen score

     

 

     

 

           

2 2

1 , , , 2 1 : log log , 2 : : , , , ,

S p

S x q l x l x d D p q E p x q x dx D p cq D p q x l x l x l x                      s

 

     

slide-33
SLIDE 33

parametric case

 

{ , }: M q A  x 

 

min : ,

S

D p q     x

     

1

, , : estimating function , 0 : estimating equation

N i i

S

  

s x x s x

  

Information geometry of

 

:

S

D p q

slide-34
SLIDE 34

Asymptotic Analysis of estimator

1 1

1

T T

E K VK G N

  

         

     

, , ,

T

K E V E            s x s x s x

   G 

slide-35
SLIDE 35

         

 

1 1

, , , log , ,

T T

E G AG A E c

 

               a x a x s x x a x

       log : efficient q   a

slide-36
SLIDE 36

Hyvärinen score

     

 

     

 

           

2 2

1 , , , 2 1 : log log , 2 : : , , , ,

S p

S x q l x l x d D p q E p x q x dx D p cq D p q x l x l x l x                      s

 

     

slide-37
SLIDE 37

Hyvärinen estimator

Fisher efficient

q multivariate Gaussian

slide-38
SLIDE 38

Discrete case : graph nodes

 x

     

 

1

N x

f f f N



   

x

x x x const

x

N   x x

slide-39
SLIDE 39

           

 

                       

2 2 ,

, 2 :

S q

q q S q q q q , q , D E q , q , f h f h f x h x dx f x h x dx                                                  

   

x x x

x x x x x x x x x x x x x

      , f h x   

slide-40
SLIDE 40

Deep Deep Learning Learning

Self‐Organization + Supervised Learning

RBM: Restricted Boltzmann Machine Auto‐Encoder, Recurrent Net Dropout Contrastive divergence

convolution

slide-41
SLIDE 41

Mathematical Neurons

   

i i

y w x h        w x

x

y

( ) u 

u

slide-42
SLIDE 42

Multilayer Perceptrons

 

i i

y v  

w x

   

,

i i

f v  

x w x 

1 2

( , ,..., )

n

x x x x 

1 1

( ,..., ; ,..., )

m m

w w v v  

 

1

  w x

y

x

slide-43
SLIDE 43

Multilayer Perceptron

   

 

1 1,

, , ; ,

i i m m

y f v v v     

  x θ w x θ w w

neuromanifold

( ) x 

space of functions S

slide-44
SLIDE 44

Backpropagation --- stochastic gradient learning Backpropagation --- stochastic gradient learning

       

1 1 2

examples : , , , training set 1 ( , ; ) , 2 log , ;

t t

y y l y x y f p y         x x x x  

   

( , ; ) ,

t t t t t i i

l y x f v          

x w x   

slide-45
SLIDE 45

singularities

slide-46
SLIDE 46

Geometry of singular model

 

y v n     w x

v

v | | 0  w

slide-47
SLIDE 47

model: 2 hidden neurons

         

2

1 1 2 2 2

, , 1 2

t u

f w w y f u e dt     

 

      

x J x J x x  

slide-48
SLIDE 48

   

 

 

2

1 : , ; , 2 : teacher signal : , ,

t t t

l y y f y l y        stochasti loss fu backprop : vanilla gradient c descent learnin nction g  x x x      

slide-49
SLIDE 49

     

1

, , :

t t t t

G y G l l 

          Fisher Natural Gradient Stochastic Descent Information Matrix invarint; steepest descent  x

   

    

slide-50
SLIDE 50

Natural gradient is superior

Steepest descent; invariant Yan Ollivier

Fisher‐efficient Natural gradient is non‐vanishing even in multiple layers Good at singular regions (avoid plateaus: Milnor attractor)

slide-51
SLIDE 51

Adaptive Natural Gradient

slide-52
SLIDE 52

Singular Region in Parameter Space

  

    

     

1 2 1 2 1 2 2 1 2 1 1 1 2 2

, , 0, , , 0, , R w w w w w w w w w w f w w                    J J J J J J J J x J x J x    

slide-53
SLIDE 53

Coordinate transformation

 

1 1 2 2 1 2 1 2 2 1 2 1 1 2

, , , , , , w w w w w w w w w z w w w z             J J v u J J v u

slide-54
SLIDE 54

Singular Region

     

, 1 R w z      J u

slide-55
SLIDE 55

Milnor attractor

slide-56
SLIDE 56
  • Fig. 2: trajectories
slide-57
SLIDE 57

Saddle and plateau

slide-58
SLIDE 58

Topology of singular R

 

   

2 2 1 2 3 2

blow-down coordinates , , 1 , 1 , , 1

n

c z u u c z z u S             : = e u u e e u 

slide-59
SLIDE 59

Singular Region

     

, 1 R w z      J u

slide-60
SLIDE 60
slide-61
SLIDE 61

Sphere Sn and Projective space Pn

slide-62
SLIDE 62

natural gradient learning near singularity

 

: 1 : d R dt d O R dt                               true model true model Milnor attractor

slide-63
SLIDE 63

Canonical Divergence in Manifold

  • f Dual Affine Connections

Nihat Ay and S. Amari

slide-64
SLIDE 64

Divergence and metric

     

 

3

: 1 : 2

i j ij

D p q D d g d d O d           

: Riemannian metric, positive‐definite

G

slide-65
SLIDE 65

Divergence and dual affine connections

   

: : ;

ijk ijk ijk i j k ijk i j k i j i j

D D  

    

                              

   

   

slide-66
SLIDE 66

Dual geometry

 

 

, , , , , , , , , 2

X X ijk ijk ijk

  • ijk

ijk ijk

M g X Y Z Y Z Y Z M g T T T

   

             

: Levi‐Civita connection

slide-67
SLIDE 67

Dual geometry

canonical divergence

         

: dually flat : , : M D                    

Bregman divergence

slide-68
SLIDE 68

Exponential map : geodesic

 

t 

     

1 log p p q X q      

 



  

X p q

slide-69
SLIDE 69

p q

Exponential map divergence

   

2

: : D p q X p q 

‐divergence

   

2

: : D p q X p q

 

‐geodesic

slide-70
SLIDE 70

Theorem 1. Exponential map divergence

induces geometry

3   

Standard divergence:

   

2 stan 1/3

: , D p q X p q

Theorem 2.

exponential map divergence recovers the original geometry

1 3   

slide-71
SLIDE 71

1 , 1 2 , 1 2 ,

[ : ] ( , ), ( ) || ( ) || ( ) || ( ) ||

t q p q p q p

D p q X q p t dt t t dt w t t dt    

  

ξ ξ ξ   

slide-72
SLIDE 72

Divergence and projection

projection theorem:

 

ˆ argmin :

q S

p D p q

 

grad :

q

X c D p q 

p ˆ p S

slide-73
SLIDE 73

IEEE ISIT‐2011 Sankt Petersburg

Data Compression in Multiterminal Statistical Inference Shun‐ichi Amari RIKEN Brain Science Institute

slide-74
SLIDE 74

A lo long standi anding ng pr probl

  • blem

T. T.Berger; Csi Csiszar, Ahl Ahlsw swede, ede, Burnashe Burnashev, Han, Han, Amari Amari

co correlate ated sour sources ces X, Y data compression and statistical inference

1 2

:

n

X x x x  bits

X

k ˆ q

1 2

:

n

Y y y y  bits

Y

k

 

, ; ; p x y q iid

slide-75
SLIDE 75

   

1 binary : , 0,1 ; Prob 1 Prob 1 2 x y x y     

 

, ; ; p x y q iid

 

Prob x y q  

1 2

:

n

X x x x  bits

X

k ˆ q

1 2

:

n

Y y y y  bits

Y

k

slide-76
SLIDE 76

Encoding : data compression

x

n

X c 2n  x 2 X

k

 c

N bits K bits

slide-77
SLIDE 77

One‐bit helper case

1,

X Y

k k n c   y

 

sgn c   a x

1 2

:

n

X x x x c   1bit ˆ q

1 2

:

n

Y y y y  n bits

slide-78
SLIDE 78

Is single‐bit encoding optimal?

when

1/ 2 q 

but not for general .

( independent),

, x y

q

It is optimal

slide-79
SLIDE 79

Fisher information:

1,

X Y

k k n  

slide-80
SLIDE 80

Kingo Kobayashi: parity encoding

1 2 s

x x x    

in progress!

slide-81
SLIDE 81

Information Geometry and Transportation Problem (Wasserstein distance) entropic relaxation : min <c, p> ‐ a{‐H(p)} dual New Paper

  • S. Pal and T‐K L. Wong,

Exponentially concave function and a new information geometry

Portfolio theory, transportation problem and information geometry (dually projectively flat)