Information Geometry and Its Applications to Machine Learning - - PowerPoint PPT Presentation

information geometry
SMART_READER_LITE
LIVE PREVIEW

Information Geometry and Its Applications to Machine Learning - - PowerPoint PPT Presentation

Machine Learning SS: Kyoto U. Information Geometry and Its Applications to Machine Learning Shun-ichi Amari RIKEN Brain Science Institute Information Geometry -- Manifolds of Probability Distributions = x { ( )} M p Information


slide-1
SLIDE 1

Machine Learning SS: Kyoto U.

Information Geometry

and Its Applications to Machine Learning

Shun-ichi Amari

RIKEN Brain Science Institute

slide-2
SLIDE 2

Information Geometry

  • - Manifolds of

Probability Distributions

{ ( )} M p = x

slide-3
SLIDE 3

Information Geometry Information Geometry

Systems Theory Information Theory Statistics Neural Networks Combinatorics Physics

Information Sciences Riemannian Manifold Dual Affine Connections Manifold of Probability Distributions

Math.

AI Vision

Optimization

slide-4
SLIDE 4

( )

{ }

( ) ( )

2 2

1 ; , ; , exp 2 2 x S p x p x μ μ σ μ σ σ πσ ⎧ ⎫ − ⎪ ⎪ = = − ⎨ ⎬ ⎪ ⎪ ⎩ ⎭

Information Geometry ? Information Geometry ?

( )

{ }

p x

σ μ

( )

{ }

; S p x = θ

Gaussian distributions

( , ) μ σ = θ

slide-5
SLIDE 5

Manifold of Probability Distributions Manifold of Probability Distributions

( )

1 2 3 1 2 3

1, 2,3 ={ ( )} , , 1

n

x S p x p p p p p p = = + + =

3

p

2

p

1

p

p

( )

{ }

; M p x = θ

slide-6
SLIDE 6

Invariance Invariance

( )

{ }

, S p x = θ

Invariant under different representation

( ) ( )

, , = y y x p y sufficient statistics θ

( ) ( )

2 1 2 2 1 2

, , | ( , ) ( , ) | p x p x dx p y p y dy θ θ θ θ − ≠ −

∫ ∫

slide-7
SLIDE 7

Two Geometrical Structures Two Geometrical Structures

Riemannian metric affine connection --- geodesic

( )

2 ij i j

ds g d d θ θ = ∑ θ

Fisher information

log log

ij i j

g E p p θ θ ⎡ ⎤ ∂ ∂ = ⎢ ⎥ ∂ ∂ ⎢ ⎥ ⎣ ⎦

Orthogonality: innner product

1 2 1 2

,

T

d d d Gd θ θ θ θ < >=

slide-8
SLIDE 8

Affine Connection

covariant derivative; parallel transport

, geodesic X=X X=X(t) ( )

X c i j ij

Y X Y s g d d θ θ θ ∇ Π = Π =

∑ ∫

minimal distance & &

straight line

slide-9
SLIDE 9

Duality: two affine connections Duality: two affine connections

, , ,

i j ij

X Y X Y X Y g X Y

= Π Π < >= ∑

Riemannian geometry:

∏ = ∏

X Y X Y

Π

*

Π

{ , , , *} S g ∇ ∇

slide-10
SLIDE 10

Dual Affine Connections

e-geodesic m-geodesic

( ) ( ) ( ) ( ) ( )

log , log 1 log r x t t p x t q x c t = + − +

( ) ( ) ( ) ( )

, 1 r x t tp x t q x = + −

( )

,

∇ ∇

( )

q x

( )

p x

*

( , ) Π Π

( ) * ( )

x x

x t x t ∇ = ∇ =

& &

& &

slide-11
SLIDE 11

Mathematical structure of

( )

{ }

, S p x = ξ

( ) ( )

ij i j ijk i j k

g E l l T E l l l ξ ξ ⎡ ⎤ = ∂ ∂ ⎣ ⎦ ⎡ ⎤ = ∂ ∂ ∂ ⎣ ⎦

( )

log , ;

i i

l p x ξ ∂ = ∂ = ∂ ξ

  • connection

{ }

, ;

ijk ijk

i j k T

α

α Γ = −

α α −

∇ ↔ ∇

: dually coupled

, , ,

X X

X Y Z Y Z Y Z

= ∇ + ∇

α

{M,g,T}

slide-12
SLIDE 12

Divergence:

[ ]

: D z y

[ ] [ ] [ ]

: : 0, iff :

ij i j

D D D d g dz dz ≥ = = + = ∑ z y z y z y z z z

positive‐definite

Z Y

M

slide-13
SLIDE 13

Kullback-Leibler Divergence

quasi-distance

( ) [ ( ) : ( )] ( )log ( ) [ ( ) : ( )] 0 =0 iff ( ) ( ) [ : ] [ : ]

x

p x D p x q x p x q x D p x q x p x q x D p q D q p = ≥ = ≠

slide-14
SLIDE 14

[ ]

:

i f i i

q D p f p ⎛ ⎞ = ≥ ⎜ ⎟ ⎝ ⎠

% % % % % p q

[ ]

:

f

D = ⇔ = % % % % p q p q

( ) ( ) ( )

not invariant under 1 f u f u c u = − − %

divergence of f S %

{ },

0 : ( 1 holds)

i i

S p p nn = > =

% % % % p

slide-15
SLIDE 15

divergence

α

1 1 2 2

1 1 [ : ] { } 2 2

i i i i

D p q p q p q

α α α

α α

− +

− + = + −

% % % % % %

[ : ] { log }

i i i i i

p D p q p p q q = + −

% % % % % % %

KL-divergence

α

slide-16
SLIDE 16

( , ) divergence α β −

, [

: ] { }

i i i i

D p q p q p q

α β α β α β α β

α β α β α β

+ +

= + − + +

: divergence 1: -divergence β α α α β = − − =

slide-17
SLIDE 17

Metric and Connections Induced by Divergence

(Eguchi)

( )

[ ] [ ]

( ) ( )

[ ]

( )

[ ]

' ' '

1 : : : = 2 : :

= = ∗ =

= ∂ ∂ + Γ = −∂ ∂ ∂ Γ = −∂ ∂ ∂

ij i j ij i j ijk i j k ijk i j k

g D D d g dz dz D D

y z y z y z

z z y z z z z z z y z z y

* '

{ , } ,

i i i i

z y ∇ ∇ ∂ ∂ ∂ = ∂ = ∂ ∂

Riemannian metric affine connections

slide-18
SLIDE 18

Duality:

{ }

, ,

k ij kij kji ijk ijk ijk

g T M g T

∗ ∗

∂ = Γ + Γ Γ = Γ −

*

, , ,

X X

X Y Z Y Z Y Z < >=< ∇ > + < ∇ >

slide-19
SLIDE 19

Dually flat manifold

exponential family; mixture family; {p(x); x discrete}

  • coordinates : affine coordinates, flat, geodesics
  • coordinates: (dual) affine coordinates, flat, geodesics

canonical divergence D(P: P') : θ η − b KL divergence not Riemannian flat

( ) ( )

{ }

, exp : exponential family θ θ ψ θ = −

i i

p x x

slide-20
SLIDE 20

( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )

{ }

2 2

potential functions , ; : , exp : exponential family : cumulant generating function : negative en ψ θ ϕ η η ψ θ θ ϕ θ θ η ψ θ ϕ η θη θ ψ θ ϕ θ θ θ η η θ θ ψ θ ψ ϕ ∂ ∂ = = ∂ ∂ + − = ∂ ∂ = = ∂ ∂ ∂ ∂ = −

∑ ∑

L

i i i i i i ij ij i j i j i i

g g p x x Legendre transformation

( ) ( )

tropy canonical divergence D(P: P')= ' ' ψ θ ϕ η θη + −∑

i i

Dually flat manifold

slide-21
SLIDE 21

Manifold with Convex Function

S : coordinates

( )

1 2

, , ,

n

θ θ θ = L θ

( )

ψ θ

: convex function

negative entropy

( ) ( ) ( )

log p p x p x dx ϕ = ∫

( )

( )

2

1 2

i

ψ θ = ∑ θ

slide-22
SLIDE 22

Riemannian metric and flatness (affine structure)

Bregman divergence

( ) ( ) ( ) ( )

, grad D ψ ψ ′ ′ ′ = − − ⋅ θ θ θ θ θ θ

( ) ( )

1 , 2

i j ij

D d g d d θ θ + = ∑ θ θ θ θ

( ),

ij i j i i

g ψ θ ∂ = ∂ ∂ ∂ = ∂ θ

θ

: geodesic (not Levi-Civita)

Flatness (affine)

{ , ( ), } S ψ θ θ

slide-23
SLIDE 23

Legendre Transformation

( )

i i

η ψ = ∂ θ

↔ θ η

  • ne-to-one

( ) ( )

i i

ϕ ψ θη + − = η θ

( ),

i i i i

θ ϕ η ∂ = ∂ ∂ = ∂ η

( ) ( ) ( )

, D ψ ϕ ′ ′ ′ = + − ⋅ θ θ θ η θ η

( ) max { ( )}

i i θ

ϕ η θ η ψ θ = −

slide-24
SLIDE 24

Two affine coordinate systems (

)

, θ η

θ

: geodesic (e-geodesic)

η

: dual geodesic (m-geodesic)

“dually orthogonal”

, ,

j j i i i i i i

δ θ η ∂ ∂ = ∂ ∂ ∂ = ∂ = ∂ ∂

*

, , ,

X X

X Y Z Y Z Y Z < >=< ∇ > + < ∇ >

slide-25
SLIDE 25

Pythagorean Theorem

(dually flat manifold)

[ ] [ ] [ ]

: : : D P Q D Q R D P R + =

Euclidean space: self-dual

= θ η

( ) ( )

2

1 2

i

ψ θ θ = ∑

slide-26
SLIDE 26

Projection Theorem

[ ]

min :

Q M D P Q ∈

Q = m-geodesic projection of P to M

[ ]

min :

Q M D Q P ∈

Q’ = e-geodesic projection of P to M

slide-27
SLIDE 27

Two Types of Divergence

Invariant divergence (Chentsov, Csiszar)

f-divergence: Fisher- structure

Flat divergence (Bregman) – convex function

KL-divergence belongs to both classes: flat and invariant

q(x) D[p : q] = p(x)f{ }dx p(x)

α

slide-28
SLIDE 28

dually flat space

convex functions Bregman

divergence

invariance

invariant divergence Flat divergence

KL‐divergence

F-divergence Fisher inf metric Alpha connection

: space of probability distributions

} {p = S

log

p(x) D[p : q] = p(x) { }dx q(x)

slide-29
SLIDE 29

{ },

0 : ( 1 holds)

i i

S p p nn = > =

% % % % p

Space of positive measures : vectors, matrices, arrays

f‐divergence α‐divergence Bregman divergence

slide-30
SLIDE 30

Applications of Information Geometry

Statistical Inference Machine Learning and AI Computer Vision Convex Programming Signal Processing (ICA; Sparse) Information Theory, Systems Theory Quantum Information Geometry

slide-31
SLIDE 31

Applications to Statistics

curved exponential family:

( ) ( ) ( )

( )

{ }

, exp p x u u u θ ψ θ = ⋅ − x

1

1

=

=

n k k

x x n

: estimator

ˆ u

ˆ x η =

1, 2

( , ) ,... n p x u x x x ฀ ( , ) exp{ ( )} p x x θ θ ψ θ = ⋅ −

1

ˆ( ,..., )

n

u x x

slide-32
SLIDE 32

x : discrete X = {0, 1, …, n}

1

{ ( ) | }: ( ) ( ) exp[ ( )] log log ; ( ); ( ) log ( , )

n n n i i i i i i i i i i

S p x x X p x p x x p p x x p p x u δ θ ψ θ θ δ ψ θ

= =

= ∈ = = − = − = = −

∑ ∑

exponential family statistical model :

slide-33
SLIDE 33

High-Order Asymptotics High-Order Asymptotics

( ) ( )

1 1

, (u) : , , ˆ u u , ,

n n

p x x x x x = L L θ

( )( )

ˆ ˆ

T

e E u u u u ⎡ ⎤ = − − ⎣ ⎦

1 2 2

1 1 e G G n n = +

1 1

G G− ≥

:Cramér-Rao: linear theory

( ) ( ) ( )

2 2 2

2 e m m M A

G H H = + + Γ

:

ˆ u

ˆ x η =

quadratic approximation

slide-34
SLIDE 34

Information Geometry

  • f

Belief Propagation

  • Shun-ichi Amari (RIKEN BSI)
  • Shiro Ikeda (Inst. Statist. Math.)
  • Toshiyuki Tanaka (Kyoto U.)
slide-35
SLIDE 35

Stochastic Reasoning

( , , , , ) p x y z r s

( , , | , ) , , ,... 1 , 1 p x y z r s x y z = −

slide-36
SLIDE 36

Stochastic Reasoning

q(x1,x2,x3,…| observation) X= (x1 x2 x3 …..) x = 1, -1 X= argmax q(x1, x2 ,x3 ,…..) maximum likelihood Xi = sgn E[xi] least bit error rate estimator

slide-37
SLIDE 37

Mean Value

Marginalization: projection to independent distributions

1 1 2 2

( ) ( ) ( )... ( ) ( )

n n

q q x q x q x q Π = = x x

1, 1

( ) ( ..., ) .. ..

i i n i n

q x q x x dx dx dx = ∫ (

[ ] [ ]

q q

η =

= E x E x

slide-38
SLIDE 38

( ) ( ) ( ) ( ) { } ( ) ( )

{ }

1

1 1 1 2

exp , 1, 1 e p , x

s

ij L i j i i i r q r r r i i s i i

q k x c c c x x r q w x x h x i i x r i i ψ ψ

=

⎧ ⎫ = ⋅ + − ⎨ ⎬ ⎩ ⎭ = = − = + = = −

∑ ∑ ∑ ∑

L L x x x x

Boltzmann machine, spin glass, neural networks Turbo Codes, LDPC Codes

slide-39
SLIDE 39

Computationally Difficult Computationally Difficult

( )

[ ]

( ) ( )

{ }

exp

r q

q E q c η ψ → = = −

x x x x

mean-field approximation belief propagation tree propagation, CCCP (convex-concave)

slide-40
SLIDE 40

Information Geometry of Mean Field Approximation

  • m-projection
  • e-projection

D[q:p]=

q = argmin D[q:p] q = argmin D[p:q]

( ) ( )log ( )

x

q x q x p x

e

Π

m

Π

{ ( )}

i i i

M p x = Π

( )∈ p x M

slide-41
SLIDE 41

Information Geometry Information Geometry

( )

{ }

{ } ( ) ( )

{ }

{ }

, exp , exp ψ ψ = = ⋅ − = = + ⋅ −

r r r r r r

M p M p c ξ ξ x x x x x θ θ

1, , r L = L

( )

q x

r

M

'

r

M M θ

( ) exp{ ( )

r

q x c x φ = − }

slide-42
SLIDE 42

1 t 1 1

( , ) ( , ) : belief for ( )

+ + +

Π = Π − = ∑

t r r t t r r r r r t t r

p x p x c x ξ θ ξ ξ θ θ

Belief Propagation Belief Propagation

( ) ( )

{ }

: , exp ψ = + ⋅ −

r r r r r r

M p c ξ ξ x x x

slide-43
SLIDE 43

Belief Prop Algorithm

M

r

M

'

r

M

r

ς

' r

ς

r

ς Π

'r ς

slide-44
SLIDE 44

Equilibrium of BP Equilibrium of BP (

)

,

∗ ∗ r

ξ θ

1) m-condition

( )

*

,

∗ = Π r r

p ξ θ x

( )

  • flat submanifold

m M θ∗

r

M

'

r

M M

r

M

'

r

M M q

2) e-condition

*

1 1

∗ ∗

= − ∑ r

r

L θ ξ

( )

  • flat submanifold

q e ∈ x

ξ

1( ')

ξ θ

slide-45
SLIDE 45

( )

[ ] [ ]

1

, , , : :

L r

F D p q D p p θ ζ ζ = −∑ L

critical point

: -condition :

  • condition

r

F e F m ∂ = ∂ ∂ = ∂ θ ζ

not convex

Free energy: Free energy:

slide-46
SLIDE 46

Belief Propagation e-condition OK

CCCP m-condition OK

1 2 1 2 1 2

1 ( ; , , , ' ' 1 ( , , ) ( ' , ' , ' ) ,... ) = − ,... → ,...

L r L L

L θ ξ ξ ξ θ ξ ξ ξ ξ ξ ξ ξ

( ) ( )

1 1 1

' ( '), ( '),..., ( ') ' , ' : ' , ' → = Π = Π

r r r

p p θ θ ξ θ ξ θ ξ θ θ ξ ξ θ x x

slide-47
SLIDE 47

( )

1 1 1 1 1

, ξ

+ + + + +

= Π = Π = −∑

t t t r r r t t t r

p L ξ θ θ θ θ x

slide-48
SLIDE 48

Convex-Concave Computational Procedure (CCCP) Yuille

1 2 1 1 2

( ) ( ) ( ) ( ) ( ) θ θ θ θ θ

+

= − ∇ = ∇

t t

F F F F F

Elimination of double loops

slide-49
SLIDE 49

Boltzmann Machine

( )

( )

1

i ij j i

p x w x h ϕ = = −

( ) ( )

{ }

exp

ij i j i i

p x w x x h x w ψ = − −

∑ ∑

2

x

3

x

4

x

1

x

( )

q x

( )

ˆ p x

B

slide-50
SLIDE 50

Boltzmann machine

  • --hidden units
  • EM algorithm
  • e-projection
  • m-projection

D M

slide-51
SLIDE 51

EM algorithm

hidden variables

( )

, ; p x y u

{ }

1,

,

N

D = L x x

( )

{ }

, ; M p = x y u

( ) ( ) ( )

{ }

,

M D

D p p p = = x y x x

( )

ˆ min , : KL p p M ∈ ⎡ ⎤ ⎣ ⎦ x y

m-projection to

M D

e-projection to

( )

ˆ min : , ; KL p D p ∈ ⎡ ⎤ ⎣ ⎦ x y u

slide-52
SLIDE 52

SVM : support vector machine

Embedding Kernel Conformal change of kernel

( ) ( ) ( ) ( , )

i i i i i i i

z x f x w x y K x x φ φ α = = =

∑ ∑

( , ') ( ) ( ')

i i

K x x x x φ φ =∑

2

( , ') ( ) ( ') ( , ') ( ) exp{ | ( ) | } ρ ρ ρ κ ⎯⎯ → = − K x x x x K x x x f x

slide-53
SLIDE 53

Signal Processing

ICA : Independent Component Analysis

t t t t

A = → x s x s

sparse component analysis positive matrix factorization

slide-54
SLIDE 54

mixture and unmixture

  • f independent signals

2

x

1

s

n

s

2

s

m

x

1

x

1 n i ij j j

x A s

=

= =

x As

slide-55
SLIDE 55

Independent Component Analysis

1 i ij j

A x A s W W A− = = = =

x s y x

s A W y x

  • bservations: x(1), x(2), …, x(t)

recover: s(1), s(2), …, s(t)

slide-56
SLIDE 56
slide-57
SLIDE 57

Space of Matrices : Lie group

  • 1

d d = X WW

( ) ( )

2 1

tr tr

T T T T

d d d d d l l

− −

= = ∂ ∇ = ∂ W X X W W W W W W W : dX

I I d + X

W d + W W

non-holonomic basis

1

W −

slide-58
SLIDE 58

Information Geometry of ICA

natural gradient estimating function stability, efficiency

S ={p(y)}

1 1 2 2

{ ( ) ( )... ( )}

n n

I q y q y q y =

{ ( )} p Wx

r

q

( ) [ ( ; ) : ( )] ( ) l KL p q r = W y W y y

slide-59
SLIDE 59

Semiparametric Statistical Model

( ; , ) | | ( ) p r r = x W W Wx

unknown x(1), x(2), …, x(t)

i

r r = Π

, ( ) : r s

=

1

W A

slide-60
SLIDE 60

Natural Gradient

( )

,

T

l η ∂ Δ = − ∂ y W W W W W

slide-61
SLIDE 61

Basis Given: overcomplete case Sparse Solution

many solutions many ˆ

i i i t t

A s s A = = → =

x s a x s ˆ ˆ ˆ : = A x As

sparse

slide-62
SLIDE 62

generalized inverse

ˆ min Σ

2 i

s

sparse solution

ˆ min

i i

∑ s

ˆ ˆ ˆ : x = A As

sparse

2

:

  • norm

L

1

:

  • norm

L

slide-63
SLIDE 63
slide-64
SLIDE 64

Overcomplete Basis and Sparse Solution

1 '

min min

i i i p p

s A s A α = = = − +

∑ ∑

x a s s s x s

non-linear denoising

slide-65
SLIDE 65

Sparse Solution

( ) ( )

min penalty : Bayes prior

p p i

F ϕ = ∑ β β β

( ) ( ) ( ) ( )

1 1 2 2

#1[ 0]: : : 0 1 :

i i p i

F F L F p F β β β = ≠ = ≤ ≤ =

∑ ∑

sparsest solution solution generalized inverse solution β β β β

Sparse solution: overcomplete case

slide-66
SLIDE 66

Optimization under Spasity Condition

( ) ( )

min : convex function constraint F c ϕ ⎧ ⎪ ⎨ ≤ ⎪ ⎩ β β

typical case: ( )

( )

2

1 1 ( *) ( *) 2 2 1 ; ϕ β = − = = ∑

T p i

X G F p β β β − β β − β β y

2, 1, 1/ 2 p p p = = =

slide-67
SLIDE 67

L1-constrained optimization

LASSO LARS

( ) ( ) ( )

min under solution :

c

F c c c ϕ

∗ ∗ ∗

≤ = → ∞ = → β β β β β

( ) ( ) ( )

min solution

λ

ϕ λ λ λ

∗ ∗ ∗

+ = ∞ → = → F β β β β β

( ) ( )

, , : , ≥

* * c λ

solutions β and β : coincide λ = λ c p 1 p < 1 λ = λ c multiple noncontinuous stability different

λ

P Problem

c

P Problem

slide-68
SLIDE 68

* β * β

Projection from to F = c (information geometry)

* β

slide-69
SLIDE 69

Convex Cone Programming

P

: positive semi-definite matrix

convex potential function

dual geodesic approach

, min A = ⋅ x b c x

Support vector machine

slide-70
SLIDE 70

a) : 2, 1

c

R n p = > b) : 2, 1

c

R n p = = c) : 2, 1

c

R n p = <

  • Fig. 1

non-convex

slide-71
SLIDE 71

( )

( ) : ϕ ⎡ ⎤ = ⎣ ⎦

*

min β

  • rthogonal projection, dual

D β : β , F β = c dual geodesic projec projec tion tion

η η η

∗ ∗ ∗

− ∝ ∇ ()

c c

F

dual

slide-72
SLIDE 72

n F = ∇ n

  • Fig. 5 subgradient

η η

∗ ∗

∝ ∇ ()

c c

F

slide-73
SLIDE 73

LASSO path and LARS path (stagewise solution)

( ) ( ) ( ) ( )

min : min F c F ϕ ϕ λ = + β β β β

( ) ( )

, c λ

∗ ∗

⇔ c λ correspondence β β

slide-74
SLIDE 74

Active set and gradient

( ) {

}

( ) ( )

( )

( ) [ ]

1

sgn , , , 1,1

i p i i p

A i i A F i A β β β

− −

= ≠ ⎧ ∈ ⎪ ∇ = −∞ ∞ ∉ ⎨ ⎪ − ⎩ β β

slide-75
SLIDE 75

Solution path

( ) ( ) ( ) ( )

{ }

( )

( )

1

0, ; ϕ λ λ λ ϕ λ

∗ ∗ ∗ ∗ ∗ − ∗

∇ + ∇ = ∇ ∇ + ∇ ∇ ⋅ = − ∇ = − ∇ = & & & & &

c c A c c A c c A A c c A A c c A c c c A c c

K F F F d dc F β β β β β β β β β β β

( ) ( )

c c c

K G F λ

∗ ∗

= + ∇∇ β β

( )

1 1 1

0; (sgn ) : β ∇∇ = ∇ =

i

F F L

slide-76
SLIDE 76

Solution path in the subspace of the active set

( ) ( ) ( )

1

: active direction

λ λ λ λ

ϕ λ

∗ ∗ ∗ − ∗

∇ + ∇ = ∇ = − ∇ &

A A A A A

F K F β β β β

′ → turning point A A

slide-77
SLIDE 77

Gradient Descent Method

{ ( )}: covariant { ( )}: contravariant

i ji i

L L x x L g L x x ∂ ∇ = ∂ ∂ ∇ = ∂

%

2

min L(x+a):

i j ij

g a a ε =

1

( )

t t t

x x c L x

+ =

− ∇

slide-78
SLIDE 78

Extended LARS (p = 1) and Minkovskian grad

( ) ( ) ( )

{ }

( )

1 1

norm max under 1 1 sgn , max , , 0,

  • therwise

p i p p p i i N A

a p ψ ε ψ ε λ η η η η ψ ψ

+

= + = + − = ⎧ = ⎪ ∇ = ⎨ ⎪ ⎩ = ∇

1 L a a a a a β β β η β

slide-79
SLIDE 79

arg max

i

i f

∗ =

max

i i j

f f f

∗ ∗

= =

( )

1, for and ,

  • therwise.

i

i i j F

∗ ∗

⎧ = ∇ = ⎨ ⎩ %

1 t t

F η

+ =

− ∇ % β β

LARS

slide-80
SLIDE 80

F f ∇ = ∇ %

( )

1 1

sgn

p i i

F c f f

∇ = %

( )

1 sgn

i

F c f ∗ ⎡ ⎤ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ∇ = ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎣ ⎦ M % M

Euclidean case

1 α →

slide-81
SLIDE 81

λ

c

  • trajectory and-trajectory
  • Ex. 1-dim

( ) ( )

2

1 * 2 ϕ β β β = −

( ) ( )

2

1 2 2 2

λ β

φ λ β λ β = + = − + f F

L1/2 constraint: non-convex optimization

slide-82
SLIDE 82

( )

2

: min ,

c

P c β β β

− ≤

ˆ

c

c β =

: P f

λ λ

∇ = λ β β β

− + =

( )

ˆ : Xu Zongben's operator

λ

β β ∗ = R

( )

c

c c λ β ∗ = −

c β ∗ β

β ∗

( )

Rλ β ∗ c λ

slide-83
SLIDE 83

ICCN-Huangshan(黄山)

Sparse Signal Analysis

Shun-ichi Ammari (甘利俊一)

RIKEN Brain Science Institute (Collaborator: Masahiro Yukawa, Niigata University)

slide-84
SLIDE 84

Solution Path :

not continuous, not-monotone jump

c λ ↔

c λ

β β ⇔

slide-85
SLIDE 85

An Example of the greedy path

β1 β2

slide-86
SLIDE 86

Linear Programming

( )

( )

inner met max lo ho g d

ij j i i i ij j i i

A x b c x A x b ψ ≥ = −

∑ ∑ ∑ ∑

x

slide-87
SLIDE 87

Convex Programming ━ Inner Method

: , LP A ≥ ⋅ ≥ x b c x min ⋅ c x

( )

( )

log log

ij j i i

A x b x ψ = − +

∑ ∑ ∑

x

( )

= ∂ x η

Simplex method ; inner method

slide-88
SLIDE 88

Polynomial-Time Algorithm

curvature : step-size

( ) 2

m

H

( ) ( )

min : geodesic t t ψ

⋅ + = ∇ − c x x x δ

slide-89
SLIDE 89

Neural Networks

Higher-order correlations

Synchronous firing

Multilayer Perceptron

slide-90
SLIDE 90

Multilayer Perceptrons

( )

i i

y v n ϕ = ⋅ +

w x

( )

( )

( )

( ) ( )

2

1 ; exp , 2 ,

i i

p y c y f f vϕ ⎧ ⎫ = − − ⎨ ⎬ ⎩ ⎭ = ⋅

x x x w x θ θ θ

x

1 2

( , ,..., )

n

x x x x =

1 1

( ,..., ; ,..., )

m m

w w v v θ =

slide-91
SLIDE 91

Multilayer Perceptron

( ) ( )

( )

1 1,

, , ; ,

i i m m

y f v v v ϕ = = ⋅ =

L L x θ w x θ w w

neuromanifold

( ) x ψ

space of functions

slide-92
SLIDE 92

singularities

slide-93
SLIDE 93

Geometry of singular model

( )

y v n ϕ = ⋅ + w x

v

| | 0 v = w

slide-94
SLIDE 94

Backpropagation ---gradient learning Backpropagation ---gradient learning

( ) ( ) ( ) ( )

1 1 2

examples : , , , 1 , log , ; 2

t t

y y E y f p y = − = − L θ θ x x x x

( ) ( )

,

t t i i

E f v η ϕ ∂ Δ = − ∂ = ⋅

x w x θ θ θ

1

natural gradient (Riemannian)

  • -steepest descent

E G E

∇ = ∇ %

slide-95
SLIDE 95
slide-96
SLIDE 96
slide-97
SLIDE 97

q‐Fisher information

( )( )

( ) ( )

q F ij ij q

q g p g p h p =

conformal transformation

1

1 [ ( ): ( )] (1 ( ) ( ) ) (1 ) ( )

q q q q

q divergence D p x r x p x r x dx q h p

= − − −

slide-98
SLIDE 98

Total Bregman Divergence and its Applications to Shape Retrieval

  • Baba C. Vemuri, Meizhu Liu, Shun-ichi Amari,

Frank Nielsen

IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010

slide-99
SLIDE 99

Total Bregman Divergence

[ ] [ ]

2

: : 1 D TD f = + ∇ x y x y

  • rotational invariance
  • conformal geometry
slide-100
SLIDE 100

Total Bregman divergence (Vemuri)

2

( ) ( ) ( ) ( ) TBD( : ) 1 | ( ) | p q q p q p q q ϕ ϕ ϕ ϕ − −∇ ⋅ − = + ∇

slide-101
SLIDE 101

Clustering : t-center

{ }

1,

,

m

E x x = L

[ ]

arg min ,

i i

TD

∗ =

x x x

E y

T-center of E

x

slide-102
SLIDE 102

t-center

( )

( ) ( )

2

1 1

i i i i i

w f f w w f

∇ ∇ = = + ∇

∑ ∑

x x x

x

slide-103
SLIDE 103

q ‐super‐robust estimator (Eguchi)

( ) ( ) ( ) ( ) ( )

{ }

( ) ( ) ( )

1 1 1 1 1

, ˆ max , max bias-corrected -estimating function ˆ , , log 1 log 1 1 ˆ , max ,

q q i q q q N q i i i q

p x p x h q s x p x p c c h q s x p x h ξ ξ ξ ξ ξ ξ ξ ξ

+ + + + = +

→ = ∂ − = ∂ + = ⇔

∑ ∑

slide-104
SLIDE 104

Conformal change of divergence

( ) ( )

[ ]

: : D p q p D p q σ = % ( )

ij ij

g p g σ = %

( ) log

ijk ijk k ij j ik i jk i i

T T s g s g s g s σ σ = + + + = ∂ %

slide-105
SLIDE 105

t-center is robust

{ }

( )

1,

, ; 1 ; ,

n

E n ε ε

∗ ∗ ∗ ∗

= = + = L % x x y x x z x y

( )

influence fun ; ction

z x y

robust as : c < → ∞ z y

slide-106
SLIDE 106

( )

( )

( )

( ) ( ) ( )

1

, 1

i i

f f G w G w f n

∗ ∗ − ∇

−∇ = = ∇∇

y x z x y y x x

( ) ( ) ( ) ( ) ( )

2

1 1 f f w f w ∇ ∇ = < ∞ + ∇ > y y y y y Robust: is bounded z

2

1 Euclidean case 2 f = x

( ) ( )

1 2

, 1 , G

∗ − ∗

= + = y z x y y z x y y

slide-107
SLIDE 107

MPEG7 database

  • Great intraclass variability, and small

interclass dissimilarity.

slide-108
SLIDE 108

Other TBD applications

Diffusion tensor imaging (DTI) analysis

[Vemuri]

  • Interpolation
  • Segmentation

Baba C. Vemuri, Meizhu Liu, Shun-ichi Amari and Frank Nielsen, Total Bregman Divergence and its Applications to DTI Analysis, IEEE TMI, to appear

slide-109
SLIDE 109

TBD application-shape retrieval

  • Using MPEG7 database;
  • 70 classes, with 20 shapes each class

(Meizhu Liu)

slide-110
SLIDE 110

Multiterminal Information & Statistical Inference Multiterminal Information & Statistical Inference

:

n

X x :

n

Y y

X

R

Y

R ˆ θ

( )

, ; 2 2

X Y

R n R n X r

p x y M M θ = =

slide-111
SLIDE 111

marginal correlation

0-rate Slepian-Wolf

M C

G G G = + p(x,y) p(x,y,z)

slide-112
SLIDE 112

Linear Systems Linear Systems ARMA ARMA

( ) ( )

1 1 1 1 1 1 1 1 1

1 1 , , : , , ,

p q t t p p p q t t

b z b z x u a z a z a a b b f z u θ

− − + − − − +

+ + + = + + + = = L L L L x θ,

t

u

t

x

AR---e-flat MA---m-flat

slide-113
SLIDE 113

Machine Learning

Boosting : combination of weak learners

( ) ( ) ( )

{ }

1 1 2 2

, , , , , ,

N N

D y y y = L x x x 1

i

y = ± −

( ) ( ) ( )

, : , sgn , f y h f = = x u x u x u

slide-114
SLIDE 114

Weak Learners

( ) ( )

( )

sgn

t t

H h α =

x x

( )

{ }

: Prob

t t i i t

h y W ε ≠ x

( ) ( ) ( )

{ }

1

exp

t t t i t i

W i cW i y h x α

+

= −

weight distribution

slide-115
SLIDE 115

Boosting ━generalization

( ) ( )

( )

{ }

{ }

1

exp

t t t t t

Q Q y x Q y x yh x f α

= = − %

( ) ( )

{ }

, const

t t

F P y x Eyh x = = : min :

t t

D P Q α ⎡ ⎤ ⎣ ⎦ %

( ) ( )

1

, ,

t t

D P Q D P Q

+

< % %

slide-116
SLIDE 116

Integration of evidences:

arithmetic mean geometric mean harmonic mean

  • mean

α

1 2

, ,... m x x x

slide-117
SLIDE 117

Various Means

2 a b +

ab

2 1 1 a b +

:

arithmetic geometric

Any other mean?

:

harmonic

slide-118
SLIDE 118

Generalized mean: f-mean

f(u): monotone; f-representation of u

1

( ) ( ) ( , ) { } 2

f

f a f b m a b f − + =

1 2

( ) , 1 log , ( , ) 1 ( , ) =

f f

m ca cb cm a f u u b u

α α

α α

= ≠ =

scale free

α-representation

slide-119
SLIDE 119
  • mean :

α

2

1 : 1: 2 1 0 : ( ) 4 2 min( , ) max( , ) ab a b a b a b ab m a b m a b

α α

α α α α α = + = − + = + = + = ∞ = = −∞ =

1 2

( ( ), ( )) m p s p s

α

slide-120
SLIDE 120

Family of Distributions

{ }

1( ),

, ( )

k

p s p s L

1

( ) ( ), 1

k mix i i i i

p s t p s t

=

= =

∑ ∑

exp

log ( ) log ( )

i i

p s t p s ψ = −

α −

mixture family : exponential family :

1

( ; ) { ( ( ))}

i i

p x f f p x

α α

θ θ

=