Machine Learning SS: Kyoto U.
Information Geometry
and Its Applications to Machine Learning
Shun-ichi Amari
RIKEN Brain Science Institute
Information Geometry and Its Applications to Machine Learning - - PowerPoint PPT Presentation
Machine Learning SS: Kyoto U. Information Geometry and Its Applications to Machine Learning Shun-ichi Amari RIKEN Brain Science Institute Information Geometry -- Manifolds of Probability Distributions = x { ( )} M p Information
Machine Learning SS: Kyoto U.
RIKEN Brain Science Institute
Probability Distributions
Information Geometry Information Geometry
Systems Theory Information Theory Statistics Neural Networks Combinatorics Physics
Information Sciences Riemannian Manifold Dual Affine Connections Manifold of Probability Distributions
Math.
AI Vision
Optimization
( )
{ }
( ) ( )
2 2
1 ; , ; , exp 2 2 x S p x p x μ μ σ μ σ σ πσ ⎧ ⎫ − ⎪ ⎪ = = − ⎨ ⎬ ⎪ ⎪ ⎩ ⎭
( )
{ }
p x
σ μ
( )
{ }
; S p x = θ
Gaussian distributions
( , ) μ σ = θ
Manifold of Probability Distributions Manifold of Probability Distributions
1 2 3 1 2 3
1, 2,3 ={ ( )} , , 1
n
x S p x p p p p p p = = + + =
3
p
2
p
1
p
p
( )
{ }
; M p x = θ
( )
{ }
, S p x = θ
Invariant under different representation
( ) ( )
, , = y y x p y sufficient statistics θ
( ) ( )
2 1 2 2 1 2
, , | ( , ) ( , ) | p x p x dx p y p y dy θ θ θ θ − ≠ −
∫ ∫
Two Geometrical Structures Two Geometrical Structures
Riemannian metric affine connection --- geodesic
( )
2 ij i j
ds g d d θ θ = ∑ θ
Fisher information
log log
ij i j
g E p p θ θ ⎡ ⎤ ∂ ∂ = ⎢ ⎥ ∂ ∂ ⎢ ⎥ ⎣ ⎦
Orthogonality: innner product
1 2 1 2
,
T
d d d Gd θ θ θ θ < >=
covariant derivative; parallel transport
, geodesic X=X X=X(t) ( )
X c i j ij
Y X Y s g d d θ θ θ ∇ Π = Π =
minimal distance & &
straight line
Duality: two affine connections Duality: two affine connections
, , ,
i j ij
X Y X Y X Y g X Y
∗
= Π Π < >= ∑
Riemannian geometry:
∗
∏ = ∏
X Y X Y
Π
*
Π
{ , , , *} S g ∇ ∇
Dual Affine Connections
e-geodesic m-geodesic
( ) ( ) ( ) ( ) ( )
log , log 1 log r x t t p x t q x c t = + − +
( ) ( ) ( ) ( )
, 1 r x t tp x t q x = + −
,
∗
∇ ∇
( )
q x
( )
p x
*
( ) * ( )
x x
x t x t ∇ = ∇ =
& &
& &
Mathematical structure of
( )
{ }
, S p x = ξ
( ) ( )
ij i j ijk i j k
g E l l T E l l l ξ ξ ⎡ ⎤ = ∂ ∂ ⎣ ⎦ ⎡ ⎤ = ∂ ∂ ∂ ⎣ ⎦
( )
log , ;
i i
l p x ξ ∂ = ∂ = ∂ ξ
{ }
, ;
ijk ijk
i j k T
α
α Γ = −
α α −
∇ ↔ ∇
: dually coupled
, , ,
X X
X Y Z Y Z Y Z
∗
= ∇ + ∇
α
: D z y
[ ] [ ] [ ]
: : 0, iff :
ij i j
D D D d g dz dz ≥ = = + = ∑ z y z y z y z z z
positive‐definite
Z Y
M
quasi-distance
( ) [ ( ) : ( )] ( )log ( ) [ ( ) : ( )] 0 =0 iff ( ) ( ) [ : ] [ : ]
x
p x D p x q x p x q x D p x q x p x q x D p q D q p = ≥ = ≠
[ ]
:
i f i i
q D p f p ⎛ ⎞ = ≥ ⎜ ⎟ ⎝ ⎠
∑
% % % % % p q
[ ]
:
f
D = ⇔ = % % % % p q p q
( ) ( ) ( )
not invariant under 1 f u f u c u = − − %
0 : ( 1 holds)
i i
S p p nn = > =
% % % % p
α
1 1 2 2
1 1 [ : ] { } 2 2
i i i i
D p q p q p q
α α α
α α
− +
− + = + −
% % % % % %
i i i i i
KL-divergence
( , ) divergence α β −
, [
: ] { }
i i i i
D p q p q p q
α β α β α β α β
α β α β α β
+ +
= + − + +
: divergence 1: -divergence β α α α β = − − =
Metric and Connections Induced by Divergence
(Eguchi)
( )
[ ] [ ]
( ) ( )
[ ]
( )
[ ]
' ' '
1 : : : = 2 : :
= = ∗ =
= ∂ ∂ + Γ = −∂ ∂ ∂ Γ = −∂ ∂ ∂
ij i j ij i j ijk i j k ijk i j k
g D D d g dz dz D D
y z y z y z
z z y z z z z z z y z z y
* '
{ , } ,
i i i i
z y ∇ ∇ ∂ ∂ ∂ = ∂ = ∂ ∂
Riemannian metric affine connections
k ij kij kji ijk ijk ijk
∗ ∗
*
, , ,
X X
X Y Z Y Z Y Z < >=< ∇ > + < ∇ >
exponential family; mixture family; {p(x); x discrete}
canonical divergence D(P: P') : θ η − b KL divergence not Riemannian flat
( ) ( )
{ }
, exp : exponential family θ θ ψ θ = −
∑
i i
p x x
( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )
{ }
2 2
potential functions , ; : , exp : exponential family : cumulant generating function : negative en ψ θ ϕ η η ψ θ θ ϕ θ θ η ψ θ ϕ η θη θ ψ θ ϕ θ θ θ η η θ θ ψ θ ψ ϕ ∂ ∂ = = ∂ ∂ + − = ∂ ∂ = = ∂ ∂ ∂ ∂ = −
∑ ∑
L
i i i i i i ij ij i j i j i i
g g p x x Legendre transformation
( ) ( )
tropy canonical divergence D(P: P')= ' ' ψ θ ϕ η θη + −∑
i i
Dually flat manifold
Manifold with Convex Function
S : coordinates
( )
1 2
, , ,
n
θ θ θ = L θ
( )
ψ θ
: convex function
negative entropy
( ) ( ) ( )
log p p x p x dx ϕ = ∫
( )
( )
2
1 2
i
ψ θ = ∑ θ
Riemannian metric and flatness (affine structure)
Bregman divergence
( ) ( ) ( ) ( )
, grad D ψ ψ ′ ′ ′ = − − ⋅ θ θ θ θ θ θ
( ) ( )
1 , 2
i j ij
D d g d d θ θ + = ∑ θ θ θ θ
( ),
ij i j i i
g ψ θ ∂ = ∂ ∂ ∂ = ∂ θ
θ
: geodesic (not Levi-Civita)
Flatness (affine)
{ , ( ), } S ψ θ θ
Legendre Transformation
( )
i i
η ψ = ∂ θ
↔ θ η
( ) ( )
i i
ϕ ψ θη + − = η θ
( ),
i i i i
θ ϕ η ∂ = ∂ ∂ = ∂ η
( ) ( ) ( )
, D ψ ϕ ′ ′ ′ = + − ⋅ θ θ θ η θ η
( ) max { ( )}
i i θ
ϕ η θ η ψ θ = −
Two affine coordinate systems (
)
, θ η
θ
: geodesic (e-geodesic)
η
: dual geodesic (m-geodesic)
“dually orthogonal”
, ,
j j i i i i i i
δ θ η ∂ ∂ = ∂ ∂ ∂ = ∂ = ∂ ∂
*
, , ,
X X
X Y Z Y Z Y Z < >=< ∇ > + < ∇ >
(dually flat manifold)
[ ] [ ] [ ]
: : : D P Q D Q R D P R + =
Euclidean space: self-dual
= θ η
( ) ( )
2
1 2
i
ψ θ θ = ∑
[ ]
min :
Q M D P Q ∈
Q = m-geodesic projection of P to M
[ ]
min :
Q M D Q P ∈
Q’ = e-geodesic projection of P to M
Invariant divergence (Chentsov, Csiszar)
f-divergence: Fisher- structure
Flat divergence (Bregman) – convex function
KL-divergence belongs to both classes: flat and invariant
∫
q(x) D[p : q] = p(x)f{ }dx p(x)
α
convex functions Bregman
invariant divergence Flat divergence
KL‐divergence
F-divergence Fisher inf metric Alpha connection
: space of probability distributions
} {p = S
log
∫
p(x) D[p : q] = p(x) { }dx q(x)
0 : ( 1 holds)
i i
S p p nn = > =
% % % % p
f‐divergence α‐divergence Bregman divergence
Statistical Inference Machine Learning and AI Computer Vision Convex Programming Signal Processing (ICA; Sparse) Information Theory, Systems Theory Quantum Information Geometry
curved exponential family:
( ) ( ) ( )
( )
{ }
, exp p x u u u θ ψ θ = ⋅ − x
1
1
=
=
∑
n k k
x x n
: estimator
ˆ u
ˆ x η =
1, 2
( , ) ,... n p x u x x x ( , ) exp{ ( )} p x x θ θ ψ θ = ⋅ −
1
ˆ( ,..., )
n
u x x
x : discrete X = {0, 1, …, n}
1
{ ( ) | }: ( ) ( ) exp[ ( )] log log ; ( ); ( ) log ( , )
n n n i i i i i i i i i i
S p x x X p x p x x p p x x p p x u δ θ ψ θ θ δ ψ θ
= =
= ∈ = = − = − = = −
exponential family statistical model :
High-Order Asymptotics High-Order Asymptotics
( ) ( )
1 1
, (u) : , , ˆ u u , ,
n n
p x x x x x = L L θ
( )( )
ˆ ˆ
T
e E u u u u ⎡ ⎤ = − − ⎣ ⎦
1 2 2
1 1 e G G n n = +
1 1
G G− ≥
:Cramér-Rao: linear theory
( ) ( ) ( )
2 2 2
2 e m m M A
G H H = + + Γ
:
ˆ u
ˆ x η =
quadratic approximation
q(x1,x2,x3,…| observation) X= (x1 x2 x3 …..) x = 1, -1 X= argmax q(x1, x2 ,x3 ,…..) maximum likelihood Xi = sgn E[xi] least bit error rate estimator
Marginalization: projection to independent distributions
1 1 2 2
n n
1, 1
( ) ( ..., ) .. ..
i i n i n
q x q x x dx dx dx = ∫ (
q q
1
1 1 1 2
exp , 1, 1 e p , x
s
ij L i j i i i r q r r r i i s i i
q k x c c c x x r q w x x h x i i x r i i ψ ψ
=
⎧ ⎫ = ⋅ + − ⎨ ⎬ ⎩ ⎭ = = − = + = = −
L L x x x x
Boltzmann machine, spin glass, neural networks Turbo Codes, LDPC Codes
Computationally Difficult Computationally Difficult
exp
r q
q E q c η ψ → = = −
x x x x
mean-field approximation belief propagation tree propagation, CCCP (convex-concave)
D[q:p]=
q = argmin D[q:p] q = argmin D[p:q]
( ) ( )log ( )
x
q x q x p x
e
Π
m
Π
{ ( )}
i i i
M p x = Π
( )∈ p x M
r r r r r r
1, , r L = L
( )
q x
r
M
'
r
M M θ
r
1 t 1 1
( , ) ( , ) : belief for ( )
+ + +
Π = Π − = ∑
t r r t t r r r r r t t r
p x p x c x ξ θ ξ ξ θ θ
Belief Propagation Belief Propagation
( ) ( )
{ }
: , exp ψ = + ⋅ −
r r r r r r
M p c ξ ξ x x x
r
'
r
r
ς
' r
ς
r
ς Π
'r ς
Equilibrium of BP Equilibrium of BP (
∗ ∗ r
1) m-condition
*
,
∗ = Π r r
p ξ θ x
( )
m M θ∗
r
M
'
r
M M
r
M
'
r
M M q
2) e-condition
*
1 1
∗ ∗
= − ∑ r
r
L θ ξ
( )
q e ∈ x
ξ
1( ')ξ θ
1
L r
critical point
: -condition :
r
F e F m ∂ = ∂ ∂ = ∂ θ ζ
not convex
Belief Propagation e-condition OK
1 2 1 2 1 2
1 ( ; , , , ' ' 1 ( , , ) ( ' , ' , ' ) ,... ) = − ,... → ,...
L r L L
L θ ξ ξ ξ θ ξ ξ ξ ξ ξ ξ ξ
1 1 1
' ( '), ( '),..., ( ') ' , ' : ' , ' → = Π = Π
r r r
p p θ θ ξ θ ξ θ ξ θ θ ξ ξ θ x x
1 1 1 1 1
+ + + + +
t t t r r r t t t r
Convex-Concave Computational Procedure (CCCP) Yuille
1 2 1 1 2
+
t t
Elimination of double loops
( )
( )
1
i ij j i
p x w x h ϕ = = −
( ) ( )
exp
ij i j i i
p x w x x h x w ψ = − −
2
x
3
x
4
x
1
x
( )
q x
( )
ˆ p x
B
Boltzmann machine
D M
EM algorithm
hidden variables
( )
, ; p x y u
{ }
1,
,
N
D = L x x
( )
{ }
, ; M p = x y u
( ) ( ) ( )
{ }
,
M D
D p p p = = x y x x
( )
ˆ min , : KL p p M ∈ ⎡ ⎤ ⎣ ⎦ x y
m-projection to
M D
e-projection to
( )
ˆ min : , ; KL p D p ∈ ⎡ ⎤ ⎣ ⎦ x y u
Embedding Kernel Conformal change of kernel
( ) ( ) ( ) ( , )
i i i i i i i
z x f x w x y K x x φ φ α = = =
( , ') ( ) ( ')
i i
K x x x x φ φ =∑
2
( , ') ( ) ( ') ( , ') ( ) exp{ | ( ) | } ρ ρ ρ κ ⎯⎯ → = − K x x x x K x x x f x
Signal Processing
ICA : Independent Component Analysis
t t t t
A = → x s x s
sparse component analysis positive matrix factorization
mixture and unmixture
2
1
n
2
m
1
1 n i ij j j
x A s
=
= =
x As
Independent Component Analysis
1 i ij j
s A W y x
recover: s(1), s(2), …, s(t)
Space of Matrices : Lie group
d d = X WW
( ) ( )
2 1
tr tr
T T T T
d d d d d l l
− −
= = ∂ ∇ = ∂ W X X W W W W W W W : dX
I I d + X
W d + W W
non-holonomic basis
1
W −
Information Geometry of ICA
natural gradient estimating function stability, efficiency
S ={p(y)}
1 1 2 2
{ ( ) ( )... ( )}
n n
I q y q y q y =
{ ( )} p Wx
r
q
( ) [ ( ; ) : ( )] ( ) l KL p q r = W y W y y
unknown x(1), x(2), …, x(t)
i
r r = Π
, ( ) : r s
−
=
1
W A
T
Basis Given: overcomplete case Sparse Solution
many solutions many ˆ
i i i t t
A s s A = = → =
x s a x s ˆ ˆ ˆ : = A x As
sparse
generalized inverse
ˆ min Σ
2 i
s
sparse solution
ˆ min
i i
ˆ ˆ ˆ : x = A As
sparse
2
:
L
1
:
L
Overcomplete Basis and Sparse Solution
1 '
i i i p p
non-linear denoising
( ) ( )
min penalty : Bayes prior
p p i
F ϕ = ∑ β β β
( ) ( ) ( ) ( )
1 1 2 2
#1[ 0]: : : 0 1 :
i i p i
F F L F p F β β β = ≠ = ≤ ≤ =
∑ ∑
sparsest solution solution generalized inverse solution β β β β
Sparse solution: overcomplete case
( ) ( )
min : convex function constraint F c ϕ ⎧ ⎪ ⎨ ≤ ⎪ ⎩ β β
typical case: ( )
( )
2
1 1 ( *) ( *) 2 2 1 ; ϕ β = − = = ∑
T p i
X G F p β β β − β β − β β y
2, 1, 1/ 2 p p p = = =
LASSO LARS
( ) ( ) ( )
min under solution :
c
F c c c ϕ
∗ ∗ ∗
≤ = → ∞ = → β β β β β
( ) ( ) ( )
min solution
λ
ϕ λ λ λ
∗ ∗ ∗
+ = ∞ → = → F β β β β β
( ) ( )
, , : , ≥
* * c λ
solutions β and β : coincide λ = λ c p 1 p < 1 λ = λ c multiple noncontinuous stability different
λ
P Problem
c
P Problem
* β * β
Projection from to F = c (information geometry)
* β
Convex Cone Programming
P
: positive semi-definite matrix
convex potential function
dual geodesic approach
, min A = ⋅ x b c x
Support vector machine
a) : 2, 1
c
R n p = > b) : 2, 1
c
R n p = = c) : 2, 1
c
R n p = <
non-convex
( )
( ) : ϕ ⎡ ⎤ = ⎣ ⎦
*
min β
D β : β , F β = c dual geodesic projec projec tion tion
η η η
∗ ∗ ∗
− ∝ ∇ ()
c c
F
dual
n F = ∇ n
η η
∗ ∗
∝ ∇ ()
c c
F
LASSO path and LARS path (stagewise solution)
( ) ( ) ( ) ( )
min : min F c F ϕ ϕ λ = + β β β β
( ) ( )
, c λ
∗ ∗
⇔ c λ correspondence β β
( ) {
}
( ) ( )
( )
( ) [ ]
1
sgn , , , 1,1
i p i i p
A i i A F i A β β β
− −
= ≠ ⎧ ∈ ⎪ ∇ = −∞ ∞ ∉ ⎨ ⎪ − ⎩ β β
Solution path
( ) ( ) ( ) ( )
{ }
( )
( )
1
0, ; ϕ λ λ λ ϕ λ
∗ ∗ ∗ ∗ ∗ − ∗
∇ + ∇ = ∇ ∇ + ∇ ∇ ⋅ = − ∇ = − ∇ = & & & & &
c c A c c A c c A A c c A A c c A c c c A c c
K F F F d dc F β β β β β β β β β β β
( ) ( )
c c c
K G F λ
∗ ∗
= + ∇∇ β β
( )
1 1 1
0; (sgn ) : β ∇∇ = ∇ =
i
F F L
Solution path in the subspace of the active set
( ) ( ) ( )
1
: active direction
λ λ λ λ
ϕ λ
∗ ∗ ∗ − ∗
∇ + ∇ = ∇ = − ∇ &
A A A A A
F K F β β β β
′ → turning point A A
Gradient Descent Method
{ ( )}: covariant { ( )}: contravariant
i ji i
L L x x L g L x x ∂ ∇ = ∂ ∂ ∇ = ∂
∑
%
2
min L(x+a):
i j ij
g a a ε =
1
t t t
+ =
Extended LARS (p = 1) and Minkovskian grad
( ) ( ) ( )
{ }
( )
1 1
norm max under 1 1 sgn , max , , 0,
p i p p p i i N A
a p ψ ε ψ ε λ η η η η ψ ψ
+
= + = + − = ⎧ = ⎪ ∇ = ⎨ ⎪ ⎩ = ∇
∑
1 L a a a a a β β β η β
arg max
i
i f
∗ =
max
i i j
f f f
∗ ∗
= =
( )
1, for and ,
i
i i j F
∗ ∗
⎧ = ∇ = ⎨ ⎩ %
1 t t
F η
+ =
− ∇ % β β
LARS
F f ∇ = ∇ %
( )
1 1
sgn
p i i
F c f f
−
∇ = %
( )
1 sgn
i
F c f ∗ ⎡ ⎤ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ∇ = ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎣ ⎦ M % M
Euclidean case
1 α →
λ
c
( ) ( )
2
1 * 2 ϕ β β β = −
( ) ( )
2
1 2 2 2
λ β
φ λ β λ β = + = − + f F
L1/2 constraint: non-convex optimization
( )
2
: min ,
c
P c β β β
∗
− ≤
ˆ
c
c β =
: P f
λ λ
∇ = λ β β β
∗
− + =
( )
ˆ : Xu Zongben's operator
λ
β β ∗ = R
( )
c
c c λ β ∗ = −
c β ∗ β
β ∗
( )
Rλ β ∗ c λ
ICCN-Huangshan(黄山)
Shun-ichi Ammari (甘利俊一)
RIKEN Brain Science Institute (Collaborator: Masahiro Yukawa, Niigata University)
Solution Path :
not continuous, not-monotone jump
c λ ↔
c λ
β β ⇔
β1 β2
( )
( )
inner met max lo ho g d
ij j i i i ij j i i
A x b c x A x b ψ ≥ = −
x
Convex Programming ━ Inner Method
: , LP A ≥ ⋅ ≥ x b c x min ⋅ c x
( )
( )
log log
ij j i i
A x b x ψ = − +
∑ ∑ ∑
x
( )
iψ
= ∂ x η
Simplex method ; inner method
Polynomial-Time Algorithm
curvature : step-size
( ) 2
m
H
( ) ( )
min : geodesic t t ψ
∗
⋅ + = ∇ − c x x x δ
Neural Networks
Higher-order correlations
Synchronous firing
Multilayer Perceptron
Multilayer Perceptrons
( )
i i
y v n ϕ = ⋅ +
w x
( )
( )
( )
( ) ( )
2
1 ; exp , 2 ,
i i
p y c y f f vϕ ⎧ ⎫ = − − ⎨ ⎬ ⎩ ⎭ = ⋅
x x x w x θ θ θ
y
1 2
( , ,..., )
n
x x x x =
1 1
( ,..., ; ,..., )
m m
w w v v θ =
Multilayer Perceptron
1 1,
, , ; ,
i i m m
y f v v v ϕ = = ⋅ =
L L x θ w x θ w w
neuromanifold
( ) x ψ
space of functions
Geometry of singular model
( )
y v n ϕ = ⋅ + w x
W
v
Backpropagation ---gradient learning Backpropagation ---gradient learning
( ) ( ) ( ) ( )
1 1 2
examples : , , , 1 , log , ; 2
t t
y y E y f p y = − = − L θ θ x x x x
( ) ( )
,
t t i i
E f v η ϕ ∂ Δ = − ∂ = ⋅
x w x θ θ θ
1
natural gradient (Riemannian)
E G E
−
∇ = ∇ %
q‐Fisher information
( )( )
( ) ( )
q F ij ij q
q g p g p h p =
conformal transformation
1
1 [ ( ): ( )] (1 ( ) ( ) ) (1 ) ( )
q q q q
q divergence D p x r x p x r x dx q h p
−
= − − −
Frank Nielsen
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010
Total Bregman Divergence
[ ] [ ]
2
: : 1 D TD f = + ∇ x y x y
Total Bregman divergence (Vemuri)
2
( ) ( ) ( ) ( ) TBD( : ) 1 | ( ) | p q q p q p q q ϕ ϕ ϕ ϕ − −∇ ⋅ − = + ∇
Clustering : t-center
{ }
1,
,
m
E x x = L
[ ]
arg min ,
i i
TD
∗ =
∑
x x x
E y
T-center of E
∗
x
t-center
( )
( ) ( )
2
1 1
i i i i i
w f f w w f
∗
∇ ∇ = = + ∇
∑ ∑
x x x
∗
x
q ‐super‐robust estimator (Eguchi)
( ) ( ) ( ) ( ) ( )
{ }
( ) ( ) ( )
1 1 1 1 1
, ˆ max , max bias-corrected -estimating function ˆ , , log 1 log 1 1 ˆ , max ,
q q i q q q N q i i i q
p x p x h q s x p x p c c h q s x p x h ξ ξ ξ ξ ξ ξ ξ ξ
+ + + + = +
→ = ∂ − = ∂ + = ⇔
∑ ∑
Conformal change of divergence
( ) ( )
[ ]
: : D p q p D p q σ = % ( )
ij ij
g p g σ = %
( ) log
ijk ijk k ij j ik i jk i i
T T s g s g s g s σ σ = + + + = ∂ %
{ }
( )
1,
, ; 1 ; ,
n
E n ε ε
∗ ∗ ∗ ∗
= = + = L % x x y x x z x y
( )
influence fun ; ction
∗
z x y
robust as : c < → ∞ z y
( )
( )
( )
( ) ( ) ( )
1
, 1
i i
f f G w G w f n
∗ ∗ − ∇
−∇ = = ∇∇
∑
y x z x y y x x
( ) ( ) ( ) ( ) ( )
2
1 1 f f w f w ∇ ∇ = < ∞ + ∇ > y y y y y Robust: is bounded z
2
1 Euclidean case 2 f = x
( ) ( )
1 2
, 1 , G
∗ − ∗
= + = y z x y y z x y y
interclass dissimilarity.
Other TBD applications
Diffusion tensor imaging (DTI) analysis
[Vemuri]
Baba C. Vemuri, Meizhu Liu, Shun-ichi Amari and Frank Nielsen, Total Bregman Divergence and its Applications to DTI Analysis, IEEE TMI, to appear
(Meizhu Liu)
Multiterminal Information & Statistical Inference Multiterminal Information & Statistical Inference
:
n
X x :
n
Y y
X
R
Y
R ˆ θ
( )
, ; 2 2
X Y
R n R n X r
p x y M M θ = =
marginal correlation
0-rate Slepian-Wolf
M C
G G G = + p(x,y) p(x,y,z)
Linear Systems Linear Systems ARMA ARMA
( ) ( )
1 1 1 1 1 1 1 1 1
1 1 , , : , , ,
p q t t p p p q t t
b z b z x u a z a z a a b b f z u θ
− − + − − − +
+ + + = + + + = = L L L L x θ,
t
t
AR---e-flat MA---m-flat
Machine Learning
Boosting : combination of weak learners
( ) ( ) ( )
{ }
1 1 2 2
, , , , , ,
N N
D y y y = L x x x 1
i
y = ± −
( ) ( ) ( )
, : , sgn , f y h f = = x u x u x u
Weak Learners
( ) ( )
( )
sgn
t t
H h α =
∑
x x
( )
{ }
: Prob
t t i i t
h y W ε ≠ x
( ) ( ) ( )
{ }
1
exp
t t t i t i
W i cW i y h x α
+
= −
weight distribution
Boosting ━generalization
( ) ( )
( )
{ }
{ }
1
exp
t t t t t
Q Q y x Q y x yh x f α
−
= = − %
( ) ( )
{ }
, const
t t
F P y x Eyh x = = : min :
t t
D P Q α ⎡ ⎤ ⎣ ⎦ %
( ) ( )
1
, ,
t t
D P Q D P Q
+
< % %
arithmetic mean geometric mean harmonic mean
α
1 2
2 a b +
2 1 1 a b +
:
arithmetic geometric
Any other mean?
:
harmonic
f(u): monotone; f-representation of u
1
( ) ( ) ( , ) { } 2
f
f a f b m a b f − + =
1 2
( ) , 1 log , ( , ) 1 ( , ) =
f f
m ca cb cm a f u u b u
α α
α α
−
= ≠ =
scale free
α-representation
2
1 : 1: 2 1 0 : ( ) 4 2 min( , ) max( , ) ab a b a b a b ab m a b m a b
α α
α α α α α = + = − + = + = + = ∞ = = −∞ =
1 2
( ( ), ( )) m p s p s
α
{ }
1( ),
, ( )
k
p s p s L
1
( ) ( ), 1
k mix i i i i
p s t p s t
=
= =
exp
log ( ) log ( )
i i
p s t p s ψ = −
mixture family : exponential family :
1
( ; ) { ( ( ))}
i i
p x f f p x
α α
θ θ
−
=