Information Geometry and Its Applications to Machine Learning - PowerPoint PPT Presentation

Stochastic Reasoning , 1 ( , , , , ) p x y z r s = − ( , , | , ) p x y z r s , , ,... 1 x y z

Stochastic Reasoning q(x 1 ,x 2 ,x 3 ,…| observation) X = (x 1 x 2 x 3 …..) x = 1, -1 Ｘ = argmax q ( x 1, x 2 ,x 3 ,….. ) maximum likelihood Ｘ i = sgn E[x i ] least bit error rate estimator

Mean Value Marginalization: projection to independent distributions Π = = x x ( ) ( ) ( )... ( ) ( ) q q x q x q x q 0 1 1 2 2 0 n n = ∫ ( ( ) ( ..., ) .. .. q x q x x dx dx dx 1, 1 i i n i n η = = E x E x [ ] [ ] q q 0

⎧ ⎫ L ∑ ∑ ( ) ( ) = ⋅ + − ψ x x ⎨ ⎬ exp q k x c i i r q ⎩ ⎭ = 1 r ( ) ( ) = = x L L , c c x x r i i 1 r r i i s 1 s { } ( ) = − = 1, 1 , x r i i 1 2 i { } ∑ ∑ ( ) = + − ψ x e p x q w x x h x Boltzmann machine, spin glass, neural networks ij i j i i Turbo Codes, LDPC Codes

Computationally Difficult Computationally Difficult [ ] ( ) → η = x x q E { } ∑ ( ) ( ) = − ψ x x exp q c r q mean-field approximation belief propagation tree propagation, CCCP (convex-concave)

Information Geometry of Mean Field Approximation • m-projection • e-projection ( ) q x ∑ ( )log ( ) q x D[q:p]= p x x Π q = argmin D[q:p] m 0 = Π Π { ( )} M p x e q = argmin D[p:q ] 0 i i i 0 ( ) ∈ p x M 0

Information Geometry Information Geometry ( ) q x M r M ' r θ M ∑ 0 = − } φ ( ) exp{ ( ) q x c x r { } ( ) { } = θ = θ ⋅ − ψ x x , exp M p 0 0 0 { } { } ( ) ( ) = = + ⋅ − ψ x ξ x ξ x , exp M p c r r r r r r = L 1, , r L

Belief Propagation Belief Propagation { } ( ) ( ) = + ⋅ − ψ x ξ x ξ x : , exp M p c r r r r r r Π ξ t ( , ) p x 0 r r + = Π − θ ξ ξ 1 t t t ( , ) : belief for ( ) p x c x 0 r r r r r = ∑ + + θ 1 θ 1 t t r

0 r Belief Prop Algorithm M ' M r M ς r Π ' r ς ς r ' ς r

Equilibrium of BP ( ) ∗ ∗ θ ξ , Equilibrium of BP r 1) m -condition ∗ = Π ( ) θ x ξ * , p M r 0 r r ( ) M M θ ∗ ' -flat submanifold r m M 0 ξ θ ξ 1 ( ') 2) e -condition q 1 − ∑ r ∗ ∗ = θ ξ * M r r 1 L M ' r ( ) ∈ x -flat submanifold q e M 0

Free energy: Free energy: [ ] − ∑ [ ] ( ) θ ζ ζ = L , , , : : F D p q D p p 1 0 0 L r critical point ∂ F = 0 : -condition e ∂ θ ∂ F = 0 : -condition m ∂ ζ r not convex

Belief Propagation e-condition OK 1 ∑ ,... ) = θ ξ ξ ξ θ ξ ( ; , , , ' ' − 1 2 L r 1 L ,... → ,... ξ ξ ξ ξ ξ ξ ( , , ) ( ' , ' , ' ) 1 2 1 2 L L CCCP m-condition OK → θ θ ' ξ θ ξ θ ξ θ ( '), ( '),..., ( ') 1 1 1 ( ) ( ) = Π = Π θ x ξ ξ x θ ' , ' : ' , ' p p 0 0 r r r

( ) + + + = Π = Π ξ 1 θ 1 x θ 1 t t t , p 0 r r r − ∑ + + = ξ θ θ 1 1 t t t L r

Convex-Concave Computational Procedure (CCCP) Yuille θ = θ − θ ( ) ( ) ( ) F F F 1 2 + ∇ θ = ∇ θ 1 t t ( ) ( ) F F 1 2 Elimination of double loops

Boltzmann Machine x 1 ( ) ∑ ( ) x x = = ϕ − 1 p x w x h 4 2 i ij j i { } x ∑ ∑ ( ) ( ) = − − ψ 3 exp p x w x x h x w ij i j i i ( ) q x ( ) ˆ p x B

Boltzmann machine ---hidden units • EM algorithm D • e-projection • m-projection M

EM algorithm hidden variables ( ) p x y u , ; { } D = x x L 1 , , N { } ( ) = x y u , ; M p { } ( ) ( ) ( ) = = x y x x , D p p p M D ( ) ⎡ ∈ ⎤ m-projection to x y ˆ min , : KL p ⎣ p M ⎦ M ( ) ⎡ ∈ ⎤ e-projection to x y u ˆ min : , ; KL p ⎣ D p ⎦ D

SVM : support vector machine = φ ( ) z x Embedding i i ∑ ∑ = φ = α ( ) ( ) ( , ) f x w x y K x x i i i i i = ∑ φ φ Kernel ( , ') ( ) ( ') K x x x x i i Conformal change of kernel ⎯⎯ → ρ ρ ( , ') ( ) ( ') ( , ') K x x x x K x x ρ = − κ 2 ( ) exp{ | ( ) | } x f x

Signal Processing ICA : Independent Component Analysis = → x s x s A t t t t sparse component analysis positive matrix factorization

mixture and unmixture of independent signals x n ∑ = 1 x A s i ij j s = 1 1 j x s m 2 = x As s n x 2

Independent Component Analysis y s A W ∑ = = x s A x A s x i ij j A − = = y x 1 W W observations: x(1), x(2), …, x(t) recover: s(1), s(2), …, s(t)

Space of Matrices : Lie group + W W d = X WW -1 d d W W − 1 + X I d I ( ) ( ) 2 − − = = W X X W W 1 W W T T T tr tr d d d d d ∂ l ฀ ∇ = ∂ W W T l W non-holonomic basis d X :

Information Geometry of ICA S ={p( y )} q r = { ( ) ( )... ( )} I q y q y q y 1 1 2 2 n n { ( )} p Wx = natural gradient W y W y ( ) [ ( ; ) : ( )] l KL p q estimating function y ( ) r stability, efficiency

Semiparametric Statistical Model = x W W Wx ( ; , ) | | ( ) p r r = Π − = 1 W A r r , ( ) : unknown r s i x(1), x(2), …, x(t)

Natural Gradient ( ) η ∂ y W , l Δ = − W W W T ∂ W

Basis Given: overcomplete case Sparse Solution ∑ ˆ ˆ ˆ = = = sparse A x As x s a : A s i i many solutions → many 0 s i ˆ = x s A t t

ˆ x = ˆ ˆ A As : sparse generalized inverse min Σ 2 s ˆ -norm : L i 2 sparse solution ∑ s ˆ min -norm : L 1 i i

Overcomplete Basis and Sparse Solution ∑ = = x a s s A i i ∑ = s min s i 1 − + α s x s min A ' p p non-linear denoising

Sparse Solution ( ) ϕ β min = ∑ ( ) p β β penalty : Bayes prior F p i ( ) β = β ≠ sparsest solution #1[ 0]: F 0 i ∑ ( ) β = β solution : F L 1 i 1 ( ) Sparse solution: overcomplete case β ≤ ≤ : 0 1 F p p ( ) ∑ 2 β = β generalized inverse solution : F 2 i

Optimization under Spasity Condition ( ) ⎧ ϕ β ⎪ min : convex function ⎨ ( ) β ≤ ⎪ constraint ⎩ F c typical case: ( ) 1 1 ( ϕ β = − β 2 = β − β β − β y T *) ( *) X G 2 2 = = = = ∑ 1 2, 1, 1/ 2 p p p ( ) β β p ; F i p

L1-constrained optimization ( ) ( ) ϕ β β ≤ LASSO min under F c P Problem c ( ) ∗ β = → ∞ solution : 0 c c β ∗ = → β ∗ 0 c LARS ( ) ( ) ϕ β + λ β min F P Problem λ ( ) ∗ β λ λ = ∞ → solution 0 β ∗ = → β ∗ 0 λ ( ) ≥ * * solutions β and β : coincide λ = λ c p 1 , , c λ ( ) p < 1 λ = λ c multiple noncontinuous : , stability different

Projection from to F = c (information geometry) β * β * β *

Convex Cone Programming : positive semi-definite matrix P convex potential function dual geodesic approach A = ⋅ x b c x , min Support vector machine

= > = = a) : 2, 1 b) : 2, 1 R n p R n p c c non-convex = < c) : 2, 1 R n p c Fig. 1

orthogonal projection, dual projec tion ( ) ⎡ ⎤ ϕ = * min β D β : β , F β = c dual geodesic projec tion ( ) : ⎣ ⎦ dual η ∗ − η ∗ ∝ ∇ η ∗ F () c c

F = ∇ n Fig. 5 subgradient n () c ∗ η F ∝ ∇ c ∗ η

LASSO path and LARS path (stagewise solution) ( ) ( ) ϕ β β = min : F c ( ) ( ) ϕ β + λ β min F ( ) ( ) ∗ ∗ β β λ ⇔ c λ correspondence , c

Active set and gradient ( ) { } β = β ≠ 0 A i i ( ) ⎧ ( ) − − 1 β β p ∈ sgn , i A ⎪ i i ( ) ( ) ∇ β = −∞ ∞ ∉ ⎨ , , F i A p ⎪ − [ ] 1,1 ⎩

Solution path ( ) ( ) ∗ ∗ ∗ ∇ ϕ β + λ ∇ β = β 0, F A c c A c c { } ( ) ( ) ( ) & & ∇ ∇ ϕ β ∗ + λ ∇ ∇ β ∗ ⋅ β = − λ ∇ β F F A A c c A A c c c A c ( ) d & & & − ∗ β = − λ ∇ β β = β 1 ; K F c c A c c c dc ( ) ( ) = β ∗ + λ ∇∇ β ∗ K G F c c c ( ) ∇∇ = ∇ = β 0; (sgn ) : F F L 1 1 1 i

Solution path in the subspace of the active set ( ) ( ) ∗ ∗ ∇ ϕ β + ∇ λ β = ∇ 0 : active direction F λ λ A A A ( ) & ∗ − ∗ β = − ∇ β 1 K F λ λ A A ′ → turning point A A

Gradient Descent Method = ε 2 i j min L(x+a): g a a ij ∂ ∇ = ∂ { ( )}: covariant L L x x i ∂ ∑ % ∇ = ji { ( )}: contravariant L g L x ∂ x i ฀ + = − ∇ ( ) x x c L x 1 t t t

Extended LARS (p = 1) and Minkovskian grad ∑ p = a norm a i p ( ) ψ β + ε = a a max under 1 p ( ) ψ β + ε − λ a a p = + 1 p { } ⎧ η η = η η ⎪ L sgn , max , , ( ) ∇ ψ β = 1 1 i i N ⎨ 1 A ⎪ ⎩ 0, otherwise ( ) η = ∇ ψ β

∗ = arg max i f i = = max f f f ∗ ∗ i i j ⎧ = ∗ ∗ ( ) 1, for and , i i j % ∇ = ⎨ F ⎩ 0 otherwise. i % β + = β − ∇ η LARS F 1 t t

% ∇ = ∇ F f Euclidean case 1 ( ) % ∇ = sgn − F c f f 1 p i i ⎡ ⎤ 0 ⎢ ⎥ M ⎢ ⎥ ⎢ ⎥ ( ) 1 α → 1 % ∇ = ⎢ ⎥ sgn F c f ∗ i 0 ⎢ ⎥ ⎢ ⎥ M ⎢ ⎥ ⎣ ⎦ 0

L1/2 constraint: non-convex optimization λ -trajectory and-trajectory c Ex. 1-dim 1 ( ) ( ) ϕ β = β − β 2 * 2 1 ( ) ( ) 2 λ β = φ + λ = β − + λ β 2 2 f F 2

( ) 2 β − β ∗ β ≤ : min , P c c β β ∗ c 0 ˆ β = c c λ ∇ = ( ) : 0 P f ˆ β ∗ β = R β − β ∗ + = λ λ : Xu Zongben's operator 0 λ β ( ) R λ β ∗ ( ) β ∗ λ = − c c c c β ∗ λ

ICCN-Huangshan （黄山） Sparse Signal Analysis Shun-ichi Ammari （甘利俊一） RIKEN Brain Science Institute （ Collaborator: Masahiro Yukawa, Niigata University)

Solution Path : λ ↔ c not continuous, not-monotone jump β ⇔ β λ c

An Example of the greedy path β ２ β １

Linear Programming ∑ ≥ A x b ij j i ∑ max c x i i ( ) ∑ ∑ ( ) ψ = − x lo g A x b ij j i i inner met ho d

Convex Programming ━ Inner Method LP A ≥ ⋅ ≥ x b c x : , 0 ⋅ c x min ( ) ∑ ∑ ( ) ψ = − x log A x b ij j i ∑ + log x i ( ) η = ∂ i ψ x Simplex method ; inner method

Polynomial-Time Algorithm curvature : step-size ( ) 2 H m ( ) ( ) ⋅ + ψ = δ ∇ − ∗ c x x x min : geodesic t t

Neural Networks Multilayer Perceptron Higher-order correlations Synchronous firing

Multilayer Perceptrons x ∑ ( ) = ϕ ⋅ + w x ｙ y v n i i = ( , ,..., ) x x x x 1 2 n ⎧ ⎫ ( ) 1 ( ) ( ) 2 θ = − − θ x x ⎨ ⎬ ; exp , p y c y f ⎩ ⎭ 2 ∑ ( ) ( ) θ = v ϕ ⋅ x w x , f i i θ = ( ,..., ; ,..., ) w w v v 1 1 m m

Multilayer Perceptron ψ ( ) x neuromanifold space of functions ( ) = x θ , y f ∑ ( ) = ϕ ⋅ w x v i i ( ) = θ w w L L , ; , v v 1 1, m m

singularities

Geometry of singular model | 0 = w | v Ｗ n + ) w x ⋅ v ( ϕ v = y

Backpropagation ---gradient learning Backpropagation ---gradient learning ( ) ( ) x x L examples : , , , y y 1 1 t t 1 ( ) ( ) 2 = − θ = − θ x x , log , ; E y f p y 2 natural gradient (Riemannian) % ∇ = − ∇ 1 --steepest descent ∂ E G E E Δ θ = − η ∂ θ t t ∑ ( ) ( ) θ = ϕ ⋅ x w x , f v i i

conformal transformation q ‐ Fisher information q ( ) ( ) = q F ( ) g p g p ij ij ( ) h p q − q divergence 1 ∫ − = − 1 q q [ ( ): ( )] (1 ( ) ( ) ) D p x r x p x r x dx − q (1 ) ( ) q h p q

Total Bregman Divergence and its Applications to Shape Retrieval • Baba C. Vemuri, Meizhu Liu, Shun-ichi Amari, Frank Nielsen IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010

Total Bregman Divergence [ ] x y : D [ ] = x y : TD + ∇ 2 1 f • rotational invariance • conformal geometry

Total Bregman divergence (Vemuri) ϕ − ϕ −∇ ϕ ⋅ − ( ) ( ) ( ) ( ) p q q p q = TBD( : ) p q 2 + ∇ ϕ 1 | ( ) | q

Information Geometry and Its Applications to Machine Learning - PowerPoint PPT Presentation

Machine Learning SS: Kyoto U. Information Geometry and Its Applications to Machine Learning Shun-ichi Amari RIKEN Brain Science Institute Information Geometry -- Manifolds of Probability Distributions = x { ( )} M p Information

Stochastic geometry and random generation 1 Stochastic geometry and random generation

48-175 Descriptive Geometry Basic Concepts of Descriptive Geometry Descriptive geometry is

Hyperbolic Geometry Victor Gonzalez Mentor: Ryan Kirk May 4, 2016 Hyperbolic Geometry We are

Geometry Problems Geometry Problems Examples for Typical ACM Instances Elementary Geometry

3d Geometry for Computer Graphics Lesson 1: Basics & PCA 3d geometry 3d geometry 3d

Geometry Euclid of Alexandria The Founder of Geometry. He was a Greek mathematician, often

Ansys - Old Geometry - Cathode 1 Ansys - New Geometry - Cathode lamella (PCB and copper

Snapshots from the History of Toric Geometry David A. Cox Geometry 19701988 Toric Geometry

Group Rings and Geometry: The (FA) Property Finite Geometry & Friends Doryan Temmerman

A glimpse into convex geometry 5 \ A glimpse into convex geometry Two

Computational Geometry Algorithm Library Efi Fogel Tel Aviv University Computational Geometry

2.2 Classic Differential Geometry 1 Hao Li http://cs621.hao-li.com 1 Spring 2018 CSCI 621:

IGA Lecture II: Dirac Geometry Eckhard Meinrenken Adelaide, September 6, 2011 Eckhard Meinrenken

Mason Experimental Geometry Lab Geometry Labs United 2020 ICERM, July 16, 2020 Sean Lawton

Computational Geometry Lecture 1: Introduction and convex hulls 1 Computational Geometry

3.1 Classic Differential Geometry Hao Li http://cs599.hao-li.com 1 Spring 2014 CSCI 599:

TUTORIAL HANDOUT GENOVA JUNE 11 INFORMATION GEOMETRY AND ALGEBRAIC STATISTICS ON A FINITE STATE

Extended hybrid inflationary Extended hybrid inflationary models models Partly based on works

Modeling Portfolios that Contain Risky Assets Stochastic Models II: Portfolios with Risky Assets

r t t

Value-at-Risk Notations: . S = vector of m market prices 1 . t = horizon for risk

The Orlicz-Sobolev-Gauss Exponential Manifold Giovanni Pistone www.giannidiorestino.it Liblice

General Edgeworth expansions with One-split branching random walks applications to profiles of

Noncrossing partitions, interval partitions and the Bruhat order Philippe Biane CNRS, IGM,

Sambuz

Useful Links

Newsletter

Mail Us

Information Geometry and Its Applications to Machine Learning - PowerPoint PPT Presentation

Machine Learning SS: Kyoto U. Information Geometry and Its Applications to Machine Learning Shun-ichi Amari RIKEN Brain Science Institute Information Geometry -- Manifolds of Probability Distributions = x { ( )} M p Information

Stochastic geometry and random generation 1 Stochastic geometry and random generation

48-175 Descriptive Geometry Basic Concepts of Descriptive Geometry Descriptive geometry is

Hyperbolic Geometry Victor Gonzalez Mentor: Ryan Kirk May 4, 2016 Hyperbolic Geometry We are

Geometry Problems Geometry Problems Examples for Typical ACM Instances Elementary Geometry

3d Geometry for Computer Graphics Lesson 1: Basics &amp; PCA 3d geometry 3d geometry 3d

Geometry Euclid of Alexandria The Founder of Geometry. He was a Greek mathematician, often

Ansys - Old Geometry - Cathode 1 Ansys - New Geometry - Cathode lamella (PCB and copper

Snapshots from the History of Toric Geometry David A. Cox Geometry 19701988 Toric Geometry

Group Rings and Geometry: The (FA) Property Finite Geometry &amp; Friends Doryan Temmerman

A glimpse into convex geometry 5 \ A glimpse into convex geometry Two

Computational Geometry Algorithm Library Efi Fogel Tel Aviv University Computational Geometry

2.2 Classic Differential Geometry 1 Hao Li http://cs621.hao-li.com 1 Spring 2018 CSCI 621:

IGA Lecture II: Dirac Geometry Eckhard Meinrenken Adelaide, September 6, 2011 Eckhard Meinrenken

Mason Experimental Geometry Lab Geometry Labs United 2020 ICERM, July 16, 2020 Sean Lawton

Computational Geometry Lecture 1: Introduction and convex hulls 1 Computational Geometry

3.1 Classic Differential Geometry Hao Li http://cs599.hao-li.com 1 Spring 2014 CSCI 599:

TUTORIAL HANDOUT GENOVA JUNE 11 INFORMATION GEOMETRY AND ALGEBRAIC STATISTICS ON A FINITE STATE

Extended hybrid inflationary Extended hybrid inflationary models models Partly based on works

Modeling Portfolios that Contain Risky Assets Stochastic Models II: Portfolios with Risky Assets

r t t

Value-at-Risk Notations: . S = vector of m market prices 1 . t = horizon for risk

The Orlicz-Sobolev-Gauss Exponential Manifold Giovanni Pistone www.giannidiorestino.it Liblice

General Edgeworth expansions with One-split branching random walks applications to profiles of

Noncrossing partitions, interval partitions and the Bruhat order Philippe Biane CNRS, IGM,

Sambuz

Useful Links

Newsletter

Mail Us

3d Geometry for Computer Graphics Lesson 1: Basics & PCA 3d geometry 3d geometry 3d

Group Rings and Geometry: The (FA) Property Finite Geometry & Friends Doryan Temmerman