information geometry
play

Information Geometry and Its Applications to Machine Learning - PowerPoint PPT Presentation

Machine Learning SS: Kyoto U. Information Geometry and Its Applications to Machine Learning Shun-ichi Amari RIKEN Brain Science Institute Information Geometry -- Manifolds of Probability Distributions = x { ( )} M p Information


  1. Stochastic Reasoning , 1 ( , , , , ) p x y z r s = − ( , , | , ) p x y z r s , , ,... 1 x y z

  2. Stochastic Reasoning q(x 1 ,x 2 ,x 3 ,…| observation) X = (x 1 x 2 x 3 …..) x = 1, -1 X = argmax q ( x 1, x 2 ,x 3 ,….. ) maximum likelihood X i = sgn E[x i ] least bit error rate estimator

  3. Mean Value Marginalization: projection to independent distributions Π = = x x ( ) ( ) ( )... ( ) ( ) q q x q x q x q 0 1 1 2 2 0 n n = ∫ ( ( ) ( ..., ) .. .. q x q x x dx dx dx 1, 1 i i n i n η = = E x E x [ ] [ ] q q 0

  4. ⎧ ⎫ L ∑ ∑ ( ) ( ) = ⋅ + − ψ x x ⎨ ⎬ exp q k x c i i r q ⎩ ⎭ = 1 r ( ) ( ) = = x L L , c c x x r i i 1 r r i i s 1 s { } ( ) = − = 1, 1 , x r i i 1 2 i { } ∑ ∑ ( ) = + − ψ x e p x q w x x h x Boltzmann machine, spin glass, neural networks ij i j i i Turbo Codes, LDPC Codes

  5. Computationally Difficult Computationally Difficult [ ] ( ) → η = x x q E { } ∑ ( ) ( ) = − ψ x x exp q c r q mean-field approximation belief propagation tree propagation, CCCP (convex-concave)

  6. Information Geometry of Mean Field Approximation • m-projection • e-projection ( ) q x ∑ ( )log ( ) q x D[q:p]= p x x Π q = argmin D[q:p] m 0 = Π Π { ( )} M p x e q = argmin D[p:q ] 0 i i i 0 ( ) ∈ p x M 0

  7. Information Geometry Information Geometry ( ) q x M r M ' r θ M ∑ 0 = − } φ ( ) exp{ ( ) q x c x r { } ( ) { } = θ = θ ⋅ − ψ x x , exp M p 0 0 0 { } { } ( ) ( ) = = + ⋅ − ψ x ξ x ξ x , exp M p c r r r r r r = L 1, , r L

  8. Belief Propagation Belief Propagation { } ( ) ( ) = + ⋅ − ψ x ξ x ξ x : , exp M p c r r r r r r Π ξ t ( , ) p x 0 r r + = Π − θ ξ ξ 1 t t t ( , ) : belief for ( ) p x c x 0 r r r r r = ∑ + + θ 1 θ 1 t t r

  9. 0 r Belief Prop Algorithm M ' M r M ς r Π ' r ς ς r ' ς r

  10. Equilibrium of BP ( ) ∗ ∗ θ ξ , Equilibrium of BP r 1) m -condition ∗ = Π ( ) θ x ξ * , p M r 0 r r ( ) M M θ ∗ ' -flat submanifold r m M 0 ξ θ ξ 1 ( ') 2) e -condition q 1 − ∑ r ∗ ∗ = θ ξ * M r r 1 L M ' r ( ) ∈ x -flat submanifold q e M 0

  11. Free energy: Free energy: [ ] − ∑ [ ] ( ) θ ζ ζ = L , , , : : F D p q D p p 1 0 0 L r critical point ∂ F = 0 : -condition e ∂ θ ∂ F = 0 : -condition m ∂ ζ r not convex

  12. Belief Propagation e-condition OK 1 ∑ ,... ) = θ ξ ξ ξ θ ξ ( ; , , , ' ' − 1 2 L r 1 L ,... → ,... ξ ξ ξ ξ ξ ξ ( , , ) ( ' , ' , ' ) 1 2 1 2 L L CCCP m-condition OK → θ θ ' ξ θ ξ θ ξ θ ( '), ( '),..., ( ') 1 1 1 ( ) ( ) = Π = Π θ x ξ ξ x θ ' , ' : ' , ' p p 0 0 r r r

  13. ( ) + + + = Π = Π ξ 1 θ 1 x θ 1 t t t , p 0 r r r − ∑ + + = ξ θ θ 1 1 t t t L r

  14. Convex-Concave Computational Procedure (CCCP) Yuille θ = θ − θ ( ) ( ) ( ) F F F 1 2 + ∇ θ = ∇ θ 1 t t ( ) ( ) F F 1 2 Elimination of double loops

  15. Boltzmann Machine x 1 ( ) ∑ ( ) x x = = ϕ − 1 p x w x h 4 2 i ij j i { } x ∑ ∑ ( ) ( ) = − − ψ 3 exp p x w x x h x w ij i j i i ( ) q x ( ) ˆ p x B

  16. Boltzmann machine ---hidden units • EM algorithm D • e-projection • m-projection M

  17. EM algorithm hidden variables ( ) p x y u , ; { } D = x x L 1 , , N { } ( ) = x y u , ; M p { } ( ) ( ) ( ) = = x y x x , D p p p M D ( ) ⎡ ∈ ⎤ m-projection to x y ˆ min , : KL p ⎣ p M ⎦ M ( ) ⎡ ∈ ⎤ e-projection to x y u ˆ min : , ; KL p ⎣ D p ⎦ D

  18. SVM : support vector machine = φ ( ) z x Embedding i i ∑ ∑ = φ = α ( ) ( ) ( , ) f x w x y K x x i i i i i = ∑ φ φ Kernel ( , ') ( ) ( ') K x x x x i i Conformal change of kernel ⎯⎯ → ρ ρ ( , ') ( ) ( ') ( , ') K x x x x K x x ρ = − κ 2 ( ) exp{ | ( ) | } x f x

  19. Signal Processing ICA : Independent Component Analysis = → x s x s A t t t t sparse component analysis positive matrix factorization

  20. mixture and unmixture of independent signals x n ∑ = 1 x A s i ij j s = 1 1 j x s m 2 = x As s n x 2

  21. Independent Component Analysis y s A W ∑ = = x s A x A s x i ij j A − = = y x 1 W W observations: x(1), x(2), …, x(t) recover: s(1), s(2), …, s(t)

  22. Space of Matrices : Lie group + W W d = X WW -1 d d W W − 1 + X I d I ( ) ( ) 2 − − = = W X X W W 1 W W T T T tr tr d d d d d ∂ l ฀ ∇ = ∂ W W T l W non-holonomic basis d X :

  23. Information Geometry of ICA S ={p( y )} q r = { ( ) ( )... ( )} I q y q y q y 1 1 2 2 n n { ( )} p Wx = natural gradient W y W y ( ) [ ( ; ) : ( )] l KL p q estimating function y ( ) r stability, efficiency

  24. Semiparametric Statistical Model = x W W Wx ( ; , ) | | ( ) p r r = Π − = 1 W A r r , ( ) : unknown r s i x(1), x(2), …, x(t)

  25. Natural Gradient ( ) η ∂ y W , l Δ = − W W W T ∂ W

  26. Basis Given: overcomplete case Sparse Solution ∑ ˆ ˆ ˆ = = = sparse A x As x s a : A s i i many solutions → many 0 s i ˆ = x s A t t

  27. ˆ x = ˆ ˆ A As : sparse generalized inverse min Σ 2 s ˆ -norm : L i 2 sparse solution ∑ s ˆ min -norm : L 1 i i

  28. Overcomplete Basis and Sparse Solution ∑ = = x a s s A i i ∑ = s min s i 1 − + α s x s min A ' p p non-linear denoising

  29. Sparse Solution ( ) ϕ β min = ∑ ( ) p β β penalty : Bayes prior F p i ( ) β = β ≠ sparsest solution #1[ 0]: F 0 i ∑ ( ) β = β solution : F L 1 i 1 ( ) Sparse solution: overcomplete case β ≤ ≤ : 0 1 F p p ( ) ∑ 2 β = β generalized inverse solution : F 2 i

  30. Optimization under Spasity Condition ( ) ⎧ ϕ β ⎪ min : convex function ⎨ ( ) β ≤ ⎪ constraint ⎩ F c typical case: ( ) 1 1 ( ϕ β = − β 2 = β − β β − β y T *) ( *) X G 2 2 = = = = ∑ 1 2, 1, 1/ 2 p p p ( ) β β p ; F i p

  31. L1-constrained optimization ( ) ( ) ϕ β β ≤ LASSO min under F c P Problem c ( ) ∗ β = → ∞ solution : 0 c c β ∗ = → β ∗ 0 c LARS ( ) ( ) ϕ β + λ β min F P Problem λ ( ) ∗ β λ λ = ∞ → solution 0 β ∗ = → β ∗ 0 λ ( ) ≥ * * solutions β and β : coincide λ = λ c p 1 , , c λ ( ) p < 1 λ = λ c multiple noncontinuous : , stability different

  32. Projection from to F = c (information geometry) β * β * β *

  33. Convex Cone Programming : positive semi-definite matrix P convex potential function dual geodesic approach A = ⋅ x b c x , min Support vector machine

  34. = > = = a) : 2, 1 b) : 2, 1 R n p R n p c c non-convex = < c) : 2, 1 R n p c Fig. 1

  35. orthogonal projection, dual projec tion ( ) ⎡ ⎤ ϕ = * min β D β : β , F β = c dual geodesic projec tion ( ) : ⎣ ⎦ dual η ∗ − η ∗ ∝ ∇ η ∗ F () c c

  36. F = ∇ n Fig. 5 subgradient n () c ∗ η F ∝ ∇ c ∗ η

  37. LASSO path and LARS path (stagewise solution) ( ) ( ) ϕ β β = min : F c ( ) ( ) ϕ β + λ β min F ( ) ( ) ∗ ∗ β β λ ⇔ c λ correspondence , c

  38. Active set and gradient ( ) { } β = β ≠ 0 A i i ( ) ⎧ ( ) − − 1 β β p ∈ sgn , i A ⎪ i i ( ) ( ) ∇ β = −∞ ∞ ∉ ⎨ , , F i A p ⎪ − [ ] 1,1 ⎩

  39. Solution path ( ) ( ) ∗ ∗ ∗ ∇ ϕ β + λ ∇ β = β 0, F A c c A c c { } ( ) ( ) ( ) & & ∇ ∇ ϕ β ∗ + λ ∇ ∇ β ∗ ⋅ β = − λ ∇ β F F A A c c A A c c c A c ( ) d & & & − ∗ β = − λ ∇ β β = β 1 ; K F c c A c c c dc ( ) ( ) = β ∗ + λ ∇∇ β ∗ K G F c c c ( ) ∇∇ = ∇ = β 0; (sgn ) : F F L 1 1 1 i

  40. Solution path in the subspace of the active set ( ) ( ) ∗ ∗ ∇ ϕ β + ∇ λ β = ∇ 0 : active direction F λ λ A A A ( ) & ∗ − ∗ β = − ∇ β 1 K F λ λ A A ′ → turning point A A

  41. Gradient Descent Method = ε 2 i j min L(x+a): g a a ij ∂ ∇ = ∂ { ( )}: covariant L L x x i ∂ ∑ % ∇ = ji { ( )}: contravariant L g L x ∂ x i ฀ + = − ∇ ( ) x x c L x 1 t t t

  42. Extended LARS (p = 1) and Minkovskian grad ∑ p = a norm a i p ( ) ψ β + ε = a a max under 1 p ( ) ψ β + ε − λ a a p = + 1 p { } ⎧ η η = η η ⎪ L sgn , max , , ( ) ∇ ψ β = 1 1 i i N ⎨ 1 A ⎪ ⎩ 0, otherwise ( ) η = ∇ ψ β

  43. ∗ = arg max i f i = = max f f f ∗ ∗ i i j ⎧ = ∗ ∗ ( ) 1, for and , i i j % ∇ = ⎨ F ⎩ 0 otherwise. i % β + = β − ∇ η LARS F 1 t t

  44. % ∇ = ∇ F f Euclidean case 1 ( ) % ∇ = sgn − F c f f 1 p i i ⎡ ⎤ 0 ⎢ ⎥ M ⎢ ⎥ ⎢ ⎥ ( ) 1 α → 1 % ∇ = ⎢ ⎥ sgn F c f ∗ i 0 ⎢ ⎥ ⎢ ⎥ M ⎢ ⎥ ⎣ ⎦ 0

  45. L1/2 constraint: non-convex optimization λ -trajectory and-trajectory c Ex. 1-dim 1 ( ) ( ) ϕ β = β − β 2 * 2 1 ( ) ( ) 2 λ β = φ + λ = β − + λ β 2 2 f F 2

  46. ( ) 2 β − β ∗ β ≤ : min , P c c β β ∗ c 0 ˆ β = c c λ ∇ = ( ) : 0 P f ˆ β ∗ β = R β − β ∗ + = λ λ : Xu Zongben's operator 0 λ β ( ) R λ β ∗ ( ) β ∗ λ = − c c c c β ∗ λ

  47. ICCN-Huangshan (黄山) Sparse Signal Analysis Shun-ichi Ammari (甘利俊一) RIKEN Brain Science Institute ( Collaborator: Masahiro Yukawa, Niigata University)

  48. Solution Path : λ ↔ c not continuous, not-monotone jump β ⇔ β λ c

  49. An Example of the greedy path β 2 β 1

  50. Linear Programming ∑ ≥ A x b ij j i ∑ max c x i i ( ) ∑ ∑ ( ) ψ = − x lo g A x b ij j i i inner met ho d

  51. Convex Programming ━ Inner Method LP A ≥ ⋅ ≥ x b c x : , 0 ⋅ c x min ( ) ∑ ∑ ( ) ψ = − x log A x b ij j i ∑ + log x i ( ) η = ∂ i ψ x Simplex method ; inner method

  52. Polynomial-Time Algorithm curvature : step-size ( ) 2 H m ( ) ( ) ⋅ + ψ = δ ∇ − ∗ c x x x min : geodesic t t

  53. Neural Networks Multilayer Perceptron Higher-order correlations Synchronous firing

  54. Multilayer Perceptrons x ∑ ( ) = ϕ ⋅ + w x y y v n i i = ( , ,..., ) x x x x 1 2 n ⎧ ⎫ ( ) 1 ( ) ( ) 2 θ = − − θ x x ⎨ ⎬ ; exp , p y c y f ⎩ ⎭ 2 ∑ ( ) ( ) θ = v ϕ ⋅ x w x , f i i θ = ( ,..., ; ,..., ) w w v v 1 1 m m

  55. Multilayer Perceptron ψ ( ) x neuromanifold space of functions ( ) = x θ , y f ∑ ( ) = ϕ ⋅ w x v i i ( ) = θ w w L L , ; , v v 1 1, m m

  56. singularities

  57. Geometry of singular model | 0 = w | v W n + ) w x ⋅ v ( ϕ v = y

  58. Backpropagation ---gradient learning Backpropagation ---gradient learning ( ) ( ) x x L examples : , , , y y 1 1 t t 1 ( ) ( ) 2 = − θ = − θ x x , log , ; E y f p y 2 natural gradient (Riemannian) % ∇ = − ∇ 1 --steepest descent ∂ E G E E Δ θ = − η ∂ θ t t ∑ ( ) ( ) θ = ϕ ⋅ x w x , f v i i

  59. conformal transformation q ‐ Fisher information q ( ) ( ) = q F ( ) g p g p ij ij ( ) h p q − q divergence 1 ∫ − = − 1 q q [ ( ): ( )] (1 ( ) ( ) ) D p x r x p x r x dx − q (1 ) ( ) q h p q

  60. Total Bregman Divergence and its Applications to Shape Retrieval • Baba C. Vemuri, Meizhu Liu, Shun-ichi Amari, Frank Nielsen IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010

  61. Total Bregman Divergence [ ] x y : D [ ] = x y : TD + ∇ 2 1 f • rotational invariance • conformal geometry

  62. Total Bregman divergence (Vemuri) ϕ − ϕ −∇ ϕ ⋅ − ( ) ( ) ( ) ( ) p q q p q = TBD( : ) p q 2 + ∇ ϕ 1 | ( ) | q

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend