information diffusion kernels
play

Information diffusion kernels Based on the technical report by John - PowerPoint PPT Presentation

T-122.102 Special Course in Information Technology Information diffusion kernels Based on the technical report by John Lafferty and Guy Lebanon, (2004) Diffusion Kernels on Statistical Manifolds (CMU-CS-04-101) Sven Laur Helsinki University of


  1. T-122.102 Special Course in Information Technology Information diffusion kernels Based on the technical report by John Lafferty and Guy Lebanon, (2004) Diffusion Kernels on Statistical Manifolds (CMU-CS-04-101) Sven Laur Helsinki University of Technology swen@math.ut.ee,slaur@tcs.hut.fi Special Course in Information Technology, 30.03.2004 Information diffusion kernels, Sven Laur 1

  2. Outline • The problem and motivation • From data to distribution • What is a reasonable geometry over the distributions? ⋆ Coordinates, tangent vectors, distances etc. • Why heat diffusion? ⋆ Geodesic distance vs . Mercer kernel, Gaussian kernels. • Building a model • Extracting an approximate kernel Special Course in Information Technology, 30.03.2004 Information diffusion kernels, Sven Laur 2

  3. How to build kernels for discrete data structures? • Simple embedding of discrete vectors to R n ⋆ Works with vectors of fixed length ⋆ It is ad hoc technique • Embedding via generative models ⋆ Theoretically sound ⋆ What should be the right proximity measure? ⋆ Proximity measure should be independent of parameterization! Special Course in Information Technology, 30.03.2004 Information diffusion kernels, Sven Laur 3

  4. Parameterization invariant kernel methods • Fisher kernels K ( x , y ) = �∇ ℓ ( x | θ ) , ∇ ℓ ( y | θ ) � • Information diffusion kernels K ( x , y ) = ??? • Mutual information kernels (Bayesian prediction probability) � K ( x , y ) = Pr [ y | x ] ∝ p ( y | θ ) p ( x | θ ) p ( θ ) dθ integrated over model class P with prior probability p ( θ ) . Special Course in Information Technology, 30.03.2004 Information diffusion kernels, Sven Laur 4

  5. Text classification • Bag of word approach produces a count vector ( x 1 , . . . , x n ) • Let the model class be a multinomial distribution. • MLE estimate is 1 � θ tf ( x ) = ( x 1 , . . . , x n ) . x 1 + · · · + x n • Second embedding is inverse document frequency weighting 1 � θ tfidf ( x ) = ( x 1 w i , . . . , x n w n ) x 1 w i + · · · + x n w n w i = log(1 /f i ) Special Course in Information Technology, 30.03.2004 Information diffusion kernels, Sven Laur 5

  6. What is a statistical manifold? • Statistical manifold is a family of probability distributions P = { p ( ·| θ ) : X → R : θ ∈ Θ } , where Θ is open subset of R n . • The parameterization must be unique p ( ·| θ 1 ) ≡ p ( ·| θ 2 ) = ⇒ θ 1 = θ 2 • Parameters θ can be treated as the coordinate vector of p ( ·| θ ) Special Course in Information Technology, 30.03.2004 Information diffusion kernels, Sven Laur 6

  7. Set of admissible coordinates and distributions • The parameterization ψ is admissible iff ψ as a function of primary parameters θ is C ∞ smooth. • The set of admissible parameterization is an invariant. • We consider only such manifolds where log-likelihood function ℓ ( x | θ ) = log p ( x | θ ) is C ∞ differentiable w.r.t θ . • The multinomial family satisfies the C ∞ requirement m m � � ℓ ( x | θ ) = log θ x j = log θ x j . j =1 j =1 Special Course in Information Technology, 30.03.2004 Information diffusion kernels, Sven Laur 7

  8. Geometry ≈ distance measure • Distance measure determines geometry. This can be reversed. • Recall that the length of a path γ : [0 , 1] → P � 1 1 � � d ( p, q ) = � ˙ γ ( t ) � dt = � ˙ γ ( t ) , ˙ γ ( t ) � dt, 0 0 where ˙ γ ( t ) is a tangent vector. • But the set P does not have any geometrical structure!!! • We redefine (tangent) vectors—vectors will be operators. Special Course in Information Technology, 30.03.2004 Information diffusion kernels, Sven Laur 8

  9. What is a vector? • Vector will be operator that maps C ∞ functions f : P → R to reals. For fixed coordinates θ and point p natural maps ( ∂ ∂θ i ) p emerge � ∂ � � ( f ) = ∂f � � . � p ∂θ i ∂θ i p They will be basis of tangent space. • For arbitrary differentiable γ we can express � � � � � � f ( γ ( t )) ′ = θ 1 ( t ) ′ ∂ ∂ + · · · θ n ( t ) ′ ( f ) . ∂θ 1 ∂θ n γ ( t ) γ ( t ) The operator in the square brackets does not depend on f and has right type—it will be a speed/tangent vector. Special Course in Information Technology, 30.03.2004 Information diffusion kernels, Sven Laur 9

  10. Is this a reasonable definition? • The speed vector ˙ γ ( t ) uniquely characterizes the rate of change of arbitrary admissible function f γ ( t )( f ) = f ( γ ( t )) ′ ˙ t • There is a one-to-one correspondence θ ( ˙ θ n ( t )) ∈ R n . θ 1 ( t ) , . . . , ˙ γ ( t ) �− ˙ → • The are coordinate transformation formulas between different bases � ∂ � ∂ � n � n and ∂θ i ∂ψ i i =1 i =1 • We really cannot expect more, if there is no geometrical structure!!! Special Course in Information Technology, 30.03.2004 Information diffusion kernels, Sven Laur 10

  11. Kullback-Leibler divergence • The most reasonable distance measure between adjacent distribu- tions p and q is the weighted Kullback-Leibler divergence J ( p, q ) = D p � q + D q � p � � p ( x ) log p ( x ) p ( x ) log p ( x ) = q ( x ) d x + q ( x ) d x , • It quantifies additional utility if we use wrong distribution. • In discrete case it means that we need J ( p, q ) times more bits for encoding. Special Course in Information Technology, 30.03.2004 Information diffusion kernels, Sven Laur 11

  12. What is a reasonable distance metrics? Consider an infinitesimal movement along the curve γ ( t ) . • The corresponding change of coordinates is from θ to θ + ˙ θ ∆ t and the distance formula gives � ∂ � n � , ∂ γ ( t ) � 2 = ∆ t 2 d ( p, q ) 2 ≈ ∆ t 2 � ˙ θ i ˙ ˙ θ j ∂θ i ∂θ j i,j =1 • Under mild regularity conditions � n p ( x ) · ∂ℓ ( x | θ ) · ∂ℓ ( x | θ ) � J ( p, q ) ≈ ∆ t 2 θ i ˙ ˙ θ j g ij , g ij = d x . ∂θ i ∂θ j i,j =1 • Hence, the local requirement d 2 ( p, q ) ≈ J ( p, q ) fixes geometry � � ∂θ i , ∂ ∂ = g ij . ∂θ j Special Course in Information Technology, 30.03.2004 Information diffusion kernels, Sven Laur 12

  13. Limitations of geodesic distance • Geodesic distance d ( p, q ) is the shortest path between p and q . • Geodesic distance cannot be always used for SVM kernels ⋆ SVM kernel (Mercer kernel) is a computational shortcut of K ( x , y ) = Ψ( x ) · Ψ( y ) , where Ψ : R n → R d is a smooth enough function. ⋆ If geodesic distance corresponds to a Mercer kernel then there must be only one shortest path between two points. Special Course in Information Technology, 30.03.2004 Information diffusion kernels, Sven Laur 13

  14. Classification via temperature • Consider two classes ”hot” and ”cold”, i.e. each data point has a an initial amount of heat λ i concentrated around a small neighborhood. • All other points have zero temperature. • Fix a time moment t . All points below zero belong to the class ”cold” and others to the class ”hot”. • Heat gradually diffuses over the manifold. If t → ∞ all points have constant temperature. Varying t gives different levels of smoothing. • Large t gives flatter decision border that is classification is more robust, but also a less sensitive. Special Course in Information Technology, 30.03.2004 Information diffusion kernels, Sven Laur 14

  15. How to model heat diffusion? • Classical heat diffusion is given by partial differential equations ∂f ∂t − ∆ f = 0 f ( x, 0) = f ( x ) and by Dirichlet’ or von Neumann boundary conditions. • In non-Euclidean geometry Laplace operator has a nasty form � � n � ∂ g ij det G 1 / 2 ∂f ∆ f = det G − 1 / 2 ∂θ j ∂θ i i,j =1 where g ij are elements of inverse Fisher matrix G . Special Course in Information Technology, 30.03.2004 Information diffusion kernels, Sven Laur 15

  16. Extracting the kernel • In the Euclidean space R n ∆ f = ∂ 2 f + · · · + ∂ 2 f . ∂x 2 ∂x 2 n 1 • The solution corresponding to initial condition f ( x ) � � � −� x − y � 2 f ( x , t ) = (4 π ) − n/ 2 exp f ( y ) d y 4 t • Alternatively � � � −� x − y � 2 f ( x , t ) = K t ( x , y ) f ( y ) d y K t ( x , y ) = exp 4 t • In SVM-s f = λ 1 δ x 1 + · · · + λ k δ x k and integral collapses to a sum. Special Course in Information Technology, 30.03.2004 Information diffusion kernels, Sven Laur 16

  17. Central theoretical result Theorem Let M be a complete Riemannian manifold. Then there exists a kernel function K (heat kernel), which satisfies the following properties: (1) K ( x , y , t ) = K ( y , x , t ) ; (2) lim t → 0 K ( x , y , t ) = δ ( x , y ) ; (3) (∆ − ∂ ∂t ) K ( x , y , t ) = 0 ; � K ( x , z , t − s ) K ( z , y , s ) d z . (4) K ( x , y , t ) = The assertion means : (1) if q converges parameter-wise p then J ( p, q ) → 0 ; Special Course in Information Technology, 30.03.2004 Information diffusion kernels, Sven Laur 17

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend