Information diffusion kernels Based on the technical report by John - PowerPoint PPT Presentation

T-122.102 Special Course in Information Technology Information diffusion kernels Based on the technical report by John Lafferty and Guy Lebanon, (2004) Diffusion Kernels on Statistical Manifolds (CMU-CS-04-101) Sven Laur Helsinki University of Technology swen@math.ut.ee,slaur@tcs.hut.fi Special Course in Information Technology, 30.03.2004 Information diffusion kernels, Sven Laur 1

Outline • The problem and motivation • From data to distribution • What is a reasonable geometry over the distributions? ⋆ Coordinates, tangent vectors, distances etc. • Why heat diffusion? ⋆ Geodesic distance vs . Mercer kernel, Gaussian kernels. • Building a model • Extracting an approximate kernel Special Course in Information Technology, 30.03.2004 Information diffusion kernels, Sven Laur 2

How to build kernels for discrete data structures? • Simple embedding of discrete vectors to R n ⋆ Works with vectors of fixed length ⋆ It is ad hoc technique • Embedding via generative models ⋆ Theoretically sound ⋆ What should be the right proximity measure? ⋆ Proximity measure should be independent of parameterization! Special Course in Information Technology, 30.03.2004 Information diffusion kernels, Sven Laur 3

Parameterization invariant kernel methods • Fisher kernels K ( x , y ) = �∇ ℓ ( x | θ ) , ∇ ℓ ( y | θ ) � • Information diffusion kernels K ( x , y ) = ??? • Mutual information kernels (Bayesian prediction probability) � K ( x , y ) = Pr [ y | x ] ∝ p ( y | θ ) p ( x | θ ) p ( θ ) dθ integrated over model class P with prior probability p ( θ ) . Special Course in Information Technology, 30.03.2004 Information diffusion kernels, Sven Laur 4

Text classification • Bag of word approach produces a count vector ( x 1 , . . . , x n ) • Let the model class be a multinomial distribution. • MLE estimate is 1 � θ tf ( x ) = ( x 1 , . . . , x n ) . x 1 + · · · + x n • Second embedding is inverse document frequency weighting 1 � θ tfidf ( x ) = ( x 1 w i , . . . , x n w n ) x 1 w i + · · · + x n w n w i = log(1 /f i ) Special Course in Information Technology, 30.03.2004 Information diffusion kernels, Sven Laur 5

What is a statistical manifold? • Statistical manifold is a family of probability distributions P = { p ( ·| θ ) : X → R : θ ∈ Θ } , where Θ is open subset of R n . • The parameterization must be unique p ( ·| θ 1 ) ≡ p ( ·| θ 2 ) = ⇒ θ 1 = θ 2 • Parameters θ can be treated as the coordinate vector of p ( ·| θ ) Special Course in Information Technology, 30.03.2004 Information diffusion kernels, Sven Laur 6

Set of admissible coordinates and distributions • The parameterization ψ is admissible iff ψ as a function of primary parameters θ is C ∞ smooth. • The set of admissible parameterization is an invariant. • We consider only such manifolds where log-likelihood function ℓ ( x | θ ) = log p ( x | θ ) is C ∞ differentiable w.r.t θ . • The multinomial family satisfies the C ∞ requirement m m � � ℓ ( x | θ ) = log θ x j = log θ x j . j =1 j =1 Special Course in Information Technology, 30.03.2004 Information diffusion kernels, Sven Laur 7

Geometry ≈ distance measure • Distance measure determines geometry. This can be reversed. • Recall that the length of a path γ : [0 , 1] → P � 1 1 � � d ( p, q ) = � ˙ γ ( t ) � dt = � ˙ γ ( t ) , ˙ γ ( t ) � dt, 0 0 where ˙ γ ( t ) is a tangent vector. • But the set P does not have any geometrical structure!!! • We redefine (tangent) vectors—vectors will be operators. Special Course in Information Technology, 30.03.2004 Information diffusion kernels, Sven Laur 8

What is a vector? • Vector will be operator that maps C ∞ functions f : P → R to reals. For fixed coordinates θ and point p natural maps ( ∂ ∂θ i ) p emerge � ∂ � � ( f ) = ∂f � � . � p ∂θ i ∂θ i p They will be basis of tangent space. • For arbitrary differentiable γ we can express � � � � � � f ( γ ( t )) ′ = θ 1 ( t ) ′ ∂ ∂ + · · · θ n ( t ) ′ ( f ) . ∂θ 1 ∂θ n γ ( t ) γ ( t ) The operator in the square brackets does not depend on f and has right type—it will be a speed/tangent vector. Special Course in Information Technology, 30.03.2004 Information diffusion kernels, Sven Laur 9

Is this a reasonable definition? • The speed vector ˙ γ ( t ) uniquely characterizes the rate of change of arbitrary admissible function f γ ( t )( f ) = f ( γ ( t )) ′ ˙ t • There is a one-to-one correspondence θ ( ˙ θ n ( t )) ∈ R n . θ 1 ( t ) , . . . , ˙ γ ( t ) �− ˙ → • The are coordinate transformation formulas between different bases � ∂ � ∂ � n � n and ∂θ i ∂ψ i i =1 i =1 • We really cannot expect more, if there is no geometrical structure!!! Special Course in Information Technology, 30.03.2004 Information diffusion kernels, Sven Laur 10

Kullback-Leibler divergence • The most reasonable distance measure between adjacent distributions p and q is the weighted Kullback-Leibler divergence J ( p, q ) = D p � q + D q � p � � p ( x ) log p ( x ) p ( x ) log p ( x ) = q ( x ) d x + q ( x ) d x , • It quantifies additional utility if we use wrong distribution. • In discrete case it means that we need J ( p, q ) times more bits for encoding. Special Course in Information Technology, 30.03.2004 Information diffusion kernels, Sven Laur 11

What is a reasonable distance metrics? Consider an infinitesimal movement along the curve γ ( t ) . • The corresponding change of coordinates is from θ to θ + ˙ θ ∆ t and the distance formula gives � ∂ � n � , ∂ γ ( t ) � 2 = ∆ t 2 d ( p, q ) 2 ≈ ∆ t 2 � ˙ θ i ˙ ˙ θ j ∂θ i ∂θ j i,j =1 • Under mild regularity conditions � n p ( x ) · ∂ℓ ( x | θ ) · ∂ℓ ( x | θ ) � J ( p, q ) ≈ ∆ t 2 θ i ˙ ˙ θ j g ij , g ij = d x . ∂θ i ∂θ j i,j =1 • Hence, the local requirement d 2 ( p, q ) ≈ J ( p, q ) fixes geometry � � ∂θ i , ∂ ∂ = g ij . ∂θ j Special Course in Information Technology, 30.03.2004 Information diffusion kernels, Sven Laur 12

Limitations of geodesic distance • Geodesic distance d ( p, q ) is the shortest path between p and q . • Geodesic distance cannot be always used for SVM kernels ⋆ SVM kernel (Mercer kernel) is a computational shortcut of K ( x , y ) = Ψ( x ) · Ψ( y ) , where Ψ : R n → R d is a smooth enough function. ⋆ If geodesic distance corresponds to a Mercer kernel then there must be only one shortest path between two points. Special Course in Information Technology, 30.03.2004 Information diffusion kernels, Sven Laur 13

Classification via temperature • Consider two classes ”hot” and ”cold”, i.e. each data point has a an initial amount of heat λ i concentrated around a small neighborhood. • All other points have zero temperature. • Fix a time moment t . All points below zero belong to the class ”cold” and others to the class ”hot”. • Heat gradually diffuses over the manifold. If t → ∞ all points have constant temperature. Varying t gives different levels of smoothing. • Large t gives flatter decision border that is classification is more robust, but also a less sensitive. Special Course in Information Technology, 30.03.2004 Information diffusion kernels, Sven Laur 14

How to model heat diffusion? • Classical heat diffusion is given by partial differential equations ∂f ∂t − ∆ f = 0 f ( x, 0) = f ( x ) and by Dirichlet’ or von Neumann boundary conditions. • In non-Euclidean geometry Laplace operator has a nasty form � � n � ∂ g ij det G 1 / 2 ∂f ∆ f = det G − 1 / 2 ∂θ j ∂θ i i,j =1 where g ij are elements of inverse Fisher matrix G . Special Course in Information Technology, 30.03.2004 Information diffusion kernels, Sven Laur 15

Extracting the kernel • In the Euclidean space R n ∆ f = ∂ 2 f + · · · + ∂ 2 f . ∂x 2 ∂x 2 n 1 • The solution corresponding to initial condition f ( x ) � � � −� x − y � 2 f ( x , t ) = (4 π ) − n/ 2 exp f ( y ) d y 4 t • Alternatively � � � −� x − y � 2 f ( x , t ) = K t ( x , y ) f ( y ) d y K t ( x , y ) = exp 4 t • In SVM-s f = λ 1 δ x 1 + · · · + λ k δ x k and integral collapses to a sum. Special Course in Information Technology, 30.03.2004 Information diffusion kernels, Sven Laur 16

Central theoretical result Theorem Let M be a complete Riemannian manifold. Then there exists a kernel function K (heat kernel), which satisfies the following properties: (1) K ( x , y , t ) = K ( y , x , t ) ; (2) lim t → 0 K ( x , y , t ) = δ ( x , y ) ; (3) (∆ − ∂ ∂t ) K ( x , y , t ) = 0 ; � K ( x , z , t − s ) K ( z , y , s ) d z . (4) K ( x , y , t ) = The assertion means : (1) if q converges parameter-wise p then J ( p, q ) → 0 ; Special Course in Information Technology, 30.03.2004 Information diffusion kernels, Sven Laur 17

Information diffusion kernels Based on the technical report by John - PowerPoint PPT Presentation

T-122.102 Special Course in Information Technology Information diffusion kernels Based on the technical report by John Lafferty and Guy Lebanon, (2004) Diffusion Kernels on Statistical Manifolds (CMU-CS-04-101) Sven Laur Helsinki University of

c + = Diffusion Diffusion 2 6.82 10 -6 v c D c 10 -1 Equation

The Gray Code Kernels The Gray Code Kernels The Gray Code Kernels Gil Ben-Artzi Hagit Hel-Or

Overview: Kernels for Sequences and Graphs String Kernels 8 Example Sequence Classification

Beta kernels and transformed kernels applications to copulas and quantiles Arthur Charpentier

Kernels on structures Andrea Passerini passerini@disi.unitn.it Machine Learning Kernels on

PLS Advanced Diffusion Model New Advanced Diffusion Model for Dopants in Silicon Advanced Dopant

NON-SYMMETRIC FRACTIONAL DIFFUSION NON-SYMMETRIC FRACTIONAL DIFFUSION AS A SPECIAL CASE OF AS A

Scalable Machine Learning 6. Kernels Alex Smola Yahoo! Research and ANU

SVM Kernels COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning SVM Kernels 1 /

GRAPH MINING AND GRAPH KERNELS Part II: Graph Kernels Karsten Borgwardt^ and Xifeng Yan*

Information Diffusion on Social Networks SMART Summer School 2017 Sylvain Lamprier LIP6 - UPMC

A Bloch Torrey Equation for Diffusion in a Deforming Media Damien Rohmer November 21, 2006 A

Inhomogeneous materials can become homogeneous by diffusion. For an active diffusion to occur, the

31/10/2019 Diffusion General Note. Atomic diffusion is a process whereby the random

Directed Diffusion for Wireless Sensor Networking Jussi Nikander Jussi.Nikander@hut.fi 9th

From normal to anomalous deterministic diffusion Part 1: Normal deterministic diffusion Rainer

WebCLEF 2007 The Overview Valentin Jijkoun, Maarten de RIjke Overview Valentin Jijkoun,

Guided Interaction: Rethinking the Query-Result Paradigm Arnab Nandi H.V. Jagadish University

GRADY Principals Report GO! WHOA! iFinish Data (Incompletes Spring 2020) Initially we

Visualizing public health data for communicable disease management and control Anamaria Crisan

Model-Independent Online Learning for Influence Maximization Sharan Vaswani 1 , Branislav Kveton

A Recurrent Neural Cascade-based Model for Continuous-Time Diffusion Sylvain Lamprier LIP6 -

What do we get out of studying systems as networks? examples: Political/Financial Networks ! Mark

Breaking the News: Extracting the Sparse Citation Network Backbone of Online News Articles