SLIDE 1 Kernel methods for Network Analysis: An introduction
Chiranjib Bhattacharyya
Machine Learning lab Dept of CSA, IISc chiru@csa.iisc.ernet.in http://drona.csa.iisc.ernet.in/~chiru
13th Jan, 2013
SLIDE 2
Computational Biology Which super-family does this protein structure belongs to?
SLIDE 3
Multimedia Who are the actors?
SLIDE 4
Social Networks How can one run a succesful Ad-campaign on this network?
SLIDE 5
Data Representation as a vector
SLIDE 6
Data Representation as a vector
SLIDE 7
Data Representation as a vector
SLIDE 8
Data Representation as a vector
SLIDE 9
Data Representation as a vector f e a t u r e m a p
SLIDE 10
When we have Feature maps Linear Classifiers, Principal Component Analysis
SLIDE 11
Similarity maybe readily available
Problem Feature maps are not readily available
SLIDE 12
Kernel functions- a formal notion of similarity functions Kernel functions are essentially similarity functions. One can easily generalize many existing algorithms using kernel functions.Sometimes called the kernel trick Kernels can help integrate different sources of data
SLIDE 13 Agenda
1 Kernel Trick
SVMs and Non-linear Classification Principal Component Analysis What can we compute with the dot product in feature spaces?
2 Mathematical Foundations
RKHS, Representer theorem
3 Kernels on Graphs aka Networks
Kernels on vertices of a Graph Kernels on graphs
4 Advanced Topics: Multiple Kernel Learning
SLIDE 14 1 Kernel Trick
SVMs and Non-linear Classification Principal Component Analysis What can we compute with the dot product in feature spaces?
2 Mathematical Foundations
RKHS, Representer theorem
3 Kernels on Graphs aka Networks
Kernels on vertices of a Graph Kernels on graphs
4 Advanced Topics: Multiple Kernel Learning
SLIDE 15
PART 1: KERNEL TRICK
SLIDE 16 1 Kernel Trick
SVMs and Non-linear Classification Principal Component Analysis What can we compute with the dot product in feature spaces?
2 Mathematical Foundations
RKHS, Representer theorem
3 Kernels on Graphs aka Networks
Kernels on vertices of a Graph Kernels on graphs
4 Advanced Topics: Multiple Kernel Learning
SLIDE 17 The problem of classification
Given Training data D = {(xi,yi)|i = 1,...,m}
class label yi ∈ {−1,1} Find A classifier f : X → {−1,1}. f(x) = sign(w⊤x+b)
SLIDE 18 Regularized risk
minw,b C
m
∑
i=1
max(1−yi(w⊤xi +b),0)
+ 1 2w2 Regularization
SLIDE 19 Regularized risk
minw,b C
m
∑
i=1
max(1−yi(w⊤xi +b),0)
+ 1 2w2 Regularization The SVM formulation minw,b,ξ 1 2w2 +C
m
∑
i=1
ξi subject to yi(w⊤xi +b) ≥ 1−ξi ξi ≥ 0 ∀ i ∈ [m]
SLIDE 20 SVM formulation
maximizeα
m
∑
i=1
αi − 1 2 ∑
ij
αiαjyiyjx⊤
i xj
subject to 0 ≤ αi,
m
∑
i=1
αiyi = 0
SLIDE 21 SVM formulation
maximizeα
m
∑
i=1
αi − 1 2 ∑
ij
αiαjyiyjx⊤
i xj
subject to 0 ≤ αi,
m
∑
i=1
αiyi = 0 w = ∑m
i=1 αiyixi
f(x) = sign(
m
∑
i=1
αiyix⊤
i x+b)
SLIDE 22 C-SVM in feature spaces
Let us work with a feature map, Φ(x) maximizeα − 1 2 ∑
ij
αiαjyiyjΦ(xi)⊤Φ(xj)+
m
∑
i=1
αi subject to 0 ≤ αi,∑
i
αiyi = 0 f(x) = sign(
m
∑
i=1
αiyiΦ(xi)⊤Φ(x)+b) The dot product between any pair of examples computed in the feature space be denoted by K(x,z) = Φ(x)⊤Φ(z)
SLIDE 23 C-SVM in feature spaces
Let us work with a feature map, Φ(x) maximizeα − 1 2 ∑
ij
αiαjyiyjK(xi,xj)+
m
∑
i=1
αi subject to 0 ≤ αi,∑
i
αiyi = 0 f(x) = sign(
m
∑
i=1
αiyiK(xi,x)+b) The dot product between any pair of examples computed in the feature space be denoted by K(x,z) = Φ(x)⊤Φ(z)
SLIDE 24 1 Kernel Trick
SVMs and Non-linear Classification Principal Component Analysis What can we compute with the dot product in feature spaces?
2 Mathematical Foundations
RKHS, Representer theorem
3 Kernels on Graphs aka Networks
Kernels on vertices of a Graph Kernels on graphs
4 Advanced Topics: Multiple Kernel Learning
SLIDE 25 Principal Component Analysis(PCA)
Principal Directions Given X = [x1,...,xm] find directions of maximum variance( Jollife 2002). The direction of maximum variance, v, is given by 1 mXX⊤v = λv (assuming that Xe = 0) Define v = Xα 1 mXX⊤Xα = λXα leading to the following eigenvalue problem 1 mKα = λα where (K)ij = (X⊤X)ij = x⊤
i xj.
SLIDE 26 Nonlinear component analysis(Scholkopf et al. 1996)
Compute PCA in feature spaces Replace x⊤
i xj by Φ(xi)⊤Φ(xj)
Principal component of x In input space In feature space v⊤x ∑m
i=1 αiK(xi,x)
SLIDE 27 We just need the dot product
Let x ∈ IR2 and Φ(x) = [x2
1 x2 2
√ 2x1x2]⊤ K(x,z) = Φ(x)⊤Φ(z) = x2
1z2 1 +2x1x2z1z2 +x2 2z2 2 = (x⊤z)2
If K(x,z) = (x⊤z)r is a dot product in a d+r−1
r
corresponding to x,z ∈ IRd. If d = 256,r = 4, the feature space size is 6,35,376. However if we know K one can still solve the SVM formulation without explicitly evaluating Φ
SLIDE 28 1 Kernel Trick
SVMs and Non-linear Classification Principal Component Analysis What can we compute with the dot product in feature spaces?
2 Mathematical Foundations
RKHS, Representer theorem
3 Kernels on Graphs aka Networks
Kernels on vertices of a Graph Kernels on graphs
4 Advanced Topics: Multiple Kernel Learning
SLIDE 29 Norms, Distances
Φ(x) =
Normalized features ˆ Φ(x) = Φ(x) Φ(x) ˆ K(x,z) = ˆ Φ(x)⊤ ˆ Φ(z) = K(x,z)
Distances Φ(x)−Φ(z)2 = (Φ(x)−Φ(z))⊤ (Φ(x)−Φ(z)) = K(x,x)+K(z,z)−2K(x,z) If Φ is normalized K(x,x) = 1 then Φ(x)−Φ(z)2 = 2−2K(x,z)
SLIDE 30
In the sequel
Will formalize these notions conditions on K will be discussed K for graphs
SLIDE 31 1 Kernel Trick
SVMs and Non-linear Classification Principal Component Analysis What can we compute with the dot product in feature spaces?
2 Mathematical Foundations
RKHS, Representer theorem
3 Kernels on Graphs aka Networks
Kernels on vertices of a Graph Kernels on graphs
4 Advanced Topics: Multiple Kernel Learning
SLIDE 32
Definition of Kernel functions
SLIDE 33
Kernel function
Kernel function K : X ×X → IR is a Kernel function if K(x,z) = K(z,x) symmetric Kis positive semidefinite, i.e.∀n,x1,...,xn ∈ X , the matrix Kij = K(xi,xj) is psd Recall that a K ∈ IRd×d is psd if u⊤Ku ≥ 0 for all u ∈ IRd.
SLIDE 34 Examples of Kernel functions
K(x,z) = φ(x)⊤φ(z) where φ : X → IRd K is symmetric i.e. K(x,z) = K(z,x) Positive Semidefinite: Let D = {x1,x2,...,xn} be set of arbitrarily chosen n elements of X . Define Kij = φ(xi)⊤φ(xj) For any u ∈ IRn it is straightforward to see that u⊤Ku =
m
∑
i=1
uiφ(xi)2
2 ≥ 0
SLIDE 35 Examples of Kernel functions
K(x,z) = x⊤z Φ(x) = x K(x,z) = (x⊤z)r Φt1t2...td(x) =
t1!t2!....td!xt1 1 xt2 2 ...xtd d
∑d
i=1 ti = r
K(x,z) = e−γx−z2
SLIDE 36 Kernel Construction
Let K1 and K2 be two valid kernels. K(x,y) = φ(x)⊤φ(y) K(u,v) = K1(u,v)K2(u,v) K = αK1 +βK2 α,β ≥ 0 ˆ K(x,y) = K(x,y)
SLIDE 37 Kernel Construction
Let K1 and K2 be two valid kernels. K(x,y) = φ(x)⊤φ(y) K(u,v) = K1(u,v)K2(u,v) K = αK1 +βK2 α,β ≥ 0 ˆ K(x,y) = K(x,y)
K(x,y) = x⊤y K(x,y) = (x⊤y)i K(x,y) = lim
N→∞ N
∑
i=0
(x⊤y)i i! = ex⊤y ˆ K(x,y) = e− 1
2x−y2
SLIDE 38 Kernel function and feature map
A theorem due to Mercer guarantees a feature map for symmetric, psd kernel functions. Loosely stated For a symmetric kernel K : X ×X → IR, there exists an expansion K(x,z) = Φ(x)⊤Φ(z) iff
SLIDE 39 What is a Dot product(aka Inner Product)
Let X be a vector space. What is a Dot product Symmetry < u,v >=< v,u > u,v ∈ X Bilinear < αu+βv,w >= α < u,w > +β < v,w > u,v,w,∈ X Positive Semidefinite < u,u > ≥ 0 u ∈ X < u,u >= 0 iff u = 0 Norm x =
x = 0 = ⇒ x = 0
SLIDE 40 Examples of Dot products
X = IRn,< u,v >= u⊤v X = IRn,< u,v >=
n
∑
i=1
λiuivi λi ≥ 0 X = L2(X) = {f :
f,g ∈ X < f,g >=
SLIDE 41 Cauchy Schwartz inequality
Cauchy Schwartz inequality Let X be an inner product space. |x,y| ≤ xy ∀ x,y ∈ X and equality holds iff x = αz for some scalar α Proof: ∀α ∈ IR x−αz2 ≥ 0 x2 −2αx,z+α2z2 ≥ 0∀α Let α = x,z
z2 and the inequality follows by taking square roots. The
claim about equality follows from the definition of norm.
SLIDE 42
Hilbert Space: Basic facts Definition Inner product space (H ,·,·H ) is a Hilbert Space if it is separable and complete. Denote the norm as ·H .
SLIDE 43 Projections in Hilbert space
The orthogonal complement of M ⊂ H is defined as M⊥ = {z|x,zH = 0, ∀x ∈ M} Hilbert space Projection theorem Let M be a subspace of Hilbert space H ,·,·H . For every x ∈ H the following holds There exists an unique ΠM(x) ∈ M such that ΠM(x) = argminz∈Mx−zH x−ΠM(x) ∈ M⊥ z,x−ΠM(x)H = 0 ∀z ∈ M x2
H = ΠM(x)2 H +y2 H where
x = ΠM(x)+y where y ∈ M⊥
SLIDE 44 1 Kernel Trick
SVMs and Non-linear Classification Principal Component Analysis What can we compute with the dot product in feature spaces?
2 Mathematical Foundations
RKHS, Representer theorem
3 Kernels on Graphs aka Networks
Kernels on vertices of a Graph Kernels on graphs
4 Advanced Topics: Multiple Kernel Learning
SLIDE 45 Reproducing kernel Hilbert Space(RKHS)
Let K be any kernel function. Consider the following set H = {f|f(.) =
m
∑
i=1
αiK(.,xi)∀xi ∈ X ,m ∈ N} Reproducible Property ∀f ∈ H ,f(x) =
m
∑
i=i
αiK(x,xi) =
m
∑
i=1
αiK(.,xi),K(.,x) = f(.),K(.,x)
SLIDE 46 Dot product in RKHS
Dot product ∀f,g ∈ H ,f(.) =
m1
∑
i=1
αiK(.,xi) , g(.) =
m2
∑
i=1
βjK(.,xj) f,gH =
m1
∑
i=1 m2
∑
j=1
αiβjK(xi,xj) As K is symmetric, f,gH = g,fH f(.),f(.) =
m
∑
i=1 m
∑
j=1
αiαjK(xi,xj) Recall that K is a psd matrix if K is kernel function and so f(.),f(.)H ≥ 0 CS inequality holds so |f(x)| ≤
SLIDE 47 Representer theorem
Representer theorem Let K be a valid kernel defined on X and H be the corresponding
- RKHS. Let Ω be an increasing function. The optimization problem
min
g∈H G(g) = m
∑
i=1
l(g(xi),yi)+Ω(g2
H )
is solved when g∗ = ∑m
i=1 αiK(.,xi)
SLIDE 48 Representer theorem
Representer theorem Let K be a valid kernel defined on X and H be the corresponding
- RKHS. Let Ω be an increasing function. The optimization problem
min
g∈H G(g) = m
∑
i=1
l(g(xi),yi)+Ω(g2
H )
is solved when g∗ = ∑m
i=1 αiK(.,xi)
Proof: Let M = {∑m
i=1 αiK(.,xi) i = 1,...,m}. Clearly M is a
subspace of H . Take any g ∈ H . g(xi) = g,K(.,xi) = gM +gper,K(.,xi) = gM,K(.,xi)+gper,K(.,xi) = gM,K(.,xi) = gM(xi) As Ω is an increasing function, Ω(g2
H ) ≥ Ω(gM2 H )
SLIDE 49 Back to C-SVM formulation
Given a Kernel function K defined on X one can create a RKHS H = {
n
∑
i=1
βizi|zi ∈ X ,n ∈ N} Classifier: f(x) = sign(g(x)+b) ming∈H ,b∈IR
m
∑
i=1
max(0,1−yi(g(xi)+b))
+g2
H
At optimality g(.) = ∑m
i=1 γiK(.,xi)
(Representer theorem)
SLIDE 50 1 Kernel Trick
SVMs and Non-linear Classification Principal Component Analysis What can we compute with the dot product in feature spaces?
2 Mathematical Foundations
RKHS, Representer theorem
3 Kernels on Graphs aka Networks
Kernels on vertices of a Graph Kernels on graphs
4 Advanced Topics: Multiple Kernel Learning
SLIDE 51
Many Applications
Ideal to model molecules Protein-protein interaction networks metabolic networks Social networks
SLIDE 52
Graph Kernels
Kernels on vertices on a Graph, G = (V,E) Compute K(vi,vj), where vi,vj ∈ V Kernels on Graphs Compute K(G1,G2) where G1,G2 are two graphs
SLIDE 53 1 Kernel Trick
SVMs and Non-linear Classification Principal Component Analysis What can we compute with the dot product in feature spaces?
2 Mathematical Foundations
RKHS, Representer theorem
3 Kernels on Graphs aka Networks
Kernels on vertices of a Graph Kernels on graphs
4 Advanced Topics: Multiple Kernel Learning
SLIDE 54 Diffusion Kernels(Kondor and Lafferty 2002)
Let X = {1,...,m} and there be some associated edges between
- them. The adjacency matrix of the resulting graph be A.
Diffusion Kernel K = lim
s→∞(I + β
s H)s H = A−D, where D is diagonal with dii = ∑j aij. K is positive definite and Symmetric. Computation is O(m3)
SLIDE 55 Diffusion Kernels(Kondor and Lafferty 2002)
Let X = {1,...,m} and there be some associated edges between
- them. The adjacency matrix of the resulting graph be A.
Diffusion Kernel K = lim
s→∞(I + β
s H)s H = A−D, where D is diagonal with dii = ∑j aij. K is positive definite and Symmetric. Computation is O(m3) lim
s→∞(1+ β
s x)s = eβx K = eβH = ∑m
i=1 vieβλivi where (λi,vi) are the (eigen-value,
eigen-vector) of H
SLIDE 56 Diffusion Kernels(Kondor and Lafferty 2002)
Sometimes can be computed in closed form for special graphs e.g. complete graphs K(i,j) =
m
i = j
1−e−mβ m
i = j Has a very interesting analogue with Diffusion equation in physics.
SLIDE 57 1 Kernel Trick
SVMs and Non-linear Classification Principal Component Analysis What can we compute with the dot product in feature spaces?
2 Mathematical Foundations
RKHS, Representer theorem
3 Kernels on Graphs aka Networks
Kernels on vertices of a Graph Kernels on graphs
4 Advanced Topics: Multiple Kernel Learning
SLIDE 58
Kernels on graphs
Graph Isomorphism Find a mapping g to the vertices of G1 = (V1,E1) to the vertices of G2 = (V2,E2) such that G1 and G2 are identical. If (u,v) ∈ E1 iff (g(u),g(v)) ∈ E2 then g is an isomorphism SubGraph Isomorphism Is there a subgraph S of G1 and a subgraph T of G2 such that S and T are isomorphic NP hard. Need computationally efficient approximations.
SLIDE 59
Desideratum of a kernel function
Computationally efficient Positive Definite Can relate graph structures Applicable to wide variety of graphs
SLIDE 60 Some Definitions
Let A be a m×n and B be a p×q matrix. A
a11B ··· a1nB . . . ... . . . am1B ··· amnB
SLIDE 61 Definitions: Product graph
Let G1 = (V1,E1) and G2 = (V2,E2) be two graphs. G = (V,E) is the product graph of G1 and G2 if V = V1 ×V2 and ((i,i′),(j,j′)) ∈ E iff (i,j) ∈ E1 and (i′,j′) ∈ E2. A(G) = A(G1)
SLIDE 62 Random walk kernel between two graphs(Vishwanathan et
Random walk kernel K(G1,G2) =
V
∑
i,j=1 ∞
∑
t=0
λ tAt = e⊤(I −λA)−1e V is the number of vertices of product graph of G1,G2 A = A(G1)A(G2) Counts the number of paths by simultaneously random walks on G1 and G2. Computational complexity is O(n6), where n = V(G1) = V(G2)
SLIDE 63
Can we compute it more efficiently(Vishwanathan et al. 2010)
Sylvester Equation Given S,T and M0 one can solve for M M = SMT⊤ +M0 in O(n3) time
SLIDE 64 1 Kernel Trick
SVMs and Non-linear Classification Principal Component Analysis What can we compute with the dot product in feature spaces?
2 Mathematical Foundations
RKHS, Representer theorem
3 Kernels on Graphs aka Networks
Kernels on vertices of a Graph Kernels on graphs
4 Advanced Topics: Multiple Kernel Learning
SLIDE 65
PART 5: Multiple Kernel Learning
SLIDE 66 Recap of SVMs
On a dataset D = {(xi,yi)|i = 1,...,m} SVMs solve the following problem ω(K) = maxα ∑
i
αi − 1 2α⊤YKYα (1) 0 ≤ αi ≤ C
∑
i
αiyi = 0 (2) where Kij = k(xi,xj) is the kernel function defined on examples xi and xj. The final classifier is y = sign(∑i αiyiK(x,xi)+b)
SLIDE 67
Recap of SVMs
Does not scale well The function ω(K) is a pointwise maximum of a set of functions and hence is convex If the maximization in α is not unique then ω(K) is not differentiable. ω(K) maynot be differentiable, but subgradients exist. Let us relax the problem a little and say that µi ≥ 0
SLIDE 68
Learning a linear combination of Multiple Kernels Let {K1,...,Kn} be a given library of kernels. Given a training set of m examples, each Ki = K⊤
i ∈ IRm×m
MKL(Lanckriet et al. 2004) minKω(K) K =
l
∑
i=1
µiKi trace(K) = c K 0
SLIDE 69 MKL is a Semi-definite Programming problem
minz c⊤z s.t. F(z) = ∑l
i=1 ziFi 0
Bz = d z ∈ IRl and Fi = F⊤
i ∈ IRm×m.
F(z) is positive semidefinite Instance of a convex optimization problem. Can be solved by interior point methods
SLIDE 70 MKL formulation
SDP formulation minµ,t,λ,ν>0 t (3)
i=1 µiYKiY⊤
e+ν +λy e+ν +λy t
∑n
i=1 µiKi 0
(5)
SLIDE 71 Reformulation of MKL
The SDP problem can be recast as QCQPs maxα,t α⊤e−ct (6) s.t. α⊤YKiYα ≤ rit i = 1,...,l (7) α⊤y = 0 0 ≤ α ≤ C (8) where ri = trace(Ki) QCQPs are instances of SOCPs minz c⊤z (9) Aiz+bi2 ≤ c⊤
i z+di
(10) where Ai ∈ IRni×l,bi,ci,c,z ∈ IRl,di ∈ IR
SLIDE 72 Equivalence with Block L1 reqularization
Bach et al. (2004) showed that the QCQP formulation is equivalent to minw,b,ξ
1 2
l
∑
i=1
diwi
2
+C∑m
i=1 ξi
(11) s.t. yi(∑j w⊤
j φj(xi)+b) ≥ 1−ξi ∀ i = {1,...,m}
ξi ≥ 0 (12) for proper choice of di Block L1 norm promotes sparsity i.e. most of µi = 0
SLIDE 73 Efficient algorithms for MKL
A trick Let γ ∈ IBn = {γ ∈ IRn|γi ≥ 0, ∑n
i=1 γi = 1} For any
ai ∈ IR, i = 1,...,n (
n
∑
i=1
|ai|)2 ≤
n
∑
i=1
a2
i
γi This implies that
∑
i=1
wi 2 ≤
n
∑
i=1
1 γi wi2 where γ lies in a probability simplex Can be helpful in reformulating the L1 formulation.
SLIDE 74 Solving MKL by reusing SVM solvers
The following problem is equivalent to the Block L1 formulation(Rakotomamonjy et al. 2007) Sm = {α|0 ≤ αi ≤ C,α⊤y = 0} and IB = {µ|0 ≤ µi ∑l
i=1 µi = 1}
minµ∈IBJ(µ)
2 ∑l i=1 µiα⊤YKiYα
A gradient descent algorithm-iteration 1.) Solve SVM problem with Kernel K = ∑l
i=1 µiKi
2.) Differentiate J w.r.t µ and update µ See also Sonnenberg et al. 2006
SLIDE 75
References
Kernel methods in Computational Biology Scholkopf et al. 2004 Kernel methods for Pattern Analysis John Shawe Taylor and N. Cristanini Learning with Kernels Scholkopf and Smola 2002