Kernel methods for Network Analysis: An introduction Chiranjib - - PowerPoint PPT Presentation

kernel methods for network analysis an introduction
SMART_READER_LITE
LIVE PREVIEW

Kernel methods for Network Analysis: An introduction Chiranjib - - PowerPoint PPT Presentation

Kernel methods for Network Analysis: An introduction Chiranjib Bhattacharyya Machine Learning lab Dept of CSA, IISc chiru@csa.iisc.ernet.in http://drona.csa.iisc.ernet.in/~chiru 13th Jan, 2013 Computational Biology Which super-family does


slide-1
SLIDE 1

Kernel methods for Network Analysis: An introduction

Chiranjib Bhattacharyya

Machine Learning lab Dept of CSA, IISc chiru@csa.iisc.ernet.in http://drona.csa.iisc.ernet.in/~chiru

13th Jan, 2013

slide-2
SLIDE 2

Computational Biology Which super-family does this protein structure belongs to?

slide-3
SLIDE 3

Multimedia Who are the actors?

slide-4
SLIDE 4

Social Networks How can one run a succesful Ad-campaign on this network?

slide-5
SLIDE 5

Data Representation as a vector

slide-6
SLIDE 6

Data Representation as a vector

slide-7
SLIDE 7

Data Representation as a vector

slide-8
SLIDE 8

Data Representation as a vector

slide-9
SLIDE 9

Data Representation as a vector                  f e a t u r e m a p                 

slide-10
SLIDE 10

When we have Feature maps Linear Classifiers, Principal Component Analysis

slide-11
SLIDE 11

Similarity maybe readily available

Problem Feature maps are not readily available

slide-12
SLIDE 12

Kernel functions- a formal notion of similarity functions Kernel functions are essentially similarity functions. One can easily generalize many existing algorithms using kernel functions.Sometimes called the kernel trick Kernels can help integrate different sources of data

slide-13
SLIDE 13

Agenda

1 Kernel Trick

SVMs and Non-linear Classification Principal Component Analysis What can we compute with the dot product in feature spaces?

2 Mathematical Foundations

RKHS, Representer theorem

3 Kernels on Graphs aka Networks

Kernels on vertices of a Graph Kernels on graphs

4 Advanced Topics: Multiple Kernel Learning

slide-14
SLIDE 14

1 Kernel Trick

SVMs and Non-linear Classification Principal Component Analysis What can we compute with the dot product in feature spaces?

2 Mathematical Foundations

RKHS, Representer theorem

3 Kernels on Graphs aka Networks

Kernels on vertices of a Graph Kernels on graphs

4 Advanced Topics: Multiple Kernel Learning

slide-15
SLIDE 15

PART 1: KERNEL TRICK

slide-16
SLIDE 16

1 Kernel Trick

SVMs and Non-linear Classification Principal Component Analysis What can we compute with the dot product in feature spaces?

2 Mathematical Foundations

RKHS, Representer theorem

3 Kernels on Graphs aka Networks

Kernels on vertices of a Graph Kernels on graphs

4 Advanced Topics: Multiple Kernel Learning

slide-17
SLIDE 17

The problem of classification

Given Training data D = {(xi,yi)|i = 1,...,m}

  • bservation xi

class label yi ∈ {−1,1} Find A classifier f : X → {−1,1}. f(x) = sign(w⊤x+b)

slide-18
SLIDE 18

Regularized risk

minw,b C

m

i=1

max(1−yi(w⊤xi +b),0)

  • Risk

+ 1 2w2 Regularization

slide-19
SLIDE 19

Regularized risk

minw,b C

m

i=1

max(1−yi(w⊤xi +b),0)

  • Risk

+ 1 2w2 Regularization The SVM formulation minw,b,ξ 1 2w2 +C

m

i=1

ξi subject to yi(w⊤xi +b) ≥ 1−ξi ξi ≥ 0 ∀ i ∈ [m]

slide-20
SLIDE 20

SVM formulation

maximizeα

m

i=1

αi − 1 2 ∑

ij

αiαjyiyjx⊤

i xj

subject to 0 ≤ αi,

m

i=1

αiyi = 0

slide-21
SLIDE 21

SVM formulation

maximizeα

m

i=1

αi − 1 2 ∑

ij

αiαjyiyjx⊤

i xj

subject to 0 ≤ αi,

m

i=1

αiyi = 0 w = ∑m

i=1 αiyixi

f(x) = sign(

m

i=1

αiyix⊤

i x+b)

slide-22
SLIDE 22

C-SVM in feature spaces

Let us work with a feature map, Φ(x) maximizeα − 1 2 ∑

ij

αiαjyiyjΦ(xi)⊤Φ(xj)+

m

i=1

αi subject to 0 ≤ αi,∑

i

αiyi = 0 f(x) = sign(

m

i=1

αiyiΦ(xi)⊤Φ(x)+b) The dot product between any pair of examples computed in the feature space be denoted by K(x,z) = Φ(x)⊤Φ(z)

slide-23
SLIDE 23

C-SVM in feature spaces

Let us work with a feature map, Φ(x) maximizeα − 1 2 ∑

ij

αiαjyiyjK(xi,xj)+

m

i=1

αi subject to 0 ≤ αi,∑

i

αiyi = 0 f(x) = sign(

m

i=1

αiyiK(xi,x)+b) The dot product between any pair of examples computed in the feature space be denoted by K(x,z) = Φ(x)⊤Φ(z)

slide-24
SLIDE 24

1 Kernel Trick

SVMs and Non-linear Classification Principal Component Analysis What can we compute with the dot product in feature spaces?

2 Mathematical Foundations

RKHS, Representer theorem

3 Kernels on Graphs aka Networks

Kernels on vertices of a Graph Kernels on graphs

4 Advanced Topics: Multiple Kernel Learning

slide-25
SLIDE 25

Principal Component Analysis(PCA)

Principal Directions Given X = [x1,...,xm] find directions of maximum variance( Jollife 2002). The direction of maximum variance, v, is given by 1 mXX⊤v = λv (assuming that Xe = 0) Define v = Xα 1 mXX⊤Xα = λXα leading to the following eigenvalue problem 1 mKα = λα where (K)ij = (X⊤X)ij = x⊤

i xj.

slide-26
SLIDE 26

Nonlinear component analysis(Scholkopf et al. 1996)

Compute PCA in feature spaces Replace x⊤

i xj by Φ(xi)⊤Φ(xj)

Principal component of x In input space In feature space v⊤x ∑m

i=1 αiK(xi,x)

slide-27
SLIDE 27

We just need the dot product

Let x ∈ IR2 and Φ(x) = [x2

1 x2 2

√ 2x1x2]⊤ K(x,z) = Φ(x)⊤Φ(z) = x2

1z2 1 +2x1x2z1z2 +x2 2z2 2 = (x⊤z)2

If K(x,z) = (x⊤z)r is a dot product in a d+r−1

r

  • feature space

corresponding to x,z ∈ IRd. If d = 256,r = 4, the feature space size is 6,35,376. However if we know K one can still solve the SVM formulation without explicitly evaluating Φ

slide-28
SLIDE 28

1 Kernel Trick

SVMs and Non-linear Classification Principal Component Analysis What can we compute with the dot product in feature spaces?

2 Mathematical Foundations

RKHS, Representer theorem

3 Kernels on Graphs aka Networks

Kernels on vertices of a Graph Kernels on graphs

4 Advanced Topics: Multiple Kernel Learning

slide-29
SLIDE 29

Norms, Distances

Φ(x) =

  • Φ(x),Φ(x) =
  • K(x,x)

Normalized features ˆ Φ(x) = Φ(x) Φ(x) ˆ K(x,z) = ˆ Φ(x)⊤ ˆ Φ(z) = K(x,z)

  • K(x,x)K(z,z)

Distances Φ(x)−Φ(z)2 = (Φ(x)−Φ(z))⊤ (Φ(x)−Φ(z)) = K(x,x)+K(z,z)−2K(x,z) If Φ is normalized K(x,x) = 1 then Φ(x)−Φ(z)2 = 2−2K(x,z)

slide-30
SLIDE 30

In the sequel

Will formalize these notions conditions on K will be discussed K for graphs

slide-31
SLIDE 31

1 Kernel Trick

SVMs and Non-linear Classification Principal Component Analysis What can we compute with the dot product in feature spaces?

2 Mathematical Foundations

RKHS, Representer theorem

3 Kernels on Graphs aka Networks

Kernels on vertices of a Graph Kernels on graphs

4 Advanced Topics: Multiple Kernel Learning

slide-32
SLIDE 32

Definition of Kernel functions

slide-33
SLIDE 33

Kernel function

Kernel function K : X ×X → IR is a Kernel function if K(x,z) = K(z,x) symmetric Kis positive semidefinite, i.e.∀n,x1,...,xn ∈ X , the matrix Kij = K(xi,xj) is psd Recall that a K ∈ IRd×d is psd if u⊤Ku ≥ 0 for all u ∈ IRd.

slide-34
SLIDE 34

Examples of Kernel functions

K(x,z) = φ(x)⊤φ(z) where φ : X → IRd K is symmetric i.e. K(x,z) = K(z,x) Positive Semidefinite: Let D = {x1,x2,...,xn} be set of arbitrarily chosen n elements of X . Define Kij = φ(xi)⊤φ(xj) For any u ∈ IRn it is straightforward to see that u⊤Ku =

m

i=1

uiφ(xi)2

2 ≥ 0

slide-35
SLIDE 35

Examples of Kernel functions

K(x,z) = x⊤z Φ(x) = x K(x,z) = (x⊤z)r Φt1t2...td(x) =

  • r!

t1!t2!....td!xt1 1 xt2 2 ...xtd d

∑d

i=1 ti = r

K(x,z) = e−γx−z2

slide-36
SLIDE 36

Kernel Construction

Let K1 and K2 be two valid kernels. K(x,y) = φ(x)⊤φ(y) K(u,v) = K1(u,v)K2(u,v) K = αK1 +βK2 α,β ≥ 0 ˆ K(x,y) = K(x,y)

  • K(x,x)
  • K(y,y)
slide-37
SLIDE 37

Kernel Construction

Let K1 and K2 be two valid kernels. K(x,y) = φ(x)⊤φ(y) K(u,v) = K1(u,v)K2(u,v) K = αK1 +βK2 α,β ≥ 0 ˆ K(x,y) = K(x,y)

  • K(x,x)
  • K(y,y)

K(x,y) = x⊤y K(x,y) = (x⊤y)i K(x,y) = lim

N→∞ N

i=0

(x⊤y)i i! = ex⊤y ˆ K(x,y) = e− 1

2x−y2

slide-38
SLIDE 38

Kernel function and feature map

A theorem due to Mercer guarantees a feature map for symmetric, psd kernel functions. Loosely stated For a symmetric kernel K : X ×X → IR, there exists an expansion K(x,z) = Φ(x)⊤Φ(z) iff

  • X g(x)g(z)K(x,z)dxdz ≥ 0
slide-39
SLIDE 39

What is a Dot product(aka Inner Product)

Let X be a vector space. What is a Dot product Symmetry < u,v >=< v,u > u,v ∈ X Bilinear < αu+βv,w >= α < u,w > +β < v,w > u,v,w,∈ X Positive Semidefinite < u,u > ≥ 0 u ∈ X < u,u >= 0 iff u = 0 Norm x =

  • x,x

x = 0 = ⇒ x = 0

slide-40
SLIDE 40

Examples of Dot products

X = IRn,< u,v >= u⊤v X = IRn,< u,v >=

n

i=1

λiuivi λi ≥ 0 X = L2(X) = {f :

  • X f(x)2dx < ∞}

f,g ∈ X < f,g >=

  • X f(x)g(x)dx
slide-41
SLIDE 41

Cauchy Schwartz inequality

Cauchy Schwartz inequality Let X be an inner product space. |x,y| ≤ xy ∀ x,y ∈ X and equality holds iff x = αz for some scalar α Proof: ∀α ∈ IR x−αz2 ≥ 0 x2 −2αx,z+α2z2 ≥ 0∀α Let α = x,z

z2 and the inequality follows by taking square roots. The

claim about equality follows from the definition of norm.

slide-42
SLIDE 42

Hilbert Space: Basic facts Definition Inner product space (H ,·,·H ) is a Hilbert Space if it is separable and complete. Denote the norm as ·H .

slide-43
SLIDE 43

Projections in Hilbert space

The orthogonal complement of M ⊂ H is defined as M⊥ = {z|x,zH = 0, ∀x ∈ M} Hilbert space Projection theorem Let M be a subspace of Hilbert space H ,·,·H . For every x ∈ H the following holds There exists an unique ΠM(x) ∈ M such that ΠM(x) = argminz∈Mx−zH x−ΠM(x) ∈ M⊥ z,x−ΠM(x)H = 0 ∀z ∈ M x2

H = ΠM(x)2 H +y2 H where

x = ΠM(x)+y where y ∈ M⊥

slide-44
SLIDE 44

1 Kernel Trick

SVMs and Non-linear Classification Principal Component Analysis What can we compute with the dot product in feature spaces?

2 Mathematical Foundations

RKHS, Representer theorem

3 Kernels on Graphs aka Networks

Kernels on vertices of a Graph Kernels on graphs

4 Advanced Topics: Multiple Kernel Learning

slide-45
SLIDE 45

Reproducing kernel Hilbert Space(RKHS)

Let K be any kernel function. Consider the following set H = {f|f(.) =

m

i=1

αiK(.,xi)∀xi ∈ X ,m ∈ N} Reproducible Property ∀f ∈ H ,f(x) =

m

i=i

αiK(x,xi) =

m

i=1

αiK(.,xi),K(.,x) = f(.),K(.,x)

slide-46
SLIDE 46

Dot product in RKHS

Dot product ∀f,g ∈ H ,f(.) =

m1

i=1

αiK(.,xi) , g(.) =

m2

i=1

βjK(.,xj) f,gH =

m1

i=1 m2

j=1

αiβjK(xi,xj) As K is symmetric, f,gH = g,fH f(.),f(.) =

m

i=1 m

j=1

αiαjK(xi,xj) Recall that K is a psd matrix if K is kernel function and so f(.),f(.)H ≥ 0 CS inequality holds so |f(x)| ≤

  • f,fH
  • K(x,x)
slide-47
SLIDE 47

Representer theorem

Representer theorem Let K be a valid kernel defined on X and H be the corresponding

  • RKHS. Let Ω be an increasing function. The optimization problem

min

g∈H G(g) = m

i=1

l(g(xi),yi)+Ω(g2

H )

is solved when g∗ = ∑m

i=1 αiK(.,xi)

slide-48
SLIDE 48

Representer theorem

Representer theorem Let K be a valid kernel defined on X and H be the corresponding

  • RKHS. Let Ω be an increasing function. The optimization problem

min

g∈H G(g) = m

i=1

l(g(xi),yi)+Ω(g2

H )

is solved when g∗ = ∑m

i=1 αiK(.,xi)

Proof: Let M = {∑m

i=1 αiK(.,xi) i = 1,...,m}. Clearly M is a

subspace of H . Take any g ∈ H . g(xi) = g,K(.,xi) = gM +gper,K(.,xi) = gM,K(.,xi)+gper,K(.,xi) = gM,K(.,xi) = gM(xi) As Ω is an increasing function, Ω(g2

H ) ≥ Ω(gM2 H )

slide-49
SLIDE 49

Back to C-SVM formulation

Given a Kernel function K defined on X one can create a RKHS H = {

n

i=1

βizi|zi ∈ X ,n ∈ N} Classifier: f(x) = sign(g(x)+b) ming∈H ,b∈IR

m

i=1

max(0,1−yi(g(xi)+b))

  • l(g(xi),yi)

+g2

H

At optimality g(.) = ∑m

i=1 γiK(.,xi)

(Representer theorem)

slide-50
SLIDE 50

1 Kernel Trick

SVMs and Non-linear Classification Principal Component Analysis What can we compute with the dot product in feature spaces?

2 Mathematical Foundations

RKHS, Representer theorem

3 Kernels on Graphs aka Networks

Kernels on vertices of a Graph Kernels on graphs

4 Advanced Topics: Multiple Kernel Learning

slide-51
SLIDE 51

Many Applications

Ideal to model molecules Protein-protein interaction networks metabolic networks Social networks

slide-52
SLIDE 52

Graph Kernels

Kernels on vertices on a Graph, G = (V,E) Compute K(vi,vj), where vi,vj ∈ V Kernels on Graphs Compute K(G1,G2) where G1,G2 are two graphs

slide-53
SLIDE 53

1 Kernel Trick

SVMs and Non-linear Classification Principal Component Analysis What can we compute with the dot product in feature spaces?

2 Mathematical Foundations

RKHS, Representer theorem

3 Kernels on Graphs aka Networks

Kernels on vertices of a Graph Kernels on graphs

4 Advanced Topics: Multiple Kernel Learning

slide-54
SLIDE 54

Diffusion Kernels(Kondor and Lafferty 2002)

Let X = {1,...,m} and there be some associated edges between

  • them. The adjacency matrix of the resulting graph be A.

Diffusion Kernel K = lim

s→∞(I + β

s H)s H = A−D, where D is diagonal with dii = ∑j aij. K is positive definite and Symmetric. Computation is O(m3)

slide-55
SLIDE 55

Diffusion Kernels(Kondor and Lafferty 2002)

Let X = {1,...,m} and there be some associated edges between

  • them. The adjacency matrix of the resulting graph be A.

Diffusion Kernel K = lim

s→∞(I + β

s H)s H = A−D, where D is diagonal with dii = ∑j aij. K is positive definite and Symmetric. Computation is O(m3) lim

s→∞(1+ β

s x)s = eβx K = eβH = ∑m

i=1 vieβλivi where (λi,vi) are the (eigen-value,

eigen-vector) of H

slide-56
SLIDE 56

Diffusion Kernels(Kondor and Lafferty 2002)

Sometimes can be computed in closed form for special graphs e.g. complete graphs K(i,j) =

  • 1+(m−1)e−mβ

m

i = j

1−e−mβ m

i = j Has a very interesting analogue with Diffusion equation in physics.

slide-57
SLIDE 57

1 Kernel Trick

SVMs and Non-linear Classification Principal Component Analysis What can we compute with the dot product in feature spaces?

2 Mathematical Foundations

RKHS, Representer theorem

3 Kernels on Graphs aka Networks

Kernels on vertices of a Graph Kernels on graphs

4 Advanced Topics: Multiple Kernel Learning

slide-58
SLIDE 58

Kernels on graphs

Graph Isomorphism Find a mapping g to the vertices of G1 = (V1,E1) to the vertices of G2 = (V2,E2) such that G1 and G2 are identical. If (u,v) ∈ E1 iff (g(u),g(v)) ∈ E2 then g is an isomorphism SubGraph Isomorphism Is there a subgraph S of G1 and a subgraph T of G2 such that S and T are isomorphic NP hard. Need computationally efficient approximations.

slide-59
SLIDE 59

Desideratum of a kernel function

Computationally efficient Positive Definite Can relate graph structures Applicable to wide variety of graphs

slide-60
SLIDE 60

Some Definitions

Let A be a m×n and B be a p×q matrix. A

  • B =

   a11B ··· a1nB . . . ... . . . am1B ··· amnB   

slide-61
SLIDE 61

Definitions: Product graph

Let G1 = (V1,E1) and G2 = (V2,E2) be two graphs. G = (V,E) is the product graph of G1 and G2 if V = V1 ×V2 and ((i,i′),(j,j′)) ∈ E iff (i,j) ∈ E1 and (i′,j′) ∈ E2. A(G) = A(G1)

  • A(G2)
slide-62
SLIDE 62

Random walk kernel between two graphs(Vishwanathan et

  • al. 2010)

Random walk kernel K(G1,G2) =

V

i,j=1 ∞

t=0

λ tAt = e⊤(I −λA)−1e V is the number of vertices of product graph of G1,G2 A = A(G1)A(G2) Counts the number of paths by simultaneously random walks on G1 and G2. Computational complexity is O(n6), where n = V(G1) = V(G2)

slide-63
SLIDE 63

Can we compute it more efficiently(Vishwanathan et al. 2010)

Sylvester Equation Given S,T and M0 one can solve for M M = SMT⊤ +M0 in O(n3) time

slide-64
SLIDE 64

1 Kernel Trick

SVMs and Non-linear Classification Principal Component Analysis What can we compute with the dot product in feature spaces?

2 Mathematical Foundations

RKHS, Representer theorem

3 Kernels on Graphs aka Networks

Kernels on vertices of a Graph Kernels on graphs

4 Advanced Topics: Multiple Kernel Learning

slide-65
SLIDE 65

PART 5: Multiple Kernel Learning

slide-66
SLIDE 66

Recap of SVMs

On a dataset D = {(xi,yi)|i = 1,...,m} SVMs solve the following problem ω(K) = maxα ∑

i

αi − 1 2α⊤YKYα (1) 0 ≤ αi ≤ C

i

αiyi = 0 (2) where Kij = k(xi,xj) is the kernel function defined on examples xi and xj. The final classifier is y = sign(∑i αiyiK(x,xi)+b)

slide-67
SLIDE 67

Recap of SVMs

Does not scale well The function ω(K) is a pointwise maximum of a set of functions and hence is convex If the maximization in α is not unique then ω(K) is not differentiable. ω(K) maynot be differentiable, but subgradients exist. Let us relax the problem a little and say that µi ≥ 0

slide-68
SLIDE 68

Learning a linear combination of Multiple Kernels Let {K1,...,Kn} be a given library of kernels. Given a training set of m examples, each Ki = K⊤

i ∈ IRm×m

MKL(Lanckriet et al. 2004) minKω(K) K =

l

i=1

µiKi trace(K) = c K 0

slide-69
SLIDE 69

MKL is a Semi-definite Programming problem

minz c⊤z s.t. F(z) = ∑l

i=1 ziFi 0

Bz = d z ∈ IRl and Fi = F⊤

i ∈ IRm×m.

F(z) is positive semidefinite Instance of a convex optimization problem. Can be solved by interior point methods

slide-70
SLIDE 70

MKL formulation

SDP formulation minµ,t,λ,ν>0 t (3)

  • ∑l

i=1 µiYKiY⊤

e+ν +λy e+ν +λy t

  • (4)

∑n

i=1 µiKi 0

(5)

slide-71
SLIDE 71

Reformulation of MKL

The SDP problem can be recast as QCQPs maxα,t α⊤e−ct (6) s.t. α⊤YKiYα ≤ rit i = 1,...,l (7) α⊤y = 0 0 ≤ α ≤ C (8) where ri = trace(Ki) QCQPs are instances of SOCPs minz c⊤z (9) Aiz+bi2 ≤ c⊤

i z+di

(10) where Ai ∈ IRni×l,bi,ci,c,z ∈ IRl,di ∈ IR

slide-72
SLIDE 72

Equivalence with Block L1 reqularization

Bach et al. (2004) showed that the QCQP formulation is equivalent to minw,b,ξ

1 2

    

l

i=1

diwi

  • Block L1

    

2

+C∑m

i=1 ξi

(11) s.t. yi(∑j w⊤

j φj(xi)+b) ≥ 1−ξi ∀ i = {1,...,m}

ξi ≥ 0 (12) for proper choice of di Block L1 norm promotes sparsity i.e. most of µi = 0

slide-73
SLIDE 73

Efficient algorithms for MKL

A trick Let γ ∈ IBn = {γ ∈ IRn|γi ≥ 0, ∑n

i=1 γi = 1} For any

ai ∈ IR, i = 1,...,n (

n

i=1

|ai|)2 ≤

n

i=1

a2

i

γi This implies that

  • n

i=1

wi 2 ≤

n

i=1

1 γi wi2 where γ lies in a probability simplex Can be helpful in reformulating the L1 formulation.

slide-74
SLIDE 74

Solving MKL by reusing SVM solvers

The following problem is equivalent to the Block L1 formulation(Rakotomamonjy et al. 2007) Sm = {α|0 ≤ αi ≤ C,α⊤y = 0} and IB = {µ|0 ≤ µi ∑l

i=1 µi = 1}

minµ∈IBJ(µ)

  • = maxα∈Smα⊤e− 1

2 ∑l i=1 µiα⊤YKiYα

  • (13)

A gradient descent algorithm-iteration 1.) Solve SVM problem with Kernel K = ∑l

i=1 µiKi

2.) Differentiate J w.r.t µ and update µ See also Sonnenberg et al. 2006

slide-75
SLIDE 75

References

Kernel methods in Computational Biology Scholkopf et al. 2004 Kernel methods for Pattern Analysis John Shawe Taylor and N. Cristanini Learning with Kernels Scholkopf and Smola 2002