Machine learning for computational biology Jean-Philippe Vert - - PowerPoint PPT Presentation

machine learning for computational biology
SMART_READER_LITE
LIVE PREVIEW

Machine learning for computational biology Jean-Philippe Vert - - PowerPoint PPT Presentation

Machine learning for computational biology Jean-Philippe Vert Jean-Philippe.Vert@mines.org Outline Motivations 1 Linear SVM 2 Nonlinear SVM and kernels 3 Learning molecular classifiers with network information 4 Kernels for strings and


slide-1
SLIDE 1

Machine learning for computational biology

Jean-Philippe Vert Jean-Philippe.Vert@mines.org

slide-2
SLIDE 2

Outline

1

Motivations

2

Linear SVM

3

Nonlinear SVM and kernels

4

Learning molecular classifiers with network information

5

Kernels for strings and graphs

6

Data integration with kernels

7

Conclusion

slide-3
SLIDE 3

Outline

1

Motivations

2

Linear SVM

3

Nonlinear SVM and kernels

4

Learning molecular classifiers with network information

5

Kernels for strings and graphs

6

Data integration with kernels

7

Conclusion

slide-4
SLIDE 4

What’s in your body

1 body = 1014 human cells (and 100x more non-human cells) 1 cell = 6 × 109 ACGT coding for 20, 000 genes

slide-5
SLIDE 5

Sequencing revolution

slide-6
SLIDE 6

A cancer cell

slide-7
SLIDE 7

A cancer cell

slide-8
SLIDE 8

A cancer cell

slide-9
SLIDE 9

Opportunities

What is your risk of developing a cancer? (prevention) After diagnosis and treatment, what is the risk of relapse? (prognosis) What specific treatment will cure your cancer? (personalized medicine)

slide-10
SLIDE 10

Cancer diagnosis

Problem 1

Given the expression levels of 20k genes in a leukemia, is it an acute lymphocytic or myeloid leukemia (ALL or AML)?

slide-11
SLIDE 11

Cancer prognosis

Problem 2

Given the expression levels of 20k genes in a tumour after surgery, is it likely to relapse later?

slide-12
SLIDE 12

Pharmacogenomics / Toxicogenomics

Problem 3

Given the genome of a person, which drug should we give?

slide-13
SLIDE 13

Protein annotation

Data available

Secreted proteins: MASKATLLLAFTLLFATCIARHQQRQQQQNQCQLQNIEA... MARSSLFTFLCLAVFINGCLSQIEQQSPWEFQGSEVW... MALHTVLIMLSLLPMLEAQNPEHANITIGEPITNETLGWL... ... Non-secreted proteins: MAPPSVFAEVPQAQPVLVFKLIADFREDPDPRKVNLGVG... MAHTLGLTQPNSTEPHKISFTAKEIDVIEWKGDILVVG... MSISESYAKEIKTAFRQFTDFPIEGEQFEDFLPIIGNP.. ...

Problem 4

Given a newly sequenced protein, is it secreted or not?

slide-14
SLIDE 14

Drug discovery

inactive active active active inactive inactive

Problem 5

Given a new candidate molecule, is it likely to be active?

slide-15
SLIDE 15

A common topic

slide-16
SLIDE 16

A common topic

slide-17
SLIDE 17

A common topic

slide-18
SLIDE 18

A common topic

slide-19
SLIDE 19

On real data...

slide-20
SLIDE 20

Pattern recognition, aka supervised classification

Challenges

High dimension Few samples Structured data Heterogeneous data Prior knowledge Fast and scalable implementations Interpretable models

slide-21
SLIDE 21

Outline

1

Motivations

2

Linear SVM

3

Nonlinear SVM and kernels

4

Learning molecular classifiers with network information

5

Kernels for strings and graphs

6

Data integration with kernels

7

Conclusion

slide-22
SLIDE 22

Linear classifier

slide-23
SLIDE 23

Linear classifier

slide-24
SLIDE 24

Linear classifier

slide-25
SLIDE 25

Linear classifier

slide-26
SLIDE 26

Linear classifier

slide-27
SLIDE 27

Linear classifier

slide-28
SLIDE 28

Linear classifier

slide-29
SLIDE 29

Linear classifier

slide-30
SLIDE 30

Which one is better?

slide-31
SLIDE 31

The margin of a linear classifier

slide-32
SLIDE 32

The margin of a linear classifier

slide-33
SLIDE 33

The margin of a linear classifier

slide-34
SLIDE 34

The margin of a linear classifier

slide-35
SLIDE 35

The margin of a linear classifier

slide-36
SLIDE 36

Largest margin classifier (hard-margin SVM)

slide-37
SLIDE 37

Support vectors

slide-38
SLIDE 38

More formally

The training set is a finite set of n data/class pairs: S =

  • (

x1, y1), . . . , ( xn, yn)

  • ,

where xi ∈ Rp and yi ∈ {−1, 1}. We assume (for the moment) that the data are linearly separable, i.e., that there exists ( w, b) ∈ Rp × R such that:

  • w.

xi + b > 0 if yi = 1 ,

  • w.

xi + b < 0 if yi = −1 .

slide-39
SLIDE 39

How to find the largest separating hyperplane?

For a given linear classifier f(x) = w. x + b consider the "tube" defined by the values −1 and +1 of the decision function:

x2 x1 w.x+b > +1 w.x+b < −1 w w.x+b=+1 w.x+b=−1 w.x+b=0

slide-40
SLIDE 40

The margin is 2/ w

Indeed, the points x1 and x2 satisfy:

  • w.

x1 + b = 0 ,

  • w.

x2 + b = 1 . By subtracting we get w.( x2 − x1) = 1, and therefore: γ = 2 x2 − x1 = 2 w .

slide-41
SLIDE 41

All training points should be on the correct side of the dotted line

For positive examples (yi = 1) this means:

  • w.

xi + b ≥ 1 . For negative examples (yi = −1) this means:

  • w.

xi + b ≤ −1 . Both cases are summarized by: ∀i = 1, . . . , n , yi

  • w.

xi + b

  • ≥ 1 .
slide-42
SLIDE 42

Finding the optimal hyperplane

Find ( w, b) which minimize: w 2 under the constraints: ∀i = 1, . . . , n , yi

  • w.

xi + b

  • − 1 ≥ 0 .

This is a classical quadratic program on Rp+1.

slide-43
SLIDE 43

Lagrangian

In order to minimize: 1 2 w 2

2

under the constraints: ∀i = 1, . . . , n , yi

  • w.

xi + b

  • − 1 ≥ 0 ,

we introduce one dual variable αi for each constraint, i.e., for each training point. The Lagrangian is: L

  • w, b,

α

  • = 1

2|| w||2 −

n

  • i=1

αi

  • yi
  • w.

xi + b

  • − 1
  • .
slide-44
SLIDE 44

Lagrangian

L

  • w, b,

α

  • is convex quadratic in
  • w. It is minimized for:

wL =

w −

n

  • i=1

αiyi xi = 0 = ⇒

  • w =

n

  • i=1

αiyi xi . L

  • w, b,

α

  • is affine in b. Its minimum is −∞ except if:

∇bL =

n

  • i=1

αiyi = 0 .

slide-45
SLIDE 45

Dual function

We therefore obtain the Lagrange dual function: q ( α) = inf

  • w∈Rp,b∈RL
  • w, b,

α

  • =

n

i=1 αi − 1 2

n

i=1

n

j=1 yiyjαiαj

xi. xj if n

i=1 αiyi = 0 ,

−∞

  • therwise.

The dual problem is: maximize q ( α) subject to

  • α ≥ 0 .
slide-46
SLIDE 46

Dual problem

Find α∗ ∈ Rn which maximizes L( α) =

n

  • i=1

αi − 1 2

n

  • i=1

n

  • j=1

αiαjyiyj xi. xj, under the (simple) constraints αi ≥ 0 (for i = 1, . . . , n), and

n

  • i=1

αiyi = 0. This is a quadratic program on RN, with "box constraints". α∗ can be found efficiently using dedicated optimization softwares.

slide-47
SLIDE 47

Recovering the optimal hyperplane

Once α∗ is found, we recover ( w∗, b∗) corresponding to the optimal

  • hyperplane. w∗ is given by:
  • w∗ =

n

  • i=1

αi xi, and the decision function is therefore: f ∗( x) = w∗. x + b∗ =

n

  • i=1

αi xi. x + b∗ . (1)

slide-48
SLIDE 48

Interpretation: support vectors α>0 α=0

slide-49
SLIDE 49

What if data are not linearly separable?

slide-50
SLIDE 50

What if data are not linearly separable?

slide-51
SLIDE 51

What if data are not linearly separable?

slide-52
SLIDE 52

What if data are not linearly separable?

slide-53
SLIDE 53

Soft-margin SVM

Find a trade-off between large margin and few errors. Mathematically: min

f

  • 1

margin(f) + C × errors(f)

  • C is a parameter
slide-54
SLIDE 54

Soft-margin SVM formulation

The margin of a labeled point ( x, y) is margin( x, y) = y

  • w.

x + b

  • The error is

0 if margin( x, y) > 1, 1 − margin( x, y) otherwise.

The soft margin SVM solves: min

  • w,b
  • ||

w||2 + C

n

  • i=1

max

  • 0, 1 − yi
  • w.

xi + b

slide-55
SLIDE 55

Soft-margin SVM and hinge loss

min

  • w,b

n

  • i=1

ℓhinge

  • w.xi + b, yi
  • + λ

w 2

2

  • ,

for λ = 1/C and the hinge loss function: ℓhinge(u, y) = max (1 − yu, 0) =

  • if yu ≥ 1,

1 − yu

  • therwise.

yf(x) l(f(x),y) 1

slide-56
SLIDE 56

Dual formulation of soft-margin SVM (exercice)

Maximize L( α) =

n

  • i=1

αi − 1 2

n

  • i=1

n

  • j=1

αiαjyiyj xi. xj, under the constraints:

  • 0 ≤ αi ≤ C,

for i = 1, . . . , n n

i=1 αiyi = 0.

slide-57
SLIDE 57

Interpretation: bounded and unbounded support vectors

C α=0 0<α<C α=

slide-58
SLIDE 58

Primal (for large n) vs dual (for large p) optimization

1

Find ( w, b) ∈ Rp+1 which solve: min

  • w,b

n

  • i=1

ℓhinge

  • w.xi + b, yi
  • + λ

w 2

2

  • .

2

Find α∗ ∈ Rn which maximizes L( α) =

n

  • i=1

αi − 1 2

n

  • i=1

n

  • j=1

αiαjyiyj xi. xj , under the constraints:

  • 0 ≤ αi ≤ C,

for i = 1, . . . , n n

i=1 αiyi = 0.

slide-59
SLIDE 59

Outline

1

Motivations

2

Linear SVM

3

Nonlinear SVM and kernels

4

Learning molecular classifiers with network information

5

Kernels for strings and graphs

6

Data integration with kernels

7

Conclusion

slide-60
SLIDE 60

Sometimes linear methods are not interesting

slide-61
SLIDE 61

Solution: nonlinear mapping to a feature space

2

R x1 x2 x1 x2

2

For x = x1 x2

  • let Φ(x) =

x2

1

x2

2

  • . The decision function is:

f(x) = x2

1 + x2 2 − R2 =

1 1 ⊤ x2

1

x2

2

  • − R2 = β⊤Φ(x) + b .
slide-62
SLIDE 62

Kernel = inner product in the feature space

Definition

For a given mapping Φ : X → H from the space of objects X to some Hilbert space of features H, the kernel between two objects x and x′ is the inner product of their images in the features space: ∀x, x′ ∈ X, K(x, x′) = Φ(x)⊤Φ(x′) .

φ X F

slide-63
SLIDE 63

Example

φ X F

Let X = H = R2 and for x = x1 x2

  • let Φ(x) =

x2

1

x2

2

  • Then

K(x, x′) = Φ(x)⊤Φ(x′) = (x1)2(x′

1)2 + (x2)2(x′ 2)2 .

slide-64
SLIDE 64

The kernel tricks

φ X F

2 tricks

1

Many linear algorithms (in particular linear SVM) can be performed in the feature space of Φ(x) without explicitly computing the images Φ(x), but instead by computing kernels K(x, x′).

2

It is sometimes possible to easily compute kernels which correspond to complex large-dimensional feature spaces: K(x, x′) is often much simpler to compute than Φ(x) and Φ(x′)

slide-65
SLIDE 65

Trick 1 : SVM in the original space

Train the SVM by maximizing max

α∈Rn n

  • i=1

αi − 1 2

n

  • i=1

n

  • j=1

αiαjyiyjx⊤

i xj ,

under the constraints:

  • 0 ≤ αi ≤ C ,

for i = 1, . . . , n n

i=1 αiyi = 0 .

Predict with the decision function f (x) =

n

  • i=1

αiyix⊤

i x + b∗ .

slide-66
SLIDE 66

Trick 1 : SVM in the feature space

Train the SVM by maximizing max

α∈Rn n

  • i=1

αi − 1 2

n

  • i=1

n

  • j=1

αiαjyiyjΦ (xi)⊤ Φ

  • xj
  • ,

under the constraints:

  • 0 ≤ αi ≤ C ,

for i = 1, . . . , n n

i=1 αiyi = 0 .

Predict with the decision function f (x) =

n

  • i=1

αiyiΦ (xi)⊤ Φ (x) + b∗ .

slide-67
SLIDE 67

Trick 1 : SVM in the feature space with a kernel

Train the SVM by maximizing max

α∈Rn n

  • i=1

αi − 1 2

n

  • i=1

n

  • j=1

αiαjyiyjK

  • xi, xj
  • ,

under the constraints:

  • 0 ≤ αi ≤ C ,

for i = 1, . . . , n n

i=1 αiyi = 0 .

Predict with the decision function f (x) =

n

  • i=1

αiK (xi, x) + b∗ .

slide-68
SLIDE 68

Trick 2 illustration: polynomial kernel

2

R x1 x2 x1 x2

2

For x = (x1, x2)⊤ ∈ R2, let Φ(x) = (x2

1,

√ 2x1x2, x2

2) ∈ R3:

K(x, x′) = x2

1x′2 1 + 2x1x2x′ 1x′ 2 + x2 2x′2 2

=

  • x1x′

1 + x2x′ 2

2 =

  • x⊤x′2

.

slide-69
SLIDE 69

Trick 2 illustration: polynomial kernel

2

R x1 x2 x1 x2

2

More generally, for x, x′ ∈ Rp, K(x, x′) =

  • x⊤x′ + 1

d is an inner product in a feature space of all monomials of degree up to d (left as exercice.)

slide-70
SLIDE 70

Combining tricks: learn a polynomial discrimination rule with SVM

Train the SVM by maximizing max

α∈Rn n

  • i=1

αi − 1 2

n

  • i=1

n

  • j=1

αiαjyiyj

  • x⊤

i xj + 1

d , under the constraints:

  • 0 ≤ αi ≤ C ,

for i = 1, . . . , n n

i=1 αiyi = 0 .

Predict with the decision function f (x) =

n

  • i=1

αiyi

  • x⊤

i x + 1

d + b∗ .

slide-71
SLIDE 71

Illustration: toy nonlinear problem

> plot(x,col=ifelse(y>0,1,2),pch=ifelse(y>0,1,2))

  • −1

1 2 3 −1 1 2 3 4

Training data

x1 x2

slide-72
SLIDE 72

Illustration: toy nonlinear problem, linear SVM

> library(kernlab) > svp <- ksvm(x,y,type="C-svc",kernel=’vanilladot’) > plot(svp,data=x)

−1.0 −0.5 0.0 0.5 1.0 1.5 2.0 −1 1 2 3 4 −1 1 2 3

  • SVM classification plot

x2 x1

slide-73
SLIDE 73

Illustration: toy nonlinear problem, polynomial SVM

> svp <- ksvm(x,y,type="C-svc", ... kernel=polydot(degree=2)) > plot(svp,data=x)

−5 5 10 −1 1 2 3 4 −1 1 2 3

  • SVM classification plot

x2 x1

slide-74
SLIDE 74

Which functions K(x, x′) are kernels?

Definition

A function K(x, x′) defined on a set X is a kernel if and only if there exists a features space (Hilbert space) H and a mapping Φ : X → H , such that, for any x, x′ in X: K

  • x, x′

=

  • Φ (x) , Φ
  • x′

H .

φ X F

slide-75
SLIDE 75

Positive Definite (p.d.) functions

Definition

A positive definite (p.d.) function on the set X is a function K : X × X → R symmetric: ∀

  • x, x′

∈ X 2, K

  • x, x′

= K

  • x′, x
  • ,

and which satisfies, for all N ∈ N, (x1, x2, . . . , xN) ∈ X N et (a1, a2, . . . , aN) ∈ RN:

N

  • i=1

N

  • j=1

aiajK

  • xi, xj
  • ≥ 0.
slide-76
SLIDE 76

Kernels are p.d. functions

Theorem (Aronszajn, 1950)

K is a kernel if and only if it is a positive definite function.

φ X F

slide-77
SLIDE 77

Proof?

Kernel = ⇒ p.d. function:

Φ (x) , Φ (x′)Rd = Φ (x′) , Φ (x)Rd , N

i=1

N

j=1 aiaj Φ (xi) , Φ (xj)Rd = N i=1 aiΦ (xi) 2 Rd ≥ 0 .

P .d. function = ⇒ kernel: more difficult...

slide-78
SLIDE 78

Kernel examples

Polynomial (on Rd): K(x, x′) = (x.x′ + 1)d Gaussian radial basis function (RBF) (on Rd) K(x, x′) = exp

  • −||x − x′||2

2σ2

  • Laplace kernel (on R)

K(x, x′) = exp

  • −γ|x − x′|
  • Min kernel (on R+)

K(x, x′) = min(x, x′)

Exercice

Exercice: for each kernel, find a Hilbert space H and a mapping Φ : X → H such that K(x, x′) = Φ(x), Φ(x′)

slide-79
SLIDE 79

Example: SVM with a Gaussian kernel

Training: min

α∈Rn n

  • i=1

αi − 1 2

n

  • i,j=1

αiαjyiyj exp

  • −||

xi − xj||2 2σ2

  • s.t. 0 ≤ αi ≤ C,

and

n

  • i=1

αiyi = 0. Prediction f( x) =

n

  • i=1

αi exp

  • −||

x − xi||2 2σ2

slide-80
SLIDE 80

Example: SVM with a Gaussian kernel

f( x) =

n

  • i=1

αi exp

  • −||

x − xi||2 2σ2

  • −1.0

−0.5 0.0 0.5 1.0 −2 2 4 6 −2 2 4

  • SVM classification plot
slide-81
SLIDE 81

Linear vs nonlinear SVM

slide-82
SLIDE 82

Regularity vs data fitting trade-off

slide-83
SLIDE 83

C controls the trade-off

min

f

  • 1

margin(f) + C × errors(f)

slide-84
SLIDE 84

Why it is important to control the trade-off

slide-85
SLIDE 85

How to choose C in practice

Split your dataset in two ("train" and "test") Train SVM with different C on the "train" set Compute the accuracy of the SVM on the "test" set Choose the C which minimizes the "test" error (you may repeat this several times = cross-validation)

slide-86
SLIDE 86

Outline

1

Motivations

2

Linear SVM

3

Nonlinear SVM and kernels

4

Learning molecular classifiers with network information

5

Kernels for strings and graphs

6

Data integration with kernels

7

Conclusion

slide-87
SLIDE 87

Breast cancer prognosis

slide-88
SLIDE 88

Gene selection, molecular signature

The idea

We look for a limited set of genes that are sufficient for prediction. Selected genes should inform us about the underlying biology

slide-89
SLIDE 89

Lack of stability of signatures

0.56 0.58 0.6 0.62 0.64 0.66 0.05 0.1 0.15 0.2

AUC Stability

Random T test Entropy Bhatt. Wilcoxon RFE GFS Lasso E−Net Single−run Ensemble−mean Ensemble−exp Ensemble−ss

Haury et al. (2011)

slide-90
SLIDE 90

Gene networks

N

  • Glycan

biosynthesis

Protein kinases

DNA and RNA polymerase subunits Glycolysis / Gluconeogenesis Sulfur metabolism Porphyrin and chlorophyll metabolism Riboflavin metabolism Folate biosynthesis

Biosynthesis of steroids, ergosterol metabolism Lysine biosynthesis Phenylalanine, tyrosine and tryptophan biosynthesis Purine metabolism Oxidative phosphorylation, TCA cycle

Nitrogen, asparagine metabolism

slide-91
SLIDE 91

Gene networks and expression data

Motivation

Basic biological functions usually involve the coordinated action of several proteins:

Formation of protein complexes Activation of metabolic, signalling or regulatory pathways

Many pathways and protein-protein interactions are already known Hypothesis: the weights of the classifier should be “coherent” with respect to this prior knowledge

slide-92
SLIDE 92

Graph based penalty

fβ(x) = β⊤x min

β R(fβ) + λΩ(β)

Prior hypothesis

Genes near each other on the graph should have similar weigths.

An idea (Rapaport et al., 2007)

Ω(β) =

  • i∼j

(βi − βj)2 , min

β∈Rp R(fβ) + λ

  • i∼j

(βi − βj)2 .

slide-93
SLIDE 93

Graph based penalty

fβ(x) = β⊤x min

β R(fβ) + λΩ(β)

Prior hypothesis

Genes near each other on the graph should have similar weigths.

An idea (Rapaport et al., 2007)

Ω(β) =

  • i∼j

(βi − βj)2 , min

β∈Rp R(fβ) + λ

  • i∼j

(βi − βj)2 .

slide-94
SLIDE 94

Graph Laplacian

Definition

The Laplacian of the graph is the matrix L = D − A.

1 2 3 4 5

L = D − A =       1 −1 1 −1 −1 −1 3 −1 −1 2 −1 1 1      

slide-95
SLIDE 95

Spectral penalty as a kernel

Theorem

The function f(x) = β⊤x where β is solution of min

β∈Rp

1 n

n

  • i=1

  • β⊤xi, yi
  • + λ
  • i∼j
  • βi − βj

2 is equal to g(x) = γ⊤Φ(x) where γ is solution of min

γ∈Rp

1 n

n

  • i=1

  • γ⊤Φ(xi), yi
  • + λγ⊤γ ,

and where Φ(x)⊤Φ(x′) = x⊤KGx′ for KG = L∗, the pseudo-inverse of the graph Laplacian. Proof: left as exercice

slide-96
SLIDE 96

Example

1 2 3 4 5

L∗ =       0.88 −0.12 0.08 −0.32 −0.52 −0.12 0.88 0.08 −0.32 −0.52 0.08 0.08 0.28 −0.12 −0.32 −0.32 −0.32 −0.12 0.48 0.28 −0.52 −0.52 −0.32 0.28 1.08      

slide-97
SLIDE 97

Classifiers

N

  • Glycan

biosynthesis

Protein kinases

DNA and RNA polymerase subunits Glycolysis / Gluconeogenesis Sulfur metabolism Porphyrin and chlorophyll metabolism Riboflavin metabolism Folate biosynthesis

Biosynthesis of steroids, ergosterol metabolism Lysine biosynthesis Phenylalanine, tyrosine and tryptophan biosynthesis Purine metabolism Oxidative phosphorylation, TCA cycle

Nitrogen, asparagine metabolism

slide-98
SLIDE 98

Classifier

a) b)

slide-99
SLIDE 99

Other penalties with kernels

Φ(x)⊤Φ(x′) = x⊤KGx′ with: KG = (c + L)−1 leads to Ω(β) = c

p

  • i=1

β2

i +

  • i∼j
  • βi − βj

2 . The diffusion kernel: KG = expM(−2tL) . penalizes high frequencies of β in the Fourier domain.

slide-100
SLIDE 100

Outline

1

Motivations

2

Linear SVM

3

Nonlinear SVM and kernels

4

Learning molecular classifiers with network information

5

Kernels for strings and graphs

6

Data integration with kernels

7

Conclusion

slide-101
SLIDE 101

Supervised sequence classification

Data (training)

Secreted proteins: MASKATLLLAFTLLFATCIARHQQRQQQQNQCQLQNIEA... MARSSLFTFLCLAVFINGCLSQIEQQSPWEFQGSEVW... MALHTVLIMLSLLPMLEAQNPEHANITIGEPITNETLGWL... ... Non-secreted proteins: MAPPSVFAEVPQAQPVLVFKLIADFREDPDPRKVNLGVG... MAHTLGLTQPNSTEPHKISFTAKEIDVIEWKGDILVVG... MSISESYAKEIKTAFRQFTDFPIEGEQFEDFLPIIGNP.. ...

Goal

Build a classifier to predict whether new proteins are secreted or not.

slide-102
SLIDE 102

String kernels

The idea

Map each string x ∈ X to a vector Φ(x) ∈ F. Train a classifier for vectors on the images Φ(x1), . . . , Φ(xn) of the training set (nearest neighbor, linear perceptron, logistic regression, support vector machine...)

mahtlg...

φ X F

maskat... msises marssl... malhtv... mappsv...

slide-103
SLIDE 103

Example: substring indexation

The approach

Index the feature space by fixed-length strings, i.e., Φ (x) = (Φu (x))u∈Ak where Φu (x) can be: the number of occurrences of u in x (without gaps) : spectrum kernel (Leslie et al., 2002) the number of occurrences of u in x up to m mismatches (without gaps) : mismatch kernel (Leslie et al., 2004) the number of occurrences of u in x allowing gaps, with a weight decaying exponentially with the number of gaps : substring kernel (Lohdi et al., 2002)

slide-104
SLIDE 104

Spectrum kernel (1/2)

Kernel definition

The 3-spectrum of x = CGGSLIAMMWFGV is: (CGG,GGS,GSL,SLI,LIA,IAM,AMM,MMW,MWF,WFG,FGV) . Let Φu (x) denote the number of occurrences of u in x. The k-spectrum kernel is: K

  • x, x′

:=

  • u∈Ak

Φu (x) Φu

  • x′

.

slide-105
SLIDE 105

Spectrum kernel (2/2)

Implementation

The computation of the kernel is formally a sum over |A|k terms, but at most | x | − k + 1 terms are non-zero in Φ (x) = ⇒ Computation in O (| x | + | x′ |) with pre-indexation of the strings. Fast classification of a sequence x in O (| x |): f (x) = w · Φ (x) =

  • u

wuΦu (x) =

| x |−k+1

  • i=1

wxi...xi+k−1.

Remarks

Work with any string (natural language, time series...) Fast and scalable, a good default method for string classification. Variants allow matching of k-mers up to m mismatches.

slide-106
SLIDE 106

Local alignmnent kernel (Saigo et al., 2004)

CGGSLIAMM----WFGV |...|||||....|||| C---LIVMMNRLMWFGV sS,g(π) = S(C, C) + S(L, L) + S(I, I) + S(A, V) + 2S(M, M) + S(W, W) + S(F, F) + S(G, G) + S(V, V) − g(3) − g(4) SWS,g(x, y) := max

π∈Π(x,y) sS,g(π)

is not a kernel K (β)

LA (x, y) =

  • π∈Π(x,y)

exp

  • βsS,g (x, y, π)
  • is a kernel
slide-107
SLIDE 107

LA kernel is p.d.: proof (1/2)

Definition: Convolution kernel (Haussler, 1999)

Let K1 and K2 be two p.d. kernels for strings. The convolution of K1 and K2, denoted K1 ⋆ K2, is defined for any x, x′ ∈ X by: K1 ⋆ K2(x, y) :=

  • x1x2=x,y1y2=y

K1(x1, y1)K2(x2, y2).

Lemma

If K1 and K2 are p.d. then K1 ⋆ K2 is p.d..

slide-108
SLIDE 108

LA kernel is p.d.: proof (2/2)

K (β)

LA = ∞

  • n=0

K0 ⋆

  • K (β)

a

⋆ K (β)

g

(n−1) ⋆ K (β)

a

⋆ K0 , with The constant kernel: K0 (x, y) := 1 . A kernel for letters: K (β)

a

(x, y) := if | x | = 1 where | y | = 1 , exp (βS(x, y))

  • therwise .

A kernel for gaps: K (β)

g

(x, y) = exp [β (g (| x |) + g (| x |))] .

slide-109
SLIDE 109

The choice of kernel matters

10 20 30 40 50 60 0.2 0.4 0.6 0.8 1

  • No. of families with given performance

ROC50 SVM-LA SVM-pairwise SVM-Mismatch SVM-Fisher

Performance on the SCOP superfamily recognition benchmark (from Saigo et al., 2004).

slide-110
SLIDE 110

Virtual screening for drug discovery

inactive active active active inactive inactive

NCI AIDS screen results (from http://cactus.nci.nih.gov).

slide-111
SLIDE 111

Image retrieval and classification

From Harchaoui and Bach (2007).

slide-112
SLIDE 112

Graph kernels

1

Represent each graph x by a vector Φ(x) ∈ H, either explicitly or implicitly through the kernel K(x, x′) = Φ(x)⊤Φ(x′) .

2

Use a linear method for classification in H.

X

slide-113
SLIDE 113

Graph kernels

1

Represent each graph x by a vector Φ(x) ∈ H, either explicitly or implicitly through the kernel K(x, x′) = Φ(x)⊤Φ(x′) .

2

Use a linear method for classification in H.

φ H X

slide-114
SLIDE 114

Graph kernels

1

Represent each graph x by a vector Φ(x) ∈ H, either explicitly or implicitly through the kernel K(x, x′) = Φ(x)⊤Φ(x′) .

2

Use a linear method for classification in H.

φ H X

slide-115
SLIDE 115

Indexing by all subgraphs?

Theorem

Computing all subgraph occurrences is NP-hard.

Proof.

The linear graph of size n is a subgraph of a graph X with n vertices iff X has an Hamiltonian path The decision problem whether a graph has a Hamiltonian path is NP-complete.

slide-116
SLIDE 116

Indexing by all subgraphs?

Theorem

Computing all subgraph occurrences is NP-hard.

Proof.

The linear graph of size n is a subgraph of a graph X with n vertices iff X has an Hamiltonian path The decision problem whether a graph has a Hamiltonian path is NP-complete.

slide-117
SLIDE 117

Indexing by all subgraphs?

Theorem

Computing all subgraph occurrences is NP-hard.

Proof.

The linear graph of size n is a subgraph of a graph X with n vertices iff X has an Hamiltonian path The decision problem whether a graph has a Hamiltonian path is NP-complete.

slide-118
SLIDE 118

Indexing by specific subgraphs

Substructure selection

We can imagine more limited sets of substuctures that lead to more computationnally efficient indexing (non-exhaustive list) substructures selected by domain knowledge (MDL fingerprint) all path up to length k (Openeye fingerprint, Nicholls 2005) all shortest paths (Borgwardt and Kriegel, 2005) all subgraphs up to k vertices (graphlet kernel, Sherashidze et al., 2009) all frequent subgraphs in the database (Helma et al., 2004)

slide-119
SLIDE 119

Example : Indexing by all shortest paths

(0,...,0,2,0,...,0,1,0,...)

B A B A A A A B A B A B A A A A

Properties (Borgwardt and Kriegel, 2005)

There are O(n2) shortest paths. The vector of counts can be computed in O(n4) with the Floyd-Warshall algorithm.

slide-120
SLIDE 120

Example : Indexing by all shortest paths

(0,...,0,2,0,...,0,1,0,...)

B A B A A A A B A B A B A A A A

Properties (Borgwardt and Kriegel, 2005)

There are O(n2) shortest paths. The vector of counts can be computed in O(n4) with the Floyd-Warshall algorithm.

slide-121
SLIDE 121

Example : Indexing by all subgraphs up to k vertices

Properties (Shervashidze et al., 2009)

Naive enumeration scales as O(nk). Enumeration of connected graphlets in O(ndk−1) for graphs with degree ≤ d and k ≤ 5. Randomly sample subgraphs if enumeration is infeasible.

slide-122
SLIDE 122

Example : Indexing by all subgraphs up to k vertices

Properties (Shervashidze et al., 2009)

Naive enumeration scales as O(nk). Enumeration of connected graphlets in O(ndk−1) for graphs with degree ≤ d and k ≤ 5. Randomly sample subgraphs if enumeration is infeasible.

slide-123
SLIDE 123

Walks

Definition

A walk of a graph (V, E) is sequence of v1, . . . , vn ∈ V such that (vi, vi+1) ∈ E for i = 1, . . . , n − 1. We note Wn(G) the set of walks with n vertices of the graph G, and W(G) the set of all walks.

etc...

slide-124
SLIDE 124

Walks = paths

slide-125
SLIDE 125

Walk kernel

Definition

Let Sn denote the set of all possible label sequences of walks of length n (including vertices and edges labels), and S = ∪n≥1Sn. For any graph X let a weight λG(w) be associated to each walk w ∈ W(G). Let the feature vector Φ(G) = (Φs(G))s∈S be defined by: Φs(G) =

  • w∈W(G)

λG(w)1 (s is the label sequence of w) . A walk kernel is a graph kernel defined by: Kwalk(G1, G2) =

  • s∈S

Φs(G1)Φs(G2) .

slide-126
SLIDE 126

Walk kernel

Definition

Let Sn denote the set of all possible label sequences of walks of length n (including vertices and edges labels), and S = ∪n≥1Sn. For any graph X let a weight λG(w) be associated to each walk w ∈ W(G). Let the feature vector Φ(G) = (Φs(G))s∈S be defined by: Φs(G) =

  • w∈W(G)

λG(w)1 (s is the label sequence of w) . A walk kernel is a graph kernel defined by: Kwalk(G1, G2) =

  • s∈S

Φs(G1)Φs(G2) .

slide-127
SLIDE 127

Walk kernel examples

The nth-order walk kernel is the walk kernel with λG(w) = 1 if the length of w is n, 0 otherwise. It compares two graphs through their common walks of length n. The random walk kernel is obtained with λG(w) = PG(w), where PG is a Markov random walk on G. In that case we have: K(G1, G2) = P(label(W1) = label(W2)) , where W1 and W2 are two independant random walks on G1 and G2, respectively (Kashima et al., 2003). The geometric walk kernel is obtained (when it converges) with λG(w) = βlength(w), for β > 0. In that case the feature space is of infinite dimension (Gärtner et al., 2003).

slide-128
SLIDE 128

Walk kernel examples

The nth-order walk kernel is the walk kernel with λG(w) = 1 if the length of w is n, 0 otherwise. It compares two graphs through their common walks of length n. The random walk kernel is obtained with λG(w) = PG(w), where PG is a Markov random walk on G. In that case we have: K(G1, G2) = P(label(W1) = label(W2)) , where W1 and W2 are two independant random walks on G1 and G2, respectively (Kashima et al., 2003). The geometric walk kernel is obtained (when it converges) with λG(w) = βlength(w), for β > 0. In that case the feature space is of infinite dimension (Gärtner et al., 2003).

slide-129
SLIDE 129

Walk kernel examples

The nth-order walk kernel is the walk kernel with λG(w) = 1 if the length of w is n, 0 otherwise. It compares two graphs through their common walks of length n. The random walk kernel is obtained with λG(w) = PG(w), where PG is a Markov random walk on G. In that case we have: K(G1, G2) = P(label(W1) = label(W2)) , where W1 and W2 are two independant random walks on G1 and G2, respectively (Kashima et al., 2003). The geometric walk kernel is obtained (when it converges) with λG(w) = βlength(w), for β > 0. In that case the feature space is of infinite dimension (Gärtner et al., 2003).

slide-130
SLIDE 130

Computation of walk kernels

Proposition

These three kernels (nth-order, random and geometric walk kernels) can be computed efficiently in polynomial time.

slide-131
SLIDE 131

Product graph

Definition

Let G1 = (V1, E1) and G2 = (V2, E2) be two graphs with labeled

  • vertices. The product graph G = G1 × G2 is the graph G = (V, E) with:

1

V = {(v1, v2) ∈ V1 × V2 : v1 and v2 have the same label} ,

2

E =

  • (v1, v2), (v′

1, v′ 2)

  • ∈ V × V : (v1, v′

1) ∈ E1 and (v2, v′ 2) ∈ E2

  • .

G1 x G2

c d e 4 3 2 1 1b 2a 1d 1a 2b 3c 4c 2d 3e 4e

G1 G2

a b

slide-132
SLIDE 132

Walk kernel and product graph

Lemma

There is a bijection between:

1

The pairs of walks w1 ∈ Wn(G1) and w2 ∈ Wn(G2) with the same label sequences,

2

The walks on the product graph w ∈ Wn(G1 × G2).

Corollary

Kwalk(G1, G2) =

  • s∈S

Φs(G1)Φs(G2) =

  • (w1,w2)∈W(G1)×W(G1)

λG1(w1)λG2(w2)1(l(w1) = l(w2)) =

  • w∈W(G1×G2)

λG1×G2(w) .

slide-133
SLIDE 133

Walk kernel and product graph

Lemma

There is a bijection between:

1

The pairs of walks w1 ∈ Wn(G1) and w2 ∈ Wn(G2) with the same label sequences,

2

The walks on the product graph w ∈ Wn(G1 × G2).

Corollary

Kwalk(G1, G2) =

  • s∈S

Φs(G1)Φs(G2) =

  • (w1,w2)∈W(G1)×W(G1)

λG1(w1)λG2(w2)1(l(w1) = l(w2)) =

  • w∈W(G1×G2)

λG1×G2(w) .

slide-134
SLIDE 134

Computation of the nth-order walk kernel

For the nth-order walk kernel we have λG1×G2(w) = 1 if the length

  • f w is n, 0 otherwise.

Therefore: Knth−order (G1, G2) =

  • w∈Wn(G1×G2)

1 . Let A be the adjacency matrix of G1 × G2. Then we get: Knth−order (G1, G2) =

  • i,j

[An]i,j = 1⊤An1 . Computation in O(n|G1||G2|d1d2), where di is the maximum degree of Gi.

slide-135
SLIDE 135

Computation of random and geometric walk kernels

In both cases λG(w) for a walk w = v1 . . . vn can be decomposed as: λG(v1 . . . vn) = λi(v1)

n

  • i=2

λt(vi−1, vi) . Let Λi be the vector of λi(v) and Λt be the matrix of λt(v, v′): Kwalk(G1, G2) =

  • n=1
  • w∈Wn(G1×G2)

λi(v1)

n

  • i=2

λt(vi−1, vi) =

  • n=0

ΛiΛn

t 1

= Λi (I − Λt)−1 1 Computation in O(|G1|3|G2|3)

slide-136
SLIDE 136

Extension: branching walks (Ramon and Gärtner, 2003; Mahé and Vert, 2009)

. . . . . . . . .

N N C C O C

. . .

C O C N C N O C N C N C C N N

T (v, n + 1) =

  • R⊂N(v)
  • v′∈R

λt(v, v′)T (v′, n) ,

slide-137
SLIDE 137

2D Subtree vs walk kernels

70 72 74 76 78 80 AUC Walks Subtrees CCRF−CEM HL−60(TB) K−562 MOLT−4 RPMI−8226 SR A549/ATCC EKVX HOP−62 HOP−92 NCI−H226 NCI−H23 NCI−H322M NCI−H460 NCI−H522 COLO_205 HCC−2998 HCT−116 HCT−15 HT29 KM12 SW−620 SF−268 SF−295 SF−539 SNB−19 SNB−75 U251 LOX_IMVI MALME−3M M14 SK−MEL−2 SK−MEL−28 SK−MEL−5 UACC−257 UACC−62 IGR−OV1 OVCAR−3 OVCAR−4 OVCAR−5 OVCAR−8 SK−OV−3 786−0 A498 ACHN CAKI−1 RXF_393 SN12C TK−10 UO−31 PC−3 DU−145 MCF7 NCI/ADR−RES MDA−MB−231/ATCC HS_578T MDA−MB−435 MDA−N BT−549 T−47D

Screening of inhibitors for 60 cancer cell lines.

slide-138
SLIDE 138

Outline

1

Motivations

2

Linear SVM

3

Nonlinear SVM and kernels

4

Learning molecular classifiers with network information

5

Kernels for strings and graphs

6

Data integration with kernels

7

Conclusion

slide-139
SLIDE 139

Motivation

Assume we observe M types of data and would like to learn a joint model (e.g., predict susceptibility from SNP and expression data). We saw in the previous part how to make kernels K1, . . . , KM for each type of data, and learn with each kernel individually Can we combine them to learn jointly from heterogeneous data?

slide-140
SLIDE 140

Sum kernel

Definition

Let K1, . . . , KM be M kernels on X. The sum kernel KS is the kernel on X defined as ∀x, x′ ∈ X , KS(x, x′) =

M

  • i=1

Ki(x, x′) .

slide-141
SLIDE 141

Sum kernel and vector concatenation

Theorem

For i = 1, . . . , M, let Φi : X → Hi be a feature map such that Ki(x, x′) =

  • Φi (x) , Φi
  • x′

Hi .

Then KS = M

i=1 Ki can be written as:

KS(x, x′) =

  • ΦS (x) , ΦS
  • x′

HS ,

where ΦS : X → HS = H1 ⊕ . . . ⊕ HM is the concatenation of the feature maps Φi: ΦS (x) = (Φ1 (x) , . . . , ΦM (x))⊤ . Therefore, summing kernels amounts to concatenating their feature space representations, which is a quite natural way to integrate different features.

slide-142
SLIDE 142

Proof

For ΦS (x) = (Φ1 (x) , . . . , ΦM (x))⊤, we easily compute:

  • ΦS (x) , ΦS
  • x′

Hs = M

  • i=1
  • Φi (x) , Φi
  • x′

Hi

=

M

  • i=1

Ki(x, x′) = KS(x, x′) .

slide-143
SLIDE 143

Example: data integration with the sum kernel

BIOINFORMATICS

  • Vol. 20 Suppl. 1 2004, pages i363–i370

DOI: 10.1093/bioinformatics/bth910

Protein network inference from multiple genomic data: a supervised approach

  • Y. Yamanishi1,∗, J.-P. Vert2 and M. Kanehisa1

1Bioinformatics Center, Institute for Chemical Research, Kyoto University, Gokasho,

Uji, Kyoto 611-0011, Japan and 2Computational Biology group, Ecole des Mines de Paris, 35 rue Saint-Honoré, 77305 Fontainebleau cedex, France

Kexp (Expression) Kppi (Protein interaction) Kloc (Localization) Kphy (Phylogenetic profile) Kexp + Kppi + Kloc + Kphy (Integration)

slide-144
SLIDE 144

Learning the kernel

Motivation

If we know how to weight each kernel, then we can learn with the weighted kernel Kη =

M

  • i=1

ηiKi However, usually we don’t know... Perhaps we can optimize the weights ηi during learning?

slide-145
SLIDE 145

An objective function for K

Theorem

For any p.d. kernel K on X, let J(K) = min

f∈HK

  • R(f n) + λ β 2

HK

  • .

The function K → J(K) is convex. This suggests a principled way to "learn" a kernel: define a convex set

  • f candidate kernels, and minimize J(K) by convex optimization.
slide-146
SLIDE 146

Proof

We can show by strong duality that J(K) = max

γ∈Rn

  • −R∗(−2λγ) − λγ⊤Kγ
  • .

For each γ fixed, this is an affine function of K, hence convex A supremum of convex functions is convex.

slide-147
SLIDE 147

MKL (Lanckriet et al., 2004)

We consider the set of convex combinations Kη =

M

  • i=1

ηiKi with η ∈ ΣM =

  • ηi ≥ 0 ,

M

  • i=1

ηi = 1

  • We optimize both η and f ∗ by solving:

min

η∈ΣM

J (Kη) = min

η∈ΣM

min

f∈HKη

  • R(f n) + λ β 2

HKη

  • The problem is jointly convex in (η, α) and can be solved efficiently

The output is both a set of weights η, and a predictor corresponding to the kernel method trained with kernel Kη. This method is usually called Multiple Kernel Learning (MKL).

slide-148
SLIDE 148

Example: protein annotation

BIOINFORMATICS

  • Vol. 20 no. 16 2004, pages 2626–2635

doi:10.1093/bioinformatics/bth294

A statistical framework for genomic data fusion

Gert R. G. Lanckriet1, Tijl De Bie3, Nello Cristianini4, Michael I. Jordan2 and William Stafford Noble5,∗

1Department of Electrical Engineering and Computer Science, 2Division of Computer

Science, Department of Statistics, University of California, Berkeley 94720, USA,

3Department of Electrical Engineering, ESAT-SCD, Katholieke Universiteit Leuven 3001,

Belgium, 4Department of Statistics, University of California, Davis 95618, USA and

5Department of Genome Sciences, University of Washington, Seattle 98195, USA

Kernel Data Similarity measure KSW protein sequences Smith-Waterman KB protein sequences BLAST KPfam protein sequences Pfam HMM KFFT hydropathy profile FFT KLI protein interactions linear kernel KD protein interactions diffusion kernel KE gene expression radial basis kernel KRND random numbers linear kernel B SW Pfam FFT LI D E all ROC B SW Pfam FFT LI D E all TP1FP Weights (B) Membrane proteins 0.7 0.8 0.9 1.0 10 20 30 40 0.5 1

slide-149
SLIDE 149

Example: Image classification (Harchaoui and Bach, 2007)

COREL14 dataset

1400 natural images in 14 classes Compare kernel between histograms (H), walk kernel (W), subtree kernel (TW), weighted subtree kernel (wTW), and a combination by MKL (M).

H W TW wTW M 0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12

Test error Kernels

Performance comparison on Corel14

slide-150
SLIDE 150

Sum kernel vs MKL (Bach et al., 2004)

Learning with the sum kernel (uniform combination) solves min

f1,...,fM

  • R

M

  • i=1

f n

i

  • + λ

M

  • i=1

βi 2

HKi

  • .

Learning with MKL (best convex combination) solves min

f1,...,fM

  R M

  • i=1

f n

i

  • + λ

M

  • i=1

βi HKi 2   . Although MKL can be thought of as optimizing a convex combination of kernels, it is more correct to think of it as a penalized risk minimization estimator with the group lasso penalty: Ω(f) = min

f1+...+fM=f M

  • i=1

βi HKi .

slide-151
SLIDE 151

Example: ridge vs LASSO regression

Take X = Rd, and for x = (x1, . . . , xd)⊤ consider the rank-1 kernels: ∀i = 1, . . . , d , Ki

  • x, x′

= xix′

i .

The sum kernel is KS (x, x′) = d

i=1 xix′ i = x⊤x

Learning with the sum kernel solves a ridge regression problem: min

β∈Rd

  • R(Xβ) + λ

d

  • i=1

β2

i

  • .

Learning with MKL solves a LASSO regression problem: min

β∈Rd

  R(Xβ) + λ d

  • i=1

| βi | 2   .

slide-152
SLIDE 152

Example: Graph lasso (Jacob et al., 2009)

Graph G = (V, E), X = RV For each edge e = (i, j), define the kernel Ke(x, x′) = x⊤

e x′ e = xix′ i + xjx′ j

MKL (aka latent group lasso) with the set {Ke : e ∈ E} leads to a sparse linear model with connected non-zero components.

slide-153
SLIDE 153

Application: breast cancer prognosis

slide-154
SLIDE 154

Lasso signature (accuracy 0.61)

slide-155
SLIDE 155

Graph Lasso signature (accuracy 0.64)

slide-156
SLIDE 156

Outline

1

Motivations

2

Linear SVM

3

Nonlinear SVM and kernels

4

Learning molecular classifiers with network information

5

Kernels for strings and graphs

6

Data integration with kernels

7

Conclusion

slide-157
SLIDE 157

SVM summary

Large margin classifier Control of the regularization / data fitting trade-off with C Linear or nonlinear (with the kernel trick) Extension to strings, graphs... and many other Data integration

slide-158
SLIDE 158

References

  • N. Aronszajn. Theory of reproducing kernels. Trans. Am. Math. Soc., 68:337 – 404, 1950. URL

http://www.jstor.org/stable/1990404.

  • F. R. Bach, G. R. G. Lanckriet, and M. I. Jordan. Multiple kernel learning, conic duality, and the

SMO algorithm. In Proceedings of the Twenty-First International Conference on Machine Learning, page 6, New York, NY, USA, 2004. ACM. doi: http://doi.acm.org/10.1145/1015330.1015424.

  • K. M. Borgwardt and H.-P

. Kriegel. Shortest-path kernels on graphs. In ICDM ’05: Proceedings

  • f the Fifth IEEE International Conference on Data Mining, pages 74–81, Washington, DC,

USA, 2005. IEEE Computer Society. ISBN 0-7695-2278-5. doi: http://dx.doi.org/10.1109/ICDM.2005.132.

  • Z. Harchaoui and F

. Bach. Image classification with segmentation graph kernels. In 2007 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2007), pages 1–8. IEEE Computer Society, 2007. doi: 10.1109/CVPR.2007.383049. URL http://dx.doi.org/10.1109/CVPR.2007.383049.

  • D. Haussler. Convolution Kernels on Discrete Structures. Technical Report UCSC-CRL-99-10,

UC Santa Cruz, 1999.

  • C. Helma, T. Cramer, S. Kramer, and L. De Raedt. Data mining and machine learning techniques

for the identification of mutagenicity inducing substructures and structure activity relationships

  • f noncongeneric compounds. J. Chem. Inf. Comput. Sci., 44(4):1402–11, 2004. doi:

10.1021/ci034254q. URL http://dx.doi.org/10.1021/ci034254q.

slide-159
SLIDE 159

References (cont.)

  • L. Jacob, G. Obozinski, and J.-P

. Vert. Group lasso with overlap and graph lasso. In ICML ’09: Proceedings of the 26th Annual International Conference on Machine Learning, pages 433–440, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-516-1. doi: 10.1145/1553374.1553431. URL http://dx.doi.org/10.1145/1553374.1553431.

  • G. Lanckriet, N. Cristianini, P

. Bartlett, L. El Ghaoui, and M. Jordan. Learning the kernel matrix with semidefinite programming. J. Mach. Learn. Res., 5:27–72, 2004a. URL http://www.jmlr.org/papers/v5/lanckriet04a.html.

  • G. R. G. Lanckriet, T. De Bie, N. Cristianini, M. I. Jordan, and W. S. Noble. A statistical framework

for genomic data fusion. Bioinformatics, 20(16):2626–2635, 2004b. doi: 10.1093/bioinformatics/bth294. URL http: //bioinformatics.oupjournals.org/cgi/content/abstract/20/16/2626.

  • C. Leslie and R. Kuang. Fast string kernels using inexact matching for protein sequences. J.
  • Mach. Learn. Res., 5:1435–1455, 2004.
  • C. Leslie, E. Eskin, and W. Noble. The spectrum kernel: a string kernel for SVM protein
  • classification. In R. B. Altman, A. K. Dunker, L. Hunter, K. Lauerdale, and T. E. Klein, editors,

Proceedings of the Pacific Symposium on Biocomputing 2002, pages 564–575, Singapore,

  • 2002. World Scientific.
  • H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C. n. p. v. d. d. r. Watkins. Text

classification using string kernels. J. Mach. Learn. Res., 2:419–444, 2002. URL http: //www.ai.mit.edu/projects/jmlr/papers/volume2/lodhi02a/abstract.html.

slide-160
SLIDE 160

References (cont.)

P . Mahé and J. P . Vert. Graph kernels based on tree patterns for molecules. Mach. Learn., 75(1): 3–35, 2009. doi: 10.1007/s10994-008-5086-2. URL http://dx.doi.org/10.1007/s10994-008-5086-2.

  • A. Nicholls. Oechem, version 1.3.4, openeye scientific software. website, 2005.
  • A. Rakotomamonjy, F

. Bach, S. Canu, and Y. Grandvalet. SimpleMKL. J. Mach. Learn. Res., 9: 2491–2521, 2008. URL http://jmlr.org/papers/v9/rakotomamonjy08a.html.

  • J. Ramon and T. Gärtner. Expressivity versus efficiency of graph kernels. In T. Washio and
  • L. De Raedt, editors, Proceedings of the First International Workshop on Mining Graphs,

Trees and Sequences, pages 65–74, 2003.

  • F. Rapaport, A. Zynoviev, M. Dutreix, E. Barillot, and J.-P

. Vert. Classification of microarray data using gene networks. BMC Bioinformatics, 8:35, 2007. doi: 10.1186/1471-2105-8-35. URL http://dx.doi.org/10.1186/1471-2105-8-35.

  • H. Saigo, J.-P

. Vert, N. Ueda, and T. Akutsu. Protein homology detection using string alignment

  • kernels. Bioinformatics, 20(11):1682–1689, 2004. URL http:

//bioinformatics.oupjournals.org/cgi/content/abstract/20/11/1682.

  • N. Sherashidze, S. Vishwanathan, T. Petri, K. Mehlhorn, and K. Borgwardt. Efficient graphlet

kernels for large graph comparison. In 12th International Conference on Artificial Intelligence and Statistics (AISTATS), pages 488–495, Clearwater Beach, Florida USA, 2009. Society for Artificial Intelligence and Statistics.

  • Y. Yamanishi, J.-P

. Vert, and M. Kanehisa. Protein network inference from multiple genomic data: a supervised approach. Bioinformatics, 20:i363–i370, 2004. URL http://bioinformatics.oupjournals.org/cgi/reprint/19/suppl_1/i323.