An Introduction to Kernel Methods for Classification, Regression and - - PowerPoint PPT Presentation

an introduction to kernel methods for classification
SMART_READER_LITE
LIVE PREVIEW

An Introduction to Kernel Methods for Classification, Regression and - - PowerPoint PPT Presentation

An Introduction to Kernel Methods for Classification, Regression and Structured Data atsch Gunnar R Computational Biology Center Sloan-Kettering Institute, New York City previous versions together with S oren Sonnenburg & Cheng


slide-1
SLIDE 1

An Introduction to Kernel Methods for Classification, Regression and Structured Data

Gunnar R¨ atsch∗ Computational Biology Center Sloan-Kettering Institute, New York City

∗ previous versions together with S¨

  • ren Sonnenburg & Cheng Soon Ong
slide-2
SLIDE 2

Tutorial Outline

1 Introduction to Machine Learning

Classification, Regression, and Structure prediction Complexity and Model Selection

2 Kernels and basic kernel methods

Large Margin Separation Non-linear Separation with Kernels

3 Kernels for Structured Data

Substring Kernels for Biological Sequences Kernels for Graphs & Images

4 Useful Extensions of SVMs

Heterogeneous Data & Multiple Kernel Learning Understanding the Learned SVM Classifier

5 Structured Output Learning

HMMs & Label Sequence Learning Semi-Markov Extensions

6 Case Studies (Applications)

Transcription Start Site Prediction and Gene Finding Tiling Array Analysis and Short Read Alignments

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 2

Memorial Sloan-Kettering Cancer Center

slide-3
SLIDE 3

Material

Supporting Material is available online

Slides Tutorial Example Scripts Software Toy Datasets Links http://bioweb.me/MLSSKernels2012 With contributions from S¨

  • ren Sonnenburg, Cheng Soon Ong,

Bernhard Sch¨

  • lkopf, Petra Philipps, Klaus-Robert M¨

uller, Peter Gehler, Karsten Borgwardt, Christian Widmer, Philipp Drewe and

  • thers.

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 3

Memorial Sloan-Kettering Cancer Center

slide-4
SLIDE 4

Part I Introduction to Machine Learning

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 4

slide-5
SLIDE 5

Overview: Introduction to Machine Learning

1

Example: Sequence Classification Running Example

2

Empirical Inference Learning from Examples Loss Functions Measuring Complexity

3

Digestion Putting Things together

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 5

Memorial Sloan-Kettering Cancer Center

slide-6
SLIDE 6

Why machine learning?

A lot of data Data is noisy No clear biological theory Large number of features Complex relationships Let the data do the talking!

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 6

Memorial Sloan-Kettering Cancer Center

slide-7
SLIDE 7

Running Example: Splicing

Almost all donor splice sites exhibit GU Almost all acceptor splice site exhibit AG Not all GUs and AGs are used as splice site

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 7

Memorial Sloan-Kettering Cancer Center

slide-8
SLIDE 8

Running Example: Splicing

Almost all donor splice sites exhibit GU Almost all acceptor splice site exhibit AG Not all GUs and AGs are used as splice site

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 7

Memorial Sloan-Kettering Cancer Center

slide-9
SLIDE 9

Classification of Sequences

Example: Recognition of splice sites Every ’AG’ is a potential acceptor splice site The computer has to learn what splice sites look like

given some known genes/splice sites . . .

Prediction on unknown DNA

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 8

Memorial Sloan-Kettering Cancer Center

slide-10
SLIDE 10

From Sequences to Features

Many algorithms depend on numerical representations.

Each example is a vector of values (features).

Use background knowledge to design good features.

intron exon

x1 x2 x3 x4 x5 x6 x7 x8 . . . GC before 0.6 0.2 0.4 0.3 0.2 0.4 0.5 0.5 . . . GC after 0.7 0.7 0.3 0.6 0.3 0.4 0.7 0.6 . . . AGAGAAG 1 1 1 . . . TTTAG 1 1 1 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... Label +1 +1 +1 −1 −1 +1 −1 −1 . . .

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 9

Memorial Sloan-Kettering Cancer Center

slide-11
SLIDE 11

Numerical Representation

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 10

Memorial Sloan-Kettering Cancer Center

slide-12
SLIDE 12

Recognition of Splice Sites

Given: Potential acceptor splice sites

intron exon

Goal: Rule that distinguishes true from false ones

AG AG AG AG AG AG AG AG AG AG AG AG

GC content after 'AG' GC content before 'AG' true splice sites decoy sites

exploit that exons have higher GC content

  • r

that certain motifs are located nearby

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 11

Memorial Sloan-Kettering Cancer Center

slide-13
SLIDE 13

Recognition of Splice Sites

Given: Potential acceptor splice sites

intron exon

Goal: Rule that distinguishes true from false ones

AG AG AG AG AG AG AG AG AG AG AG AG

GC content after 'AG' GC content before 'AG'

?

exploit that exons have higher GC content

  • r

that certain motifs are located nearby

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 11

Memorial Sloan-Kettering Cancer Center

slide-14
SLIDE 14

Empirical Inference (=Learning from Examples)

The machine utilizes information from training data to predict the

  • utputs associated with a particular test example.

Use training data to “train” the machine. Use trained machine to perform predictions on test data.

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 12

Memorial Sloan-Kettering Cancer Center

slide-15
SLIDE 15

Supervised Learning = Function Estimation

Basic Notion We want to estimate the relationship between the examples xi and the associated label yi. Formally We want to choose an estimator f : X → Y. Intuition We would like a function f which correctly predicts the label y for a given example x. Question How do we measure how well we are doing?

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 13

Memorial Sloan-Kettering Cancer Center

slide-16
SLIDE 16

Loss Function

Basic Notion We characterize the quality of an estimator by a loss function. Formally We define a loss function as ℓ(f (xi), yi) : Y × Y → R+. Intuition For a given label yi and a given prediction f (xi), we want a positive value telling us how much error there is.

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 14

Memorial Sloan-Kettering Cancer Center

slide-17
SLIDE 17

Classification

AG AG AG AG AG AG AG AG AG AG AG AG

GC content after 'AG' GC content before 'AG'

?

In binary classification (Y = {−1, +1}), we one may use the 0/1-loss function: ℓ(f (xi), yi) = if f (xi) = yi 1 if f (xi) = yi

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 15

Memorial Sloan-Kettering Cancer Center

slide-18
SLIDE 18

Regression

In regression (Y = R), one often uses the square loss function: ℓ(f (xi), yi) = (f (xi) − yi)2.

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 16

Memorial Sloan-Kettering Cancer Center

slide-19
SLIDE 19

Expected vs. Empirical Risk

Expected Risk This is the average loss on unseen examples. We would like to have it as small as possible, but it is hard to compute. Empirical Risk We can compute the average on training data. We define the empirical risk to be: Remp(f , X, Y ) = 1 n

n

  • i=1

ℓ(f (xi), yi). Basic Notion Instead of minimizing the expected risk, we minimize the empirical risk. This is called empirical risk minimization. Question

How do we know that estimator performs well on unseen data?

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 17

Memorial Sloan-Kettering Cancer Center

slide-20
SLIDE 20

Simple vs. Complex Functions

AG AG AG AG AG AG AG AG AG AG AG AG

GC content after 'AG' GC content before 'AG'

AG AG AG AG AG AG AG AG AG AG AG AG

GC content after 'AG' GC content before 'AG'

AG AG AG AG AG AG AG AG AG AG AG AG

GC content after 'AG' GC content before 'AG'

Which function is preferable?

[http://www.franciscans.org]

Occam’s razor (a.k.a. Occam’s Law of Parsimony):

(William of Occam, 14th century)

“Entities should not be multiplied beyond necessity”

(“Do not make the hypothesis more complex than necessary”)

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 18

Memorial Sloan-Kettering Cancer Center

slide-21
SLIDE 21

Simple vs. Complex Functions

AG AG AG AG AG AG AG AG AG AG AG AG

GC content after 'AG' GC content before 'AG'

AG AG AG AG AG AG AG AG AG AG AG AG

GC content after 'AG' GC content before 'AG'

AG AG AG AG AG AG AG AG AG AG AG AG

GC content after 'AG' GC content before 'AG'

Which function is preferable?

[http://www.franciscans.org]

Occam’s razor (a.k.a. Occam’s Law of Parsimony):

(William of Occam, 14th century)

“Entities should not be multiplied beyond necessity”

(“Do not make the hypothesis more complex than necessary”)

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 18

Memorial Sloan-Kettering Cancer Center

slide-22
SLIDE 22

Summary of Empirical Inference

Learn function f : X → Y given N labeled examples (xi, yi) ∈ X × Y.

Three important ingredients: Model fθ parametrized with some parameters θ ∈ Θ Loss function ℓ(f (x), y) measuring the “deviation” between predictions f (x) and the label y Complexity term P[f ] defining model classes with limited complexity (via nested subsets {f | P[f ] ≤ p} ⊆ {f | P[f ] ≤ p′} for p ≤ p′) Most algorithms find θ in fθ by minimizing: θ∗ = argmin

θ∈Θ

  • N
  • i=1

ℓ(fθ(xi), yi)

  • Empricial error

+

Regularization parameter

  • C

P[fθ]

  • Complexity term
  • for given C

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 19

Memorial Sloan-Kettering Cancer Center

slide-23
SLIDE 23

Summary of Empirical Inference

Learn function f : X → Y given N labeled examples (xi, yi) ∈ X × Y.

Three important ingredients: Model fθ parametrized with some parameters θ ∈ Θ Loss function ℓ(f (x), y) measuring the “deviation” between predictions f (x) and the label y Complexity term P[f ] defining model classes with limited complexity (via nested subsets {f | P[f ] ≤ p} ⊆ {f | P[f ] ≤ p′} for p ≤ p′) Most algorithms find θ in fθ by minimizing: θ∗ = argmin

θ∈Θ

  • N
  • i=1

ℓ(fθ(xi), yi)

  • Empricial error

+

Regularization parameter

  • C

P[fθ]

  • Complexity term
  • for given C

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 19

Memorial Sloan-Kettering Cancer Center

slide-24
SLIDE 24

Summary of Empirical Inference

Learn function f : X → Y given N labeled examples (xi, yi) ∈ X × Y.

Three important ingredients: Model fθ parametrized with some parameters θ ∈ Θ Loss function ℓ(f (x), y) measuring the “deviation” between predictions f (x) and the label y Complexity term P[f ] defining model classes with limited complexity (via nested subsets {f | P[f ] ≤ p} ⊆ {f | P[f ] ≤ p′} for p ≤ p′) Most algorithms find θ in fθ by minimizing: θ∗ = argmin

θ∈Θ

  • N
  • i=1

ℓ(fθ(xi), yi)

  • Empricial error

+

Regularization parameter

  • C

P[fθ]

  • Complexity term
  • for given C

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 19

Memorial Sloan-Kettering Cancer Center

slide-25
SLIDE 25

Summary of Empirical Inference

Learn function f : X → Y given N labeled examples (xi, yi) ∈ X × Y.

Three important ingredients: Model fθ parametrized with some parameters θ ∈ Θ Loss function ℓ(f (x), y) measuring the “deviation” between predictions f (x) and the label y Complexity term P[f ] defining model classes with limited complexity (via nested subsets {f | P[f ] ≤ p} ⊆ {f | P[f ] ≤ p′} for p ≤ p′) Most algorithms find θ in fθ by minimizing: θ∗ = argmin

θ∈Θ

  • N
  • i=1

ℓ(fθ(xi), yi)

  • Empricial error

+

Regularization parameter

  • C

P[fθ]

  • Complexity term
  • for given C

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 19

Memorial Sloan-Kettering Cancer Center

slide-26
SLIDE 26

Part II Support Vector Machines and Kernels

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 20

slide-27
SLIDE 27

Overview: Support Vector Machines and Kernels

4

Margin Maximization Some Learning Theory Support Vector Machines for Binary Classification Convex Optimization

5

Kernels & the “Trick” Inflating the Feature Space Kernel “Trick” Common Kernels Results for Running Example

6

Beyond 2-Class Classification Multiple Kernel Learning Multi-Class Classification & Regression Semi-Supervised Learning & Transfer Learning

7

Software & Demonstration

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 21

Memorial Sloan-Kettering Cancer Center

slide-28
SLIDE 28

Why maximize the margin?

AG AG AG AG AG AG AG AG AG AG AG

GC content after 'AG' GC content before 'AG' w

Intuitively, it feels the safest. For a small error in the separating hyperplane, we do not suffer too many mistakes. Empirically, it works well. VC theory indicates that it is the right thing to do.

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 22

Memorial Sloan-Kettering Cancer Center

slide-29
SLIDE 29

Approximation & Estimation Error

R(fn) − R∗ = R(fn) − inf

f ∈F R(f )

  • estimation error

+ inf

f ∈F R(f ) − R∗

  • approximation error

algorithms choice with n examples = fn ∈ F, R∗ = minimal risk F large small approximation error

  • verfitting (estimation error large)

F small large approximation error better generalization/estimation, but poor overall performance Model selection Choose F to get an

  • ptimal tradeoff between approxi-

mation and estimation error.

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 23

Memorial Sloan-Kettering Cancer Center

slide-30
SLIDE 30

Estimation Error

R(fn) − inff ∈F R(f ) ? Uniform differences R(fn) − inf

f ∈F R(f ) ≤ 2 sup f ∈F

|Remp(f ) − R(f )| Finite Sample Results F finite: R(fn) − inff ∈F R(f ) ≈

  • log(|F|)/√n

F infinite: ?

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 24

Memorial Sloan-Kettering Cancer Center

slide-31
SLIDE 31

Special Case: Complexity of Hyperplanes

AG AG AG AG AG AG AG AG AG AG AG

GC content after 'AG' GC content before 'AG' w

What is the complexity

  • f

hyperplane classifiers?

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 25

Memorial Sloan-Kettering Cancer Center

slide-32
SLIDE 32

Special Case: Complexity of Hyperplanes

AG AG AG AG AG AG AG AG AG AG AG

GC content after 'AG' GC content before 'AG' w

What is the complexity

  • f

hyperplane classifiers? Vladimir Vapnik and Alexey Chervonenkis: Vapnik-Chervonenkis (VC) dimension

[Vapnik and Chervonenkis, 1971; Vapnik, 1995] [http://tinyurl.com/cl8jo9,http://tinyurl.com/d7lmux]

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 25

Memorial Sloan-Kettering Cancer Center

slide-33
SLIDE 33

VC Dimension

A model class shatters a set of data points if it can correctly classify any possible labeling. Lines shatter any 3 points in R2, but not 4 points. VC dimension

[Vapnik, 1995]

The VC dimension of a model class F is the maximum h such that some data point set of size h can be shattered by the model. (e.g. VC dimension of R2 is 3.) R(fn) ≤ Remp(fn) +

  • h log(2N/h) + h − log(η/4)

N

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 26

Memorial Sloan-Kettering Cancer Center

slide-34
SLIDE 34

Larger Margin ⇒ Less Complex

Large Margin Hyperplanes ⇒ Small VC dimension Hyperplane classifiers with large margins have small VC dimension [Vapnik and Chervonenkis, 1971; Vapnik, 1995]. Maximum Margin ⇒ Minimum Complexity Minimize complexity by maximizing margin

(irrespective of the dimension of the space).

Useful Idea: Find the hyperplane that classifies all points correctly, while maximizing the margin (=SVMs).

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 27

Memorial Sloan-Kettering Cancer Center

slide-35
SLIDE 35

Large Margin ⇒ low complexity? - Why?

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 28

Memorial Sloan-Kettering Cancer Center

slide-36
SLIDE 36

Canonical Hyperplanes [Vapnik, 1995]

Note: If c = 0, then {x | w, x + b = 0} = {x | cw, x + cb = 0} Hence (w, b) describes the same hyperplane as (w, b). Definition: The hyperplane is in canonical form w.r.t. X ∗ = {x1, . . . , xr} if min

xi∈X |w, xi + b| = 1

Note, than for canonical hyperplanes, the distance of the closest point to the hyperplane (”margins”) is 1/w|: min

xi∈X

  • w

w, xi

  • +

b w

  • =

1 w

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 29

Memorial Sloan-Kettering Cancer Center

slide-37
SLIDE 37

Theorem 8 [Vapnik, 1979]

Consider hyperplanes w, xi = 0 where w is normalized such that they are in canonical form w.r.t. a set of points X = {x1, . . . , xr}, i.e., min

i=1,...,r |w, xi| = 1.

The set of decision functions fw(x) = sgn(x, w defined on X and satisfying the constraint w ≤ Λ has a VC dimension satisfying h ≤ R2Λ2. Here, R is the radius of the smallest sphere around the origin containing X

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 30

Memorial Sloan-Kettering Cancer Center

slide-38
SLIDE 38

How to Maximize the Margin? I

Consider linear hyperplanes with parameters w, b: f (x) =

d

  • j=1

wjxj + b = w, x + b

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 31

Memorial Sloan-Kettering Cancer Center

slide-39
SLIDE 39

How to Maximize the Margin? II

Margin maximization is equivalent to minimizing w.

[Sch¨

  • lkopf and Smola, 2002]

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 32

Memorial Sloan-Kettering Cancer Center

slide-40
SLIDE 40

How to Maximize the Margin? II

Margin maximization is equivalent to minimizing w.

[Sch¨

  • lkopf and Smola, 2002]

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 32

Memorial Sloan-Kettering Cancer Center

slide-41
SLIDE 41

How to Maximize the Margin? III

AG AG AG AG AG AG AG AG AG AG AG

GC content after 'AG' GC content before 'AG' w

Minimize 1 2w2 + C

n

  • i=1

ξi Subject to yi(w, xi + b) 1 − ξi ξi 0 for all i = 1, . . . , n. Examples on the margin are called support vectors [Vapnik, 1995]

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 33

Memorial Sloan-Kettering Cancer Center

slide-42
SLIDE 42

How to Maximize the Margin? III

AG AG AG AG AG AG AG AG AG AG AG

GC content after 'AG' GC content before 'AG'

margin

Minimize 1 2w2 + C

n

  • i=1

ξi Subject to yi(w, xi + b) 1 − ξi ξi 0 for all i = 1, . . . , n. Examples on the margin are called support vectors [Vapnik, 1995]

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 33

Memorial Sloan-Kettering Cancer Center

slide-43
SLIDE 43

We have to solve an “Optimization Problem”

minimize

w,b,ξ

1 2w2 + C

n

  • i=1

ξi subject to yi(w, xi + b) 1 − ξi for all i = 1, . . . , n. ξi 0 for all i = 1, . . . , n Quadratic objective function, linear constraints in w and b: “Quadratic Optimization Problem” (QP) “Convex Optimization Problem” (efficient solution possible, every local minimum is a global minimum) How to solve it? General purpose optimization packages (GNU Linear Programming Kit, CPLEX, Mosek, . . . ) Much faster specialized solvers (liblinear, SVM OCAS, Nieme, SGD, . . . )

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 34

Memorial Sloan-Kettering Cancer Center

slide-44
SLIDE 44

We have to solve an “Optimization Problem”

minimize

w,b,ξ

1 2w2 + C

n

  • i=1

ξi subject to yi(w, xi + b) 1 − ξi for all i = 1, . . . , n. ξi 0 for all i = 1, . . . , n Quadratic objective function, linear constraints in w and b: “Quadratic Optimization Problem” (QP) “Convex Optimization Problem” (efficient solution possible, every local minimum is a global minimum) How to solve it? General purpose optimization packages (GNU Linear Programming Kit, CPLEX, Mosek, . . . ) Much faster specialized solvers (liblinear, SVM OCAS, Nieme, SGD, . . . )

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 34

Memorial Sloan-Kettering Cancer Center

slide-45
SLIDE 45

Lagrange Function (e.g., [Bertsekas, 1995])

Introduce Lagrange multipliers αi ≥ 0 and a Lagrangian L(w, b, α) = 1 2w2 −

m

  • i=1

αi(yi · (w, xi + b) − 1). L has to be minimized w.r.t. the primal variables w and b and maximized with respect to the dual variables αi if a constraint is violated, then yi · (w, xi + b) − 1 < 0

αi will grow to increase L w, b want to decrease L; i.e. they have to change such that the constraint is satisfied. If the problem is separable, this ensures that αi < ∞

yi · (w, xi + b) − 1 > 0, then αi = 0: otherwise, L could be increased by decreasing αi (KKT conditions)

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 35

Memorial Sloan-Kettering Cancer Center

slide-46
SLIDE 46

Derivation of the Dual Problem

At the extremum, we have δ δbL(w, b, α) = 0, δ δwL(w, b, α) = 0, i.e.

m

  • i=1

αiyi = 0 and w =

m

  • i=1

αiyixi Substitute both into L to get the dual problem

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 36

Memorial Sloan-Kettering Cancer Center

slide-47
SLIDE 47

Dual Problem

Dual: maximize W (α) =

m

  • i=1

αi − 1 2

m

  • i,j=1

αiαjyiyjxi, xj subject to αi ≥ 0, i = 1, . . . , m, and

m

  • i=1

αiyi = 0.

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 37

Memorial Sloan-Kettering Cancer Center

slide-48
SLIDE 48

An Important Detail

minimize

w,b,ξ

1 2w2 + C

n

  • i=1

ξi subject to yi(w, xi + b) 1 − ξi for all i = 1, . . . , n. ξi 0 for all i = 1, . . . , n Representer Theorem: The optimal w can be written as a linear combination of the examples (for appropriate α’s): w =

n

  • i=1

αixi ⇒ Plug in! Now optimize for the variables α, b, and ξ! Corollary: Hyperplane only depends on the scalar products of the examples x, ˆ x =

D

  • d=1

xdˆ xd Remember this!

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 38

Memorial Sloan-Kettering Cancer Center

slide-49
SLIDE 49

An Important Detail

minimize

w,b,ξ

1 2w2 + C

n

  • i=1

ξi subject to yi(w, xi + b) 1 − ξi for all i = 1, . . . , n. ξi 0 for all i = 1, . . . , n Representer Theorem: The optimal w can be written as a linear combination of the examples (for appropriate α’s): w =

n

  • i=1

αixi ⇒ Plug in! Now optimize for the variables α, b, and ξ! Corollary: Hyperplane only depends on the scalar products of the examples x, ˆ x =

D

  • d=1

xdˆ xd Remember this!

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 38

Memorial Sloan-Kettering Cancer Center

slide-50
SLIDE 50

An Important Detail

minimize

α,b,ξ

1 2

  • N
  • i=1

αixi

  • 2

+ C

n

  • i=1

ξi subject to yi N

j=1 αjxj, xi + b

  • 1 − ξi for all i = 1, . . . , n.

ξi 0 for all i = 1, . . . , n Representer Theorem: The optimal w can be written as a linear combination of the examples (for appropriate α’s): w =

n

  • i=1

αixi ⇒ Plug in! Now optimize for the variables α, b, and ξ! Corollary: Hyperplane only depends on the scalar products of the examples x, ˆ x =

D

  • d=1

xdˆ xd Remember this!

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 38

Memorial Sloan-Kettering Cancer Center

slide-51
SLIDE 51

An Important Detail

minimize

α,b,ξ

1 2

  • N
  • i=1

αixi

  • 2

+ C

n

  • i=1

ξi subject to yi N

j=1 αjxj, xi + b

  • 1 − ξi for all i = 1, . . . , n.

ξi 0 for all i = 1, . . . , n Representer Theorem: The optimal w can be written as a linear combination of the examples (for appropriate α’s): w =

n

  • i=1

αixi ⇒ Plug in! Now optimize for the variables α, b, and ξ! Corollary: Hyperplane only depends on the scalar products of the examples x, ˆ x =

D

  • d=1

xdˆ xd Remember this!

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 38

Memorial Sloan-Kettering Cancer Center

slide-52
SLIDE 52

An Important Detail

minimize

α,b,ξ

1 2

  • N
  • i=1

αixi

  • 2

+ C

n

  • i=1

ξi subject to yi N

j=1 αjxj, xi + b

  • 1 − ξi for all i = 1, . . . , n.

ξi 0 for all i = 1, . . . , n Representer Theorem: The optimal w can be written as a linear combination of the examples (for appropriate α’s): w =

n

  • i=1

αixi ⇒ Plug in! Now optimize for the variables α, b, and ξ! Corollary: Hyperplane only depends on the scalar products of the examples x, ˆ x =

D

  • d=1

xdˆ xd Remember this!

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 38

Memorial Sloan-Kettering Cancer Center

slide-53
SLIDE 53

Nonseparable Problems

[Bennett and Mangasarian, 1992; Cortes and Vapnik, 1995]

If yi · (w, xi + b) ≥ 1 cannot be satisfied, then αi → ∞. Modify the constraint to yi · (w, xi + b) ≥ 1 − ξi with ξi ≥ 0 (”soft margin”) and add

m

  • i=1

ξi in the objective function.

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 39

Memorial Sloan-Kettering Cancer Center

slide-54
SLIDE 54

Soft Margin SVMs

C-SVM [Cortes and Vapnik, 1995]: for C > 0, minimize τ(w, ξ) = 1 2w2 + C

m

  • i=1

ξi subject to yi · (w, xi + b) ≥ 1 − ξi, ξi ≥ 0 (margin 1/w) ν-SVM [Sch¨

  • lkopf et al., 2000]: for 0 ≤ ν ≤ 1, minimize

τ(w, ξ, ρ) = 1 2w2 − νρ +

m

  • i=1

ξi subject to yi · (w, xi + b) ≥ ρ − ξi, ξi ≥ 0 (margin ρ/w)

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 40

Memorial Sloan-Kettering Cancer Center

slide-55
SLIDE 55

The ν-Property

SVs: αi > 0 ”margin errors:” ξi > 0 KKT-Conditions imply All margin errors are SVs. Not all SVs need to be margin errors. Those which are not lie exactly on the edge of the margin. Proposition:

1 fraction of Margin Errors ≥ ν ≥ fraction of SVs. 2 asymptotically: . . . = ν = . . . 3 optimal choice: ν = expected classification error

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 41

Memorial Sloan-Kettering Cancer Center

slide-56
SLIDE 56

Connection between ν-SVC and C-SVC

  • Proposition. If ν-SV classification leads to ρ > 0, then C-SV

classification, with C set a priori to 1/ρ , leads to same decision function.

  • Proof. Minimize the primal target, then fix ρ, and minimize only
  • ver the remaining variables: nothing will change. Hence the
  • btained solution w0, b0, ξ0 minimizes the primal problem of C-SVC,

for C = 1, subject to yi · (w, xi + b) ≥ ρ − ξi, To recover the constraint yi · (w, xi + b) ≥ 1 − ξi, rescale to the set of variables w′ = w/ρ, b′ = b/ρ, ξ′ = ξ/ρ. This leaves us, up to a constant scaling factor ρ2, with the C-SV target with C = 1/ρ.

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 42

Memorial Sloan-Kettering Cancer Center

slide-57
SLIDE 57

Recognition of Splice Sites

Given: Potential acceptor splice sites

intron exon

Goal: Rule that distinguishes true from false ones Linear Classifiers with large margin

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 43

Memorial Sloan-Kettering Cancer Center

slide-58
SLIDE 58

Recognition of Splice Sites

Given: Potential acceptor splice sites

intron exon

Goal: Rule that distinguishes true from false ones

AG AG AG AG AG AG AG AG AG AG AG AG

GC content after 'AG' GC content before 'AG'

More realistic problem? Not linearly separable! Need nonlinear separation? Need more features?

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 43

Memorial Sloan-Kettering Cancer Center

slide-59
SLIDE 59

Recognition of Splice Sites

Given: Potential acceptor splice sites

intron exon

Goal: Rule that distinguishes true from false ones

AG AG AG AG AG AG AG AG AG AG AG AG

GC content after 'AG' GC content before 'AG'

More realistic problem? Not linearly separable! Need nonlinear separation? Need more features?

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 43

Memorial Sloan-Kettering Cancer Center

slide-60
SLIDE 60

Nonlinear Separations

Linear separation might not be sufficient! ⇒ Map into a higher dimensional feature space Example: all second order monomials Φ : R2 → R3 (x1, x2) → (z1, z2, z3) := (x2

1,

√ 2 x1x2, x2

2)

❍ ❍ ❍ ❍ ❍ ❍ ❍ ❍ ✕ ✕ ✕ ✕ ✕ ✕ ✕ ✕ ✕ ✕ ✕ ✕ ✕ ✕ ✕ ✕ ✕ ✕ ✕ ✕

x1 x2

❍ ❍ ❍ ❍ ❍ ❍ ❍ ❍ ✕ ✕ ✕ ✕ ✕ ✕ ✕ ✕ ✕ ✕ ✕ ✕ ✕

z1 z3

z2

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 44

Memorial Sloan-Kettering Cancer Center

slide-61
SLIDE 61

Kernels and Feature Spaces

Preprocess the data with Φ : X → H x → Φ(x) where H is a dot product space, and learn the mapping from Φ(x) to y [Boser et al., 1992]. usually, dim(X) ≪ dim(H)) ”Curse of Dimensionality”? crucial issue: capacity matters, not dimensionality

VC dimension of hyperplanes is (essentially) independent of dimensionality

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 45

Memorial Sloan-Kettering Cancer Center

slide-62
SLIDE 62

Kernel “Trick”

Example: x ∈ R2 and Φ(x) := (x2

1,

√ 2 x1x2, x2

2)

[Boser et al., 1992]

Φ(x), Φ(ˆ x) =

  • (x2

1,

√ 2 x1x2, x2

2), (ˆ

x2

1,

√ 2 ˆ x1ˆ x2, ˆ x2

2)

  • =

(x1, x2), (ˆ x1, ˆ x2)2 = x, ˆ x2 : =: k(x, ˆ x) Scalar product in feature space (here R3) can be computed in input space (here R2)! Also works for higher orders and dimensions ⇒ relatively low-dimensional input spaces ⇒ very high-dimensional feature spaces

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 46

Memorial Sloan-Kettering Cancer Center

slide-63
SLIDE 63

General Product Feature Space

[Sch¨

  • lkopf et al., 1996]

How about patterns x ∈ RN and product features of order d? Here, dim(H) grows like Nd. For instance, N = 16 · 16, and d = 5 ⇔ dimension 1010

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 47

Memorial Sloan-Kettering Cancer Center

slide-64
SLIDE 64

The Kernel Trick, N = d = 2

Φ(x), Φ(x′) = (x2

1,

√ 2x1x2, x2

2)(x′2 1 ,

√ 2x′1x′2, x′2

2 )T

= x, x′2 =: k(x, x) The dot product in H can be computed in R2

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 48

Memorial Sloan-Kettering Cancer Center

slide-65
SLIDE 65

The Kernel Trick, II

More generally: x, x′ ∈ RN, d ∈ N: x, x′d = N

  • j=1

xj · x′

j

d =

N

  • j1,...,jd

(xj1 · · · x′

jd) · (xj1 · · · x′ jd) = Φ(x), Φ(x′)

where Φ maps into the space spanned by all ordered products of d input directions

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 49

Memorial Sloan-Kettering Cancer Center

slide-66
SLIDE 66

Mercer’s Theorem

If k is a continuous kernel of a positive definite integral operator on L2(X) (where X is some compact space),

  • X

k(x, x′)f (x)f (x′)dxdx′ ≥ 0, it can be expanded as k(x, x′) =

  • i=1

λiψi(x)ψi(x′) using eigenfunctions ψi and eigenvalues λi ≥ 0 [Osuna et al., 1996].

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 50

Memorial Sloan-Kettering Cancer Center

slide-67
SLIDE 67

The Mercer Feature Map

In that case Φ(x) := (

  • λ1ψ1(x),
  • λ2ψ2(x), . . .)

satisfies Φ(x), Φ(x′) = k(x, x′) Proof: Φ(x), Φ(x′) = (

  • λ1ψ1(x), . . .), (
  • λ1ψ1(x′), . . .)

=

  • i=1

λiψi(x)ψi(x′) = k(x, x′)

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 51

Memorial Sloan-Kettering Cancer Center

slide-68
SLIDE 68

Positive Definite Kernels

It can be shown that the admissible class of kernels coincides with the one of positive definite (pd) kernels: kernels which are symmetric (i.e., k(x, x′) = k(x′, x)), and for any set of training points x1, . . . , xm ∈ X and any a1, . . . , am ∈ R satisfy

  • i,j

aiajKij ≥ 0, where Kij := k(xi, xj) K is called the Gram matrix or kernel matrix. If for pairwise distinct points,

i,j aiajKij = 0 ⇒ a = 0, call it strictly positive definite.

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 52

Memorial Sloan-Kettering Cancer Center

slide-69
SLIDE 69

Elementary Properties of PD Kernels

Kernels from Feature Maps. If Φ maps X into a dot product space H, then Φ(x), Φ(x′) is a pd kernel on X × X. Positivity on the Diagonal. k(x, x) ≥ 0 for all x ∈ X Cauchy-Schwarz Inequality. k(x, x′)2 ≤ k(x, x)k(x′, x′) Vanishing Diagonals. k(x, x) = 0 for all x ∈ X ⇒ k(x, x′) = 0 for all x, x′ ∈ X

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 53

Memorial Sloan-Kettering Cancer Center

slide-70
SLIDE 70

Some Properties of Kernels

[Sch¨

  • lkopf and Smola, 2002; Shawe-Taylor and Cristianini, 2004]

If k1, k2, . . . are pd kernels, then so are αk1, provided α ≥ 0 k1 + k2 k1 · k2 k(x, x′) := limn→∞ kn(x, x′), provided it exists k(A, B) :=

x∈A,x′∈B k1(x, x′), where A, B are finite subsets of

X (using the feature map Φ(A) :=

x∈A Φ(x))

Further operations to construct kernels from kernels: tensor products, direct sums, convolutions [Haussler, 1999b].

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 54

Memorial Sloan-Kettering Cancer Center

slide-71
SLIDE 71

Putting Things Together . . .

Use Φ(x) instead of x Use linear classifier on the Φ(x)’s From theorem: w =

n

  • i=1

αiΦ(xi). Nonlinear separation: f (x) = w, Φ(x) + b =

n

  • i=1

αi Φ(xi), Φ(x)

  • k(xi,x)

+b Trick: k(x, x′) = Φ(x), Φ(x′), i.e. do not use Φ, but k!

See e.g. [M¨ uller et al., 2001; Sch¨

  • lkopf and Smola, 2002; Vapnik, 1995]

for details.

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 55

Memorial Sloan-Kettering Cancer Center

slide-72
SLIDE 72

Kernel ≈ Similarity Measure

Distance: Φ(x) − Φ(ˆ x)2 = Φ(x)2 − 2Φ(x), Φ(ˆ x) + Φ(ˆ x) Scalar product: Φ(x), Φ(ˆ x) If Φ(x)2 = Φ(ˆ x)2 = 1, then scalar product = 2−distance Angle between vectors, i.e., Φ(x), Φ(ˆ x) Φ(x) Φ(ˆ x) = cos(Φ(x), Φ(ˆ x)) Technical detail: kernel functions have to satisfy certain conditions (Mercer’s condition).

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 56

Memorial Sloan-Kettering Cancer Center

slide-73
SLIDE 73

How to Construct a Kernel

At least two ways to get to a kernel: Construct Φ and think about efficient ways to compute scalar product Φ(x), Φ(ˆ x) Construct similarity measure (show Mercer’s condition) and think about what it means What can you do if kernel is not positive definite? Optimization problem is not convex! Add constant to diagonal (cheap) Exponentiate kernel matrix (all eigenvalues become positive) SVM-pairwise use similarity as features

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 57

Memorial Sloan-Kettering Cancer Center

slide-74
SLIDE 74

Common Kernels

See e.g. M¨

uller et al. [2001]; Sch¨

  • lkopf and Smola [2002]; Vapnik [1995]

Polynomial k(x, ˆ x) = (x, ˆ x + c)d Sigmoid k(x, ˆ x) = tanh(κx, ˆ x) + θ) RBF k(x, ˆ x) = exp

  • −x − ˆ

x2/(2 σ2)

  • Convex combinations

k(x, ˆ x) = β1k1(x, ˆ x) + β2k2(x, ˆ x) Normalization k(x, ˆ x) = k′(x, ˆ x)

  • k′(x, x)k′(ˆ

x, ˆ x) Notes: Kernels may be combined in case of heterogeneous data These kernels are good for real-valued examples Sequences need special care (coming soon!)

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 58

Memorial Sloan-Kettering Cancer Center

slide-75
SLIDE 75

Toy Examples

Linear kernel RBF kernel k(x, ˆ x) = x, ˆ x k(x, ˆ x) = exp(−x − ˆ x2/2σ)

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 59

Memorial Sloan-Kettering Cancer Center

slide-76
SLIDE 76

Kernel Summary

Nonlinear separation ⇔ linear separation of nonlinearly mapped examples Mapping Φ defines a kernel by k(x, ˆ x) := Φ(x), Φ(ˆ x) (Mercer) Kernel defines a mapping Φ (nontrivial) Choice of kernel has to match the data at hand RBF kernel often works pretty well

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 60

Memorial Sloan-Kettering Cancer Center

slide-77
SLIDE 77

Evaluation Measures for Classification

The Contingency Table/Confusion Matrix

TP, FP, FN, TN are absolute counts of true positives, false positives, false negatives and true negatives N - sample size N+ = FN + TP number of positive examples N− = FP + TN number of negative examples O+ = TP + FP number of positive predictions O− = FN + TN number of negative predictions

  • utputs\ labeling

y = +1 y = −1 Σ f (x) = +1 TP FP O+ f (x) = −1 FN TN O− Σ N+ N− N

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 61

Memorial Sloan-Kettering Cancer Center

slide-78
SLIDE 78

Evaluation Measures for Classification II

Several commonly used performance measures

Name Computation Accuracy ACC = TP+TN

N

Error rate (1-accuracy) ERR = FP+FN

N

Balanced error rate BER = 1

2

  • FN

FN+TP + FP FP+TN

  • Weighted relative accuracy

WRACC =

TP TP+FN − FP FP+TN

F1 score F1 =

2∗TP 2∗TP+FP+FN

Cross-correlation coefficient CC =

TP·TN−FP·FN

(TP+FP)(TP+FN)(TN+FP)(TN+FN)

Sensitivity/recall TPR = TP/N+ =

TP TP+FN

Specificity TNR = TN/N− =

TN TN+FP

1-sensitivity FNR = FN/N+ =

FN FN+TP

1-specificity FPR = FP/N− =

FP FP+TN

P.p.v. / precision PPV = TP/O+ =

TP TP+FP

False discovery rate FDR = FP/O+ =

FP FP+TP

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 62

Memorial Sloan-Kettering Cancer Center

slide-79
SLIDE 79

Evaluation Measures for Classification III

[left] ROC Curve [right] Precision Recall Curve

0.2 0.4 0.6 0.8 1 0.01 0.1 1 false positive rate true positive rate ROC proposed method firstef eponine mcpromotor proposed method firstef mcpromotor eponine 0.2 0.4 0.6 0.8 1 0.1 1 true positive rate positive predictive value PPV proposed method firstef eponine mcpromotor proposed method firstef eponine mcpromotor

(Obtained by varying bias and recording TPR/FPR or PPV/TPR.)

Use bias independent scalar evaluation measure

Area under ROC Curve (auROC) Area under Precision Recall Curve (auPRC)

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 63

Memorial Sloan-Kettering Cancer Center

slide-80
SLIDE 80

Measuring Performance in Practice

What to do in practice Split the data into training and validation sets; use error on validation set as estimate of the expected error

  • A. Cross-validation

Split data into c disjoint parts; use each subset as validation set and rest as training set

  • B. Random splits

Randomly split data set into two parts, for example, 80% of data for training and 20% for validation; Repeat this many times See, for instance, Duda et al. [2001] for more details.

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 64

Memorial Sloan-Kettering Cancer Center

slide-81
SLIDE 81

Model Selection

Do not train on the “test set”! Use subset of data for training From subset, further split to select model. Model selection = Find best parameters Regularization parameter C. Other parameters (introduced later)

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 65

Memorial Sloan-Kettering Cancer Center

slide-82
SLIDE 82

GC-Content-based Splice Site Recognition

Kernel auROC Linear 88.2% Polynomial d = 3 91.4% Polynomial d = 7 90.4% Gaussian σ = 100 87.9% Gaussian σ = 1 88.6% Gaussian σ = 0.01 77.3% SVM accuracy of acceptor site recognition using polynomial and Gaussian kernels with different degrees d and widths σ. Accuracy is measured using the area under the ROC curve (auROC) and is computed using five-fold cross-validation

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 66

Memorial Sloan-Kettering Cancer Center

slide-83
SLIDE 83

Demonstration: Recognition of Splice Sites

Given: Potential acceptor splice sites

intron exon

Goal: Rule that distinguishes true from false ones Task 1: Train classifier and pre- dict using 5-fold cross-validation. Evaluate predictions. Task 2: Determine best com- bination

  • f

polynomial degree and SVM’s C using 5-fold cross- validation.

http://bioweb.me/svmcompbio http://bioweb.me/mlb-galaxy

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 67

Memorial Sloan-Kettering Cancer Center

slide-84
SLIDE 84

Some Extensions

Multiple Kernel Learning Semi-Supervised Learning Multi-class classification Regression Domain Adaptation and Multi-Task Learning

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 68

Memorial Sloan-Kettering Cancer Center

slide-85
SLIDE 85

Multiple Kernel Learning (MKL)

Data may consist of sequence and structure information Possible solution: We can add the two kernels, k(x, x′) := ksequence(x, x′) + kstructure(x, x′). Better solution: We can mix the two kernels, k(x, x′) := (1 − t)ksequence(x, x′) + tkstructure(x, x′), where tis estimated from the training data In general: use the data to find the best convex combination. k(x, x′) =

K

  • p=1

βpkp(x, x′). Applications Heterogeneous data Improving interpretability (more on this later)

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 69

Memorial Sloan-Kettering Cancer Center

slide-86
SLIDE 86

Multiple Kernel Learning (MKL)

Data may consist of sequence and structure information Possible solution: We can add the two kernels, k(x, x′) := ksequence(x, x′) + kstructure(x, x′). Better solution: We can mix the two kernels, k(x, x′) := (1 − t)ksequence(x, x′) + tkstructure(x, x′), where tis estimated from the training data In general: use the data to find the best convex combination. k(x, x′) =

K

  • p=1

βpkp(x, x′). Applications Heterogeneous data Improving interpretability (more on this later)

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 69

Memorial Sloan-Kettering Cancer Center

slide-87
SLIDE 87

Multiple Kernel Learning (MKL)

Data may consist of sequence and structure information Possible solution: We can add the two kernels, k(x, x′) := ksequence(x, x′) + kstructure(x, x′). Better solution: We can mix the two kernels, k(x, x′) := (1 − t)ksequence(x, x′) + tkstructure(x, x′), where tis estimated from the training data In general: use the data to find the best convex combination. k(x, x′) =

K

  • p=1

βpkp(x, x′). Applications Heterogeneous data Improving interpretability (more on this later)

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 69

Memorial Sloan-Kettering Cancer Center

slide-88
SLIDE 88

Multiple Kernel Learning (MKL)

Data may consist of sequence and structure information Possible solution: We can add the two kernels, k(x, x′) := ksequence(x, x′) + kstructure(x, x′). Better solution: We can mix the two kernels, k(x, x′) := (1 − t)ksequence(x, x′) + tkstructure(x, x′), where tis estimated from the training data In general: use the data to find the best convex combination. k(x, x′) =

K

  • p=1

βpkp(x, x′). Applications Heterogeneous data Improving interpretability (more on this later)

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 69

Memorial Sloan-Kettering Cancer Center

slide-89
SLIDE 89

Example: Combining Heterogeneous Data

Consider data from different domains: e.g DNA-strings, binding energies, conservation, structure,. . . k(x, x′) = β1 kdna(xdna, x′

dna)+β2 knrg(xnrg, x′ nrg)+β3 k3d(x3d, x′ 3d)+· · ·

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 70

Memorial Sloan-Kettering Cancer Center

slide-90
SLIDE 90

MKL Primal Formulation

min 1 2 M

  • j=1

βj wj2 2 + C

N

  • i=1

ξn w.r.t. w = (w1, . . . , wM), wj ∈ RDj, ∀j = 1 . . . M β ∈ RM

+ , ξ ∈ RN +, b ∈ R

s.t. yi M

  • j=1

βjwj

TΦj(xi) + b

  • ≥ 1 − ξi, ∀i = 1, . . . , N

M

  • j=1

βj = 1 Properties: equivalent to SVM for M = 1; solution sparse in “blocks”; each block j corresponds to one kernel

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 71

Memorial Sloan-Kettering Cancer Center

slide-91
SLIDE 91

Solving MKL

SDP Lanckriet et al. [2004], QCQP Bach et al. [2004] SILP Sonnenburg et al. [2006a] SimpleMKL Rakotomamonjy et al. [2008] Extended Level Set Method Xu et al. [2009] ...

SILP implemented in shogun-toolbox; examples available.

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 72

Memorial Sloan-Kettering Cancer Center

slide-92
SLIDE 92

Multi-Class Classification

Real problems often have more than 2 classes Generalize the SVM to multi-class classification, for K > 2. Three approaches [Sch¨

  • lkopf and Smola, 2002]

One-vs-rest For each class, label all other classes as “negative” (K binary problems). ⇒ Simple and hard to beat! One-vs-one Compare all classes pairwise ( 1

2K(K − 1)

binary problems). Multi-class loss Define a new empirical risk term.

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 73

Memorial Sloan-Kettering Cancer Center

slide-93
SLIDE 93

Multi-Class Loss for SVMs

Two-Class SVM minimize

w,b

1 2w2 +

N

  • i=1

ℓ(fw,b(x), yi), Multi-Class SVM minimize

w,b

1 2w2 +

N

  • i=1

max

u=yi ℓ (fw,b(xi, yi) − fw,b(xi, u), yi)

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 74

Memorial Sloan-Kettering Cancer Center

slide-94
SLIDE 94

Regression

Examples x ∈ X Labels y ∈ R

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 75

Memorial Sloan-Kettering Cancer Center

slide-95
SLIDE 95

Regression

Squared loss Simplest approach ℓ(f (xi), yi) := (yi − f (xi))2 Problem: All α’s are non-zero ⇒ Inefficient! ε-insensitive loss function Extend “margin” to regression. Establish a “tube” around the line where we can make mistakes. ℓ(f (xi), yi) = |yi − f (xi)| < ε |yi − f (xi)| − ε

  • therwise

Idea: Examples (xi, yi) inside tube have αi = 0. Huber’s loss Combination of benefits ℓ(f (xi), yi) := 1

2(yi − f (xi))2

|yi − f (xi)| < γ γ|yi − f (xi)| − 1

2γ2

(yi − f (xi)) γ

See e.g. Smola and Sch¨

  • lkopf [2001] for other loss functions and more details.

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 76

Memorial Sloan-Kettering Cancer Center

slide-96
SLIDE 96

Regression ⇒

Squared loss Simplest approach ℓ(f (xi), yi) := (yi − f (xi))2 Problem: All α’s are non-zero ⇒ Inefficient! ε-insensitive loss function Extend “margin” to regression. Establish a “tube” around the line where we can make mistakes. ℓ(f (xi), yi) = |yi − f (xi)| < ε |yi − f (xi)| − ε

  • therwise

Idea: Examples (xi, yi) inside tube have αi = 0. Huber’s loss Combination of benefits ℓ(f (xi), yi) := 1

2(yi − f (xi))2

|yi − f (xi)| < γ γ|yi − f (xi)| − 1

2γ2

(yi − f (xi)) γ

See e.g. Smola and Sch¨

  • lkopf [2001] for other loss functions and more details.

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 76

Memorial Sloan-Kettering Cancer Center

slide-97
SLIDE 97

Regression

Squared loss Simplest approach ℓ(f (xi), yi) := (yi − f (xi))2 Problem: All α’s are non-zero ⇒ Inefficient! ε-insensitive loss function Extend “margin” to regression. Establish a “tube” around the line where we can make mistakes. ℓ(f (xi), yi) = |yi − f (xi)| < ε |yi − f (xi)| − ε

  • therwise

Idea: Examples (xi, yi) inside tube have αi = 0. Huber’s loss Combination of benefits ℓ(f (xi), yi) := 1

2(yi − f (xi))2

|yi − f (xi)| < γ γ|yi − f (xi)| − 1

2γ2

(yi − f (xi)) γ

See e.g. Smola and Sch¨

  • lkopf [2001] for other loss functions and more details.

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 76

Memorial Sloan-Kettering Cancer Center

slide-98
SLIDE 98

Semi-Supervised Learning: What Is It?

For most researchers: SSL = semi-supervised classification.

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 77

Memorial Sloan-Kettering Cancer Center

slide-99
SLIDE 99

Semi-Supervised Learning: How Does It Work?

Cluster Assumption

Points in the same cluster are likely to be of the same class. Equivalent assumption:

Low Density Separation Assumption

The decision boundary lies in a low density region. ⇒ Algorithmic idea: Low Density Separation

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 78

Memorial Sloan-Kettering Cancer Center

slide-100
SLIDE 100

Semi-Supervised SVM

min

w,b,(yj),(ξk) 1 2w⊤w

+C

i ξi

+C ∗

j ξj

s.t. ξi ≥ 0 ξj ≥ 0 yi(w⊤xi + b) ≥ 1 − ξi yj(w⊤xj + b) ≥ 1 − ξj Soft margin S3VM

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 79

Memorial Sloan-Kettering Cancer Center

slide-101
SLIDE 101

Semi-Supervised SVM: Hard to Train

Supervised Support Vector Machine (SVM) min

w,b,(ξk) 1 2w⊤w

+C

i ξi

s.t. ξi ≥ 0 yi(w⊤xi + b) ≥ 1 − ξi maximize margin around (labeled) points convex optimization problem (QP, quadratic programming) Semi-Supervised Support Vector Machine (S3VM) min

w,b,(yj),(ξk) 1 2w⊤w

+C

i ξi

+C ∗

j ξj

s.t. ξi ≥ 0 ξj ≥ 0 yi(w⊤xi + b) ≥ 1 − ξi yj(w⊤xj + b) ≥ 1 − ξj Maximize margin around labeled and unlabeled points Discrete, combinatorial, NP-hard!

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 80

Memorial Sloan-Kettering Cancer Center

slide-102
SLIDE 102

Semi-Supervised SVM: Optimization

⇒ Optimization matters

Comparison of S3VM Optimization Methods

Averaged over splits (and pairs of classes) Fixed hyperparams (close to hard margin) Similar results for

  • ther

hyper-parameter settings

[Chapelle et al., 2006]

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 81

Memorial Sloan-Kettering Cancer Center

slide-103
SLIDE 103

Covariate Shift & Domain Adaptation

The Idea of Domain Adaptation: Insufficient labeled training data for some problems Idea: Turn to related domains for which more data is available So-called Source and Target Domains can be different, but should be related enough to gain something Distributional point of view: Supervised Learning: Example-label pairs drawn from P(X, Y ) PSource(X, Y ) might differ from PTarget(X, Y ) Factorization: P(X, Y ) = P(Y |X) · P(X)

Covariate Shift: PSource(X) = PTarget(X) Differing Conditionals: PSource(Y |X) = PTarget(Y |X)

→ There are numerous ways to approach this problem!

[Ben-david et al., 2007; Evgeniou and Pontil, 2004; Schweikert et al., 2008]

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 82

Memorial Sloan-Kettering Cancer Center

slide-104
SLIDE 104

Domain Adaptation Methods

Idea:

[Caruana, 1997]

Simultaneous optimization of both models Similarity between solution enforced Approach: min

wS,wT ,ξ

1 2wS2 + 1 2wT2−BwT

TwS + C m+n

  • i=1

ξi s.t. yi(wS, Φ(xi) + b) ≥ 1 − ξi i = 1, . . . , m yi(wT, Φ(xi) + b) ≥ 1 − ξi i = m + 1, . . . , m + n Equivalent to multi-task kernel learning:

[Daume III, 2007]

KMTK((x, t), (x′, t′)) = γt,t′K(x, x′) for a suitably chosen Γ (p.s.d.).

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 83

Memorial Sloan-Kettering Cancer Center

slide-105
SLIDE 105

Domain Adaptation Methods

Idea:

[Caruana, 1997]

Simultaneous optimization of both models Similarity between solution enforced Approach: min

wS,wT ,ξ

1 2wS2 + 1 2wT2−BwT

TwS + C m+n

  • i=1

ξi s.t. yi(wS, Φ(xi) + b) ≥ 1 − ξi i = 1, . . . , m yi(wT, Φ(xi) + b) ≥ 1 − ξi i = m + 1, . . . , m + n Equivalent to multi-task kernel learning:

[Daume III, 2007]

KMTK((x, t), (x′, t′)) = γt,t′K(x, x′) for a suitably chosen Γ (p.s.d.).

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 83

Memorial Sloan-Kettering Cancer Center

slide-106
SLIDE 106

Two ways of leveraging a given taxonomy T

KMTL((x, t), (x′, t′)) = γt,t′K(x, x′)

[Widmer et al., 2010]

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 84

Memorial Sloan-Kettering Cancer Center

slide-107
SLIDE 107

From Taxonomy to Γ

B

1

D C

2 3 4

100 million years

Time

now 400 million years

5

A 990 million years 1600 million years

worm 1 worm 2 worm 3 fly plant

Idea: γi,j should be inversely related to time to last common ancestor Strategies: 1/years, Hop-distance, . . .

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 85

Memorial Sloan-Kettering Cancer Center

slide-108
SLIDE 108

From Taxonomy to Γ

B

1

D C

2 3 4

100 million years

Time

now 400 million years

5

A 990 million years 1600 million years

worm 1 worm 2 worm 3 fly plant

Idea: γi,j should be inversely related to time to last common ancestor Strategies: 1/years, Hop-distance, . . .

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 85

Memorial Sloan-Kettering Cancer Center

slide-109
SLIDE 109

Hierarchical Top-Down Approach

Idea: Exploit taxonomy G in a top-down fashion Use taxonomy T in a top-down procedure Initialization: w0 trained on union of all task datasets Top-Down for each node t:

Train on Di =

ji Dj

Regularize wi against parent predictor wparent: min

wi,b

1 2wi − wparent2 + C

  • (x,y)∈Di

ℓ (Φ(x), wi + b, y) ,

Use leaf predictors for classification

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 86

Memorial Sloan-Kettering Cancer Center

slide-110
SLIDE 110

Hierarchical Top-Down Approach:Illustration

(a) Given taxonomy (b) Top-level training (c) Intermediate training (d) Taxon training

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 87

Memorial Sloan-Kettering Cancer Center

slide-111
SLIDE 111

Application to Splicing Data

Formulation as binary classification problem Utilize 15 organisms related by taxonomy Restricted to at most 10.000 examples per organism

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 88

Memorial Sloan-Kettering Cancer Center

slide-112
SLIDE 112

Application to Splicing Data

Formulation as binary classification problem Utilize 15 organisms related by taxonomy Restricted to at most 10.000 examples per organism

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 88

Memorial Sloan-Kettering Cancer Center

slide-113
SLIDE 113

Results: Splicing Data

Observations: Union > Plain → conservation Often: Union > Nearest MTL methods outperform baselines Best performer: Top-Down (& MT-Kernel)

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 89

Memorial Sloan-Kettering Cancer Center

slide-114
SLIDE 114

Available SVM Packages

2-Class Classification (35 hits on http://mloss.org) package names sorted by popularity Multi-Class Classification (7 hits on http://mloss.org) Regression (54 hits on http://mloss.org) More can be found at http://www.kernel-machines.org.

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 90

Memorial Sloan-Kettering Cancer Center

slide-115
SLIDE 115

Easy-to-use Software

Easysvm an easy-to-use SVM toolbox based on Python and the Shogun toolbox, usable from command line or within Python PyML an easy-to-use Python-based SVM toolbox, usable from command line or within Python Shogun toolbox a powerful toolbox for large-scale data analysis, including many SVM implementations with support for Python, R, Matlab, and Octave LibSVM an SVM library with a graphic interface SVM-Light an efficient implementation of SVMs in C, usable from command line Galaxy Web Service a web service for using SVMs, using predefined kernels for real-valued data and string classification (based on Easysvm): http://bioweb.met/mlb-galaxy

c Gunnar R¨ atsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 91

Memorial Sloan-Kettering Cancer Center