Support vector machines and applications in computational biology - PowerPoint PPT Presentation

Trick 1 : SVM in the feature space Train the SVM by maximizing n n n α i − 1 α i α j y i y j Φ ( x i ) ⊤ Φ � � � � � max x j , 2 α ∈ R n i = 1 i = 1 j = 1 under the constraints: � 0 ≤ α i ≤ C , for i = 1 , . . . , n � n i = 1 α i y i = 0 . Predict with the decision function n α i y i Φ ( x i ) ⊤ Φ ( x ) + b ∗ . � f ( x ) = i = 1

Trick 1 : SVM in the feature space with a kernel Train the SVM by maximizing n n n α i − 1 � � � � � max α i α j y i y j K x i , x j , 2 α ∈ R n i = 1 i = 1 j = 1 under the constraints: � 0 ≤ α i ≤ C , for i = 1 , . . . , n � n i = 1 α i y i = 0 . Predict with the decision function n α i K ( x i , x ) + b ∗ . � f ( x ) = i = 1

Trick 2 illustration: polynomial kernel x1 2 x1 x2 R 2 x2 √ For x = ( x 1 , x 2 ) ⊤ ∈ R 2 , let Φ( x ) = ( x 2 2 x 1 x 2 , x 2 2 ) ∈ R 3 : 1 , K ( x , x ′ ) = x 2 1 x ′ 2 1 + 2 x 1 x 2 x ′ 1 x ′ 2 x ′ 2 2 + x 2 2 � 2 x 1 x ′ 1 + x 2 x ′ � = 2 x ⊤ x ′ � 2 � = .

Trick 2 illustration: polynomial kernel x1 2 x1 x2 R 2 x2 More generally, for x , x ′ ∈ R p , � d � x ⊤ x ′ + 1 K ( x , x ′ ) = is an inner product in a feature space of all monomials of degree up to d (left as exercice.)

Combining tricks: learn a polynomial discrimination rule with SVM Train the SVM by maximizing n n n α i − 1 � d � � � � x ⊤ max α i α j y i y j i x j + 1 , 2 α ∈ R n i = 1 i = 1 j = 1 under the constraints: � 0 ≤ α i ≤ C , for i = 1 , . . . , n � n i = 1 α i y i = 0 . Predict with the decision function n � d � + b ∗ . � x ⊤ f ( x ) = α i y i i x + 1 i = 1

Illustration: toy nonlinear problem > plot(x,col=ifelse(y>0,1,2),pch=ifelse(y>0,1,2)) Training data 4 ● ● ● ● ● ● ● ● ● ● ● ● ● 3 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2 ● x2 1 ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −1 −1 0 1 2 3 x1

Illustration: toy nonlinear problem, linear SVM > library(kernlab) > svp <- ksvm(x,y,type="C-svc",kernel=’vanilladot’) > plot(svp,data=x) SVM classification plot ● ● ● ● ● ● ● ● ● ● ● ● 2.0 ● ● ● ● ● ● ● ● 3 ● ● ● ● ● 1.5 ● ● ● ● ● 2 1.0 ● x1 0.5 1 ● ● ● ● ● 0.0 ● ● ● ● ● ● ● ● ● ● 0 ● −0.5 ● ● ● ● ● ● ● ● ● ● −1.0 −1 ● ● ● −1 0 1 2 3 4 x2

Illustration: toy nonlinear problem, polynomial SVM > svp <- ksvm(x,y,type="C-svc", ... kernel=polydot(degree=2)) > plot(svp,data=x) SVM classification plot 10 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 3 ● ● ● ● ● 5 ● ● ● ● ● 2 ● x1 0 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● −5 ● ● ● ● ● −1 ● ● ● −1 0 1 2 3 4 x2

Which functions K ( x , x ′ ) are kernels? Definition A function K ( x , x ′ ) defined on a set X is a kernel if and only if there exists a features space (Hilbert space) H and a mapping Φ : X �→ H , such that, for any x , x ′ in X : x , x ′ � x ′ �� K = Φ ( x ) , Φ H . φ X F

Positive Definite (p.d.) functions Definition A positive definite (p.d.) function on the set X is a function K : X × X → R symmetric: x , x ′ � ∈ X 2 , x , x ′ � x ′ , x � � � � ∀ K = K , and which satisfies, for all N ∈ N , ( x 1 , x 2 , . . . , x N ) ∈ X N et ( a 1 , a 2 , . . . , a N ) ∈ R N : N N � � � � a i a j K x i , x j ≥ 0 . i = 1 j = 1

Kernels are p.d. functions Theorem (Aronszajn, 1950) K is a kernel if and only if it is a positive definite function. φ F X

Proof? Kernel = ⇒ p.d. function: � Φ ( x ) , Φ ( x ′ ) � R d = � Φ ( x ′ ) , Φ ( x ) R d � , � N � N j = 1 a i a j � Φ ( x i ) , Φ ( x j ) � R d = � � N i = 1 a i Φ ( x i ) � 2 R d ≥ 0 . i = 1 P .d. function = ⇒ kernel: more difficult...

Example: SVM with a Gaussian kernel Training: n n � � −|| � x i − � x j || 2 α i − 1 � � min α i α j y i y j exp 2 σ 2 2 α ∈ R n i = 1 i , j = 1 n � s.t. 0 ≤ α i ≤ C , and α i y i = 0 . i = 1 Prediction n −|| � x − � x i || 2 � � � f ( � x ) = α i exp 2 σ 2 i = 1

Example: SVM with a Gaussian kernel n −|| � x − � x i || 2 � � � f ( � x ) = α i exp 2 σ 2 i = 1 SVM classification plot ● 1.0 4 ● 0.5 2 0.0 ● ● ● ● 0 ● −0.5 ● ● ● ● −2 ● −1.0 −2 0 2 4 6

Linear vs nonlinear SVM

Regularity vs data fitting trade-off

C controls the trade-off � 1 � margin ( f ) + C × errors ( f ) min f

Why it is important to control the trade-off

How to choose C in practice Split your dataset in two ("train" and "test") Train SVM with different C on the "train" set Compute the accuracy of the SVM on the "test" set Choose the C which minimizes the "test" error (you may repeat this several times = cross-validation)

SVM summary Large margin Linear or nonlinear (with the kernel trick) Control of the regularization / data fitting trade-off with C

Outline Motivations 1 Linear SVM 2 Nonlinear SVM and kernels 3 Kernels for strings and graphs 4

Supervised sequence classification Data (training) Secreted proteins: MASKATLLLAFTLLFATCIARHQQRQQQQNQCQLQNIEA... MARSSLFTFLCLAVFINGCLSQIEQQSPWEFQGSEVW... MALHTVLIMLSLLPMLEAQNPEHANITIGEPITNETLGWL... ... Non-secreted proteins: MAPPSVFAEVPQAQPVLVFKLIADFREDPDPRKVNLGVG... MAHTLGLTQPNSTEPHKISFTAKEIDVIEWKGDILVVG... MSISESYAKEIKTAFRQFTDFPIEGEQFEDFLPIIGNP.. ... Goal Build a classifier to predict whether new proteins are secreted or not.

String kernels The idea Map each string x ∈ X to a vector Φ( x ) ∈ F . Train a classifier for vectors on the images Φ( x 1 ) , . . . , Φ( x n ) of the training set (nearest neighbor, linear perceptron, logistic regression, support vector machine...) φ F X maskat... msises marssl... malhtv... mappsv... mahtlg...

Example: substring indexation The approach Index the feature space by fixed-length strings, i.e., Φ ( x ) = (Φ u ( x )) u ∈A k where Φ u ( x ) can be: the number of occurrences of u in x (without gaps) : spectrum kernel (Leslie et al., 2002) the number of occurrences of u in x up to m mismatches (without gaps) : mismatch kernel (Leslie et al., 2004) the number of occurrences of u in x allowing gaps, with a weight decaying exponentially with the number of gaps : substring kernel (Lohdi et al., 2002)

Spectrum kernel (1/2) Kernel definition The 3-spectrum of x = CGGSLIAMMWFGV is: (CGG,GGS,GSL,SLI,LIA,IAM,AMM,MMW,MWF,WFG,FGV) . Let Φ u ( x ) denote the number of occurrences of u in x . The k -spectrum kernel is: x , x ′ � � x ′ � � � K := Φ u ( x ) Φ u . u ∈A k

Spectrum kernel (2/2) Implementation The computation of the kernel is formally a sum over |A| k terms, but at most | x | − k + 1 terms are non-zero in Φ ( x ) = ⇒ Computation in O ( | x | + | x ′ | ) with pre-indexation of the strings. Fast classification of a sequence x in O ( | x | ) : | x |− k + 1 � � f ( x ) = w · Φ ( x ) = w u Φ u ( x ) = w x i ... x i + k − 1 . u i = 1 Remarks Work with any string (natural language, time series...) Fast and scalable, a good default method for string classification. Variants allow matching of k -mers up to m mismatches.

Local alignmnent kernel (Saigo et al., 2004) CGGSLIAMM----WFGV |...|||||....|||| C---LIVMMNRLMWFGV s S , g ( π ) = S ( C , C ) + S ( L , L ) + S ( I , I ) + S ( A , V ) + 2 S ( M , M ) + S ( W , W ) + S ( F , F ) + S ( G , G ) + S ( V , V ) − g ( 3 ) − g ( 4 ) SW S , g ( x , y ) := π ∈ Π( x , y ) s S , g ( π ) max is not a kernel K ( β ) � � � LA ( x , y ) = exp β s S , g ( x , y , π ) is a kernel π ∈ Π( x , y )

LA kernel is p.d.: proof (1/2) Definition: Convolution kernel (Haussler, 1999) Let K 1 and K 2 be two p.d. kernels for strings. The convolution of K 1 and K 2 , denoted K 1 ⋆ K 2 , is defined for any x , x ′ ∈ X by: � K 1 ⋆ K 2 ( x , y ) := K 1 ( x 1 , y 1 ) K 2 ( x 2 , y 2 ) . x 1 x 2 = x , y 1 y 2 = y Lemma If K 1 and K 2 are p.d. then K 1 ⋆ K 2 is p.d..

LA kernel is p.d.: proof (2/2) ∞ � ( n − 1 ) K ( β ) � K ( β ) ⋆ K ( β ) ⋆ K ( β ) � LA = K 0 ⋆ ⋆ K 0 , a g a n = 0 with The constant kernel: K 0 ( x , y ) := 1 . A kernel for letters: � 0 if | x | � = 1 where | y | � = 1 , K ( β ) ( x , y ) := a exp ( β S ( x , y )) otherwise . A kernel for gaps: K ( β ) ( x , y ) = exp [ β ( g ( | x | ) + g ( | x | ))] . g

The choice of kernel matters 60 SVM-LA SVM-pairwise SVM-Mismatch No. of families with given performance 50 SVM-Fisher 40 30 20 10 0 0 0.2 0.4 0.6 0.8 1 ROC50 Performance on the SCOP superfamily recognition benchmark (from Saigo et al., 2004).

Virtual screening for drug discovery active inactive active inactive inactive active NCI AIDS screen results (from http://cactus.nci.nih.gov).

Image retrieval and classification From Harchaoui and Bach (2007).

Graph kernels Represent each graph x by a vector Φ( x ) ∈ H , either explicitly or 1 implicitly through the kernel K ( x , x ′ ) = Φ( x ) ⊤ Φ( x ′ ) . Use a linear method for classification in H . 2 X

Graph kernels Represent each graph x by a vector Φ( x ) ∈ H , either explicitly or 1 implicitly through the kernel K ( x , x ′ ) = Φ( x ) ⊤ Φ( x ′ ) . Use a linear method for classification in H . 2 φ H X

Indexing by all subgraphs? Theorem Computing all subgraph occurrences is NP-hard. Proof. The linear graph of size n is a subgraph of a graph X with n vertices iff X has an Hamiltonian path The decision problem whether a graph has a Hamiltonian path is NP-complete.

Indexing by specific subgraphs Substructure selection We can imagine more limited sets of substuctures that lead to more computationnally efficient indexing (non-exhaustive list) substructures selected by domain knowledge (MDL fingerprint) all path up to length k (Openeye fingerprint, Nicholls 2005) all shortest paths (Borgwardt and Kriegel, 2005) all subgraphs up to k vertices (graphlet kernel, Sherashidze et al., 2009) all frequent subgraphs in the database (Helma et al., 2004)

Example : Indexing by all shortest paths A A B A B A A (0,...,0,2,0,...,0,1,0,...) B B A A A A A A B Properties (Borgwardt and Kriegel, 2005) There are O ( n 2 ) shortest paths. The vector of counts can be computed in O ( n 4 ) with the Floyd-Warshall algorithm.

Support vector machines and applications in computational biology - PowerPoint PPT Presentation

Support vector machines and applications in computational biology Jean-Philippe Vert Jean-Philippe.Vert@mines.org Outline Motivations 1 Linear SVM 2 Nonlinear SVM and kernels 3 Kernels for strings and graphs 4 Outline Motivations 1

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

? 17.10.2018 3 17.10.2018 4 Support Vector Machines (SVM): Background Support Vector Machines

Support Vector Machines October 16, 2018 Support Vector Machines October 16, 2018 1 / 31

Relevance Vector Machines Jukka Lankinen LUT February 21, 2011 Jukka Lankinen Relevance Vector

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Support Vector Machines 290N, 2014 Support Vector Machines (SVM) Supervised learning

SUPPORT VECTOR MACHINES SUPPORT VECTOR MACHINES Matthieu R Bloch Tuesday, February 25, 2020 1

Support Vector Machines Support Vector Machines CSC 411 Tutorial April 1, 2015 Tutor: Shenlong

Support Vector Machines & Kernelization Barna Saha Most of the slides are made using David

Introduction Kailash Awati Instructor DataCamp Support Vector Machines in R Preliminaries

Support Vector Machines Support Vector Machines Hypothesis Space Hypothesis Space variable

Support Vector Machines (Ch. 18.9) SVM Basics Support Vector Machines (SVMs) try to do our

Support vector machines CS 446 Part 1: linear support vector machines 1.0 1.0 1.0 0.8 0.8

RBF Kernels: Generating a complex dataset DataCamp Support Vector Machines in R A bit about RBF

Machine Learning for NLP Support Vector Machines Aurlie Herbelot 2019 Centre for Mind/Brain

Machine learning for computational biology Jean-Philippe Vert Jean-Philippe.Vert@mines.org

Chapter 7 Linear Regression 04/05/2016 Huamei Dong 1. Review Least square regression line 2.

CEE 370 Environmental Engineering Principles Lecture #35 Hazardous Waste I: Intro and

Limited Liability Company (LLC) TRIPLE oil products; local tourism; was founded in

Bioimaging2 November 7, 2018 1 Lecture 22: Bioimaging II CBIO (CSCI) 4835/6835: Introduction to

Data Monitoring Committee Training Lecture Two: DMC Examples 1.1 DMC Examples 1.2 Overview of

Exploiting Internal and External Semantics Xia Hu for the Clustering of Short Texts Using

Toeplitz and Asymptotic Toeplitz operators on H 2 ( D n ) Amit Maji (Joint work with Jaydeb

Support vector machines and applications in computational biology - PowerPoint PPT Presentation

Support vector machines and applications in computational biology Jean-Philippe Vert Jean-Philippe.Vert@mines.org Outline Motivations 1 Linear SVM 2 Nonlinear SVM and kernels 3 Kernels for strings and graphs 4 Outline Motivations 1

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

? 17.10.2018 3 17.10.2018 4 Support Vector Machines (SVM): Background Support Vector Machines

Support Vector Machines October 16, 2018 Support Vector Machines October 16, 2018 1 / 31

Relevance Vector Machines Jukka Lankinen LUT February 21, 2011 Jukka Lankinen Relevance Vector

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Support Vector Machines 290N, 2014 Support Vector Machines (SVM) Supervised learning

SUPPORT VECTOR MACHINES SUPPORT VECTOR MACHINES Matthieu R Bloch Tuesday, February 25, 2020 1

Support Vector Machines Support Vector Machines CSC 411 Tutorial April 1, 2015 Tutor: Shenlong

Support Vector Machines &amp; Kernelization Barna Saha Most of the slides are made using David

Introduction Kailash Awati Instructor DataCamp Support Vector Machines in R Preliminaries

Support Vector Machines Support Vector Machines Hypothesis Space Hypothesis Space variable

Support Vector Machines (Ch. 18.9) SVM Basics Support Vector Machines (SVMs) try to do our

Support vector machines CS 446 Part 1: linear support vector machines 1.0 1.0 1.0 0.8 0.8

RBF Kernels: Generating a complex dataset DataCamp Support Vector Machines in R A bit about RBF

Machine Learning for NLP Support Vector Machines Aurlie Herbelot 2019 Centre for Mind/Brain

Machine learning for computational biology Jean-Philippe Vert Jean-Philippe.Vert@mines.org

Chapter 7 Linear Regression 04/05/2016 Huamei Dong 1. Review Least square regression line 2.

CEE 370 Environmental Engineering Principles Lecture #35 Hazardous Waste I: Intro and

Limited Liability Company (LLC) TRIPLE oil products; local tourism; was founded in

Bioimaging2 November 7, 2018 1 Lecture 22: Bioimaging II CBIO (CSCI) 4835/6835: Introduction to

Data Monitoring Committee Training Lecture Two: DMC Examples 1.1 DMC Examples 1.2 Overview of

Exploiting Internal and External Semantics Xia Hu for the Clustering of Short Texts Using

Toeplitz and Asymptotic Toeplitz operators on H 2 ( D n ) Amit Maji (Joint work with Jaydeb

Support Vector Machines & Kernelization Barna Saha Most of the slides are made using David