An Introduction to Kernel Methods for Classification, Regression and - PowerPoint PPT Presentation

Memorial Sloan-Kettering Cancer Center Summary of Empirical Inference Learn function f : X → Y given N labeled examples ( x i , y i ) ∈ X × Y . Three important ingredients: Model f θ parametrized with some parameters θ ∈ Θ Loss function ℓ ( f ( x ) , y ) measuring the “deviation” between predictions f ( x ) and the label y Complexity term P [ f ] defining model classes with limited complexity (via nested subsets { f | P [ f ] ≤ p } ⊆ { f | P [ f ] ≤ p ′ } for p ≤ p ′ ) Most algorithms find θ in f θ by minimizing: Regularization parameter N � � � �� θ ∗ = argmin ℓ ( f θ ( x i ) , y i ) + C P [ f θ ] for given C �� θ ∈ Θ i =1 Complexity term � �� Empricial error @ MLSS 2012, Santa Cruz � Gunnar R¨ c atsch ( cBio@MSKCC ) Introduction to Kernels 19

Part II Support Vector Machines and Kernels @ MLSS 2012, Santa Cruz � Gunnar R¨ c atsch ( cBio@MSKCC ) Introduction to Kernels 20

Memorial Sloan-Kettering Cancer Center Overview: Support Vector Machines and Kernels Margin Maximization 4 Some Learning Theory Support Vector Machines for Binary Classification Convex Optimization Kernels & the “Trick” 5 Inflating the Feature Space Kernel “Trick” Common Kernels Results for Running Example Beyond 2-Class Classification 6 Multiple Kernel Learning Multi-Class Classification & Regression Semi-Supervised Learning & Transfer Learning Software & Demonstration 7 @ MLSS 2012, Santa Cruz � Gunnar R¨ c atsch ( cBio@MSKCC ) Introduction to Kernels 21

Memorial Sloan-Kettering Cancer Center Why maximize the margin? AG GC content before 'AG' AG AG w AG AG AG AG AG AG AG AG GC content after 'AG' Intuitively, it feels the safest. For a small error in the separating hyperplane, we do not suffer too many mistakes. Empirically, it works well. VC theory indicates that it is the right thing to do. @ MLSS 2012, Santa Cruz � Gunnar R¨ c atsch ( cBio@MSKCC ) Introduction to Kernels 22

Memorial Sloan-Kettering Cancer Center Approximation & Estimation Error R ( f n ) − R ∗ = R ( f n ) − inf f ∈F R ( f ) − R ∗ f ∈F R ( f ) + inf � �� estimation error approximation error R ∗ = minimal risk algorithms choice with n examples = f n ∈ F , small approximation error F large overfitting (estimation error large) large approximation error F small better generalization/estimation, but poor overall performance Model selection Choose F to get an optimal tradeoff between approximation and estimation error. @ MLSS 2012, Santa Cruz � Gunnar R¨ c atsch ( cBio@MSKCC ) Introduction to Kernels 23

Memorial Sloan-Kettering Cancer Center Estimation Error R ( f n ) − inf f ∈F R ( f ) ? Uniform differences R ( f n ) − inf f ∈F R ( f ) ≤ 2 sup | R emp ( f ) − R ( f ) | f ∈F Finite Sample Results log( |F| ) / √ n � F finite: R ( f n ) − inf f ∈F R ( f ) ≈ F infinite: ? @ MLSS 2012, Santa Cruz � Gunnar R¨ c atsch ( cBio@MSKCC ) Introduction to Kernels 24

Memorial Sloan-Kettering Cancer Center Special Case: Complexity of Hyperplanes AG GC content before 'AG' AG AG w AG AG What is the complexity of hyperplane classifiers? AG AG AG AG AG AG GC content after 'AG' @ MLSS 2012, Santa Cruz � Gunnar R¨ c atsch ( cBio@MSKCC ) Introduction to Kernels 25

Memorial Sloan-Kettering Cancer Center Special Case: Complexity of Hyperplanes AG GC content before 'AG' AG AG w AG AG What is the complexity of hyperplane classifiers? AG AG AG AG AG AG GC content after 'AG' Vladimir Vapnik and Alexey Chervonenkis: Vapnik-Chervonenkis (VC) dimension [Vapnik and Chervonenkis, 1971; Vapnik, 1995] [ http://tinyurl.com/cl8jo9 , http://tinyurl.com/d7lmux ] @ MLSS 2012, Santa Cruz � Gunnar R¨ c atsch ( cBio@MSKCC ) Introduction to Kernels 25

Memorial Sloan-Kettering Cancer Center VC Dimension A model class shatters a set of data points if it can correctly classify any possible labeling. Lines shatter any 3 points in R 2 , but not 4 points. VC dimension [Vapnik, 1995] The VC dimension of a model class F is the maximum h such that some data point set of size h can be shattered by the model. (e.g. VC dimension of R 2 is 3.) � h log(2 N / h ) + h − log( η/ 4) R ( f n ) ≤ R emp ( f n ) + N @ MLSS 2012, Santa Cruz � Gunnar R¨ c atsch ( cBio@MSKCC ) Introduction to Kernels 26

Memorial Sloan-Kettering Cancer Center Larger Margin ⇒ Less Complex Large Margin Hyperplanes ⇒ Small VC dimension Hyperplane classifiers with large margins have small VC dimension [Vapnik and Chervonenkis, 1971; Vapnik, 1995] . Maximum Margin ⇒ Minimum Complexity Minimize complexity by maximizing margin (irrespective of the dimension of the space) . Useful Idea: Find the hyperplane that classifies all points correctly, while maximizing the margin (=SVMs). @ MLSS 2012, Santa Cruz � Gunnar R¨ c atsch ( cBio@MSKCC ) Introduction to Kernels 27

Memorial Sloan-Kettering Cancer Center Large Margin ⇒ low complexity? - Why? @ MLSS 2012, Santa Cruz � Gunnar R¨ c atsch ( cBio@MSKCC ) Introduction to Kernels 28

Memorial Sloan-Kettering Cancer Center Canonical Hyperplanes [Vapnik, 1995] Note: If c � = 0, then { x | � w , x � + b = 0 } = { x | � c w , x � + cb = 0 } Hence ( w , b ) describes the same hyperplane as ( w , b ). Definition : The hyperplane is in canonical form w.r.t. X ∗ = { x 1 , . . . , x r } if x i ∈ X |� w , x i � + b | = 1 min Note, than for canonical hyperplanes, the distance of the closest point to the hyperplane (”margins”) is 1 / � w �| : � w � � � 1 b � � min � w � , x i + � = � � � w � � w � � x i ∈ X @ MLSS 2012, Santa Cruz � Gunnar R¨ c atsch ( cBio@MSKCC ) Introduction to Kernels 29

Memorial Sloan-Kettering Cancer Center Theorem 8 [Vapnik, 1979] Consider hyperplanes � w , x i � = 0 where w is normalized such that they are in canonical form w.r.t. a set of points X = { x 1 , . . . , x r } , i.e., i =1 ,..., r |� w , x i �| = 1 . min The set of decision functions f w ( x ) = sgn( � x , w � defined on X and satisfying the constraint � w � ≤ Λ has a VC dimension satisfying h ≤ R 2 Λ 2 . Here, R is the radius of the smallest sphere around the origin containing X @ MLSS 2012, Santa Cruz � Gunnar R¨ c atsch ( cBio@MSKCC ) Introduction to Kernels 30

Memorial Sloan-Kettering Cancer Center How to Maximize the Margin? I Consider linear hyperplanes with parameters w , b : d � f ( x ) = w j x j + b = � w , x � + b j =1 @ MLSS 2012, Santa Cruz � Gunnar R¨ c atsch ( cBio@MSKCC ) Introduction to Kernels 31

Memorial Sloan-Kettering Cancer Center How to Maximize the Margin? II Margin maximization is equivalent to minimizing � w � . [Sch¨ olkopf and Smola, 2002] @ MLSS 2012, Santa Cruz � Gunnar R¨ c atsch ( cBio@MSKCC ) Introduction to Kernels 32

Memorial Sloan-Kettering Cancer Center How to Maximize the Margin? III Minimize AG n 1 GC content before 'AG' � AG AG w 2 � w � 2 + C ξ i AG AG i =1 AG AG Subject to AG AG AG AG y i ( � w , x i � + b ) � 1 − ξ i ξ i � 0 GC content after 'AG' for all i = 1 , . . . , n . Examples on the margin are called support vectors [Vapnik, 1995] @ MLSS 2012, Santa Cruz � Gunnar R¨ c atsch ( cBio@MSKCC ) Introduction to Kernels 33

Memorial Sloan-Kettering Cancer Center How to Maximize the Margin? III Minimize AG n 1 GC content before 'AG' AG AG � 2 � w � 2 + C margin ξ i AG AG i =1 AG AG Subject to AG AG AG AG y i ( � w , x i � + b ) � 1 − ξ i ξ i � 0 GC content after 'AG' for all i = 1 , . . . , n . Examples on the margin are called support vectors [Vapnik, 1995] @ MLSS 2012, Santa Cruz � Gunnar R¨ c atsch ( cBio@MSKCC ) Introduction to Kernels 33

Memorial Sloan-Kettering Cancer Center We have to solve an “Optimization Problem” n 1 � 2 � w � 2 + C minimize ξ i w , b , ξ i =1 y i ( � w , x i � + b ) � 1 − ξ i for all i = 1 , . . . , n . subject to ξ i � 0 for all i = 1 , . . . , n Quadratic objective function, linear constraints in w and b : “Quadratic Optimization Problem” (QP) “Convex Optimization Problem” (efficient solution possible, every local minimum is a global minimum) How to solve it? General purpose optimization packages (GNU Linear Programming Kit, CPLEX, Mosek, . . . ) Much faster specialized solvers (liblinear, SVM OCAS, Nieme, SGD, . . . ) @ MLSS 2012, Santa Cruz � Gunnar R¨ c atsch ( cBio@MSKCC ) Introduction to Kernels 34

Memorial Sloan-Kettering Cancer Center Lagrange Function (e.g., [Bertsekas, 1995]) Introduce Lagrange multipliers α i ≥ 0 and a Lagrangian m L ( w , b , α ) = 1 � 2 � w � 2 − α i ( y i · ( � w , x i � + b ) − 1) . i =1 L has to be minimized w.r.t. the primal variables w and b and maximized with respect to the dual variables α i if a constraint is violated, then y i · ( � w , x i � + b ) − 1 < 0 α i will grow to increase L w , b want to decrease L ; i.e. they have to change such that the constraint is satisfied. If the problem is separable, this ensures that α i < ∞ y i · ( � w , x i � + b ) − 1 > 0, then α i = 0: otherwise, L could be increased by decreasing α i (KKT conditions) @ MLSS 2012, Santa Cruz � Gunnar R¨ c atsch ( cBio@MSKCC ) Introduction to Kernels 35

Memorial Sloan-Kettering Cancer Center Derivation of the Dual Problem At the extremum, we have δ bL ( w , b , α ) = 0 , δ δ δ w L ( w , b , α ) = 0 , i.e. m � α i y i = 0 i =1 and m � w = α i y i x i i =1 Substitute both into L to get the dual problem @ MLSS 2012, Santa Cruz � Gunnar R¨ c atsch ( cBio@MSKCC ) Introduction to Kernels 36

Memorial Sloan-Kettering Cancer Center Dual Problem Dual: maximize m m α i − 1 � � W ( α ) = α i α j y i y j � x i , x j � 2 i =1 i , j =1 subject to m � α i ≥ 0 , i = 1 , . . . , m , and α i y i = 0 . i =1 @ MLSS 2012, Santa Cruz � Gunnar R¨ c atsch ( cBio@MSKCC ) Introduction to Kernels 37

Memorial Sloan-Kettering Cancer Center An Important Detail n 1 � 2 � w � 2 + C ξ i minimize w , b , ξ i =1 subject to y i ( � w , x i � + b ) � 1 − ξ i for all i = 1 , . . . , n . ξ i � 0 for all i = 1 , . . . , n Representer Theorem: The optimal w can be written as a linear combination of the examples (for appropriate α ’s): n � w = α i x i ⇒ Plug in! i =1 Now optimize for the variables α , b , and ξ ! Corollary: Hyperplane only depends on the scalar products of the examples D � � x , ˆ x � = x d ˆ Remember this! x d d =1 @ MLSS 2012, Santa Cruz � Gunnar R¨ c atsch ( cBio@MSKCC ) Introduction to Kernels 38

Memorial Sloan-Kettering Cancer Center An Important Detail � � 2 N n 1 � � � � � � α i x i + C ξ i minimize � � 2 α , b , ξ � � i =1 i =1 �� N � subject to y i j =1 α j � x j , x i � + b � 1 − ξ i for all i = 1 , . . . , n . ξ i � 0 for all i = 1 , . . . , n Representer Theorem: The optimal w can be written as a linear combination of the examples (for appropriate α ’s): n � w = α i x i ⇒ Plug in! i =1 Now optimize for the variables α , b , and ξ ! Corollary: Hyperplane only depends on the scalar products of the examples D � � x , ˆ x � = x d ˆ x d Remember this! d =1 @ MLSS 2012, Santa Cruz � Gunnar R¨ c atsch ( cBio@MSKCC ) Introduction to Kernels 38

Memorial Sloan-Kettering Cancer Center Nonseparable Problems [Bennett and Mangasarian, 1992; Cortes and Vapnik, 1995] If y i · ( � w , x i � + b ) ≥ 1 cannot be satisfied, then α i → ∞ . Modify the constraint to y i · ( � w , x i � + b ) ≥ 1 − ξ i with ξ i ≥ 0 (”soft margin”) and add m � ξ i i =1 in the objective function. @ MLSS 2012, Santa Cruz � Gunnar R¨ c atsch ( cBio@MSKCC ) Introduction to Kernels 39

Memorial Sloan-Kettering Cancer Center Soft Margin SVMs C-SVM [Cortes and Vapnik, 1995] : for C > 0, minimize m τ ( w , ξ ) = 1 � 2 � w � 2 + C ξ i i =1 subject to y i · ( � w , x i � + b ) ≥ 1 − ξ i , ξ i ≥ 0 (margin 1 / � w � ) ν -SVM [Sch¨ olkopf et al., 2000] : for 0 ≤ ν ≤ 1, minimize m τ ( w , ξ, ρ ) = 1 � 2 � w � 2 − νρ + ξ i i =1 subject to y i · ( � w , x i � + b ) ≥ ρ − ξ i , ξ i ≥ 0 (margin ρ/ � w � ) @ MLSS 2012, Santa Cruz � Gunnar R¨ c atsch ( cBio@MSKCC ) Introduction to Kernels 40

Memorial Sloan-Kettering Cancer Center The ν -Property SVs: α i > 0 ”margin errors:” ξ i > 0 KKT-Conditions imply All margin errors are SVs. Not all SVs need to be margin errors. Those which are not lie exactly on the edge of the margin. Proposition: 1 fraction of Margin Errors ≥ ν ≥ fraction of SVs. 2 asymptotically: . . . = ν = . . . 3 optimal choice: ν = expected classification error @ MLSS 2012, Santa Cruz � Gunnar R¨ c atsch ( cBio@MSKCC ) Introduction to Kernels 41

Memorial Sloan-Kettering Cancer Center Connection between ν -SVC and C-SVC Proposition. If ν -SV classification leads to ρ > 0, then C-SV classification, with C set a priori to 1 /ρ , leads to same decision function. Proof. Minimize the primal target, then fix ρ , and minimize only over the remaining variables: nothing will change. Hence the obtained solution w 0 , b 0 , ξ 0 minimizes the primal problem of C-SVC, for C = 1, subject to y i · ( � w , x i � + b ) ≥ ρ − ξ i , To recover the constraint y i · ( � w , x i � + b ) ≥ 1 − ξ i , rescale to the set of variables w ′ = w /ρ , b ′ = b /ρ , ξ ′ = ξ/ρ . This leaves us, up to a constant scaling factor ρ 2 , with the C-SV target with C = 1 /ρ . @ MLSS 2012, Santa Cruz � Gunnar R¨ c atsch ( cBio@MSKCC ) Introduction to Kernels 42

Memorial Sloan-Kettering Cancer Center Recognition of Splice Sites Given: Potential acceptor splice sites intron exon Goal: Rule that distinguishes true from false ones Linear Classifiers with large margin @ MLSS 2012, Santa Cruz � Gunnar R¨ c atsch ( cBio@MSKCC ) Introduction to Kernels 43

Memorial Sloan-Kettering Cancer Center Recognition of Splice Sites Given: Potential acceptor splice sites intron exon Goal: Rule that distinguishes true from false ones AG AG GC content before 'AG' AG More realistic problem? AG AG Not linearly separable! AG Need nonlinear separation? AG AG Need more features? AG AG AG AG GC content after 'AG' @ MLSS 2012, Santa Cruz � Gunnar R¨ c atsch ( cBio@MSKCC ) Introduction to Kernels 43

Memorial Sloan-Kettering Cancer Center Recognition of Splice Sites Given: Potential acceptor splice sites intron exon Goal: Rule that distinguishes true from false ones AG AG GC content before 'AG' AG More realistic problem? AG Not linearly separable! AG Need nonlinear AG AG separation? AG AG Need more features? AG AG AG GC content after 'AG' @ MLSS 2012, Santa Cruz � Gunnar R¨ c atsch ( cBio@MSKCC ) Introduction to Kernels 43

Memorial Sloan-Kettering Cancer Center Nonlinear Separations Linear separation might not be sufficient! ⇒ Map into a higher dimensional feature space Example: all second order monomials Φ : R 2 R 3 → √ ( z 1 , z 2 , z 3 ) := ( x 2 2 x 1 x 2 , x 2 ( x 1 , x 2 ) �→ 1 , 2 ) z 3 x 2 ✕ ✕ ✕ ✕ ✕ ✕ ✕ ✕ ✕ ✕ ✕ ✕ ✕ ✕ ✕ ❍ ✕ ✕ ✕ ✕ x 1 ❍ ❍ ❍ ❍ ✕ ❍ ❍ ❍ ✕ z 1 ❍ ❍ ❍ ✕ ✕ ❍ ✕ ❍ ❍ ❍ ✕ ✕ ❍ ✕ ✕ ✕ ✕ ✕ ✕ z 2 ✕ ✕ @ MLSS 2012, Santa Cruz � Gunnar R¨ c atsch ( cBio@MSKCC ) Introduction to Kernels 44

Memorial Sloan-Kettering Cancer Center Kernels and Feature Spaces Preprocess the data with Φ : X → H x �→ Φ( x ) where H is a dot product space, and learn the mapping from Φ( x ) to y [Boser et al., 1992] . usually, dim( X ) ≪ dim( H )) ”Curse of Dimensionality”? crucial issue: capacity matters, not dimensionality VC dimension of hyperplanes is (essentially) independent of dimensionality @ MLSS 2012, Santa Cruz � Gunnar R¨ c atsch ( cBio@MSKCC ) Introduction to Kernels 45

Memorial Sloan-Kettering Cancer Center Kernel “Trick” √ Example: x ∈ R 2 and Φ( x ) := ( x 2 2 x 1 x 2 , x 2 1 , 2 ) [Boser et al., 1992] √ √ � � ( x 2 2 x 1 x 2 , x 2 x 2 x 2 � Φ( x ) , Φ(ˆ x ) � = 1 , 2 ) , (ˆ 1 , 2 ˆ x 1 ˆ x 2 , ˆ 2 ) x 2 ) � 2 = � ( x 1 , x 2 ) , (ˆ x 1 , ˆ x � 2 = � x , ˆ : =: k ( x , ˆ x ) Scalar product in feature space (here R 3 ) can be computed in input space (here R 2 )! Also works for higher orders and dimensions ⇒ relatively low-dimensional input spaces ⇒ very high-dimensional feature spaces @ MLSS 2012, Santa Cruz � Gunnar R¨ c atsch ( cBio@MSKCC ) Introduction to Kernels 46

Memorial Sloan-Kettering Cancer Center General Product Feature Space [Sch¨ olkopf et al., 1996] How about patterns x ∈ R N and product features of order d ? Here, dim( H ) grows like N d . For instance, N = 16 · 16, and d = 5 ⇔ dimension 10 10 @ MLSS 2012, Santa Cruz � Gunnar R¨ c atsch ( cBio@MSKCC ) Introduction to Kernels 47

Memorial Sloan-Kettering Cancer Center The Kernel Trick, N = d = 2 √ √ � Φ( x ) , Φ( x ′ ) � ( x 2 2 x 1 x 2 , x 2 2 )( x ′ 2 2 x ′ 1 x ′ 2 , x ′ 2 2 ) T = 1 , 1 , � x , x ′ � 2 = =: k ( x , x ) The dot product in H can be computed in R 2 @ MLSS 2012, Santa Cruz � Gunnar R¨ c atsch ( cBio@MSKCC ) Introduction to Kernels 48

Memorial Sloan-Kettering Cancer Center The Kernel Trick, II More generally : x , x ′ ∈ R N , d ∈ N : � N � d � � x , x ′ � d x j · x ′ = j j =1 N � ( x j 1 · · · x ′ j d ) · ( x j 1 · · · x ′ j d ) = � Φ( x ) , Φ( x ′ ) � = j 1 ,..., j d where Φ maps into the space spanned by all ordered products of d input directions @ MLSS 2012, Santa Cruz � Gunnar R¨ c atsch ( cBio@MSKCC ) Introduction to Kernels 49

Memorial Sloan-Kettering Cancer Center Mercer’s Theorem If k is a continuous kernel of a positive definite integral operator on L 2 ( X ) (where X is some compact space), � k ( x , x ′ ) f ( x ) f ( x ′ ) dxdx ′ ≥ 0 , X it can be expanded as ∞ � k ( x , x ′ ) = λ i ψ i ( x ) ψ i ( x ′ ) i =1 using eigenfunctions ψ i and eigenvalues λ i ≥ 0 [Osuna et al., 1996] . @ MLSS 2012, Santa Cruz � Gunnar R¨ c atsch ( cBio@MSKCC ) Introduction to Kernels 50

Memorial Sloan-Kettering Cancer Center The Mercer Feature Map In that case � � Φ( x ) := ( λ 1 ψ 1 ( x ) , λ 2 ψ 2 ( x ) , . . . ) satisfies � Φ( x ) , Φ( x ′ ) � = k ( x , x ′ ) Proof: � � � Φ( x ) , Φ( x ′ ) � λ 1 ψ 1 ( x ′ ) , . . . ) � = � ( λ 1 ψ 1 ( x ) , . . . ) , ( ∞ � λ i ψ i ( x ) ψ i ( x ′ ) = k ( x , x ′ ) = i =1 @ MLSS 2012, Santa Cruz � Gunnar R¨ c atsch ( cBio@MSKCC ) Introduction to Kernels 51

Memorial Sloan-Kettering Cancer Center Positive Definite Kernels It can be shown that the admissible class of kernels coincides with the one of positive definite (pd) kernels: kernels which are symmetric (i.e., k ( x , x ′ ) = k ( x ′ , x )), and for any set of training points x 1 , . . . , x m ∈ X and any a 1 , . . . , a m ∈ R satisfy � a i a j K ij ≥ 0 , where K ij := k ( x i , x j ) i , j K is called the Gram matrix or kernel matrix. If for pairwise distinct points, � i , j a i a j K ij = 0 ⇒ a = 0, call it strictly positive definite. @ MLSS 2012, Santa Cruz � Gunnar R¨ c atsch ( cBio@MSKCC ) Introduction to Kernels 52

Memorial Sloan-Kettering Cancer Center Elementary Properties of PD Kernels Kernels from Feature Maps. If Φ maps X into a dot product space H , then � Φ( x ) , Φ( x ′ ) � is a pd kernel on X × X . Positivity on the Diagonal. k ( x , x ) ≥ 0 for all x ∈ X Cauchy-Schwarz Inequality. k ( x , x ′ ) 2 ≤ k ( x , x ) k ( x ′ , x ′ ) Vanishing Diagonals. k ( x , x ) = 0 for all x ∈ X ⇒ k ( x , x ′ ) = 0 for all x , x ′ ∈ X @ MLSS 2012, Santa Cruz � Gunnar R¨ c atsch ( cBio@MSKCC ) Introduction to Kernels 53

Memorial Sloan-Kettering Cancer Center Some Properties of Kernels [Sch¨ olkopf and Smola, 2002; Shawe-Taylor and Cristianini, 2004] If k 1 , k 2 , . . . are pd kernels, then so are α k 1 , provided α ≥ 0 k 1 + k 2 k 1 · k 2 k ( x , x ′ ) := lim n →∞ k n ( x , x ′ ), provided it exists k ( A , B ) := � x ∈ A , x ′ ∈ B k 1 ( x , x ′ ), where A , B are finite subsets of X (using the feature map Φ( A ) := � x ∈ A Φ( x )) Further operations to construct kernels from kernels: tensor products, direct sums, convolutions [Haussler, 1999b] . @ MLSS 2012, Santa Cruz � Gunnar R¨ c atsch ( cBio@MSKCC ) Introduction to Kernels 54

Memorial Sloan-Kettering Cancer Center Putting Things Together . . . Use Φ( x ) instead of x Use linear classifier on the Φ( x )’s n � From theorem: w = α i Φ( x i ) . i =1 Nonlinear separation: f ( x ) = � w , Φ( x ) � + b n � = α i � Φ( x i ) , Φ( x ) � + b � �� i =1 k ( x i , x ) Trick: k ( x , x ′ ) = � Φ( x ) , Φ( x ′ ) � , i.e. do not use Φ , but k ! See e.g. [M¨ uller et al., 2001; Sch¨ olkopf and Smola, 2002; Vapnik, 1995] for details. @ MLSS 2012, Santa Cruz � Gunnar R¨ c atsch ( cBio@MSKCC ) Introduction to Kernels 55

Memorial Sloan-Kettering Cancer Center Kernel ≈ Similarity Measure Distance: x ) � 2 = � Φ( x ) � 2 − 2 � Φ( x ) , Φ(ˆ � Φ( x ) − Φ(ˆ x ) � + � Φ(ˆ x ) � Scalar product: � Φ( x ) , Φ(ˆ x ) � If � Φ( x ) � 2 = � Φ(ˆ x ) � 2 = 1, then scalar product = 2 − distance Angle between vectors, i.e., � Φ( x ) , Φ(ˆ x ) � x ) � = cos(Φ( x ) , Φ(ˆ x )) � Φ( x ) � � Φ(ˆ Technical detail: kernel functions have to satisfy certain conditions (Mercer’s condition). @ MLSS 2012, Santa Cruz � Gunnar R¨ c atsch ( cBio@MSKCC ) Introduction to Kernels 56

Memorial Sloan-Kettering Cancer Center How to Construct a Kernel At least two ways to get to a kernel: Construct Φ and think about efficient ways to compute scalar product � Φ( x ) , Φ(ˆ x ) � Construct similarity measure (show Mercer’s condition) and think about what it means What can you do if kernel is not positive definite? Optimization problem is not convex! Add constant to diagonal (cheap) Exponentiate kernel matrix (all eigenvalues become positive) SVM-pairwise use similarity as features @ MLSS 2012, Santa Cruz � Gunnar R¨ c atsch ( cBio@MSKCC ) Introduction to Kernels 57

Memorial Sloan-Kettering Cancer Center Common Kernels See e.g. M¨ uller et al. [2001]; Sch¨ olkopf and Smola [2002]; Vapnik [1995] x � + c ) d Polynomial k ( x , ˆ x ) = ( � x , ˆ Sigmoid k ( x , ˆ x ) = tanh( κ � x , ˆ x � ) + θ ) � � x � 2 / (2 σ 2 ) RBF k ( x , ˆ x ) = exp −� x − ˆ Convex combinations k ( x , ˆ x ) = β 1 k 1 ( x , ˆ x ) + β 2 k 2 ( x , ˆ x ) k ′ ( x , ˆ x ) Normalization k ( x , ˆ x ) = � k ′ ( x , x ) k ′ (ˆ x , ˆ x ) Notes : Kernels may be combined in case of heterogeneous data These kernels are good for real-valued examples Sequences need special care (coming soon!) @ MLSS 2012, Santa Cruz � Gunnar R¨ c atsch ( cBio@MSKCC ) Introduction to Kernels 58

Memorial Sloan-Kettering Cancer Center Toy Examples Linear kernel RBF kernel x � 2 / 2 σ ) k ( x , ˆ x ) = � x , ˆ x � k ( x , ˆ x ) = exp( −� x − ˆ @ MLSS 2012, Santa Cruz � Gunnar R¨ c atsch ( cBio@MSKCC ) Introduction to Kernels 59

Memorial Sloan-Kettering Cancer Center Kernel Summary Nonlinear separation ⇔ linear separation of nonlinearly mapped examples Mapping Φ defines a kernel by k ( x , ˆ x ) := � Φ( x ) , Φ(ˆ x ) � (Mercer) Kernel defines a mapping Φ (nontrivial) Choice of kernel has to match the data at hand RBF kernel often works pretty well @ MLSS 2012, Santa Cruz � Gunnar R¨ c atsch ( cBio@MSKCC ) Introduction to Kernels 60

Memorial Sloan-Kettering Cancer Center Evaluation Measures for Classification The Contingency Table/Confusion Matrix TP, FP, FN, TN are absolute counts of true positives, false positives, false negatives and true negatives N - sample size N + = FN + TP number of positive examples N − = FP + TN number of negative examples O + = TP + FP number of positive predictions O − = FN + TN number of negative predictions outputs \ labeling y = +1 y = − 1 Σ O + f ( x ) = +1 TP FP O − f ( x ) = − 1 FN TN N + N − Σ N @ MLSS 2012, Santa Cruz � Gunnar R¨ c atsch ( cBio@MSKCC ) Introduction to Kernels 61

Memorial Sloan-Kettering Cancer Center Evaluation Measures for Classification II Several commonly used performance measures Name Computation ACC = TP + TN Accuracy N ERR = FP + FN Error rate (1-accuracy) N � � BER = 1 FN FP Balanced error rate FN + TP + 2 FP + TN TP FP Weighted relative accuracy WRACC = TP + FN − FP + TN 2 ∗ TP F1 score F1 = 2 ∗ TP + FP + FN √ TP · TN − FP · FN Cross-correlation coefficient CC = ( TP + FP )( TP + FN )( TN + FP )( TN + FN ) TPR = TP / N + = TP Sensitivity/recall TP + FN TNR = TN / N − = TN Specificity TN + FP FNR = FN / N + = FN 1-sensitivity FN + TP FPR = FP / N − = FP 1-specificity FP + TN PPV = TP / O + = TP P.p.v. / precision TP + FP FDR = FP / O + = FP False discovery rate FP + TP @ MLSS 2012, Santa Cruz � Gunnar R¨ c atsch ( cBio@MSKCC ) Introduction to Kernels 62

Memorial Sloan-Kettering Cancer Center Evaluation Measures for Classification III [left] ROC Curve [right] Precision Recall Curve ROC PPV 1 1 proposed method proposed method firstef firstef proposed method eponine eponine positive predictive value mcpromotor true positive rate firstef mcpromotor eponine proposed method firstef mcpromotor eponine 0.1 0.1 mcpromotor 0.01 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 false positive rate true positive rate (Obtained by varying bias and recording TPR/FPR or PPV/TPR.) Use bias independent scalar evaluation measure Area under ROC Curve (auROC) Area under Precision Recall Curve (auPRC) @ MLSS 2012, Santa Cruz � Gunnar R¨ c atsch ( cBio@MSKCC ) Introduction to Kernels 63

Memorial Sloan-Kettering Cancer Center Measuring Performance in Practice What to do in practice Split the data into training and validation sets; use error on validation set as estimate of the expected error A. Cross-validation Split data into c disjoint parts; use each subset as validation set and rest as training set B. Random splits Randomly split data set into two parts, for example, 80% of data for training and 20% for validation; Repeat this many times See, for instance, Duda et al. [2001] for more details. @ MLSS 2012, Santa Cruz � Gunnar R¨ c atsch ( cBio@MSKCC ) Introduction to Kernels 64

Memorial Sloan-Kettering Cancer Center Model Selection Do not train on the “test set”! Use subset of data for training From subset, further split to select model. Model selection = Find best parameters Regularization parameter C . Other parameters (introduced later) @ MLSS 2012, Santa Cruz � Gunnar R¨ c atsch ( cBio@MSKCC ) Introduction to Kernels 65

Memorial Sloan-Kettering Cancer Center GC-Content-based Splice Site Recognition Kernel auROC Linear 88 . 2% Polynomial d = 3 91 . 4% Polynomial d = 7 90 . 4% Gaussian σ = 100 87 . 9% Gaussian σ = 1 88 . 6% Gaussian σ = 0 . 01 77 . 3% SVM accuracy of acceptor site recognition using polynomial and Gaussian kernels with different degrees d and widths σ . Accuracy is measured using the area under the ROC curve (auROC) and is computed using five-fold cross-validation @ MLSS 2012, Santa Cruz � Gunnar R¨ c atsch ( cBio@MSKCC ) Introduction to Kernels 66

Memorial Sloan-Kettering Cancer Center Demonstration: Recognition of Splice Sites Given: Potential acceptor splice sites intron exon Goal: Rule that distinguishes true from false ones Task 1: Train classifier and pre- dict using 5-fold cross-validation. Evaluate predictions. Task 2: Determine best combination of polynomial degree and SVM’s C using 5-fold cross- validation. http://bioweb.me/svmcompbio http://bioweb.me/mlb-galaxy @ MLSS 2012, Santa Cruz � Gunnar R¨ c atsch ( cBio@MSKCC ) Introduction to Kernels 67

Memorial Sloan-Kettering Cancer Center Some Extensions Multiple Kernel Learning Semi-Supervised Learning Multi-class classification Regression Domain Adaptation and Multi-Task Learning @ MLSS 2012, Santa Cruz � Gunnar R¨ c atsch ( cBio@MSKCC ) Introduction to Kernels 68

Memorial Sloan-Kettering Cancer Center Multiple Kernel Learning (MKL) Data may consist of sequence and structure information Possible solution: We can add the two kernels, k ( x , x ′ ) := k sequence ( x , x ′ ) + k structure ( x , x ′ ) . Better solution: We can mix the two kernels, k ( x , x ′ ) := (1 − t ) k sequence ( x , x ′ ) + tk structure ( x , x ′ ) , where t is estimated from the training data In general: use the data to find the best convex combination. K � k ( x , x ′ ) = β p k p ( x , x ′ ) . p =1 Applications Heterogeneous data Improving interpretability (more on this later) @ MLSS 2012, Santa Cruz � Gunnar R¨ c atsch ( cBio@MSKCC ) Introduction to Kernels 69

Memorial Sloan-Kettering Cancer Center Example: Combining Heterogeneous Data Consider data from different domains: e.g DNA-strings, binding energies, conservation, structure, . . . k( x , x ′ ) = β 1 k dna ( x dna , x ′ dna )+ β 2 k nrg ( x nrg , x ′ nrg )+ β 3 k 3 d ( x 3 d , x ′ 3 d )+ · · · @ MLSS 2012, Santa Cruz � Gunnar R¨ c atsch ( cBio@MSKCC ) Introduction to Kernels 70

Memorial Sloan-Kettering Cancer Center MKL Primal Formulation � M � 2 N 1 � � min β j � w j � 2 + C ξ n 2 j =1 i =1 w = ( w 1 , . . . , w M ) , w j ∈ R D j , w.r.t. ∀ j = 1 . . . M β ∈ R M + , ξ ∈ R N + , b ∈ R � M � � T Φ j ( x i ) + b s.t. y i β j w j ≥ 1 − ξ i , ∀ i = 1 , . . . , N j =1 M � β j = 1 j =1 Properties: equivalent to SVM for M = 1; solution sparse in “blocks”; each block j corresponds to one kernel @ MLSS 2012, Santa Cruz � Gunnar R¨ c atsch ( cBio@MSKCC ) Introduction to Kernels 71

Memorial Sloan-Kettering Cancer Center Solving MKL SDP Lanckriet et al. [2004], QCQP Bach et al. [2004] SILP Sonnenburg et al. [2006a] SimpleMKL Rakotomamonjy et al. [2008] Extended Level Set Method Xu et al. [2009] ... SILP implemented in shogun-toolbox; examples available. @ MLSS 2012, Santa Cruz � Gunnar R¨ c atsch ( cBio@MSKCC ) Introduction to Kernels 72

Memorial Sloan-Kettering Cancer Center Multi-Class Classification Real problems often have more than 2 classes Generalize the SVM to multi-class classification, for K > 2. Three approaches [Sch¨ olkopf and Smola, 2002] One-vs-rest For each class, label all other classes as “negative” ( K binary problems). ⇒ Simple and hard to beat! One-vs-one Compare all classes pairwise ( 1 2 K ( K − 1) binary problems). Multi-class loss Define a new empirical risk term. @ MLSS 2012, Santa Cruz � Gunnar R¨ c atsch ( cBio@MSKCC ) Introduction to Kernels 73

Memorial Sloan-Kettering Cancer Center Multi-Class Loss for SVMs Two-Class SVM N 1 � 2 � w � 2 + ℓ ( f w , b ( x ) , y i ) , minimize w , b i =1 Multi-Class SVM N 1 � 2 � w � 2 + minimize max u � = y i ℓ ( f w , b ( x i , y i ) − f w , b ( x i , u ) , y i ) w , b i =1 @ MLSS 2012, Santa Cruz � Gunnar R¨ c atsch ( cBio@MSKCC ) Introduction to Kernels 74

Memorial Sloan-Kettering Cancer Center Regression Examples x ∈ X Labels y ∈ R @ MLSS 2012, Santa Cruz � Gunnar R¨ c atsch ( cBio@MSKCC ) Introduction to Kernels 75

Memorial Sloan-Kettering Cancer Center Regression Squared loss Simplest approach ℓ ( f ( x i ) , y i ) := ( y i − f ( x i )) 2 Problem: All α ’s are non-zero ⇒ Inefficient! ε -insensitive loss function Extend “margin” to regression. Establish a “tube” around the line where we can make mistakes. � 0 | y i − f ( x i ) | < ε ℓ ( f ( x i ) , y i ) = | y i − f ( x i ) | − ε otherwise Idea: Examples ( x i , y i ) inside tube have α i = 0. Huber’s loss Combination of benefits � 1 2 ( y i − f ( x i )) 2 | y i − f ( x i ) | < γ ℓ ( f ( x i ) , y i ) := γ | y i − f ( x i ) | − 1 2 γ 2 ( y i − f ( x i )) � γ See e.g. Smola and Sch¨ olkopf [2001] for other loss functions and more details. @ MLSS 2012, Santa Cruz � Gunnar R¨ c atsch ( cBio@MSKCC ) Introduction to Kernels 76

Memorial Sloan-Kettering Cancer Center Regression ⇒ Squared loss Simplest approach ℓ ( f ( x i ) , y i ) := ( y i − f ( x i )) 2 Problem: All α ’s are non-zero ⇒ Inefficient! ε -insensitive loss function Extend “margin” to regression. Establish a “tube” around the line where we can make mistakes. � 0 | y i − f ( x i ) | < ε ℓ ( f ( x i ) , y i ) = | y i − f ( x i ) | − ε otherwise Idea: Examples ( x i , y i ) inside tube have α i = 0. Huber’s loss Combination of benefits � 1 2 ( y i − f ( x i )) 2 | y i − f ( x i ) | < γ ℓ ( f ( x i ) , y i ) := γ | y i − f ( x i ) | − 1 2 γ 2 ( y i − f ( x i )) � γ See e.g. Smola and Sch¨ olkopf [2001] for other loss functions and more details. @ MLSS 2012, Santa Cruz � Gunnar R¨ c atsch ( cBio@MSKCC ) Introduction to Kernels 76

Memorial Sloan-Kettering Cancer Center Regression Squared loss Simplest approach ℓ ( f ( x i ) , y i ) := ( y i − f ( x i )) 2 Problem: All α ’s are non-zero ⇒ Inefficient! ε -insensitive loss function Extend “margin” to regression. Establish a “tube” around the line where we can make mistakes. � 0 | y i − f ( x i ) | < ε ℓ ( f ( x i ) , y i ) = | y i − f ( x i ) | − ε otherwise Idea: Examples ( x i , y i ) inside tube have α i = 0. Huber’s loss Combination of benefits � 1 2 ( y i − f ( x i )) 2 | y i − f ( x i ) | < γ ℓ ( f ( x i ) , y i ) := γ | y i − f ( x i ) | − 1 2 γ 2 ( y i − f ( x i )) � γ See e.g. Smola and Sch¨ olkopf [2001] for other loss functions and more details. @ MLSS 2012, Santa Cruz � Gunnar R¨ c atsch ( cBio@MSKCC ) Introduction to Kernels 76

Memorial Sloan-Kettering Cancer Center Semi-Supervised Learning: What Is It? For most researchers: SSL = semi-supervised classification . @ MLSS 2012, Santa Cruz � Gunnar R¨ c atsch ( cBio@MSKCC ) Introduction to Kernels 77

Memorial Sloan-Kettering Cancer Center Semi-Supervised Learning: How Does It Work? Cluster Assumption Points in the same cluster are likely to be of the same class . Equivalent assumption: Low Density Separation Assumption The decision boundary lies in a low density region. ⇒ Algorithmic idea: Low Density Separation @ MLSS 2012, Santa Cruz � Gunnar R¨ c atsch ( cBio@MSKCC ) Introduction to Kernels 78

Memorial Sloan-Kettering Cancer Center Semi-Supervised SVM Soft margin S 3 VM 1 2 w ⊤ w ξ i ≥ 0 ξ j ≥ 0 + C � y i ( w ⊤ x i + b ) ≥ 1 − ξ i min i ξ i s . t . + C ∗ � w , b , ( y j ) , ( ξ k ) j ξ j y j ( w ⊤ x j + b ) ≥ 1 − ξ j @ MLSS 2012, Santa Cruz � Gunnar R¨ c atsch ( cBio@MSKCC ) Introduction to Kernels 79

An Introduction to Kernel Methods for Classification, Regression and - PowerPoint PPT Presentation

An Introduction to Kernel Methods for Classification, Regression and Structured Data atsch Gunnar R Computational Biology Center Sloan-Kettering Institute, New York City previous versions together with S oren Sonnenburg & Cheng

Graph Classification Classification Outline Introduction, Overview Classification using

Tight Kernel Query Complexity of Kernel Ridge Regression and Kernel -means Clustering Manuel

Kernel Methods Lei Tang Arizona State University Jul. 26th, 2007 Lei Tang Kernel Methods

Classification 1 Classification: Basic Concepts and Methods Classification: Basic Concepts

Introduction to Linux Kernel Modules Luca Abeni luca.abeni@santannapisa.it Linux Kernel Modules

Kernel PCA for SNe Kernel PCA for SNe photometric classification photometric classification

Black Kernel Rot Malady of Pecan B Wood, C Bock, l Wells, T Cottrell, M Hotchkiss Black Kernel

Kernel Properties - Convexity Leila Wehbe October 1st 2013 Leila Wehbe Kernel Properties -

Processes, Protection and the Kernel: Processes, Protection and the Kernel: Mode, Space, and

Linux Kernel Debugging Your kernel just oopsed - What do you do, hotshot? Muli Ben-Yehuda

Tight Kernel Query Complexity of Kernel Ridge Regression and Kernel -means Clustering Manuel

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Statistical Classification with Fisher Zantedeschi Introduction Kernel Topic Models LDA PLSM

Optimization for Kernel Methods S. Sathiya Keerthi Yahoo! Research, Burbank, CA, USA Kernel

Machine learning theory Kernel methods Hamid Beigy Sharif university of technology April 20,

Classification of Symmetry Classification of Symmetry Classification of Symmetry Classification

Neural State Classification for Hybrid Systems Nicola Paoletti Royal Holloway, University of

Todays lecture Logistic regression How can we use logistic regression for reranking? Shay

FusionNet: 3D Object Classification Using Multiple Data Representations Vishakh Hegde Reza Zadeh

Deep Learning for Food Security? Automatic Plant Disease Recognition Undergraduate Thesis Lydia

Library of Congress Classification: Module 2.1 1 Library of Congress Classification: Module 2.1

Expanding Use of Drones in the Railroad Environment Community of Interest Webinar for

Computer Vision II Bjoern Andres Machine Learning for Computer Vision TU Dresden April 27, 2020

Science C Curriculum Briefing riday, 29 th th Fri January ry 2016 Primary ry Science