Support Vector Machines Part 2 Yingyu Liang Computer Sciences 760 - PowerPoint PPT Presentation

Support Vector Machines Part 2 Yingyu Liang Computer Sciences 760 Fall 2017 http://pages.cs.wisc.edu/~yliang/cs760/ Some of the slides in these lectures have been adapted/borrowed from materials developed by Mark Craven, David Page, Jude Shavlik, Tom Mitchell, Nina Balcan, Matt Gormley, Elad Hazan, Tom Dietterich, and Pedro Domingos.

Goals for the lecture you should understand the following concepts • soft margin SVM • support vector regression • the kernel trick • polynomial kernel • Gaussian/RBF kernel • valid kernels and Mercer’s theorem • kernels and neural networks 2

Variants: soft-margin and SVR

Hard-margin SVM • Optimization (Quadratic Programming): 2 1 min 𝑥 2 𝑥,𝑐 𝑧 𝑗 𝑥 𝑈 𝑦 𝑗 + 𝑐 ≥ 1, ∀𝑗

Soft-margin SVM [Cortes & Vapnik, Machine Learning 1995] • if the training instances are not linearly separable, the previous formulation will fail • we can adjust our approach by using slack variables (denoted by 𝜂 𝑗 ) to tolerate errors 2 1 min 𝑥 + 𝐷 ෍ 𝜂 𝑗 2 𝑥,𝑐,𝜂 𝑗 𝑗 𝑧 𝑗 𝑥 𝑈 𝑦 𝑗 + 𝑐 ≥ 1 − 𝜂 𝑗 , 𝜂 𝑗 ≥ 0, ∀𝑗 • 𝐷 determines the relative importance of maximizing margin vs. minimizing slack

The effect of 𝐷 in soft-margin SVM Figure from Ben-Hur & Weston, Methods in Molecular Biology 2010

Hinge loss • when we covered neural nets, we talked about minimizing squared loss and cross-entropy loss • SVMs minimize hinge loss squared loss loss (error) when 𝑧 = 1 0/1 loss hinge loss model output ℎ 𝒚

Support Vector Regression • the SVM idea can also be applied in regression tasks (𝑥 ⊤ 𝑦 + 𝑐) − 𝑧 = 𝜗 • an 𝜗 -insensitive error function specifies that a training instance is well explained if the model’s 𝑧 − (𝑥 ⊤ 𝑦 + 𝑐) = 𝜗 prediction is within 𝜗 of 𝑧 𝑗

Support Vector Regression • Regression using slack variables (denoted by 𝜂 𝑗 , 𝜊 𝑗 ) to tolerate errors 2 1 min 𝑥 + 𝐷 ෍ 𝜂 𝑗 + 𝜊 𝑗 2 𝑥,𝑐,𝜂 𝑗 ,𝜊 𝑗 𝑗 𝑥 𝑈 𝑦 𝑗 + 𝑐 − 𝑧 𝑗 ≤ 𝜗 + 𝜂 𝑗 , 𝑧 𝑗 − 𝑥 𝑈 𝑦 𝑗 + 𝑐 ≤ 𝜗 + 𝜊 𝑗 , 𝜂 𝑗 , 𝜊 𝑗 ≥ 0. slack variables allow predictions for some training instances to be off by more than 𝜗

Kernel methods

Features 𝑦 𝜚 𝑦 Color Histogram Extract features Red Green Blue

Features Proper feature mapping can make non-linear to linear!

Only depend on inner products Recall: SVM dual form • Reduces to dual problem: 𝛽 𝑗 − 1 𝑈 𝑦 𝑘 ℒ 𝑥, 𝑐, 𝜷 = ෍ 2 ෍ 𝛽 𝑗 𝛽 𝑘 𝑧 𝑗 𝑧 𝑘 𝑦 𝑗 𝑗 𝑗𝑘 ෍ 𝛽 𝑗 𝑧 𝑗 = 0, 𝛽 𝑗 ≥ 0 𝑗 𝑈 𝑦 + 𝑐 • Since 𝑥 = σ 𝑗 𝛽 𝑗 𝑧 𝑗 𝑦 𝑗 , we have 𝑥 𝑈 𝑦 + 𝑐 = σ 𝑗 𝛽 𝑗 𝑧 𝑗 𝑦 𝑗

Features • Using SVM on the feature space {𝜚 𝑦 𝑗 } : only need 𝜚 𝑦 𝑗 𝑈 𝜚(𝑦 𝑘 ) • Conclusion: no need to design 𝜚 ⋅ , only need to design 𝑙 𝑦 𝑗 , 𝑦 𝑘 = 𝜚 𝑦 𝑗 𝑈 𝜚(𝑦 𝑘 )

Polynomial kernels • Fix degree 𝑒 and constant 𝑑 : 𝑙 𝑦, 𝑦′ = 𝑦 𝑈 𝑦′ + 𝑑 𝑒 • What are 𝜚(𝑦) ? • Expand the expression to get 𝜚(𝑦)

Polynomial kernels Figure from Foundations of Machine Learning, by M. Mohri, A. Rostamizadeh, and A. Talwalkar

SVMs with polynomial kernels Figure from Ben-Hur & Weston, Methods in Molecular Biology 2010 18

Gaussian/RBF kernels • Fix bandwidth 𝜏 : 2 /2𝜏 2 ) 𝑙 𝑦, 𝑦′ = exp(− 𝑦 − 𝑦 ′ • Also called radial basis function (RBF) kernels • What are 𝜚(𝑦) ? Consider the un-normalized version 𝑙′ 𝑦, 𝑦′ = exp(𝑦 𝑈 𝑦′/𝜏 2 ) • Power series expansion: +∞ 𝑦 𝑈 𝑦 ′ 𝑗 𝑙′ 𝑦, 𝑦 ′ = ෍ 𝜏 𝑗 𝑗! 𝑗

The RBF kernel illustrated 𝛿 = −10 𝛿 = −100 𝛿 = −1000 Figures from openclassroom.stanford.edu (Andrew Ng) 20

Mercer’s condition for kenerls • Theorem: 𝑙 𝑦, 𝑦′ has expansion +∞ 𝑏 𝑗 𝜚 𝑗 𝑦 𝜚 𝑗 (𝑦 ′ ) 𝑙 𝑦, 𝑦′ = ෍ 𝑗 if and only if for any function 𝑑(𝑦) , ∫ ∫ 𝑑 𝑦 𝑑 𝑦 ′ 𝑙 𝑦, 𝑦 ′ 𝑒𝑦𝑒𝑦 ′ ≥ 0 (Omit some math conditions for 𝑙 and 𝑑 )

Constructing new kernels • Kernels are closed under positive scaling, sum, product, pointwise +∞ 𝑏 𝑗 𝑙 𝑗 (𝑦, 𝑦 ′ ) limit, and composition with a power series σ 𝑗 • Example: 𝑙 1 𝑦, 𝑦′ , 𝑙 2 𝑦, 𝑦′ are kernels, then also is 𝑙 𝑦, 𝑦 ′ = 2𝑙 1 𝑦, 𝑦′ + 3𝑙 2 𝑦, 𝑦′ • Example: 𝑙 1 𝑦, 𝑦′ is kernel, then also is 𝑙 𝑦, 𝑦 ′ = exp(𝑙 1 𝑦, 𝑦 ′ )

Kernel algebra • given a valid kernel, we can make new valid kernels using a variety of operators kernel composition mapping composition ( ) k ( x , v ) = k a ( x , v ) + k b ( x , v ) f ( x ) = f a ( x ), f b ( x ) k ( x , v ) = g k a ( x , v ), g > 0 f ( x ) = g f a ( x ) f l ( x ) = f ai ( x ) f bj ( x ) k ( x , v ) = k a ( x , v ) k b ( x , v ) k ( x , v ) = f ( x ) f ( v ) k a ( x , v ) f ( x ) = f ( x ) f a ( x ) 23

Kernels v.s. Neural networks

Features 𝑦 Color Histogram Extract build 𝑧 = 𝑥 𝑈 𝜚 𝑦 features hypothesis Red Green Blue

Features: part of the model Nonlinear model build 𝑧 = 𝑥 𝑈 𝜚 𝑦 hypothesis Linear model

Polynomial kernels Figure from Foundations of Machine Learning, by M. Mohri, A. Rostamizadeh, and A. Talwalkar

Polynomial kernel SVM as two layer neural network 2 𝑦 1 2 𝑦 2 𝑦 1 2𝑦 1 𝑦 2 𝑧 = sign(𝑥 𝑈 𝜚(𝑦) + 𝑐) 2𝑑𝑦 1 𝑦 2 2𝑑𝑦 2 𝑑 First layer is fixed. If also learn first layer, it becomes two layer neural network

Comments on SVMs • we can find solutions that are globally optimal (maximize the margin) • because the learning task is framed as a convex optimization problem • no local minima, in contrast to multi-layer neural nets • there are two formulations of the optimization: primal and dual • dual represents classifier decision in terms of support vectors • dual enables the use of kernel functions • we can use a wide range of optimization methods to learn SVM • standard quadratic programming solvers • SMO [Platt, 1999] • linear programming solvers for some formulations • etc. 29

Comments on SVMs • kernels provide a powerful way to • allow nonlinear decision boundaries • represent/compare complex objects such as strings and trees • incorporate domain knowledge into the learning task • using the kernel trick, we can implicitly use high-dimensional mappings without explicitly computing them • one SVM can represent only a binary classification task; multi-class problems handled using multiple SVMs and some encoding • empirically, SVMs have shown (close to) state-of-the art accuracy for many tasks • the kernel idea can be extended to other tasks (anomaly detection, regression, etc.) 30

Support Vector Machines Part 2 Yingyu Liang Computer Sciences 760 - PowerPoint PPT Presentation

Support Vector Machines Part 2 Yingyu Liang Computer Sciences 760 Fall 2017 http://pages.cs.wisc.edu/~yliang/cs760/ Some of the slides in these lectures have been adapted/borrowed from materials developed by Mark Craven, David Page, Jude

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

? 17.10.2018 3 17.10.2018 4 Support Vector Machines (SVM): Background Support Vector Machines

Support Vector Machines October 16, 2018 Support Vector Machines October 16, 2018 1 / 31

Relevance Vector Machines Jukka Lankinen LUT February 21, 2011 Jukka Lankinen Relevance Vector

Support vector machines CS 446 Part 1: linear support vector machines 1.0 1.0 1.0 0.8 0.8

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Support Vector Machines & Kernelization Barna Saha Most of the slides are made using David

Introduction Kailash Awati Instructor DataCamp Support Vector Machines in R Preliminaries

Support Vector Machines Support Vector Machines Hypothesis Space Hypothesis Space variable

Support Vector Machines (Ch. 18.9) SVM Basics Support Vector Machines (SVMs) try to do our

SUPPORT VECTOR MACHINES SUPPORT VECTOR MACHINES Matthieu R Bloch Tuesday, February 25, 2020 1

RBF Kernels: Generating a complex dataset DataCamp Support Vector Machines in R A bit about RBF

Machine Learning for NLP Support Vector Machines Aurlie Herbelot 2019 Centre for Mind/Brain

Generating a radially separable dataset DataCamp Support Vector Machines in R Generating a 2d

Support Vector Machines 290N, 2014 Support Vector Machines (SVM) Supervised learning

Clarity of Record Pilot: Interview Summaries and Pre-search Interview Option 2/25/2016 1 DRAFT

L ECTURE 11: S OFT SVM S Prof. Julia Hockenmaier juliahmr@illinois.edu Midterm (Thursday, March

Support Vector Machines Charlie Frogner 1 MIT 2011 1 Slides mostly stolen from Ryan Rifkin

Collision dynamics in GRB internal shocks And their implication for the production of multiple

Linear models Subhransu Maji CMPSCI 689: Machine Learning 24 February 2015 26 February 2015

Linking Rigid Bodies Symmetrically Bernd Schulze 1 and Shin-ichi Tanigawa 2 1 Lancaster Unviersity,

Adaptive Monte Carlo Multiple Testing via Multi-Armed Bandits Martin Zhang joint work with:

B I O I N F O R M A T I C S Kristel Van Steen, PhD 2 Montefiore Institute - Systems and Modeling

Support Vector Machines Part 2 Yingyu Liang Computer Sciences 760 - PowerPoint PPT Presentation

Support Vector Machines Part 2 Yingyu Liang Computer Sciences 760 Fall 2017 http://pages.cs.wisc.edu/~yliang/cs760/ Some of the slides in these lectures have been adapted/borrowed from materials developed by Mark Craven, David Page, Jude

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

? 17.10.2018 3 17.10.2018 4 Support Vector Machines (SVM): Background Support Vector Machines

Support Vector Machines October 16, 2018 Support Vector Machines October 16, 2018 1 / 31

Relevance Vector Machines Jukka Lankinen LUT February 21, 2011 Jukka Lankinen Relevance Vector

Support vector machines CS 446 Part 1: linear support vector machines 1.0 1.0 1.0 0.8 0.8

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Support Vector Machines &amp; Kernelization Barna Saha Most of the slides are made using David

Introduction Kailash Awati Instructor DataCamp Support Vector Machines in R Preliminaries

Support Vector Machines Support Vector Machines Hypothesis Space Hypothesis Space variable

Support Vector Machines (Ch. 18.9) SVM Basics Support Vector Machines (SVMs) try to do our

SUPPORT VECTOR MACHINES SUPPORT VECTOR MACHINES Matthieu R Bloch Tuesday, February 25, 2020 1

RBF Kernels: Generating a complex dataset DataCamp Support Vector Machines in R A bit about RBF

Machine Learning for NLP Support Vector Machines Aurlie Herbelot 2019 Centre for Mind/Brain

Generating a radially separable dataset DataCamp Support Vector Machines in R Generating a 2d

Support Vector Machines 290N, 2014 Support Vector Machines (SVM) Supervised learning

Clarity of Record Pilot: Interview Summaries and Pre-search Interview Option 2/25/2016 1 DRAFT

L ECTURE 11: S OFT SVM S Prof. Julia Hockenmaier juliahmr@illinois.edu Midterm (Thursday, March

Support Vector Machines Charlie Frogner 1 MIT 2011 1 Slides mostly stolen from Ryan Rifkin

Collision dynamics in GRB internal shocks And their implication for the production of multiple

Linear models Subhransu Maji CMPSCI 689: Machine Learning 24 February 2015 26 February 2015

Linking Rigid Bodies Symmetrically Bernd Schulze 1 and Shin-ichi Tanigawa 2 1 Lancaster Unviersity,

Adaptive Monte Carlo Multiple Testing via Multi-Armed Bandits Martin Zhang joint work with:

B I O I N F O R M A T I C S Kristel Van Steen, PhD 2 Montefiore Institute - Systems and Modeling

Support Vector Machines & Kernelization Barna Saha Most of the slides are made using David