In SMV I IAML: Support Vector Machines II We saw: Max margin trick - PowerPoint PPT Presentation

In SMV I IAML: Support Vector Machines II We saw: ◮ Max margin trick Nigel Goddard ◮ Geometry of the margin and how to compute it School of Informatics ◮ Finding the max margin hyperplane using a constrained optimization problem ◮ Max margin = Min norm Semester 1 1 / 25 2 / 25 This Time The SVM optimization problem ◮ Last time: the max margin weights can be computed by solving a constrained optimization problem || w || 2 min ◮ Non separable data w s.t. y i ( w ⊤ x i + w 0 ) ≥ + 1 ◮ The kernel trick for all i ◮ Many algorithms have been proposed to solve this. One of the earliest efficient algorithms is called SMO [Platt, 1998]. This is outside the scope of the course, but it does explain the name of the SVM method in Weka. 3 / 25 4 / 25

Finding the optimum Why a solution of this form? If you move the points not on the marginal hyperplanes, solution doesn’t change - therefore those points don’t matter. ◮ If you go through some advanced maths (Lagrange multipliers, etc.), it turns out that you can show something x remarkable. Optimal parameters look like o x � w = α i y i x i x o i x ◮ Furthermore, solution is sparse. Optimal hyperplane is determined by just a few examples: call these support o o vectors ~ margin w o 5 / 25 6 / 25 Finding the optimum Non-separable training sets ◮ If you go through some advanced maths (Lagrange multipliers, etc.), it turns out that you can show something remarkable. Optimal parameters look like ◮ If data set is not linearly separable, the optimization problem that we have given has no solution . � w = α i y i x i i || w || 2 min ◮ Furthermore, solution is sparse. Optimal hyperplane is w s.t. y i ( w ⊤ x i + w 0 ) ≥ + 1 for all i determined by just a few examples: call these support vectors ◮ Why? ◮ α i = 0 for non-support patterns ◮ Optimization problem to find α i has no local minima (like logistic regression) ◮ Prediction on new data point x f ( x ) = sign (( w ⊤ x ) + w 0 ) n � α i y i ( x ⊤ = sign ( i x ) + w 0 ) i = 1 7 / 25 8 / 25

Non-separable training sets ◮ If data set is not linearly separable, the optimization x problem that we have given has no solution . o x o || w || 2 min x o w x ! s.t. y i ( w ⊤ x i + w 0 ) ≥ + 1 for all i ◮ Why? o o ◮ Solution: Don’t require that we classify all points correctly. ~ margin w Allow the algorithm to choose to ignore some of the points. o ◮ This is obviously dangerous (why not ignore all of them?) so we need to give it a penalty for doing so. 9 / 25 10 / 25 Slack Think about ridge regression again ◮ Solution: Add a “slack” variable ξ i ≥ 0 for each training ◮ Our max margin + slack optimization problem is to example. minimize: n ◮ If the slack variable is high, we get to relax the constraint, || w || 2 + C ( � ξ i ) k but we pay a price i = 1 ◮ New optimization problem is to minimize subject to the constraints n || w || 2 + C ( � ξ k w ⊤ x i + w 0 ≥ 1 − ξ i i ) for y i = + 1 i = 1 w ⊤ x i + w 0 ≤ − 1 + ξ i for y i = − 1 subject to the constraints ◮ This looks a even more like ridge regression than the w ⊤ x i + w 0 ≥ 1 − ξ i for y i = + 1 non-slack problem: i = 1 ξ i ) k measures how well we fit the data ◮ C ( � n w ⊤ x i + w 0 ≤ − 1 + ξ i for y i = − 1 ◮ || w || 2 penalizes weight vectors with a large norm ◮ So C can be viewed as a regularization parameters, like λ ◮ Usually set k = 1. C is a trade-off parameter. Large C in ridge regression or regularized logistic regression gives a large penalty to errors. ◮ You’re allowed to make this tradeoff even when the data ◮ Solution has same form, but support vectors also include set is separable! all where ξ i � = 0. Why? 11 / 25 12 / 25

ξ Why you might want slack in a separable data set Non-linear SVMs x 2 x 2 ◮ SVMs can be made nonlinear just like any other linear o o o o algorithm we’ve seen (i.e., using a basis expansion) o o o o o o o o o o ◮ But in an SVM, the basis expansion is implemented in a o o o w w x o o x o o very special way, using something called a kernel o x x x x x x ◮ The reason for this is that kernels can be faster to compute x 1 x 1 x x x x x x with if the expanded feature space is very high dimensional x x (even infinite)! ◮ This is a fairly advanced topic mathematically, so we will just go through a high-level version 13 / 25 14 / 25 Kernel Non-linear SVMs ◮ Transform x to φ ( x ) ◮ A kernel is in some sense an alternate “API” for specifying ◮ Linear algorithm depends only on x ⊤ x i . Hence to the classifier what your expanded feature space is. transformed algorithm depends only on φ ( x ) ⊤ φ ( x i ) ◮ Up to now, we have always given the classifier a new set of ◮ Use a kernel function k ( x i , x j ) such that training vectors φ ( x i ) for all i , e.g., just as a list of numbers. φ : R d → R D k ( x i , x j ) = φ ( x i ) ⊤ φ ( x j ) ◮ If D is large, this will be expensive; if D is infinite, this will ◮ (This is called the “kernel trick”, and can be used with a be impossible wide variety of learning algorithms, not just max margin.) 15 / 25 16 / 25

Example of kernel Kernels, dot products, and distance ◮ The Euclidean distance squared between two vectors can be computed using dot products ◮ Example 1: for 2-d input space d ( x 1 , x 2 ) = ( x 1 − x 2 ) T ( x 1 − x 2 )   x 2 = x T 1 x 1 − 2 x T 1 x 2 + x T √ i , 1 2 x 2 φ ( x i ) = 2 x i , 1 x i , 2     x 2 ◮ Using a linear kernel k ( x 1 , x 2 ) = x T 1 x 2 we can rewrite this i , 2 as then d ( x 1 , x 2 ) = k ( x 1 , x 1 ) − 2 k ( x 1 , x 2 ) + k ( x 2 , x 2 ) k ( x i , x j ) = ( x ⊤ i x j ) 2 ◮ Any kernel gives you an associated distance measure this way. Think of a kernel as an indirect way of specifying distances. 17 / 25 18 / 25 Support Vector Machine Prediction on new example ◮ A support vector machine is a kernelized maximum margin classifier. ◮ For max margin remember that we had the magic property f( x )= sgn ( ! + b ) classification f( x )= sgn ( ! $ i . k ( x , x i ) + b ) � $ 1 $ 2 $ 3 $ 4 weights w = α i y i x i i k ( x , x i )=( x . x i ) d k k k k comparison: k ( x , x i ), e.g. ◮ This means we would predict the label of a test example x k ( x , x i )=exp( ! || x ! x i || 2 / c) support vectors as x 1 ... x 4 k ( x , x i )= tanh( " ( x . x i ) + # ) y = sign [ w T x + w 0 ] = sign [ � α i y i x T ˆ i x + w 0 ] i ◮ Kernelizing this we get input vector x � ˆ y = sign [ α i y i k ( x i , x ) + b ] i Figure Credit: Bernhard Schoelkopf 19 / 25 20 / 25

Choosing φ , C input space feature space ! " ! ! ! ! " " " " " ◮ There are theoretical results, but we will not cover them. (If you want to look them up, there are actually upper bounds on the generalization error: look for VC-dimension and Figure Credit: Bernhard Schoelkopf structural risk minimization.) ◮ Example 2 ◮ However, in practice cross-validation methods are k ( x i , x j ) = exp −|| x i − x j || 2 /α 2 commonly used In this case the dimension of φ is infinite. i.e., It can be shown that no φ that maps into a finite-dimensional space will give you this kernel. ◮ We can never calculate φ ( x ) , but the algorithm only needs us to calculate k for different pairs of points. 21 / 25 22 / 25 Example application Comparison with linear and logistic regression ◮ US Postal Service digit data (7291 examples, 16 × 16 images). Three SVMs using polynomial, RBF and ◮ Underlying basic idea of linear prediction is the same, but MLP-type kernels were used (see Sch¨ olkopf and Smola, error functions differ Learning with Kernels , 2002 for details) ◮ Logistic regression (non-sparse) vs SVM (“hinge loss”, ◮ Use almost the same ( ≃ 90 % ) small sets (4% of data sparse solution) base) of SVs ◮ Linear regression (squared error) vs ǫ -insensitive error ◮ All systems perform well ( ≃ 4 % error) ◮ Linear regression and logistic regression can be ◮ Many other applications, e.g. “kernelized” too ◮ Text categorization ◮ Face detection ◮ DNA analysis 23 / 25 24 / 25

SVM summary ◮ SVMs are the combination of max-margin and the kernel trick ◮ Learn linear decision boundaries (like logistic regression, perceptrons) ◮ Pick hyperplane that maximizes margin ◮ Use slack variables to deal with non-separable data ◮ Optimal hyperplane can be written in terms of support patterns ◮ Transform to higher-dimensional space using kernel functions ◮ Good empirical results on many problems ◮ Appears to avoid overfitting in high dimensional spaces (cf regularization) ◮ Sorry for all the maths! 25 / 25

In SMV I IAML: Support Vector Machines II We saw: Max margin trick - PowerPoint PPT Presentation

In SMV I IAML: Support Vector Machines II We saw: Max margin trick Nigel Goddard Geometry of the margin and how to compute it School of Informatics Finding the max margin hyperplane using a constrained optimization problem Max

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

? 17.10.2018 3 17.10.2018 4 Support Vector Machines (SVM): Background Support Vector Machines

IAML: Support Vector Machines I Nigel Goddard School of Informatics Semester 1 1 / 18 Outline

Support Vector Machines October 16, 2018 Support Vector Machines October 16, 2018 1 / 31

Relevance Vector Machines Jukka Lankinen LUT February 21, 2011 Jukka Lankinen Relevance Vector

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Support Vector Machines & Kernelization Barna Saha Most of the slides are made using David

Introduction Kailash Awati Instructor DataCamp Support Vector Machines in R Preliminaries

Support Vector Machines Support Vector Machines Hypothesis Space Hypothesis Space variable

Support Vector Machines (Ch. 18.9) SVM Basics Support Vector Machines (SVMs) try to do our

Support vector machines CS 446 Part 1: linear support vector machines 1.0 1.0 1.0 0.8 0.8

SUPPORT VECTOR MACHINES SUPPORT VECTOR MACHINES Matthieu R Bloch Tuesday, February 25, 2020 1

RBF Kernels: Generating a complex dataset DataCamp Support Vector Machines in R A bit about RBF

Machine Learning for NLP Support Vector Machines Aurlie Herbelot 2019 Centre for Mind/Brain

Generating a radially separable dataset DataCamp Support Vector Machines in R Generating a 2d

Support Vector Machines COMP 640 Ryan Spring, Sarah Kim

CS480/680 Lecture 14: June 24, 2019 Support Vector Machines (continued) [B] Sec. 7.1 [D] Sec.

Chapter 18 Linear Programming CS 573: Algorithms, Fall 2013 October 29, 2013 18.1 Linear

3D GRAPHICS design animate render Computer Graphics 3D animation movies Computer Graphics

CS257 Linear and Convex Optimization Lecture 7 Bo Jiang John Hopcroft Center for Computer

Data Mining and Machine Learning: Fundamental Concepts and Algorithms dataminingbook.info

CS473 CS-473 Text Categorization (II) Luo Si Department of Computer Science Purdue University

Rewriting in Practice Ashish Tiwari SRI International Menlo Park, CA 94025 tiwari@csl.sri.com

In SMV I IAML: Support Vector Machines II We saw: Max margin trick - PowerPoint PPT Presentation

In SMV I IAML: Support Vector Machines II We saw: Max margin trick Nigel Goddard Geometry of the margin and how to compute it School of Informatics Finding the max margin hyperplane using a constrained optimization problem Max

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

? 17.10.2018 3 17.10.2018 4 Support Vector Machines (SVM): Background Support Vector Machines

IAML: Support Vector Machines I Nigel Goddard School of Informatics Semester 1 1 / 18 Outline

Support Vector Machines October 16, 2018 Support Vector Machines October 16, 2018 1 / 31

Relevance Vector Machines Jukka Lankinen LUT February 21, 2011 Jukka Lankinen Relevance Vector

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Support Vector Machines &amp; Kernelization Barna Saha Most of the slides are made using David

Introduction Kailash Awati Instructor DataCamp Support Vector Machines in R Preliminaries

Support Vector Machines Support Vector Machines Hypothesis Space Hypothesis Space variable

Support Vector Machines (Ch. 18.9) SVM Basics Support Vector Machines (SVMs) try to do our

Support vector machines CS 446 Part 1: linear support vector machines 1.0 1.0 1.0 0.8 0.8

SUPPORT VECTOR MACHINES SUPPORT VECTOR MACHINES Matthieu R Bloch Tuesday, February 25, 2020 1

RBF Kernels: Generating a complex dataset DataCamp Support Vector Machines in R A bit about RBF

Machine Learning for NLP Support Vector Machines Aurlie Herbelot 2019 Centre for Mind/Brain

Generating a radially separable dataset DataCamp Support Vector Machines in R Generating a 2d

Support Vector Machines COMP 640 Ryan Spring, Sarah Kim

CS480/680 Lecture 14: June 24, 2019 Support Vector Machines (continued) [B] Sec. 7.1 [D] Sec.

Chapter 18 Linear Programming CS 573: Algorithms, Fall 2013 October 29, 2013 18.1 Linear

3D GRAPHICS design animate render Computer Graphics 3D animation movies Computer Graphics

CS257 Linear and Convex Optimization Lecture 7 Bo Jiang John Hopcroft Center for Computer

Data Mining and Machine Learning: Fundamental Concepts and Algorithms dataminingbook.info

CS473 CS-473 Text Categorization (II) Luo Si Department of Computer Science Purdue University

Rewriting in Practice Ashish Tiwari SRI International Menlo Park, CA 94025 tiwari@csl.sri.com

Support Vector Machines & Kernelization Barna Saha Most of the slides are made using David