Support vector machines CS 446 Part 1: linear support vector - PowerPoint PPT Presentation

Support vector machines CS 446

Part 1: linear support vector machines 1.0 1.0 1.0 0.8 0.8 0.8 0 0 0 . 8 - 0.6 0.6 0.6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 . 0 0 0 0 0 0 2 8 4 0 0 0 0 0 4 0 0 0 . 2 . 4 6 . . 0 0 . 0 4 8 2 . - . 0 0 8 . 0 . 8 6 1 . . 0 . 0 0 . 0 . 0 . . 1 6 2 . 0 2 . 3 2 1 - 1 - - - - - - - - 0.4 0.4 0.4 0 0 0 . 4 0.2 0.2 0.2 0.0 0.0 0.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 Logistic regression. Least squares. SVM. 1 / 39

Part 2: kernelized support vector machines 1.00 1.00 -12.500 0.75 0.75 -32.000 0.50 0.50 0 0 0 . -24.000 0 1 - -8.000 0.25 0.25 -7.500 0 0 0 -8.000 - 2 -2.500 0.00 6 . 0.00 -5.000 0.000 . 8.000 5 1 0 - 0 8.000 0 0 0 0 0 . 0 0 0 . 0 . 0 0 0 0.25 0.25 2 . 5 0 0 0.50 0.50 0.75 0.75 16.000 1.00 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 ReLU network. Quadratic SVM. 1.00 1.00 0.75 0.75 - 1 0 . 0.50 0.50 0 0 0 0 0 0.25 0.25 . 3 0.500 - -0.500 0 -1.000 -1.000 0 0 0.00 0 0.00 0.500 0 . 2 5 0 . - 0.000 1 1.000 0 - 0 1 0 -0.500 . . . 5 0 1 0 0.25 - 0.25 0 -1.000 0 0 0 0 0 0 0 . 0.50 0 2.000 0.50 0 . 0 0.75 0.75 0 1.00 1.00 0 3.000 0 . 1 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 RBF SVM. Narrower RBF SVM. 2 / 39

1. Recap: linearly separable data

Linear classifiers (with Y = {− 1 , +1 } ) Linear separability assumption Assume there is a linear classifier that perfectly classifies the training data S : for some w ⋆ ∈ R d , T w ⋆ > 0 . ( x ,y ) ∈ S y x min 3 / 39

Linear classifiers (with Y = {− 1 , +1 } ) Linear separability assumption Assume there is a linear classifier that perfectly classifies the training data S : for some w ⋆ ∈ R d , T w ⋆ > 0 . ( x ,y ) ∈ S y x min Convex program Finding any such w is a convex (linear!) feasibility problem. 3 / 39

Linear classifiers (with Y = {− 1 , +1 } ) Linear separability assumption Assume there is a linear classifier that perfectly classifies the training data S : for some w ⋆ ∈ R d , T w ⋆ > 0 . ( x ,y ) ∈ S y x min Convex program Finding any such w is a convex (linear!) feasibility problem. Logistic regression Alternatively, can run enough steps of logistic regression. 3 / 39

Support vector machines (SVMs) Motivation ◮ Let’s first define a good linear separator, and then solve for it. ◮ Let’s also find a principled approach to nonseparable data. 4 / 39

Support vector machines (SVMs) Motivation ◮ Let’s first define a good linear separator, and then solve for it. ◮ Let’s also find a principled approach to nonseparable data. Support vector machines (Vapnik and Chervonenkis, 1963) ◮ Characterize a stable solution for linearly separable problems—the maximum margin solution . ◮ Solve for the maximum margin solution efficiently via convex optimization. ◮ Convex dual has valuable structure; it will give useful extensions, and is what we’ll optimize. ◮ Extend the optimization problem to non-separable data via convex surrogate losses . ◮ Nonlinear separators via kernels . 4 / 39

2. Maximum margin solution

Maximum margin solution Best linear classifier on population 5 / 39

Maximum margin solution Best linear classifier on Arbitrary linear separator on population training data S 5 / 39

Maximum margin solution Best linear classifier on Arbitrary linear separator on Maximum margin solution on population training data S training data S 5 / 39

Maximum margin solution Best linear classifier on Arbitrary linear separator on Maximum margin solution on population training data S training data S Why use the maximum margin solution ? (i) Uniquely determined by S , unlike the linear program. (ii) It is a particular inductive bias—i.e., an assumption about the problem—that seems to be commonly useful. ◮ We’ve seen inductive bias: least squares and logistic regression choose different predictors on same data. ◮ This particular bias (margin maximization) is common in machine learning, has many nice properties. 5 / 39

Maximum margin solution Best linear classifier on Arbitrary linear separator on Maximum margin solution on population training data S training data S Why use the maximum margin solution ? (i) Uniquely determined by S , unlike the linear program. (ii) It is a particular inductive bias—i.e., an assumption about the problem—that seems to be commonly useful. ◮ We’ve seen inductive bias: least squares and logistic regression choose different predictors on same data. ◮ This particular bias (margin maximization) is common in machine learning, has many nice properties. Key insight : can express this as another convex program. 5 / 39

Distance to decision boundary Suppose w ∈ R d satisfies T w > 0 . ( x ,y ) ∈ S y x min 6 / 39

Distance to decision boundary Suppose w ∈ R d satisfies T w > 0 . ( x ,y ) ∈ S y x min ◮ “Maximum margin” shouldn’t care about scaling; w and 10 w should be equally good. ◮ Thus for each direction w / � w � , we can fix a scaling. 6 / 39

Distance to decision boundary Suppose w ∈ R d satisfies T w > 0 . ( x ,y ) ∈ S y x min ◮ “Maximum margin” shouldn’t care about scaling; w and 10 w should be equally good. ◮ Thus for each direction w / � w � , we can fix a scaling. Let (˜ x , ˜ y ) be any example in S that achieves the minimum. 6 / 39

Distance to decision boundary Suppose w ∈ R d satisfies T w > 0 . ( x ,y ) ∈ S y x min ◮ “Maximum margin” shouldn’t care about scaling; w and 10 w should be equally good. ◮ Thus for each direction w / � w � , we can fix a scaling. Let (˜ x , ˜ y ) be any example in S that achieves the minimum. H x T w ˜ y ˜ y ˜ ˜ x � w � 2 w 6 / 39

Distance to decision boundary Suppose w ∈ R d satisfies T w > 0 . ( x ,y ) ∈ S y x min ◮ “Maximum margin” shouldn’t care about scaling; w and 10 w should be equally good. ◮ Thus for each direction w / � w � , we can fix a scaling. Let (˜ x , ˜ y ) be any example in S that achieves the minimum. ◮ Rescale w so that ˜ y ˜ x T w = 1 . (Now scaling is fixed.) H x T w ˜ y ˜ y ˜ ˜ x � w � 2 w 6 / 39

Distance to decision boundary Suppose w ∈ R d satisfies T w > 0 . ( x ,y ) ∈ S y x min ◮ “Maximum margin” shouldn’t care about scaling; w and 10 w should be equally good. ◮ Thus for each direction w / � w � , we can fix a scaling. Let (˜ x , ˜ y ) be any example in S that achieves the minimum. ◮ Rescale w so that ˜ y ˜ x T w = 1 . (Now scaling is fixed.) H x T w ˜ y ˜ y ˜ ˜ 1 x � w � 2 ◮ Distance from ˜ y ˜ x to H is . � w � 2 This is the (normalized minimum) margin. w 6 / 39

Distance to decision boundary Suppose w ∈ R d satisfies T w > 0 . ( x ,y ) ∈ S y x min ◮ “Maximum margin” shouldn’t care about scaling; w and 10 w should be equally good. ◮ Thus for each direction w / � w � , we can fix a scaling. Let (˜ x , ˜ y ) be any example in S that achieves the minimum. ◮ Rescale w so that ˜ y ˜ x T w = 1 . (Now scaling is fixed.) H x T w ˜ y ˜ y ˜ ˜ 1 x � w � 2 ◮ Distance from ˜ y ˜ x to H is . � w � 2 This is the (normalized minimum) margin. w ◮ This gives optimization problem max 1 / � w � T w = 1 . subj. to ( x ,y ) ∈ S y x min Can make constraint ≥ 1 . 6 / 39

Maximum margin linear classifier The solution ˆ w to the following mathematical optimization problem: 1 2 � w � 2 min 2 w ∈ R d T w ≥ 1 s.t. y x for all ( x , y ) ∈ S gives the linear classifier with the largest minimum margin on S —i.e., the maximum margin linear classifier or support vector machine (SVM) classifier . 7 / 39

Maximum margin linear classifier The solution ˆ w to the following mathematical optimization problem: 1 2 � w � 2 min 2 w ∈ R d T w ≥ 1 s.t. y x for all ( x , y ) ∈ S gives the linear classifier with the largest minimum margin on S —i.e., the maximum margin linear classifier or support vector machine (SVM) classifier . This is a convex optimization problem ; can be solved in polynomial time. 7 / 39

Maximum margin linear classifier The solution ˆ w to the following mathematical optimization problem: 1 2 � w � 2 min 2 w ∈ R d T w ≥ 1 s.t. y x for all ( x , y ) ∈ S gives the linear classifier with the largest minimum margin on S —i.e., the maximum margin linear classifier or support vector machine (SVM) classifier . This is a convex optimization problem ; can be solved in polynomial time. If there is a solution (i.e., S is linearly separable), then the solution is unique . 7 / 39

Support vector machines CS 446 Part 1: linear support vector - PowerPoint PPT Presentation

Support vector machines CS 446 Part 1: linear support vector machines 1.0 1.0 1.0 0.8 0.8 0.8 0 0 0 . 8 - 0.6 0.6 0.6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 . 0 0 0 0 0

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

? 17.10.2018 3 17.10.2018 4 Support Vector Machines (SVM): Background Support Vector Machines

Support Vector Machines October 16, 2018 Support Vector Machines October 16, 2018 1 / 31

Relevance Vector Machines Jukka Lankinen LUT February 21, 2011 Jukka Lankinen Relevance Vector

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Support Vector Machines & Kernelization Barna Saha Most of the slides are made using David

Introduction Kailash Awati Instructor DataCamp Support Vector Machines in R Preliminaries

Support Vector Machines Support Vector Machines Hypothesis Space Hypothesis Space variable

Support Vector Machines (Ch. 18.9) SVM Basics Support Vector Machines (SVMs) try to do our

SUPPORT VECTOR MACHINES SUPPORT VECTOR MACHINES Matthieu R Bloch Tuesday, February 25, 2020 1

RBF Kernels: Generating a complex dataset DataCamp Support Vector Machines in R A bit about RBF

Machine Learning for NLP Support Vector Machines Aurlie Herbelot 2019 Centre for Mind/Brain

Generating a radially separable dataset DataCamp Support Vector Machines in R Generating a 2d

Support Vector Machines 290N, 2014 Support Vector Machines (SVM) Supervised learning

Support Vector Machines Support Vector Machines CSC 411 Tutorial April 1, 2015 Tutor: Shenlong

Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 S. Gadat Big Data

Dark matter heavyweights SUSY and Q-balls Inflation+SUSY Q-balls stable Q-balls as

on the light front Sergei Alexandrov Laboratoire Charles Coulomb Montpellier work in progress

Data Mining Support Vector Machines Introduction to Data Mining, 2 nd Edition by Tan, Steinbach,

9 MultiFreedom Constraints II IFEM Ch 9 Slide 1 Introduction to FEM Penalty Function

Variational Principles and Constraints in Continuum Dynamics Giovanni Romano, Raffaele Barretta

First order gravity on the light front Sergei Alexandrov Laboratoire Charles Coulomb

AM 205: lecture 18 Last time: optimization methods Today: conditions for optimality

Support vector machines CS 446 Part 1: linear support vector - PowerPoint PPT Presentation

Support vector machines CS 446 Part 1: linear support vector machines 1.0 1.0 1.0 0.8 0.8 0.8 0 0 0 . 8 - 0.6 0.6 0.6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 . 0 0 0 0 0

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

? 17.10.2018 3 17.10.2018 4 Support Vector Machines (SVM): Background Support Vector Machines

Support Vector Machines October 16, 2018 Support Vector Machines October 16, 2018 1 / 31

Relevance Vector Machines Jukka Lankinen LUT February 21, 2011 Jukka Lankinen Relevance Vector

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Support Vector Machines &amp; Kernelization Barna Saha Most of the slides are made using David

Introduction Kailash Awati Instructor DataCamp Support Vector Machines in R Preliminaries

Support Vector Machines Support Vector Machines Hypothesis Space Hypothesis Space variable

Support Vector Machines (Ch. 18.9) SVM Basics Support Vector Machines (SVMs) try to do our

SUPPORT VECTOR MACHINES SUPPORT VECTOR MACHINES Matthieu R Bloch Tuesday, February 25, 2020 1

RBF Kernels: Generating a complex dataset DataCamp Support Vector Machines in R A bit about RBF

Machine Learning for NLP Support Vector Machines Aurlie Herbelot 2019 Centre for Mind/Brain

Generating a radially separable dataset DataCamp Support Vector Machines in R Generating a 2d

Support Vector Machines 290N, 2014 Support Vector Machines (SVM) Supervised learning

Support Vector Machines Support Vector Machines CSC 411 Tutorial April 1, 2015 Tutor: Shenlong

Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 S. Gadat Big Data

Dark matter heavyweights SUSY and Q-balls Inflation+SUSY Q-balls stable Q-balls as

on the light front Sergei Alexandrov Laboratoire Charles Coulomb Montpellier work in progress

Data Mining Support Vector Machines Introduction to Data Mining, 2 nd Edition by Tan, Steinbach,

9 MultiFreedom Constraints II IFEM Ch 9 Slide 1 Introduction to FEM Penalty Function

Variational Principles and Constraints in Continuum Dynamics Giovanni Romano, Raffaele Barretta

First order gravity on the light front Sergei Alexandrov Laboratoire Charles Coulomb

AM 205: lecture 18 Last time: optimization methods Today: conditions for optimality

Support Vector Machines & Kernelization Barna Saha Most of the slides are made using David