Support Vector Machines Machine Learning 1 Big picture Linear - PowerPoint PPT Presentation

Support Vector Machines Machine Learning 1

Big picture Linear models 2

Big picture Linear models How good is a learning algorithm? 3

Big picture Linear models Perceptron, Winnow Online learning How good is a learning algorithm? 4

Big picture Linear models Perceptron, Winnow Online PAC, Agnostic learning learning How good is a learning algorithm? 5

Big picture Linear models Perceptron, Support Vector Winnow Machines Online PAC, Agnostic learning learning How good is a learning algorithm? 6

Big picture Linear models …. Perceptron, Support Vector Winnow Machines Online PAC, Agnostic …. learning learning How good is a learning algorithm? 7

This lecture: Support vector machines • Training by maximizing margin • The SVM objective • Solving the SVM optimization problem • Support vectors, duals and kernels 8

VC dimensions and linear classifiers What we know so far 1. If we have 𝑛 examples, then with probability 1 - 𝜀 , the true error of a hypothesis ℎ with training error 𝑓𝑠𝑠 ! ℎ is bounded by 𝑊𝐷(𝐼) + 1 + ln 4 2𝑛 𝑊𝐷 𝐼 ln 𝜀 𝑓𝑠𝑠 ! ℎ ≤ 𝑓𝑠𝑠 " ℎ + 𝑛 Generalization error Training error A function of VC dimension. Low VC dimension gives tighter bound 10

VC dimensions and linear classifiers What we know so far 1. If we have 𝑛 examples, then with probability 1 - 𝜀 , the true error of a hypothesis ℎ with training error 𝑓𝑠𝑠 ! ℎ is bounded by 𝑊𝐷(𝐼) + 1 + ln 4 2𝑛 𝑊𝐷 𝐼 A function of VC dimension. ln 𝜀 Low VC dimension gives tighter bound 𝑓𝑠𝑠 ! ℎ ≤ 𝑓𝑠𝑠 " ℎ + 𝑛 Generalization error Training error 11

VC dimensions and linear classifiers What we know so far What we know so far 1. 1. If we have 𝑛 examples, then with probability 1 - 𝜀 , the true If we have 𝑛 examples, then with probability 1 - 𝜀 , the true error of a hypothesis ℎ with training error 𝑓𝑠𝑠 error of a hypothesis ℎ with training error 𝑓𝑠𝑠 ! ℎ is ! ℎ is bounded by bounded by 𝑊𝐷(𝐼) + 1 + ln 4 2𝑛 𝑊𝐷 𝐼 A function of VC dimension. ln 𝜀 Low VC dimension gives tighter bound 𝑓𝑠𝑠 ! ℎ ≤ 𝑓𝑠𝑠 " ℎ + 𝑛 Generalization error Training error 2. VC dimension of a linear classifier in 𝑒 dimensions = 𝑒 + 1 12

VC dimensions and linear classifiers What we know so far 1. If we have 𝑛 examples, then with probability 1 - 𝜀 , the true error of a hypothesis ℎ with training error 𝑓𝑠𝑠 ! ℎ is bounded by 𝑊𝐷(𝐼) + 1 + ln 4 2𝑛 𝑊𝐷 𝐼 A function of VC dimension. ln 𝜀 Low VC dimension gives tighter bound 𝑓𝑠𝑠 ! ℎ ≤ 𝑓𝑠𝑠 " ℎ + 𝑛 Generalization error Training error 2. VC dimension of a linear classifier in 𝑒 dimensions = 𝑒 + 1 But are all linear classifiers the same? 13

Recall: Margin The margin of a hyperplane for a dataset is the distance between the hyperplane and the data point nearest to it. + ++ - + - + - - - - - - + + + - - - - - - - - - - 14

Recall: Margin The margin of a hyperplane for a dataset is the distance between the hyperplane and the data point nearest to it. + ++ - + - + - - - - - - + + + - - - - - - - - - Margin with respect to this hyperplane - 15

Which line is a better choice? Why? h 1 + ++ - + - + - - - - - - + + + - - - - - - - - - - h 2 + ++ - + - + - - - - - - + + + - - - - - - - - - - 16

Which line is a better choice? Why? h 1 + + ++ - + - + - - - - - - + + + - - - - - A new example, - - - not from the - - training set might be misclassified if h 2 the margin is + smaller + ++ - + - + - - - - - - + + + - - - - - - - - - - 17

Data dependent VC dimension • Intuitively, larger margins are better • Suppose we only consider linear separators with margins 𝜈 ! and 𝜈 " – 𝐼 " = linear separators that have a margin 𝜈 " – 𝐼 # = linear separators that have a margin 𝜈 # – And 𝜈 " > 𝜈 # • The entire set of functions 𝐼 ! is “better” 18

Data dependent VC dimension Theorem (Vapnik): – Let H be the set of linear classifiers that separate the training set by a margin at least 𝛿 – Then 𝑊𝐷 𝐼 ≤ min 𝑆 # 𝛿 # , 𝑒 + 1 – 𝑆 is the radius of the smallest sphere containing the data 19

Data dependent VC dimension Theorem (Vapnik): – Let H be the set of linear classifiers that separate the training set by a margin at least 𝛿 – Then 𝑊𝐷 𝐼 ≤ min 𝑆 # 𝛿 # , 𝑒 + 1 – 𝑆 is the radius of the smallest sphere containing the data Larger margin ⇒ Lower VC dimension Lower VC dimension ⇒ Better generalization bound 20

Learning strategy Find the linear separator that maximizes the margin 21

Support Vector Machines So far Lower VC dimension → Better generalization • Vapnik: For linear separators, the VC dimension depends inversely • on the margin – That is, larger margin → better generalization For the separable case: • – Among all linear classifiers that separate the data, find the one that maximizes the margin – Maximizing the margin by minimizing 𝒙 ! 𝒙 if for all examples 𝑧𝒙 ! 𝒚 ≥ 1 General case: • – Introduce slack variables – one 𝜊 " for each example – Slack variables allow the margin constraint above to be violated 23

Recall: The geometry of a linear classifier Prediction = sgn(𝑐 + 𝑥 1 𝑦 1 + 𝑥 2 𝑦 2 ) 𝑐 + 𝑥 1 𝑦 1 + 𝑥 2 𝑦 2 = 0 + ++ + + + + + |𝑥 ! 𝑦 ! + 𝑥 " 𝑦 " + 𝑐| = 𝑧(𝑥 ! 𝑦 ! + 𝑥 " 𝑦 " + 𝑐) " + 𝑥 " - | 𝐱 | - " 𝑥 ! - - - - - - - - - - - - - - - - 24

Recall: The geometry of a linear classifier Prediction = sgn(𝑐 + 𝑥 1 𝑦 1 + 𝑥 2 𝑦 2 ) 𝑐 + 𝑥 1 𝑦 1 + 𝑥 2 𝑦 2 = 0 We only care about + ++ the sign, not the + + magnitude + + + |𝑥 ! 𝑦 ! + 𝑥 " 𝑦 " + 𝑐| = 𝑧(𝑥 ! 𝑦 ! + 𝑥 " 𝑦 " + 𝑐) " + 𝑥 " - | 𝐱 | - " 𝑥 ! - - - - - - - - - - - - - - - - 25

Recall: The geometry of a linear classifier Prediction = sgn(𝑐 + 𝑥 1 𝑦 1 + 𝑥 2 𝑦 2 ) 𝑐 + 𝑥 1 𝑦 1 + 𝑥 2 𝑦 2 = 0 We only care about 2 + 𝑥 1 𝑐 2 𝑦 1 + 𝑥 2 + ++ the sign, not the 2 𝑦 2 = 0 + + magnitude + + + 1000𝑐 + 1000𝑥 1 𝑦 1 + 1000𝑥 2 𝑦 2 = 0 |𝑥 ! 𝑦 ! + 𝑥 " 𝑦 " + 𝑐| = 𝑧(𝑥 ! 𝑦 ! + 𝑥 " 𝑦 " + 𝑐) " + 𝑥 " - | 𝐱 | - " 𝑥 ! - - - - - - - - - - - - - - - - All these are equivalent. We could multiply or divide the coefficients by any positive number and the sign of the prediction will not change 26

Maximizing margin • Margin of a hyperplane = distance of the closest point from the hyperplane 𝑧 & (𝐱 ' 𝐲 & + 𝑐) 𝛿 𝐱,% = max | 𝐱 | & • We want max w ° Some people call this the geometric margin The numerator alone is called the functional margin 27

Maximizing margin • Margin of a hyperplane = distance of the closest point from the hyperplane 𝑧 & (𝐱 ' 𝐲 & + 𝑐) 𝛿 𝐱,% = max | 𝐱 | & • We want to maximize this margin: max 𝐱,% 𝛿 𝐱,% Sometimes this is called the geometric margin The numerator alone is called the functional margin 28

Recall: The geometry of a linear classifier Prediction = sgn(𝑐 + 𝑥 1 𝑦 1 + 𝑥 2 𝑦 2 ) b +w 1 x 1 + w 2 x 2 =0 + ++ + + + + + |𝑥 ! 𝑦 ! + 𝑥 " 𝑦 " + 𝑐| - - " + 𝑥 " - " 𝑥 ! - - - - - - - - - - - - - - - We only care about the sign, not the magnitude 29

Towards maximizing the margin b +w 1 x 1 + w 2 x 2 =0 𝑥 ! 𝑑 𝑦 ! + 𝑥 " 𝑑 𝑦 " + 𝑐 + ++ + + 𝑑 + + " " + 𝑥 ! + 𝑥 " 𝑑 𝑑 |𝑥 ! 𝑦 ! + 𝑥 " 𝑦 " + 𝑐| - - " + 𝑥 " - " 𝑥 ! - - - - - - - - - - - - - - - We only care about We can scale the weights the sign, not the to make the optimization easier magnitude 30

Towards maximizing the margin b +w 1 x 1 + w 2 x 2 =0 𝑥 ! 𝑑 𝑦 ! + 𝑥 " 𝑑 𝑦 " + 𝑐 + ++ + + 𝑑 + + " " + 𝑥 ! + 𝑥 " 𝑑 𝑑 |𝑥 ! 𝑦 ! + 𝑥 " 𝑦 " + 𝑐| - - " + 𝑥 " - " 𝑥 ! - - - - - - - - - - - - - - - Key observation : We can We only care about We can scale the weights scale the 𝑑 so that the the sign, not the to make the optimization numerator is 1 for points easier magnitude that define the margin. 31

Support Vector Machines Machine Learning 1 Big picture Linear - PowerPoint PPT Presentation

Support Vector Machines Machine Learning 1 Big picture Linear models 2 Big picture Linear models How good is a learning algorithm? 3 Big picture Linear models Perceptron, Winnow Online learning How good is a learning algorithm?

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

? 17.10.2018 3 17.10.2018 4 Support Vector Machines (SVM): Background Support Vector Machines

Support Vector Machines October 16, 2018 Support Vector Machines October 16, 2018 1 / 31

Relevance Vector Machines Jukka Lankinen LUT February 21, 2011 Jukka Lankinen Relevance Vector

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Support Vector Machines & Kernelization Barna Saha Most of the slides are made using David

Introduction Kailash Awati Instructor DataCamp Support Vector Machines in R Preliminaries

Support Vector Machines Support Vector Machines Hypothesis Space Hypothesis Space variable

Support Vector Machines (Ch. 18.9) SVM Basics Support Vector Machines (SVMs) try to do our

Support vector machines CS 446 Part 1: linear support vector machines 1.0 1.0 1.0 0.8 0.8

SUPPORT VECTOR MACHINES SUPPORT VECTOR MACHINES Matthieu R Bloch Tuesday, February 25, 2020 1

RBF Kernels: Generating a complex dataset DataCamp Support Vector Machines in R A bit about RBF

Machine Learning for NLP Support Vector Machines Aurlie Herbelot 2019 Centre for Mind/Brain

Generating a radially separable dataset DataCamp Support Vector Machines in R Generating a 2d

Support Vector Machines 290N, 2014 Support Vector Machines (SVM) Supervised learning

1 Learning Linear Separators Here we can think of examples as being from { 0 , 1 } n or from R n .

d i E The Derivative as a Rate of Change a l l u d Dr. Abdulla Eid b A College of

Profit Maximization Molly W. Dahl Georgetown University Econ 101 Spring 2009 1 Economic

1 Outcome - skills Outcome: Competencies: After completing the course the student should be

Lecture 10 Support Vector Machines Oct - 20 - 2008 Linear Separators Linear Separators

Emily Shen, David Wagner EVT/WOTE 2011 San Francisco, CA 8 August 2011 Voters rank (a subset

Online Learning & Margins Instructor: Sham Kakade 1 Introduction There are two common

Enabling Safety Upgrades That Reduce Risk David Lochbaum Director, Nuclear Safety Project

Support Vector Machines Machine Learning 1 Big picture Linear - PowerPoint PPT Presentation

Support Vector Machines Machine Learning 1 Big picture Linear models 2 Big picture Linear models How good is a learning algorithm? 3 Big picture Linear models Perceptron, Winnow Online learning How good is a learning algorithm?

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

? 17.10.2018 3 17.10.2018 4 Support Vector Machines (SVM): Background Support Vector Machines

Support Vector Machines October 16, 2018 Support Vector Machines October 16, 2018 1 / 31

Relevance Vector Machines Jukka Lankinen LUT February 21, 2011 Jukka Lankinen Relevance Vector

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Support Vector Machines &amp; Kernelization Barna Saha Most of the slides are made using David

Introduction Kailash Awati Instructor DataCamp Support Vector Machines in R Preliminaries

Support Vector Machines Support Vector Machines Hypothesis Space Hypothesis Space variable

Support Vector Machines (Ch. 18.9) SVM Basics Support Vector Machines (SVMs) try to do our

Support vector machines CS 446 Part 1: linear support vector machines 1.0 1.0 1.0 0.8 0.8

SUPPORT VECTOR MACHINES SUPPORT VECTOR MACHINES Matthieu R Bloch Tuesday, February 25, 2020 1

RBF Kernels: Generating a complex dataset DataCamp Support Vector Machines in R A bit about RBF

Machine Learning for NLP Support Vector Machines Aurlie Herbelot 2019 Centre for Mind/Brain

Generating a radially separable dataset DataCamp Support Vector Machines in R Generating a 2d

Support Vector Machines 290N, 2014 Support Vector Machines (SVM) Supervised learning

1 Learning Linear Separators Here we can think of examples as being from { 0 , 1 } n or from R n .

d i E The Derivative as a Rate of Change a l l u d Dr. Abdulla Eid b A College of

Profit Maximization Molly W. Dahl Georgetown University Econ 101 Spring 2009 1 Economic

1 Outcome - skills Outcome: Competencies: After completing the course the student should be

Lecture 10 Support Vector Machines Oct - 20 - 2008 Linear Separators Linear Separators

Emily Shen, David Wagner EVT/WOTE 2011 San Francisco, CA 8 August 2011 Voters rank (a subset

Online Learning &amp; Margins Instructor: Sham Kakade 1 Introduction There are two common

Enabling Safety Upgrades That Reduce Risk David Lochbaum Director, Nuclear Safety Project

Support Vector Machines & Kernelization Barna Saha Most of the slides are made using David

Online Learning & Margins Instructor: Sham Kakade 1 Introduction There are two common