Support Vector Machine w T x + b = 0 b || w || Support Vector - PDF document

Lecture 2: The SVM classifier C4B Machine Learning Hilary 2011 A. Zisserman • Review of linear classifiers • Linear separability • Perceptron • Support Vector Machine (SVM) classifier • Wide margin • Cost function • Slack variables • Loss functions revisited Binary Classification Given training data ( x i , y i ) for i = 1 . . . N , with x i ∈ R d and y i ∈ { − 1 , 1 } , learn a classi fi er f ( x ) such that ( ≥ 0 y i = +1 f ( x i ) < 0 y i = − 1 i.e. y i f ( x i ) > 0 for a correct classi fi cation.

Linear separability linearly separable not linearly separable Linear classifiers A linear classifier has the form f ( x ) = 0 X 2 f ( x ) = w > x + b f ( x ) < 0 f ( x ) > 0 X 1 • in 2D the discriminant is a line w • is the normal to the plane, and b the bias w • is known as the weight vector

Linear classifiers f ( x ) = 0 A linear classifier has the form f ( x ) = w > x + b • in 3D the discriminant is a plane, and in nD it is a hyperplane For a K-NN classifier it was necessary to `carry’ the training data For a linear classifier, the training data is used to learn w and then discarded Only w is needed for classifying new data Reminder: The Perceptron Classifier Given linearly separable data x i labelled into two categories y i = {-1,1} , find a weight vector w such that the discriminant function f ( x i ) = w > x i + b separates the categories for i = 1, .., N • how can we find this separating hyperplane ? The Perceptron Algorithm w > ˜ x i + w 0 = w > x i f ( x i ) = ˜ Write classifier as where w = (˜ w , w 0 ) , x i = (˜ x i , 1) • Initialize w = 0 • Cycle though the data points { x i , y i } w ← w + α sign( f ( x i )) x i • if x i is misclassified then • Until all the data is correctly classified

For example in 2D • Initialize w = 0 • Cycle though the data points { x i , y i } w ← w + α sign( f ( x i )) x i • if x i is misclassified then • Until all the data is correctly classified before update after update X 2 X 2 w w X 1 X 1 x i w ← w − α x i NB after convergence w = P N i α i x i 8 6 Perceptron example 4 2 0 -2 -4 -6 -8 -10 -15 -10 -5 0 5 10 • if the data is linearly separable, then the algorithm will converge • convergence can be slow … • separating line close to training data • we would prefer a larger margin for generalization

What is the best w? • maximum margin solution: most stable under perturbations of the inputs Support Vector Machine w T x + b = 0 b || w || Support Vector Support Vector w X α i y i ( x i > x ) + b f ( x ) = i support vectors

SVM – sketch derivation • Since w > x + b = 0 and c ( w > x + b ) = 0 de fi ne the same plane, we have the freedom to choose the normalization of w • Choose normalization such that w > x + + b = +1 and w > x − + b = − 1 for the positive and negative support vectors respectively • Then the margin is given by w > ³ ´ x + − x − 2 = || w || || w || Support Vector Machine 2 Margin = || w || Support Vector Support Vector w T x + b = 1 w w T x + b = 0 w T x + b = - 1

SVM – Optimization • Learning the SVM can be formulated as an optimization: 2 || w || subject to w > x i + b ≥ 1 if y i = +1 max for i = 1 . . . N ≤ − 1 if y i = − 1 w • Or equivalently ³ ´ w || w || 2 w > x i + b ≥ 1 for i = 1 . . . N min subject to y i • This is a quadratic optimization problem subject to linear constraints and there is a unique minimum SVM – Geometric Algorithm • Compute the convex hull of the positive points, and the convex hull of the negative points • For each pair of points, one on positive hull and the other on the negative hull, compute the margin • Choose the largest margin

Geometric SVM Ex I Support Vector Support Vector • only need to consider points on hull (internal points irrelevant) for separation • hyperplane defined by support vectors Geometric SVM Ex II Support Vector Support Vector Support Vector • only need to consider points on hull (internal points irrelevant) for separation • hyperplane defined by support vectors

Linear separability again: What is the best w? • the points can be linearly separated but there is a very narrow margin • but possibly the large margin solution is better, even though one constraint is violated In general there is a trade off between the margin and the number of mistakes on the training data Introduce “slack” variables for misclassified points ξ i ≥ 0 2 ξ Margin = || w || > 2 || w || Misclassified point ξ i || w || < 1 • for 0 ≤ ξ ≤ 1 point is between margin and correct side of hyperplane • for ξ > 1 point is misclassi fi ed Support Vector Support Vector ξ = 0 w T x + b = 1 w w T x + b = 0 w T x + b = - 1

“Soft” margin solution The optimization problem becomes N X w ∈ R d , ξ i ∈ R + || w || 2 + C min ξ i i subject to ³ ´ w > x i + b y i ≥ 1 − ξ i for i = 1 . . . N • Every constraint can be satis fi ed if ξ i is su ffi ciently large • C is a regularization parameter: — small C allows constraints to be easily ignored → large margin — large C makes constraints hard to ignore → narrow margin — C = ∞ enforces all constraints: hard margin • This is still a quadratic optimization problem and there is a unique minimum. Note, there is only one parameter, C . 0.8 0.6 0.4 0.2 feature y 0 -0.2 -0.4 -0.6 -0.8 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 feature x • data is linearly separable • but only with a narrow margin

C = Infinity hard margin C = 10 soft margin

Application: Pedestrian detection in Computer Vision Objective: detect (localize) standing humans in an image • cf face detection with a sliding window classifier • reduces object detection to binary classification • does an image window contain a person or not? Method: the HOG detector Training data and features • Positive data – 1208 positive window examples • Negative data – 1218 negative window examples (initially)

Feature: histogram of oriented gradients (HOG) dominant HOG image direction • tile window into 8 x 8 pixel cells frequency • each cell represented by HOG orientation Feature vector dimension = 16 x 8 (for tiling) x 8 (orientations) = 1024

Averaged examples Algorithm Training (Learning) • Represent each example window by a HOG feature vector x i ∈ R d , with d = 1024 • Train a SVM classifier Testing (Detection) • Sliding window classifier f ( x ) = w > x + b

Dalal and Triggs, CVPR 2005 Learned model = w > x + b f ( x ) Slide from Deva Ramanan

Slide from Deva Ramanan Optimization Learning an SVM has been formulated as a constrained optimization problem over w and ξ N ³ ´ X w ∈ R d , ξ i ∈ R + || w || 2 + C w > x i + b min ξ i subject to y i ≥ 1 − ξ i for i = 1 . . . N i ³ ´ w > x i + b The constraint y i ≥ 1 − ξ i , can be written more concisely as y i f ( x i ) ≥ 1 − ξ i which is equivalent to ξ i = max (0 , 1 − y i f ( x i )) Hence the learning problem is equivalent to the unconstrained optimization problem N X w ∈ R d || w || 2 + C min max (0 , 1 − y i f ( x i )) i regularization loss function

Loss function N X w ∈ R d || w || 2 + C min max (0 , 1 − y i f ( x i )) i w T x + b = 0 loss function Points are in three categories: Support Vector 1. y i f ( x i ) > 1 Point is outside margin. No contribution to loss Support Vector 2. y i f ( x i ) = 1 Point is on margin. No contribution to loss. As in hard margin case. w 3. y i f ( x i ) < 1 Point violates margin constraint. Contributes to loss Loss functions y i f ( x i ) max (0 , 1 − y i f ( x i )) • SVM uses “hinge” loss • an approximation to the 0-1 loss

Background reading and more … • Next lecture – see that the SVM can be expressed as a sum over the support vectors: X α i y i ( x i > x ) + b f ( x ) = i support vectors • On web page: http://www.robots.ox.ac.uk/~az/lectures/ml • links to SVM tutorials and video lectures • MATLAB SVM demo

Support Vector Machine w T x + b = 0 b || w || Support Vector - PDF document

Lecture 2: The SVM classifier C4B Machine Learning Hilary 2011 A. Zisserman Review of linear classifiers Linear separability Perceptron Support Vector Machine (SVM) classifier Wide margin Cost

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Support Vector Machines Preview What is a support vector machine? The perceptron revisited

Why Deep Learning Is More Natural Questions Efficient than Support Support Vector . . . Support

Multi-class Support Vector Machine Rizal Zaini Ahmad Fathony November 10, 2016 University of

Matrix and Vector Operations Matrix and Vector Operations 1 / 21 Matrix and Vector Operations

Day 3 Advanced Vector Architectures Session A: Vector Instruction Execution Pipelines Break

Support Vector Machines October 16, 2018 Support Vector Machines October 16, 2018 1 / 31

? 17.10.2018 3 17.10.2018 4 Support Vector Machines (SVM): Background Support Vector Machines

Support Vector Machines This set of notes presents the Support Vector Machine (SVM) learning al-

Machine Learning for NLP Support Vector Machines Aurlie Herbelot 2019 Centre for Mind/Brain

What is a What are Support Vector Machines Support Vector Machine? Used For? An optimally

Lecture 6: Support Vector Machine (Part 1) Feb 10 2020 Lecturer: Steven Wu Scribe: Steven Wu We

Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning 1 Support

Classifiers: Support Vector Machine 1 MACHINE LEARNING What is Classification? Female Adult

Kernel Methods for Regression Support Vector Regression Gaussian Mixture Regression Gaussian

Support Vector Machines & Kernelization Barna Saha Most of the slides are made using David

61A Lecture 3 Wednesday, August 31 Life Cycle of a User-Defined Function What happens? Def

Machine Learning Convolutional Neural Networks Pat Virtue University of California, Berkeley

Session 5A: Research Infrastructures: ensuring trust and quality of data Darren Bell Repository

Electrifying fleets: An Electric Van Centre for Leeds Chris Plumb Highways England Tom Cowen

(Chapra, L22 & L23) David Reckhow CEE 577 #18 1 Nitrogen (Chapra L23)

Show, Match and Segment: Joint Weakly Supervised Learning of Semantic Matching and Object

Analysis of Algorithms continued Recursion On Capstone Project? Have your netwo tworking

From Program to Process To make the program s code and data come alive need a CPU The Process

Support Vector Machine w T x + b = 0 b || w || Support Vector - PDF document

Lecture 2: The SVM classifier C4B Machine Learning Hilary 2011 A. Zisserman Review of linear classifiers Linear separability Perceptron Support Vector Machine (SVM) classifier Wide margin Cost

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Support Vector Machines Preview What is a support vector machine? The perceptron revisited

Why Deep Learning Is More Natural Questions Efficient than Support Support Vector . . . Support

Multi-class Support Vector Machine Rizal Zaini Ahmad Fathony November 10, 2016 University of

Matrix and Vector Operations Matrix and Vector Operations 1 / 21 Matrix and Vector Operations

Day 3 Advanced Vector Architectures Session A: Vector Instruction Execution Pipelines Break

Support Vector Machines October 16, 2018 Support Vector Machines October 16, 2018 1 / 31

? 17.10.2018 3 17.10.2018 4 Support Vector Machines (SVM): Background Support Vector Machines

Support Vector Machines This set of notes presents the Support Vector Machine (SVM) learning al-

Machine Learning for NLP Support Vector Machines Aurlie Herbelot 2019 Centre for Mind/Brain

What is a What are Support Vector Machines Support Vector Machine? Used For? An optimally

Lecture 6: Support Vector Machine (Part 1) Feb 10 2020 Lecturer: Steven Wu Scribe: Steven Wu We

Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning 1 Support

Classifiers: Support Vector Machine 1 MACHINE LEARNING What is Classification? Female Adult

Kernel Methods for Regression Support Vector Regression Gaussian Mixture Regression Gaussian

Support Vector Machines &amp; Kernelization Barna Saha Most of the slides are made using David

61A Lecture 3 Wednesday, August 31 Life Cycle of a User-Defined Function What happens? Def

Machine Learning Convolutional Neural Networks Pat Virtue University of California, Berkeley

Session 5A: Research Infrastructures: ensuring trust and quality of data Darren Bell Repository

Electrifying fleets: An Electric Van Centre for Leeds Chris Plumb Highways England Tom Cowen

(Chapra, L22 &amp; L23) David Reckhow CEE 577 #18 1 Nitrogen (Chapra L23)

Show, Match and Segment: Joint Weakly Supervised Learning of Semantic Matching and Object

Analysis of Algorithms continued Recursion On Capstone Project? Have your netwo tworking

From Program to Process To make the program s code and data come alive need a CPU The Process

Support Vector Machines & Kernelization Barna Saha Most of the slides are made using David

(Chapra, L22 & L23) David Reckhow CEE 577 #18 1 Nitrogen (Chapra L23)