The Support Vector Machine (Ken Kreutz-Delgado) ) Nuno Vasconcelos - PDF document

The Support Vector Machine (Ken Kreutz-Delgado) ) Nuno Vasconcelos g UC San Diego (

Classification a Classification Problem has two types of variables • X - vector of observations (features) in the world • Y - state (class) of the world E.g. X ∈ X ⊂ R 2 , X = (fever, blood pressure) X 2 • X X (f bl d ) Y ∈ Y = {disease, no disease} • X, Y are stochastically related and this X Y are stochastically related and this relationship can be well approximated by an “optimal” classifier function ≈ = ˆ x ( ) y y f x ( ) · f Goal: Design a “good” classifier h ≈ f ≈ y , h : X → Y 2

Loss Functions and Risk Usually h ( . ) is a parametric function, h ( x, α ) Generally it cannot estimate the value y arbitrarily well Generally it cannot estimate the value y arbitrarily well • Indeed, the best we can (optimistically) hope for is that h will well approximate the unknown optimal classifier f , h ≈ f L y h x α We define a loss function: [ , ( , )] Goal: Find the parameter values (equivalently, find the classifier) that minimize the expected value of the loss: l ifi ) th t i i i th t d l f th l { { } } α = α ( ) [ , ( , )] Risk = Average Loss = R E L y h x , , X Y In particular, under the “0-1” loss the optimal solution is the Bayes Decision Rule (BDR): [ ] = *( ) argmax | h x P i x Y X | i 3

Bayes Decision Rule The BDR carves up the observation space X , assigning a label to each region a label to each region Clearly, h* depends on the class densities { } [ ] [ ] = + *( ) argmax log | log h x P x i P i X Y | Y i Problematic! Usually we don’t know these densities!! Problematic! Usually we don t know these densities!! Key idea of discriminant learning: • First estimating the densities followed by deriving the decision • First estimating the densities, followed by deriving the decision boundaries is a computationally intractable (hence bad) strategy • Vapnik’s Rule: “When solving a problem avoid solving a more general (and thus usually much harder) problem as more general (and thus usually much harder) problem as an intermediate step!” 4

Discriminant Learning Work directly with the decision function 1. Postulate a (parametric) family of decision boundaries 2. 2 Pi k th Pick the element in this family that produces the best classifier l t i thi f il th t d th b t l ifi Q: What is a good family of decision boundaries? Consider two equal probability Gaussian class conditional Consider two equal probability Gaussian class conditional densities of equal covariance: ⎧ ⎧ ⎫ ⎫ 1 = µ Σ Σ + ⎨ ⎨ ⎬ ⎬ *( ) *( ) argmax log l ( , ( , ) ) log 2 l h h x G G x i i ⎩ ⎭ i { { } } = − µ µ Σ Σ − − µ µ 1 T argmin ( argmin ( ) ) ( ( ) ) x x x x i i i i i ⎧ − − − µ Σ − µ < − µ Σ − µ 1 1 T T 0, ( ) ( ) ( ) ( ) if x x x x = ⎨ ⎨ 0 0 1 1 ⎩ ⎩ 1, otherwise 5

The Linear Discriminant Function The decision boundary is the set of points − µ µ Σ Σ − − µ µ = − µ µ Σ Σ − − µ µ T T 1 1 x x x x ( ( x ) ) ( ( x ) ) ( ( x ) ) ( ( x ) ) 0 0 1 1 which, after some algebra, becomes µ − µ Σ − + µ Σ − µ − µ Σ − µ = T T T T T T 1 1 1 1 1 1 2 ( ) x 0 1 0 0 0 1 1 This is the equation of the hyperplane + = T 0 w x b with = Σ − µ − µ 1 2 ( ) w 1 0 = µ Σ − µ − µ Σ − µ 1 1 T T b 0 0 1 1 This is a linear discriminant 6

Linear Discriminants The hyperplane equation can also be written as ⎛ ⎞ w ⎜ ⎟ + = ⇔ + = ⇔ T T 0 0 w x b w x b ⎜ ⎜ ⎟ ⎟ 2 w w ⎝ ⎝ ⎠ ⎠ w x 0 ( ( ) ) w − = = − x 1 T T 0 0 1 w x x with ith x b b 0 0 2 w x x n Geometric interpretation Geometric interpretation x2 x2 x 3 • Hyperplane of normal w • Hyperplane passes through x 0 0 • Hyperplane point x 0 is the point closest to the origin 7

Linear Discriminants For the given model, the quadratic discriminant function ⎧ − µ µ Σ − − µ µ < − µ µ Σ − − µ µ 1 1 T T 0, , if ( ( ) ) ( ( ) ) ( ( ) ) ( ( ) ) x x x x = ⎨ ⎨ 0 0 0 0 1 1 1 1 *( ) *( ) 1, h h x − µ Σ − − µ > − µ Σ − − µ 1 1 T T if ( ) ( ) ( ) ( ) ⎩ x x x x 0 0 1 1 is equivalent to the linear discriminant function is equivalent to the linear discriminant function > ⎧ 0 if ( ) 0 g x x = ⎨ *( ) h x < ⎩ ⎩ 1 if ( ) ( ) 0 g x g x x 0 x-x 0 where θ w ( ( ) ) x 0 = − T ( ) ( ) g x g w x x 0 0 x 1 = − θ x n w · · cos x x 0 g(x) > 0 if x is on the side w points to g(x) > 0 if x is on the side w points to x2 x 3 (“ w points to the positive side”) 8

Linear Discriminants x Finally, note that T ( ) ( ) g x g x w w ( ) w w x-x 0 θ = − x x 0 w w x 0 ( ) g x is: is: x 1 1 w • The projection of x-x 0 onto the unit x n b vector in the direction of w w x2 x2 x 3 • The length of the component of x-x 0 orthogonal to the plane I.e. g(x)/||w|| I e g(x)/||w|| = perpendicular distance from x to the plane perpendicular distance from x to the plane Similarly, | b|/||w|| is the distance from the plane to the origin, since: w w = − x b 0 2 w 9

Geometric Interpretation Summarizing, the linear discriminant decision rule g x > > ⎧ ⎧ 0 0 if ( ) if ( ) 0 0 g x = ⎨ = + with T *( ) ( ) h x g x w x b < ⎩ 1 if ( ) 0 g x has the following properties has the following properties w It divides X into two “half-spaces” • • The boundary is the hyperplane with: The boundary is the hyperplane with: x • normal w ( ) g x | | w b • distance to the origin b/||w|| w • g( x )/|| w || gives the signed distance from point x to the boundary • g(x) = 0 for points on the plane • g(x) > 0 for points on the side w points to (“positive side”) • g(x) < 0 for points on the “negative side” 10

The Linear Discriminant Function When is it a good decision function? We’ve just seen that it is optimal for • Gaussian classes having equal class probabilities and covariances But, this sounds too much like an But, this sounds too much like an artificial, toy problem However, it is also optimal if the data is linearly separable • I.e., if there is a hyperplane which has • all “class 0” data on one side • all class 0 data on one side • all “class 1” data on the other Note: this holding on the training set only guarantees optimality in the minimum training error sense, not in the sense of minimizing the true risk 11

Linear Discriminants For now, our goal is to explore the y =1 y simplicity of the linear discriminant p y w let’s assume linear separability of the training data One handy trick is to use class labels y ∈ { -1,1 } instead of y ∈ { 0,1 } , where • y = 1 for points on the positive side y 1 for points on the positive side y =-1 • y = -1 for points on the negative side The decision function then becomes The decision function then becomes > ⎧ 1 if ( ) 0 g x [ ] = ⇔ = ⎨ *( ) *( ) sgn ( ) h x h x g x − < < ⎩ ⎩ 1 1 if ( ) if ( ) 0 0 g x g x 12

Linear Discriminants & Separable Data We have a classification error if We have a classification error if • y = 1 and g(x) < 0 or y = -1 and g(x) > 0 • i.e., if yg(x) < 0 yg(x) < 0 • i e if We have a correct classification if y = 1 and g(x) > 0 y 1 and g(x) > 0 or y 1 and g(x) < 0 y = -1 and g(x) < 0 • or • i.e., if yg(x) > 0 Note that if the data is linearly separable given a training set Note that, if the data is linearly separable, given a training set D = {( x 1 ,y 1 ) , ... , ( x n ,y n )} we can have zero training error. The necessary & sufficient condition for this is that ( ) + > ∀ = T 0, 1, ···, y w x b i n i i 13

The Margin y=1 The margin is the distance from the boundary to the closest point w + T w x b γ = i min w i There will be no error on the training y y=-1 1 set if it is strictly greater than zero: set if it is strictly greater than zero: ( ) + > ∀ ⇔ γ > T 0, 0 y w x b i i i w Note that this is ill-defined in the sense Note that this is ill-defined in the sense that γ does not change if both w and b x g x ( ) are scaled by a common scalar λ | b | w w We need a normalization 14

Support Vector Machine (SVM) A convenient normalization is to make y=1 | g(x)| = 1 for the closest point, i.e. w + ≡ T min 1 w x b i i under which under which 1 γ = w y y=-1 1 The Support Vector Machine (SVM) is w the linear discriminant classifier that maximizes the margin subject to x g x these constraints: ( ) | b | w w ( ) 2 2 + ≥ ∀ T min subject to 1 w y w x b i i i , w b 15

The Support Vector Machine (Ken Kreutz-Delgado) ) Nuno Vasconcelos - PDF document

The Support Vector Machine (Ken Kreutz-Delgado) ) Nuno Vasconcelos g UC San Diego ( Classification a Classification Problem has two types of variables X - vector of observations (features) in the world Y - state (class) of the

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Support Vector Machines Preview What is a support vector machine? The perceptron revisited

Why Deep Learning Is More Natural Questions Efficient than Support Support Vector . . . Support

Multi-class Support Vector Machine Rizal Zaini Ahmad Fathony November 10, 2016 University of

Matrix and Vector Operations Matrix and Vector Operations 1 / 21 Matrix and Vector Operations

Day 3 Advanced Vector Architectures Session A: Vector Instruction Execution Pipelines Break

Support Vector Machines October 16, 2018 Support Vector Machines October 16, 2018 1 / 31

? 17.10.2018 3 17.10.2018 4 Support Vector Machines (SVM): Background Support Vector Machines

Support Vector Machine w T x + b = 0 b || w || Support Vector Support Vector w X i y i ( x

Support Vector Machines This set of notes presents the Support Vector Machine (SVM) learning al-

Machine Learning for NLP Support Vector Machines Aurlie Herbelot 2019 Centre for Mind/Brain

What is a What are Support Vector Machines Support Vector Machine? Used For? An optimally

Lecture 6: Support Vector Machine (Part 1) Feb 10 2020 Lecturer: Steven Wu Scribe: Steven Wu We

Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning 1 Support

Classifiers: Support Vector Machine 1 MACHINE LEARNING What is Classification? Female Adult

Kernel Methods for Regression Support Vector Regression Gaussian Mixture Regression Gaussian

SSD: Single Shot MultiBox Detector Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy,

The SDGs, why should I care? Am I already contributing? David Lusseau @lusseau Current

Iterative Hybrid Algorithm for Semi-supervised Classification Martin SAVESKI Supervised by

Predicting AsiaYo Users Spending for Improving Search Results Travis Greene, Martin Hsia,

Support Vector Machines & Kernelization Barna Saha Most of the slides are made using David

SVM AND STATISTICAL LEARNING THEORY W. RYAN LEE CS109/AC209/STAT121 ADVANCED SECTION

Decompositon Factors of Perverse Sheaves Iara Gonalves Department of Mathematics and

Problems of Enumeration and Realizability on Matroids, Simplicial Complexes, and Graphs Yvonne

The Support Vector Machine (Ken Kreutz-Delgado) ) Nuno Vasconcelos - PDF document

The Support Vector Machine (Ken Kreutz-Delgado) ) Nuno Vasconcelos g UC San Diego ( Classification a Classification Problem has two types of variables X - vector of observations (features) in the world Y - state (class) of the

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Support Vector Machines Preview What is a support vector machine? The perceptron revisited

Why Deep Learning Is More Natural Questions Efficient than Support Support Vector . . . Support

Multi-class Support Vector Machine Rizal Zaini Ahmad Fathony November 10, 2016 University of

Matrix and Vector Operations Matrix and Vector Operations 1 / 21 Matrix and Vector Operations

Day 3 Advanced Vector Architectures Session A: Vector Instruction Execution Pipelines Break

Support Vector Machines October 16, 2018 Support Vector Machines October 16, 2018 1 / 31

? 17.10.2018 3 17.10.2018 4 Support Vector Machines (SVM): Background Support Vector Machines

Support Vector Machine w T x + b = 0 b || w || Support Vector Support Vector w X i y i ( x

Support Vector Machines This set of notes presents the Support Vector Machine (SVM) learning al-

Machine Learning for NLP Support Vector Machines Aurlie Herbelot 2019 Centre for Mind/Brain

What is a What are Support Vector Machines Support Vector Machine? Used For? An optimally

Lecture 6: Support Vector Machine (Part 1) Feb 10 2020 Lecturer: Steven Wu Scribe: Steven Wu We

Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning 1 Support

Classifiers: Support Vector Machine 1 MACHINE LEARNING What is Classification? Female Adult

Kernel Methods for Regression Support Vector Regression Gaussian Mixture Regression Gaussian

SSD: Single Shot MultiBox Detector Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy,

The SDGs, why should I care? Am I already contributing? David Lusseau @lusseau Current

Iterative Hybrid Algorithm for Semi-supervised Classification Martin SAVESKI Supervised by

Predicting AsiaYo Users Spending for Improving Search Results Travis Greene, Martin Hsia,

Support Vector Machines &amp; Kernelization Barna Saha Most of the slides are made using David

SVM AND STATISTICAL LEARNING THEORY W. RYAN LEE CS109/AC209/STAT121 ADVANCED SECTION

Decompositon Factors of Perverse Sheaves Iara Gonalves Department of Mathematics and

Problems of Enumeration and Realizability on Matroids, Simplicial Complexes, and Graphs Yvonne

Support Vector Machines & Kernelization Barna Saha Most of the slides are made using David