[PPT] - CMU-10701 Support Vector Machines Barnabs Pczos & Aarti Singh PowerPoint Presentation

SLIDE 1

Introduction to Machine Learning CMU-10701

Support Vector Machines

Barnabás Póczos & Aarti Singh 2014 Spring

SLIDE 2

http://barnabas-cmu-10701.appspot.com/

SLIDE 3

SLIDE 4

Linear classifiers which line is better?

4

Which decision boundary is better?

SLIDE 5

Pick the one with the largest margin!

Class 1 Class 2 Margin

Data:

w ∙ x + b < 0 w ∙ x + b > 0

5

w

SLIDE 6

Scaling

Classify as.. +1 if w ∙ x + b  1 –1 if w ∙ x + b  –1 Universe explodes if

1 < w ∙ x + b < 1

Plus-Plane Minus-Plane Classifier Boundary

Classification rule: Goal: Find the maximum margin classifier How large is the margin of this classifier?

6

w ∙ x + b  –1 w ∙ x + b  1

SLIDE 7

Computing the margin width

Let x+ and x- be such that

w ∙ x+ + b = +1
w ∙ x- + b = -1
x+ = x- + l w
|x+ – x-| = M

M = Margin Width x- x+

w w   2

7

Maximize M  minimize w·w !

SLIDE 8

Observations

8

We can assume b=0

Classify as.. +1 if w ∙ x + b  1 –1 if w ∙ x + b  –1 Universe explodes if

1 < w ∙ x + b < 1

This is the same as

SLIDE 9

The Primal Hard SVM

This is a QP problem (m-dimensional) (Quadratic cost function, linear constraints)

9

SLIDE 10

Quadratic Programming

Find and to Subject to

10

Efficient Algorithms exist for QP. They often solve the dual problem instead of the primal.

SLIDE 11

Constrained Optimization

11

SLIDE 12

Lagrange Multiplier

Moving the constraint to objective function Lagrangian: Solve:

Constraint is active when a > 0

12

SLIDE 13

Lagrange Multiplier – Dual Variables

Solving:

When a > 0, constraint is tight

13

SLIDE 14

From Primal to Dual

Lagrange function:

14

Primal problem:

SLIDE 15

Proof cont.

The Lagrange problem:

The Lagrange Problem

15

SLIDE 16

Proof cont. The Dual Problem

16

SLIDE 17

The Dual Hard SVM

Quadratic Programming (n-dimensional) Lemma

17

SLIDE 18

The Problem with Hard SVM

It assumes samples are linearly separable...

18

What can we do if data is not linearly separable???

SLIDE 19

Hard 1-dimensional Dataset

If the data set is not linearly separable, then adding new features (mapping the data to a larger feature space) the data might become linearly separable

19

SLIDE 20

Hard 1-dimensional Dataset

Make up a new feature! Sort of… … computed from

riginal feature(s)

x=0

) , (

2 k k k

x x  z

Separable! MAGIC! Now drop this “augmented” data into our linear SVM.

20

SLIDE 21

Feature mapping

 n general! points in an n-1 dimensional space is always linearly separable by a hyperspace! ) it is good to map the data to high dimensional spaces  Having n training data, is it always good enough to map the data into a feature space with dimension n-1?

Nope... We have to think about the test data as well!

Even if we don’t know how many test data we have and what they are...  We might want to map our data to a huge (1) dimensional feature space  Overfitting? Generalization error?... We don’t care now...

21

SLIDE 22

1

How to do feature mapping?

Use features of features of features of features….

22

SLIDE 23

The Problem with Hard SVM

Solutions:

1. Use feature transformation to a larger space

) each training samples are linearly separable in the feature space ) Hard SVM can be applied  ) overfitting...  It assumes samples are linearly separable...

23

2. Soft margin SVM instead of Hard SVM
We will discuss this now

SLIDE 24

Hard SVM

The Hard SVM problem can be rewritten:

where Misclassification Correct classification

24

SLIDE 25

From Hard to Soft constraints

We can try to solve the soft version of it: Your loss is only 1 instead of 1 if you misclassify an instance Instead of using hard constraints (points are linearly separable)

where Misclassification Correct classification

25

SLIDE 26

Problems with l0-1 loss

It is not convex in yf(x) ) It is not convex in w, either... ... and we like only convex functions... Let us approximate it with convex functions!

26

SLIDE 27

Approximation of the Heaviside step function

Picture is taken from R. Herbrich

27

SLIDE 28

Approximations of l0-1 loss

Piecewise linear approximations (hinge loss, llin)
Quadratic approximation (lquad)

28

SLIDE 29

The hinge loss approximation of l0-1

Where,

29

The hinge loss upper bounds the 0-1 loss

SLIDE 30

Geometric interpretation: Slack Variables

M =

฀ 2 w w

7

 1

2

30

SLIDE 31

The Primal Soft SVM problem

31

Equivalently, where

SLIDE 32

The Primal Soft SVM problem

We can use this form, too.:

32

Equivalently, What is the dual form of primal soft SVM?

SLIDE 33

The Dual Soft SVM (using hinge loss)

where

33

SLIDE 34

The Dual Soft SVM (using hinge loss)

34

SLIDE 35

The Dual Soft SVM (using hinge loss)

35

SLIDE 36

SVM classification in the dual space

Solve the dual problem

36

SLIDE 37

Why is it called Support Vector Machine?

KKT conditions

SLIDE 38

Why is it called Support Vector Machine?

w¢x + b > 0 w¢x + b < 0

g g

Hard SVM: Linear hyperplane defined by “support vectors” Moving other points a little doesn’t effect the decision boundary

nly need to store the

support vectors to predict labels of new points

38

SLIDE 39

Support vectors in Soft SVM

SLIDE 40

Support vectors in Soft SVM

Margin support vectors
Nonmargin support vectors

SLIDE 41

Dual SVM Interpretation: Sparsity

Only few ajs can be non-zero : where constraint is tight

(<w,xj>+ b)yj = 1

aj > 0 aj > 0 aj > 0 aj = 0 aj = 0 aj = 0

41

SLIDE 42

What about multiple classes?

42

SLIDE 43

One against all

Learn 3 classifiers separately: Class k vs. rest (wk, bk)k=1,2,3 y = arg max wk¢x + bk

k

But wks may not be based on the same scale. Note: (aw)¢x + (ab) is also a solution

43

SLIDE 44

Learn 1 classifier: Multi-class SVM

Simultaneously learn 3 sets of weights. Constraints: y = arg maxk w(k)¢ x + b(k) Margin - gap between correct class and nearest

ther class

44

SLIDE 45

Learn 1 classifier: Multi-class SVM

Simultaneously learn 3 sets of weights y = arg maxk w(k)¢ x + b(k) Joint optimization: wks have the same scale.

45

SLIDE 46

What you need to know

Maximizing margin
Derivation of SVM formulation
Slack variables and hinge loss
Relationship between SVMs and logistic regression
0/1 loss
Hinge loss
Log loss
Tackling multiple class
One against All
Multiclass SVMs

46

SLIDE 47

SVM vs. Logistic Regression

SVM : Hinge loss: 0-1 loss

1

1

Logistic Regression : Log loss ( log conditional likelihood) Hinge loss Log loss

47

SLIDE 48

SVM for Regression

48

SLIDE 49

SVM classification in the dual space

49

“Without b” “With b”

SLIDE 50

So why solve the dual SVM?

There are some quadratic programming

algorithms that can solve the dual faster than the primal, specially in high dimensions m>>n

But, more importantly, the “kernel trick”!!!

50

SLIDE 51

What if data is not linearly separable?

Use features of features

f features of features….

Φ(x) = (x1

3, x2 3, x3 3, x1 2x2x3, ….,)

51

For example polynomials

SLIDE 52

Dot Product of Polynomials

d=1 d=2

52

SLIDE 53

Dot Product of Polynomials

d

53

SLIDE 54

Higher Order Polynomials

m – input features d – degree of polynomial

grows fast: d = 6, m = 100, about 1.6 billion terms

54

Feature space becomes really large very quickly!

SLIDE 55

Dual formulation only depends

n dot-products, not on w!

Φ(x) – High-dimensional feature space, but never need it explicitly as long as we can compute the dot product fast using some Kernel K

55

SLIDE 56

Polynomials of degree d
Polynomials of degree up to d
Gaussian/Radial kernels (polynomials of all orders –

recall series expansion)

Sigmoid

Common Kernels

57

Which functions can be used as kernels??? …and why are they called kernels???

SLIDE 57

Overfitting

Huge feature space with kernels, what about
verfitting???
Maximizing margin leads to sparse set of

support vectors

Some interesting theory says that SVMs

search for simple hypothesis with large margin

Often robust to overfitting

58

SLIDE 58

What about classification time?

For a new input x, if we need to represent (x), we are in

trouble!

Recall classifier: sign(w¢(x)+b)
Using kernels we are cool!

59

SLIDE 59

A few results

61

SLIDE 60

Steve Gunn’s svm toolbox Results, Iris 2vs13, Linear kernel

62

SLIDE 61

Results, Iris 1vs23, 2nd order kernel

63

2nd order decision boundary: (parabola, hyperbola, ellipse)

SLIDE 62

Results, Iris 1vs23, 2nd order kernel

64

SLIDE 63

Results, Iris 1vs23, 13th order kernel

65

SLIDE 64

Results, Iris 1vs23, RBF kernel

66

SLIDE 65

Results, Iris 1vs23, RBF kernel

67

SLIDE 66

Results, Chessboard, Poly kernel

68

Chessboard dataset

SLIDE 67

Results, Chessboard, Poly kernel

69

SLIDE 68

Results, Chessboard, Poly kernel

70

SLIDE 69

Results, Chessboard, Poly kernel

71

SLIDE 70

Results, Chessboard, poly kernel

72

SLIDE 71

Results, Chessboard, RBF kernel

73