CMU-10701 Support Vector Machines Barnabs Pczos & Aarti Singh - - PowerPoint PPT Presentation

cmu 10701
SMART_READER_LITE
LIVE PREVIEW

CMU-10701 Support Vector Machines Barnabs Pczos & Aarti Singh - - PowerPoint PPT Presentation

Introduction to Machine Learning CMU-10701 Support Vector Machines Barnabs Pczos & Aarti Singh 2014 Spring http://barnabas-cmu-10701.appspot.com/ Linear classifiers which line is better? Which decision boundary is better? 4 Pick


slide-1
SLIDE 1

Introduction to Machine Learning CMU-10701

Support Vector Machines

Barnabás Póczos & Aarti Singh 2014 Spring

slide-2
SLIDE 2

http://barnabas-cmu-10701.appspot.com/

slide-3
SLIDE 3
slide-4
SLIDE 4

Linear classifiers which line is better?

4

Which decision boundary is better?

slide-5
SLIDE 5

Pick the one with the largest margin!

Class 1 Class 2 Margin

Data:

w ∙ x + b < 0 w ∙ x + b > 0

5

w

slide-6
SLIDE 6

Scaling

Classify as.. +1 if w ∙ x + b  1 –1 if w ∙ x + b  –1 Universe explodes if

  • 1 < w ∙ x + b < 1

Plus-Plane Minus-Plane Classifier Boundary

Classification rule: Goal: Find the maximum margin classifier How large is the margin of this classifier?

6

w ∙ x + b  –1 w ∙ x + b  1

slide-7
SLIDE 7

Computing the margin width

Let x+ and x- be such that

  • w ∙ x+ + b = +1
  • w ∙ x- + b = -1
  • x+ = x- + l w
  • |x+ – x-| = M

M = Margin Width x- x+

w w   2

7

Maximize M  minimize w·w !

slide-8
SLIDE 8

Observations

8

We can assume b=0

Classify as.. +1 if w ∙ x + b  1 –1 if w ∙ x + b  –1 Universe explodes if

  • 1 < w ∙ x + b < 1

This is the same as

slide-9
SLIDE 9

The Primal Hard SVM

This is a QP problem (m-dimensional) (Quadratic cost function, linear constraints)

9

slide-10
SLIDE 10

Quadratic Programming

Find and to Subject to

10

Efficient Algorithms exist for QP. They often solve the dual problem instead of the primal.

slide-11
SLIDE 11

Constrained Optimization

11

slide-12
SLIDE 12

Lagrange Multiplier

Moving the constraint to objective function Lagrangian: Solve:

Constraint is active when a > 0

12

slide-13
SLIDE 13

Lagrange Multiplier – Dual Variables

Solving:

When a > 0, constraint is tight

13

slide-14
SLIDE 14

From Primal to Dual

Lagrange function:

14

Primal problem:

slide-15
SLIDE 15

Proof cont.

The Lagrange problem:

The Lagrange Problem

15

slide-16
SLIDE 16

Proof cont. The Dual Problem

16

slide-17
SLIDE 17

The Dual Hard SVM

Quadratic Programming (n-dimensional) Lemma

17

slide-18
SLIDE 18

The Problem with Hard SVM

It assumes samples are linearly separable...

18

What can we do if data is not linearly separable???

slide-19
SLIDE 19

Hard 1-dimensional Dataset

If the data set is not linearly separable, then adding new features (mapping the data to a larger feature space) the data might become linearly separable

19

slide-20
SLIDE 20

Hard 1-dimensional Dataset

Make up a new feature! Sort of… … computed from

  • riginal feature(s)

x=0

) , (

2 k k k

x x  z

Separable! MAGIC! Now drop this “augmented” data into our linear SVM.

20

slide-21
SLIDE 21

Feature mapping

 n general! points in an n-1 dimensional space is always linearly separable by a hyperspace! ) it is good to map the data to high dimensional spaces  Having n training data, is it always good enough to map the data into a feature space with dimension n-1?

  • Nope... We have to think about the test data as well!

Even if we don’t know how many test data we have and what they are...  We might want to map our data to a huge (1) dimensional feature space  Overfitting? Generalization error?... We don’t care now...

21

slide-22
SLIDE 22

1

How to do feature mapping?

Use features of features of features of features….

22

slide-23
SLIDE 23

The Problem with Hard SVM

Solutions:

  • 1. Use feature transformation to a larger space

) each training samples are linearly separable in the feature space ) Hard SVM can be applied  ) overfitting...  It assumes samples are linearly separable...

23

  • 2. Soft margin SVM instead of Hard SVM
  • We will discuss this now
slide-24
SLIDE 24

Hard SVM

The Hard SVM problem can be rewritten:

where Misclassification Correct classification

24

slide-25
SLIDE 25

From Hard to Soft constraints

We can try to solve the soft version of it: Your loss is only 1 instead of 1 if you misclassify an instance Instead of using hard constraints (points are linearly separable)

where Misclassification Correct classification

25

slide-26
SLIDE 26

Problems with l0-1 loss

It is not convex in yf(x) ) It is not convex in w, either... ... and we like only convex functions... Let us approximate it with convex functions!

26

slide-27
SLIDE 27

Approximation of the Heaviside step function

Picture is taken from R. Herbrich

27

slide-28
SLIDE 28

Approximations of l0-1 loss

  • Piecewise linear approximations (hinge loss, llin)
  • Quadratic approximation (lquad)

28

slide-29
SLIDE 29

The hinge loss approximation of l0-1

Where,

29

The hinge loss upper bounds the 0-1 loss

slide-30
SLIDE 30

Geometric interpretation: Slack Variables

M =

฀ 2 w w

7

 1

2

30

slide-31
SLIDE 31

The Primal Soft SVM problem

31

Equivalently, where

slide-32
SLIDE 32

The Primal Soft SVM problem

We can use this form, too.:

32

Equivalently, What is the dual form of primal soft SVM?

slide-33
SLIDE 33

The Dual Soft SVM (using hinge loss)

where

33

slide-34
SLIDE 34

The Dual Soft SVM (using hinge loss)

34

slide-35
SLIDE 35

The Dual Soft SVM (using hinge loss)

35

slide-36
SLIDE 36

SVM classification in the dual space

Solve the dual problem

36

slide-37
SLIDE 37

Why is it called Support Vector Machine?

KKT conditions

slide-38
SLIDE 38

Why is it called Support Vector Machine?

w¢x + b > 0 w¢x + b < 0

g g

Hard SVM: Linear hyperplane defined by “support vectors” Moving other points a little doesn’t effect the decision boundary

  • nly need to store the

support vectors to predict labels of new points

38

slide-39
SLIDE 39

Support vectors in Soft SVM

slide-40
SLIDE 40

Support vectors in Soft SVM

  • Margin support vectors
  • Nonmargin support vectors
slide-41
SLIDE 41

Dual SVM Interpretation: Sparsity

Only few ajs can be non-zero : where constraint is tight

(<w,xj>+ b)yj = 1

aj > 0 aj > 0 aj > 0 aj = 0 aj = 0 aj = 0

41

slide-42
SLIDE 42

What about multiple classes?

42

slide-43
SLIDE 43

One against all

Learn 3 classifiers separately: Class k vs. rest (wk, bk)k=1,2,3 y = arg max wk¢x + bk

k

But wks may not be based on the same scale. Note: (aw)¢x + (ab) is also a solution

43

slide-44
SLIDE 44

Learn 1 classifier: Multi-class SVM

Simultaneously learn 3 sets of weights. Constraints: y = arg maxk w(k)¢ x + b(k) Margin - gap between correct class and nearest

  • ther class

44

slide-45
SLIDE 45

Learn 1 classifier: Multi-class SVM

Simultaneously learn 3 sets of weights y = arg maxk w(k)¢ x + b(k) Joint optimization: wks have the same scale.

45

slide-46
SLIDE 46

What you need to know

  • Maximizing margin
  • Derivation of SVM formulation
  • Slack variables and hinge loss
  • Relationship between SVMs and logistic regression
  • 0/1 loss
  • Hinge loss
  • Log loss
  • Tackling multiple class
  • One against All
  • Multiclass SVMs

46

slide-47
SLIDE 47

SVM vs. Logistic Regression

SVM : Hinge loss: 0-1 loss

  • 1

1

Logistic Regression : Log loss ( log conditional likelihood) Hinge loss Log loss

47

slide-48
SLIDE 48

SVM for Regression

48

slide-49
SLIDE 49

SVM classification in the dual space

49

“Without b” “With b”

slide-50
SLIDE 50

So why solve the dual SVM?

  • There are some quadratic programming

algorithms that can solve the dual faster than the primal, specially in high dimensions m>>n

  • But, more importantly, the “kernel trick”!!!

50

slide-51
SLIDE 51

What if data is not linearly separable?

Use features of features

  • f features of features….

Φ(x) = (x1

3, x2 3, x3 3, x1 2x2x3, ….,)

51

For example polynomials

slide-52
SLIDE 52

Dot Product of Polynomials

d=1 d=2

52

slide-53
SLIDE 53

Dot Product of Polynomials

d

53

slide-54
SLIDE 54

Higher Order Polynomials

m – input features d – degree of polynomial

grows fast: d = 6, m = 100, about 1.6 billion terms

54

Feature space becomes really large very quickly!

slide-55
SLIDE 55

Dual formulation only depends

  • n dot-products, not on w!

Φ(x) – High-dimensional feature space, but never need it explicitly as long as we can compute the dot product fast using some Kernel K

55

slide-56
SLIDE 56
  • Polynomials of degree d
  • Polynomials of degree up to d
  • Gaussian/Radial kernels (polynomials of all orders –

recall series expansion)

  • Sigmoid

Common Kernels

57

Which functions can be used as kernels??? …and why are they called kernels???

slide-57
SLIDE 57

Overfitting

  • Huge feature space with kernels, what about
  • verfitting???
  • Maximizing margin leads to sparse set of

support vectors

  • Some interesting theory says that SVMs

search for simple hypothesis with large margin

  • Often robust to overfitting

58

slide-58
SLIDE 58

What about classification time?

  • For a new input x, if we need to represent (x), we are in

trouble!

  • Recall classifier: sign(w¢(x)+b)
  • Using kernels we are cool!

59

slide-59
SLIDE 59

A few results

61

slide-60
SLIDE 60

Steve Gunn’s svm toolbox Results, Iris 2vs13, Linear kernel

62

slide-61
SLIDE 61

Results, Iris 1vs23, 2nd order kernel

63

2nd order decision boundary: (parabola, hyperbola, ellipse)

slide-62
SLIDE 62

Results, Iris 1vs23, 2nd order kernel

64

slide-63
SLIDE 63

Results, Iris 1vs23, 13th order kernel

65

slide-64
SLIDE 64

Results, Iris 1vs23, RBF kernel

66

slide-65
SLIDE 65

Results, Iris 1vs23, RBF kernel

67

slide-66
SLIDE 66

Results, Chessboard, Poly kernel

68

Chessboard dataset

slide-67
SLIDE 67

Results, Chessboard, Poly kernel

69

slide-68
SLIDE 68

Results, Chessboard, Poly kernel

70

slide-69
SLIDE 69

Results, Chessboard, Poly kernel

71

slide-70
SLIDE 70

Results, Chessboard, poly kernel

72

slide-71
SLIDE 71

Results, Chessboard, RBF kernel

73