Introduction to Machine Learning CMU-10701 Support Vector Machines - - PowerPoint PPT Presentation

introduction to machine learning cmu 10701
SMART_READER_LITE
LIVE PREVIEW

Introduction to Machine Learning CMU-10701 Support Vector Machines - - PowerPoint PPT Presentation

Introduction to Machine Learning CMU-10701 Support Vector Machines Barnabs Pczos & Aarti Singh 2014 Spring http://barnabas-cmu-10701.appspot.com/ Linear classifiers which line is better? Which decision boundary is better? 4 Pick


slide-1
SLIDE 1

Introduction to Machine Learning CMU-10701

Support Vector Machines

Barnabás Póczos & Aarti Singh 2014 Spring

slide-2
SLIDE 2

http://barnabas-cmu-10701.appspot.com/

slide-3
SLIDE 3
slide-4
SLIDE 4

Linear classifiers which line is better?

4

Which decision boundary is better?

slide-5
SLIDE 5

Pick the one with the largest margin!

Class 1 Class 2 Margin

Data:

w · x + b < 0 w · x + b > 0

5

w

slide-6
SLIDE 6

Scaling

Classify as.. +1 if w · x + b ≥ 1 –1 if w · x + b ≤ –1 Universe explodes if

  • 1 < w · x + b < 1

Plus-Plane Minus-Plane Classifier Boundary

Classification rule: Goal: Find the maximum margin classifier How large is the margin of this classifier?

6

w · x + b ≤ –1 w · x + b ≥ 1

slide-7
SLIDE 7

Computing the margin width

Let x+ and x- be such that

  • w · x+ + b = +1
  • w · x- + b = -1
  • x+ = x- + λ w
  • |x+ – x-| = M

M = Margin Width x- x+

w w ⋅ = 2

7

Maximize M ≡ minimize w·w !

slide-8
SLIDE 8

Observations

8

We can assume b=0

Classify as.. +1 if w · x + b ≥ 1 –1 if w · x + b ≤ –1 Universe explodes if

  • 1 < w · x + b < 1

This is the same as

slide-9
SLIDE 9

The Primal Hard SVM

This is a QP problem (m-dimensional) (Quadratic cost function, linear constraints)

9

slide-10
SLIDE 10

Quadratic Programming

Find and to Subject to

10

Efficient Algorithms exist for QP. They often solve the dual problem instead of the primal.

slide-11
SLIDE 11

Constrained Optimization

11

slide-12
SLIDE 12

Lagrange Multiplier

Moving the constraint to objective function Lagrangian: Solve:

Constraint is active when α α α α > 0

12

slide-13
SLIDE 13

Lagrange Multiplier – Dual Variables

Solving:

When α α α α > 0, constraint is tight

13

slide-14
SLIDE 14

From Primal to Dual

Lagrange function:

14

Primal problem:

slide-15
SLIDE 15

Proof cont.

The Lagrange problem:

The Lagrange Problem

15

slide-16
SLIDE 16

Proof cont. The Dual Problem

16

slide-17
SLIDE 17

The Dual Hard SVM

Quadratic Programming (n-dimensional) Lemma

17

slide-18
SLIDE 18

The Problem with Hard SVM

It assumes samples are linearly separable...

18

What can we do if data is not linearly separable???

slide-19
SLIDE 19

Hard 1-dimensional Dataset

If the data set is not linearly separable, then adding new features (mapping the data to a larger feature space) the data might become linearly separable

19

slide-20
SLIDE 20

Hard 1-dimensional Dataset

Make up a new feature! Sort of… … computed from

  • riginal feature(s)

x=0

) , (

2 k k k

x x = z

Separable! MAGIC! Now drop this “augmented” data into our linear SVM.

20

slide-21
SLIDE 21

Feature mapping

n general! points in an n-1 dimensional space is always linearly separable by a hyperspace! ⇒ it is good to map the data to high dimensional spaces Having n training data, is it always good enough to map the data into a feature space with dimension n-1?

  • Nope... We have to think about the test data as well!

Even if we don’t know how many test data we have and what they are... We might want to map our data to a huge (∞) dimensional feature space Overfitting? Generalization error?... We don’t care now...

21

slide-22
SLIDE 22

How to do feature mapping?

Use features of features of features of features….

22

slide-23
SLIDE 23

The Problem with Hard SVM

Solutions:

  • 1. Use feature transformation to a larger space

⇒ each training samples are linearly separable in the feature space ⇒ Hard SVM can be applied ☺ ⇒ overfitting... It assumes samples are linearly separable...

23

  • 2. Soft margin SVM instead of Hard SVM
  • We will discuss this now
slide-24
SLIDE 24

Hard SVM

The Hard SVM problem can be rewritten:

where Misclassification Correct classification

24

slide-25
SLIDE 25

From Hard to Soft constraints

We can try to solve the soft version of it: Your loss is only 1 instead of ∞ if you misclassify an instance Instead of using hard constraints (points are linearly separable)

where Misclassification Correct classification

25

slide-26
SLIDE 26

Problems with l0-1 loss

It is not convex in yf(x) ⇒ It is not convex in w, either... ... and we like only convex functions... Let us approximate it with convex functions!

26

slide-27
SLIDE 27

Approximation of the Heaviside step function

Picture is taken from R. Herbrich 27

slide-28
SLIDE 28

Approximations of l0-1 loss

  • Piecewise linear approximations (hinge loss, llin)
  • Quadratic approximation (lquad)

28

slide-29
SLIDE 29

The hinge loss approximation of l0-1

Where,

29

The hinge loss upper bounds the 0-1 loss

slide-30
SLIDE 30

Geometric interpretation: Slack Variables

M =

2 w⋅ w

ξ7

ξ 1

ξ2

30

slide-31
SLIDE 31

The Primal Soft SVM problem

31

Equivalently, where

slide-32
SLIDE 32

The Primal Soft SVM problem

We can use this form, too.:

32

Equivalently, What is the dual form of primal soft SVM?

slide-33
SLIDE 33

The Dual Soft SVM (using hinge loss)

where

33

slide-34
SLIDE 34

The Dual Soft SVM (using hinge loss)

34

slide-35
SLIDE 35

The Dual Soft SVM (using hinge loss)

35

slide-36
SLIDE 36

SVM classification in the dual space

Solve the dual problem

36

slide-37
SLIDE 37

Why is it called Support Vector Machine?

KKT conditions

slide-38
SLIDE 38

Why is it called Support Vector Machine?

w· · · ·x + b > 0 w· · · ·x + b < 0

γ γ

Hard SVM: Linear hyperplane defined by “support vectors” Moving other points a little doesn’t effect the decision boundary

  • nly need to store the

support vectors to predict labels of new points

38

slide-39
SLIDE 39

Support vectors in Soft SVM

slide-40
SLIDE 40

Support vectors in Soft SVM

  • Margin support vectors
  • Nonmargin support vectors
slide-41
SLIDE 41

Dual SVM Interpretation: Sparsity

Only few αjs can be non-zero : where constraint is tight

(<w,xj>+ b)yj = 1

α α α αj > 0 α α α αj > 0 α α α αj > 0 α α α αj = 0 α α α αj = 0 α α α αj = 0

41

slide-42
SLIDE 42

What about multiple classes?

42

slide-43
SLIDE 43

One against all

Learn 3 classifiers separately: Class k vs. rest (wk, bk)k=1,2,3 y = arg max wk·x + bk

k

But wks may not be based on the same scale. Note: (aw)· · · ·x + (ab) is also a solution

43

slide-44
SLIDE 44

Learn 1 classifier: Multi-class SVM

Simultaneously learn 3 sets of weights. Constraints: y = arg maxk w(k)· x + b(k) Margin - gap between correct class and nearest

  • ther class

44

slide-45
SLIDE 45

Learn 1 classifier: Multi-class SVM

Simultaneously learn 3 sets of weights y = arg maxk w(k)· x + b(k) Joint optimization: wks have the same scale.

45

slide-46
SLIDE 46

What you need to know

  • Maximizing margin
  • Derivation of SVM formulation
  • Slack variables and hinge loss
  • Relationship between SVMs and logistic regression
  • 0/1 loss
  • Hinge loss
  • Log loss
  • Tackling multiple class
  • One against All
  • Multiclass SVMs

46

slide-47
SLIDE 47

SVM vs. Logistic Regression

SVM : Hinge loss: 0-1 loss

  • 1

1

Logistic Regression : Log loss ( log conditional likelihood) Hinge loss Log loss

47

slide-48
SLIDE 48

SVM for Regression

48

slide-49
SLIDE 49

SVM classification in the dual space

49

“Without b” “With b”

slide-50
SLIDE 50

So why solve the dual SVM?

  • There are some quadratic programming

algorithms that can solve the dual faster than the primal, specially in high dimensions m>>n

  • But, more importantly, the “kernel trick”!!!

50

slide-51
SLIDE 51

What if data is not linearly separable?

Use features of features

  • f features of features….

Φ(x) = (x1

3, x2 3, x3 3, x1 2x2x3, ….,)

51

For example polynomials

slide-52
SLIDE 52

Dot Product of Polynomials

d=1 d=2

52

slide-53
SLIDE 53

Dot Product of Polynomials

d

53

slide-54
SLIDE 54

Higher Order Polynomials

m – input features d – degree of polynomial

grows fast: d = 6, m = 100, about 1.6 billion terms

54

Feature space becomes really large very quickly!

slide-55
SLIDE 55

Dual formulation only depends

  • n dot-products, not on w!

Φ(x) – High-dimensional feature space, but never need it explicitly as long as we can compute the dot product fast using some Kernel K

55

slide-56
SLIDE 56

Finally: The Kernel Trick!

  • Never represent features explicitly

– Compute dot products in closed form

  • Constant-time high-dimensional dot-

products for many classes of features

56

slide-57
SLIDE 57
  • Polynomials of degree d
  • Polynomials of degree up to d
  • Gaussian/Radial kernels (polynomials of all orders –

recall series expansion)

  • Sigmoid

Common Kernels

57

Which functions can be used as kernels??? …and why are they called kernels???

slide-58
SLIDE 58

Overfitting

  • Huge feature space with kernels, what about
  • verfitting???
  • Maximizing margin leads to sparse set of

support vectors

  • Some interesting theory says that SVMs

search for simple hypothesis with large margin

  • Often robust to overfitting

58

slide-59
SLIDE 59

What about classification time?

  • For a new input x, if we need to represent Φ(x), we are in

trouble!

  • Recall classifier: sign(w·

· · ·Φ(x)+b)

  • Using kernels we are cool!

59

slide-60
SLIDE 60

Kernels in Logistic Regression

  • Define weights in terms of features:
  • Derive simple gradient descent rule on αi

60

slide-61
SLIDE 61

A few results

61

slide-62
SLIDE 62

Steve Gunn’s svm toolbox Results, Iris 2vs13, Linear kernel

62

slide-63
SLIDE 63

Results, Iris 1vs23, 2nd order kernel

63

2nd order decision boundary: (parabola, hyperbola, ellipse)

slide-64
SLIDE 64

Results, Iris 1vs23, 2nd order kernel

64

slide-65
SLIDE 65

Results, Iris 1vs23, 13th order kernel

65

slide-66
SLIDE 66

Results, Iris 1vs23, RBF kernel

66

slide-67
SLIDE 67

Results, Iris 1vs23, RBF kernel

67

slide-68
SLIDE 68

Results, Chessboard, Poly kernel

68

Chessboard dataset

slide-69
SLIDE 69

Results, Chessboard, Poly kernel

69

slide-70
SLIDE 70

Results, Chessboard, Poly kernel

70

slide-71
SLIDE 71

Results, Chessboard, Poly kernel

71

slide-72
SLIDE 72

Results, Chessboard, poly kernel

72

slide-73
SLIDE 73

Results, Chessboard, RBF kernel

73