SVMs and Kernel Methods Lecture 3 David Sontag New York University - - PowerPoint PPT Presentation

svms and kernel methods lecture 3
SMART_READER_LITE
LIVE PREVIEW

SVMs and Kernel Methods Lecture 3 David Sontag New York University - - PowerPoint PPT Presentation

SVMs and Kernel Methods Lecture 3 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin Todays lecture Dual form of soft-margin SVM Feature mappings & kernels Convexity,


slide-1
SLIDE 1

SVMs and Kernel Methods Lecture 3

David Sontag New York University

Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin

slide-2
SLIDE 2
  • Dual form of soft-margin SVM
  • Feature mappings & kernels
  • Convexity, Mercer’s theorem
  • (Time permitting) Extensions:
  • Imbalanced data
  • Multi-class
  • Other loss functions
  • L1 regularization

Today’s lecture

slide-3
SLIDE 3

Recap of dual SVM derivation

Can solve for optimal w, b as function of α:

⇤ ⌅ ∂L ∂w = w − ⌥

j

αjyjxj

So, in dual formulation we will solve for α directly!

  • w and b are computed from α (if needed)

(Dual)

Substituting these values back in (and simplifying), we obtain:

(Dual)

slide-4
SLIDE 4

Solving for the offset “b”

Lagrangian:

αj > 0 for some j implies constraint is tight. We use this to obtain b: (1) (2) (3)

slide-5
SLIDE 5

Dual formulation only depends on dot-products of the features!

First, we introduce a feature mapping: Next, replace the dot product with an equivalent kernel function:

Do kernels need to be symmetric?

~ ↵ ≥ 0

slide-6
SLIDE 6

Classification rule using dual solution

Using dual solution dot product of feature vectors of new example with support vectors Using a kernel function, predict with…

slide-7
SLIDE 7

Dual SVM interpretation: Sparsity

w.x + b = +1 w.x + b = -1 w.x + b = 0

Support Vectors:

  • αj≥0

Non-support Vectors:

  • αj=0
  • moving them will not

change w Final solution tends to be sparse

  • αj=0 for most j
  • don’t need to store these

points to compute w or make predictions

slide-8
SLIDE 8

Soft-margin SVM

Primal:

Solve for w,b,α:

Dual:

What changed?

  • Added upper bound of C on αi!
  • Intuitive explanation:
  • Without slack, αi  ∞ when constraints are violated (points

misclassified)

  • Upper bound of C limits the αi, so misclassifications are allowed
slide-9
SLIDE 9

Common kernels

  • Polynomials of degree exactly d
  • Polynomials of degree up to d
  • Gaussian kernels
  • Sigmoid
  • And many others: very active area of research!
slide-10
SLIDE 10

Polynomial kernel

Polynomials of degree exactly d

d=1

φ(u).φ(v) = u1 u2 ⇥ . v1 v2 ⇥ = u1v1 + u2v2 = u.v

d=2 For any d (we will skip proof):

φ(u).φ(v) = (u.v)d

⇥ φ(u).φ(v) = ⇤ ⌥ ⌥ ⇧ u2

1

u1u2 u2u1 u2

2

  • ⌃ .

⇤ ⌥ ⌥ ⇧ v2

1

v1v2 v2v1 v2

2

  • ⌃ = u2

1v2 1 + 2u1v1u2v2 + u2 2v2 2

⌃ ⇧ ⌃ = (u1v1 + u2v2)2

= (u.v)2

slide-11
SLIDE 11

Gaussian kernel

[Cynthia Rudin] [mblondel.org] Support vectors Level sets, i.e. for some r

w · φ(x) = r

slide-12
SLIDE 12

Kernel algebra

[Justin Domke] Q: How would you prove that the “Gaussian kernel” is a valid kernel? A: Expand the Euclidean norm as follows: Then, apply (e) from above

To see that this is a kernel, use the Taylor series expansion of the exponential, together with repeated application of (a), (b), and (c):

The feature mapping is infinite dimensional!

slide-13
SLIDE 13

Overfitting?

  • Huge feature space with kernels: should we worry about
  • verfitting?

– SVM objective seeks a solution with large margin

  • Theory says that large margin leads to good generalization

(we will see this in a couple of lectures)

– But everything overfits sometimes!!! – Can control by:

  • Setting C
  • Choosing a better Kernel
  • Varying parameters of the Kernel (width of Gaussian, etc.)
slide-14
SLIDE 14
  • In many practical applications we may have

imbalanced data sets

  • We may want errors to be equally distributed

between the positive and negative classes

  • A slight modification to the SVM objective

does the trick!

How to deal with imbalanced data?

Class-specific weighting of the slack variables

slide-15
SLIDE 15

How do we do multi-class classification?

slide-16
SLIDE 16

One versus all classification

Learn 3 classifiers:

  • - vs {o,+}, weights w-
  • + vs {o,-}, weights w+
  • o vs {+,-}, weights wo

Predict label using:

w+ w-

Any problems? Could we learn this (1-D) dataset? 

wo

  • 1

1

slide-17
SLIDE 17

Multi-class SVM

Simultaneously learn 3 sets

  • f weights:
  • How do we guarantee the

correct labels?

  • Need new constraints!

w+ w- wo

The “score” of the correct class must be better than the “score” of wrong classes:

slide-18
SLIDE 18

As for the SVM, we introduce slack variables and maximize margin:

Now can we learn it? 

Multi-class SVM

To predict, we use:

  • 1

1

b+ = −.5