Support Vector Machines & Kernels Lecture 6 David Sontag New - - PowerPoint PPT Presentation

support vector machines kernels lecture 6
SMART_READER_LITE
LIVE PREVIEW

Support Vector Machines & Kernels Lecture 6 David Sontag New - - PowerPoint PPT Presentation

Support Vector Machines & Kernels Lecture 6 David Sontag New York University Slides adapted from Luke Zettlemoyer and Carlos Guestrin, and Vibhav Gogate Dual SVM derivation (1) the linearly separable case Original optimization


slide-1
SLIDE 1

Support Vector Machines & Kernels Lecture 6

David Sontag New York University

Slides adapted from Luke Zettlemoyer and Carlos Guestrin, and Vibhav Gogate

slide-2
SLIDE 2

Dual SVM derivation (1) – the linearly separable case

Original optimization problem: Lagrangian:

Rewrite constraints One Lagrange multiplier per example

Our goal now is to solve:

slide-3
SLIDE 3

Dual SVM derivation (2) – the linearly separable case

Swap min and max Slater’s condition from convex optimization guarantees that these two optimization problems are equivalent!

(Primal) (Dual)

slide-4
SLIDE 4

Dual SVM derivation (3) – the linearly separable case

Can solve for optimal w, b as function of α:

⇤ ⌅ ∂L ∂w = w − ⌥

j

αjyjxj

(Dual)

Substituting these values back in (and simplifying), we obtain:

(Dual) Sums over all training examples dot product scalars

slide-5
SLIDE 5

Dual SVM derivation (3) – the linearly separable case

Can solve for optimal w, b as function of α:

⇤ ⌅ ∂L ∂w = w − ⌥

j

αjyjxj

So, in dual formulation we will solve for α directly!

  • w and b are computed from α (if needed)

(Dual)

Substituting these values back in (and simplifying), we obtain:

(Dual)

slide-6
SLIDE 6

Dual SVM derivation (3) – the linearly separable case

Lagrangian:

αj > 0 for some j implies constraint is tight. We use this to obtain b: (1) (2) (3)

slide-7
SLIDE 7

Classification rule using dual solution

Using dual solution dot product of feature vectors of new example with support vectors

slide-8
SLIDE 8

Dual for the non-separable case

Primal:

Solve for w,b,α:

Dual:

What changed?

  • Added upper bound of C on αi!
  • Intuitive explanation:
  • Without slack, αi  ∞ when constraints are violated (points

misclassified)

  • Upper bound of C limits the αi, so misclassifications are allowed
slide-9
SLIDE 9

Support vectors

  • Complementary slackness conditions:
  • Support vectors: points xj such that

(includes all j such that , but also additional points where )

  • Note: the SVM dual solution may not be unique!

↵∗

j = 0 ∧ yj(~

w∗ · ~ xj + b) ≤ 1

slide-10
SLIDE 10

Dual SVM interpretation: Sparsity

w.x + b = +1 w.x + b = -1 w.x + b = 0

Support Vectors:

  • αj≥0

Non-support Vectors:

  • αj=0
  • moving them will not

change w Final solution tends to be sparse

  • αj=0 for most j
  • don’t need to store these

points to compute w or make predictions

slide-11
SLIDE 11

SVM with kernels

  • Never compute features explicitly!!!

– Compute dot products in closed form

  • O(n2) time in size of dataset to

compute objective

– much work on speeding up Predict with:

slide-12
SLIDE 12

[Tommi Jaakkola]

Quadratic kernel

slide-13
SLIDE 13

Quadratic kernel

[Cynthia Rudin] Feature mapping given by:

slide-14
SLIDE 14

Common kernels

  • Polynomials of degree exactly d
  • Polynomials of degree up to d
  • Gaussian kernels
  • And many others: very active area of research!

(e.g., structured kernels that use dynamic programming to evaluate, string kernels, …)

Euclidean distance, squared

slide-15
SLIDE 15

Gaussian kernel

[Cynthia Rudin] [mblondel.org] Support vectors Level sets, i.e. w.x=r for some r

slide-16
SLIDE 16

Kernel algebra

[Justin Domke] Q: How would you prove that the “Gaussian kernel” is a valid kernel? A: Expand the Euclidean norm as follows: Then, apply (e) from above

To see that this is a kernel, use the Taylor series expansion of the exponential, together with repeated application of (a), (b), and (c):

The feature mapping is infinite dimensional!

slide-17
SLIDE 17

Overfitting?

  • Huge feature space with kernels: should we worry about
  • verfitting?

– SVM objective seeks a solution with large margin

  • Theory says that large margin leads to good generalization

(we will see this in a couple of lectures)

– But everything overfits sometimes!!! – Can control by:

  • Setting C
  • Choosing a better Kernel
  • Varying parameters of the Kernel (width of Gaussian, etc.)