Support vector machines (SVMs) Lecture 6 David Sontag New York - - PowerPoint PPT Presentation

support vector machines svms lecture 6
SMART_READER_LITE
LIVE PREVIEW

Support vector machines (SVMs) Lecture 6 David Sontag New York - - PowerPoint PPT Presentation

Support vector machines (SVMs) Lecture 6 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin Pegasos vs. Perceptron Pegasos Algorithm Initialize: w 1 = 0, t=0 For iter = 1,2,,20 For


slide-1
SLIDE 1

Support vector machines (SVMs) Lecture 6

David Sontag New York University

Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin

slide-2
SLIDE 2

Pegasos vs. Perceptron

Pegasos Algorithm Initialize: w1 = 0, t=0 For iter = 1,2,…,20 For j=1,2,…,|data| t = t+1 ηt = 1/(tλ) If yj(wt xj) < 1 wt+1 = (1-ηtλ) wt + ηt yj xj Else wt+1 = (1-ηtλ) wt Output: wt+1

slide-3
SLIDE 3

Pegasos vs. Perceptron

Perceptron Algorithm Initialize: w1 = 0, t=0 For iter = 1,2,…,20 For j=1,2,…,|data| t = t+1 ηt = 1/(tλ) If yj(wt xj) < 1 wt+1 = (1-ηtλ) wt + ηt yj xj Else wt+1 = (1-ηtλ) wt Output: wt+1

slide-4
SLIDE 4

Much faster than previous methods

  • 3 datasets (provided by Joachims)

– Reuters CCAT (800K examples, 47k features) – Physics ArXiv (62k examples, 100k features) – Covertype (581k examples, 54 features)

Training Time (in seconds):

Pegasos SVM-Perf SVM-Light

Reuters

2 77 20,075

Covertype

6 85 25,514

Astro-Physics

2 5 80

slide-5
SLIDE 5

Running time guarantee

Error Decomposition

  • Approximation error:

– Best error achievable by large-margin predictor – Error of population minimizer w0 = argmin E[f(w)] = argmin λ|w|2 + Ex,y[loss(⟨w,x⟩;y)]

  • Estimation error:

– Extra error due to replacing E[loss] with empirical loss w* = arg min fn(w)

  • Optimization error:

– Extra error due to only optimizing to within finite precision err(w0) err(w*) err(w) Prediction error

[Shalev Schwartz, Srebro ’08]

Note: w0 is redefined in this context (see below) – does not refer to initial weight vector

slide-6
SLIDE 6

Error Decomposition

  • Approximation error:

– Best error achievable by large-margin predictor – Error of population minimizer w0 = argmin E[f(w)] = argmin λ|w|2 + Ex,y[loss(⟨w,x⟩;y)]

  • Estimation error:

– Extra error due to replacing E[loss] with empirical loss w* = arg min fn(w)

  • Optimization error:

– Extra error due to only optimizing to within finite precision err(w0) err(w*) err(w) Prediction error

Pegasos

After updates: err(wT) < err(w0) + With probability 1-

δ T = ˜ O ✓ 1 ✏ ◆

Running time guarantee

[Shalev Schwartz, Srebro ’08]

slide-7
SLIDE 7

Extending to multi-class classification

slide-8
SLIDE 8

One versus all classification

Learn 3 classifiers:

  • - vs {o,+}, weights w-
  • + vs {o,-}, weights w+
  • o vs {+,-}, weights wo

Predict label using:

w+ w-

Any problems? Could we learn this (1-D) dataset?

wo

  • 1

1

slide-9
SLIDE 9

Multi-class SVM

Simultaneously learn 3 sets

  • f weights:
  • How do we guarantee the

correct labels?

  • Need new constraints!

w+ w- wo

The “score” of the correct class must be better than the “score” of wrong classes:

slide-10
SLIDE 10

As for the SVM, we introduce slack variables and maximize margin:

Now can we learn it?

Multi-class SVM

To predict, we use:

  • 1

1

slide-11
SLIDE 11
  • In many practical applications we may have

imbalanced data sets

  • We may want errors to be equally distributed

between the positive and negative classes

  • A slight modification to the SVM objective

does the trick!

How to deal with imbalanced data?

Class-specific weighting of the slack variables

slide-12
SLIDE 12

What if the data is not linearly separable?

Use features of features

  • f features of features….

Feature space can get really large really quickly!

φ(x) =              x(1) . . . x(n) x(1)x(2) x(1)x(3) . . . ex(1) . . .             

slide-13
SLIDE 13

Key idea #3: the kernel trick

  • High dimensional feature spaces at no extra cost!
  • After every update (of Pegasos), the weight vector can

be written in the form:

  • As a result, prediction can be performed with:

w = X

i

αiyixi

= sign ⇣ X

i

αiyiK(xi, x) ⌘ = sign ⇣ X

i

αiyi(φ(xi) · φ(x)) ⌘ = sign ⇣ ( X

i

αiyiφ(xi)) · φ(x) ⌘ ˆ y ← sign(w · φ(x)) where K(x, x0) = φ(x) · φ(x0).

slide-14
SLIDE 14

Common kernels

  • Polynomials of degree exactly d
  • Polynomials of degree up to d
  • Gaussian kernels
  • Sigmoid
  • And many others: very active area of research!
slide-15
SLIDE 15

Polynomial kernel

Polynomials of degree exactly d

d=1

φ(u).φ(v) = u1 u2 ⇥ . v1 v2 ⇥ = u1v1 + u2v2 = u.v

d=2 For any d (we will skip proof):

φ(u).φ(v) = (u.v)d

⇥ φ(u).φ(v) = ⇤ ⌥ ⌥ ⇧ u2

1

u1u2 u2u1 u2

2

  • ⌃ .

⇤ ⌥ ⌥ ⇧ v2

1

v1v2 v2v1 v2

2

  • ⌃ = u2

1v2 1 + 2u1v1u2v2 + u2 2v2 2

⌃ ⇧ ⌃ = (u1v1 + u2v2)2

= (u.v)2

slide-16
SLIDE 16

[Tommi Jaakkola]

Quadratic kernel

slide-17
SLIDE 17

Gaussian kernel

[Cynthia Rudin] [mblondel.org] Support vectors Level sets, i.e. for some r

slide-18
SLIDE 18

Kernel algebra

[Justin Domke] Q: How would you prove that the “Gaussian kernel” is a valid kernel? A: Expand the Euclidean norm as follows: Then, apply (e) from above

To see that this is a kernel, use the Taylor series expansion of the exponential, together with repeated application of (a), (b), and (c):

The feature mapping is infinite dimensional!

slide-19
SLIDE 19

Dual SVM interpretation: Sparsity

w.x + b = +1 w.x + b = -1 w.x + b = 0

Support Vectors:

  • αj≥0

Non-support Vectors:

  • αj=0
  • moving them will not

change w Final solution tends to be sparse

  • αj=0 for most j
  • don’t need to store these

points to compute w or make predictions

slide-20
SLIDE 20

Overfitting?

  • Huge feature space with kernels: should we worry about
  • verfitting?

– SVM objective seeks a solution with large margin

  • Theory says that large margin leads to good generalization

(we will see this in a couple of lectures)

– But everything overfits sometimes!!! – Can control by:

  • Setting C
  • Choosing a better Kernel
  • Varying parameters of the Kernel (width of Gaussian, etc.)