Support vector machines Lecture 4 David Sontag New York University - - PowerPoint PPT Presentation

support vector machines lecture 4
SMART_READER_LITE
LIVE PREVIEW

Support vector machines Lecture 4 David Sontag New York University - - PowerPoint PPT Presentation

Support vector machines Lecture 4 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin Allowing for slack: Soft margin SVM w . x + b = +1 w . x + b = 0 w . x + b = -1 + C j j -


slide-1
SLIDE 1

Support vector machines Lecture 4

David Sontag New York University

Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin

slide-2
SLIDE 2

Allowing for slack: “Soft margin” SVM

w.x + b = +1 w.x + b = -1 w.x + b = 0

ξ ξ ξ ξ

+ C Σj ξj

  • ξj

ξj≥0

“slack variables” What is the (optimal) value of ξj as a function

  • f w and b?

If then ξj = 0 If then ξj = Sometimes written as

slide-3
SLIDE 3

Equivalent hinge loss formulation

+ C Σj ξj

  • ξj

ξj≥0

Substituting into the objective, we get: The hinge loss is defined as This part is empirical risk minimization, using the hinge loss This is called regularization; used to prevent overfitting!

slide-4
SLIDE 4

Hinge loss vs. 0/1 loss

Hinge loss upper bounds 0/1 loss!

Hinge loss:

0-1 Loss:

1 1

slide-5
SLIDE 5
  • In many practical applications we may have

imbalanced data sets

  • We may want errors to be equally distributed

between the positive and negative classes

  • A slight modification to the SVM objective

does the trick!

How to deal with imbalanced data?

Class-specific weighting of the slack variables

slide-6
SLIDE 6

How do we do multi-class classification?

slide-7
SLIDE 7

One versus all classification

Learn 3 classifiers:

  • - vs {o,+}, weights w-
  • + vs {o,-}, weights w+
  • o vs {+,-}, weights wo

Predict label using:

w+ w-

Any problems? Could we learn this dataset? 

wo

slide-8
SLIDE 8

Multi-class SVM

Simultaneously learn 3 sets

  • f weights:
  • How do we guarantee the

correct labels?

  • Need new constraints!

w+ w- wo

The “score” of the correct class must be better than the “score” of wrong classes:

slide-9
SLIDE 9

As for the SVM, we introduce slack variables and maximize margin:

Now can we learn it? 

Multi-class SVM

To predict, we use:

slide-10
SLIDE 10
  • SVMlight: one of the most widely used SVM packages. Fast
  • ptimization, can handle very large datasets, C++ code.
  • LIBSVM (used within Python’s scikit-learn)
  • Both of these handle multi-class, weighted SVM for

imbalanced data, etc.

  • There are several new approaches to solving the SVM
  • bjective that can be much faster:

– Stochastic subgradient method (up next!) – Distributed computation

  • See http://mloss.org, “machine learning open source software”

Software

slide-11
SLIDE 11

PEGASOS

Primal Efficient sub-GrAdient SOlver for SVM

Shai Shalev-Shwartz Yoram Singer Nati Srebro

The Hebrew University Jerusalem, Israel

[ICML 2007]

slide-12
SLIDE 12

Support Vector Machines

QP form: More “natural” form:

Empirical loss Regularization term

slide-13
SLIDE 13

PEGASOS

Subgradient Projection

A_t = S Subgradient method |A_t| = 1 Stochastic gradient

1

slide-14
SLIDE 14

Run-Time of Pegasos

  • Choosing |At|=1

 Run-time required for Pegasos to find ε accurate solution w.p. ¸ 1-δ

  • Run-time does not depend on #examples
  • Depends on “difficulty” of problem (λ and ε)
slide-15
SLIDE 15

Experiments

  • 3 datasets (provided by Joachims)

– Reuters CCAT (800K examples, 47k features) – Physics ArXiv (62k examples, 100k features) – Covertype (581k examples, 54 features) Training Time (in seconds):

Pegasos SVM-Perf SVM-Light

Reuters

2 77 20,075

Covertype

6 85 25,514

Astro-Physics

2 5 80

slide-16
SLIDE 16

What’s Next!

  • Learn one of the most interesting and

exciting recent advancements in machine learning

– The “kernel trick” – High dimensional feature spaces at no extra cost!

  • But first, a detour

– Constrained optimization!