Support Vector Machines Preview What is a support vector machine? - - PowerPoint PPT Presentation

support vector machines preview
SMART_READER_LITE
LIVE PREVIEW

Support Vector Machines Preview What is a support vector machine? - - PowerPoint PPT Presentation

Support Vector Machines Preview What is a support vector machine? The perceptron revisited Kernels Weight optimization Handling noisy data What Is a Support Vector Machine? 1. A subset of the training examples x (the support


slide-1
SLIDE 1

Support Vector Machines

slide-2
SLIDE 2

Preview

  • What is a support vector machine?
  • The perceptron revisited
  • Kernels
  • Weight optimization
  • Handling noisy data
slide-3
SLIDE 3

What Is a Support Vector Machine?

  • 1. A subset of the training examples x

(the support vectors)

  • 2. A vector of weights for them α
  • 3. A similarity function K(x, x′) (the kernel)

Class prediction for new example xq: f(xq) = sign

  • i

αiyiK(xq, xi)

  • (yi ∈ {−1, 1})
slide-4
SLIDE 4
  • So SVMs are a form of instance-based learning
  • But they’re usually presented as a generalization
  • f the perceptron
  • What’s the relation between perceptrons and IBL?
slide-5
SLIDE 5

The Perceptron Revisited

The perceptron is a special case of weighted kNN you get when the similarity function is the dot product: f(xq) = sign  

j

wjxqj   But wj =

  • i

αiyixij So f(xq) = sign  

j

  • i

αiyixij

  • xqj

  = sign

  • i

αiyi(xq · xi)

slide-6
SLIDE 6

Another View of SVMs

  • Take the perceptron
  • Replace dot product with arbitrary similarity function
  • Now you have a much more powerful learner
  • Kernel matrix: K(x, x′) for x, x′ ∈ Data
  • If a symmetric matrix K is positive semi-definite

(i.e., has non-negative eigenvalues), then K(x, x′) is still a dot product, but in a transformed space: K(x, x′) = φ(x) · φ(x′)

  • Also guarantees convex weight optimization problem
  • Very general trick
slide-7
SLIDE 7

Examples of Kernels

Linear: K(x, x′) = x · x′ Polynomial: K(x, x′) = (x · x′)d Gaussian: K(x, x′) = exp(− 1

2x − x′/σ)

slide-8
SLIDE 8

Example: Polynomial Kernel

u = (u1, u2) v = (v1, v2) (u · v)2 = (u1v1 + u2v2)2 = u2

1v2 1 + u2 2v2 2 + 2u1v1u2v2

= (u2

1, u2 2,

√ 2u1u2) · (v2

1, v2 2,

√ 2v1v2) = φ(u) · φ(v)

  • Linear kernel can’t represent quadratic frontiers
  • Polynomial kernel can
slide-9
SLIDE 9

Learning SVMs

So how do we:

  • Choose the kernel? Black art
  • Choose the examples? Side effect of choosing weights
  • Choose the weights? Maximize the margin
slide-10
SLIDE 10

Maximizing the Margin

slide-11
SLIDE 11

The Weight Optimization Problem

  • Margin = min yi(w · xi)
  • Easy to increase margin by increasing weights!
  • Instead: Fix margin, minimize weights
  • Minimize

w · w Subject to yi(w · xi) ≥ 1, for all i

slide-12
SLIDE 12

Constrained Optimization 101

  • Minimize

f(w) Subject to hi(w) = 0, for i = 1, 2, . . .

  • At solution w∗, ∇f(w∗) must lie in subspace spanned

by {∇hi(w∗): i = 1, 2, . . .}

  • Lagrangian function:

L(w, β) = f(w) +

  • i

βihi(w)

  • The βis are the Lagrange multipliers
  • Solve ∇L(w∗, β∗) = 0
slide-13
SLIDE 13

Primal and Dual Problems

  • Problem over w is the primal
  • Solve equations for w and substitute
  • Resulting problem over β is the dual
  • If it’s easier, solve dual instead of primal
  • In SVMs:

– Primal problem is over feature weights – Dual problem is over instance weights

slide-14
SLIDE 14

Inequality Constraints

  • Minimize

f(w) Subject to gi(w) ≤ 0, for i = 1, 2, . . . hi(w) = 0, for i = 1, 2, . . .

  • Lagrange multipliers for inequalities: αi
  • KKT Conditions:

∇L(w∗, α∗, β∗) = α∗

i

≥ gi(w∗) ≤ α∗

i gi(w∗)

=

  • Complementarity: Either a constraint is active

(gi(w∗) = 0) or its multiplier is zero (α∗

i = 0)

  • In SVMs: Active constraint ⇒ Support vector
slide-15
SLIDE 15

Solution Techniques

  • Use generic quadratic programming solver
  • Use specialized optimization algorithm
  • E.g.: SMO (Sequential Minimal Optimization)

– Simplest method: Update one αi at a time – But this violates constraints – Iterate until convergence:

  • 1. Find example xi that violates KKT conditions
  • 2. Select second example xj heuristically
  • 3. Jointly optimize αi and αj
slide-16
SLIDE 16

Handling Noisy Data

slide-17
SLIDE 17

Handling Noisy Data

  • Introduce slack variables ξi
  • Minimize

w · w + C

i ξi

Subject to yi(w · xi) ≥ 1 − ξi, for all i

slide-18
SLIDE 18

Bounds

Margin bound: Bound on VC dimension decreases with margin Leave-one-out bound: E[errorD(h)] ≤ E[# support vectors] # examples

slide-19
SLIDE 19

Support Vector Machines: Summary

  • What is a support vector machine?
  • The perceptron revisited
  • Kernels
  • Weight optimization
  • Handling noisy data