Support Vector Machines L eon Bottou COS 424 4/1/2010 Agenda - - PowerPoint PPT Presentation

support vector machines
SMART_READER_LITE
LIVE PREVIEW

Support Vector Machines L eon Bottou COS 424 4/1/2010 Agenda - - PowerPoint PPT Presentation

Support Vector Machines L eon Bottou COS 424 4/1/2010 Agenda Classification, clustering, regression, other. Goals Parametric vs. kernels vs. nonparametric Probabilistic vs. nonprobabilistic Representation Linear vs. nonlinear Deep


slide-1
SLIDE 1

Support Vector Machines

L´ eon Bottou COS 424 – 4/1/2010

slide-2
SLIDE 2

Agenda

Goals Classification, clustering, regression, other. Representation Parametric vs. kernels vs. nonparametric Probabilistic vs. nonprobabilistic Linear vs. nonlinear Deep vs. shallow Capacity Control Explicit: architecture, feature selection Explicit: regularization, priors Implicit: approximate optimization Implicit: bayesian averaging, ensembles Operational Considerations Loss functions Budget constraints Online vs. offline Computational Considerations Exact algorithms for small datasets. Stochastic algorithms for big datasets. Parallel algorithms.

L´ eon Bottou 2/46 COS 424 – 4/1/2010

slide-3
SLIDE 3

Summary

  • 1. Maximizing margins.
  • 2. Soft margins.
  • 3. Kernels.
  • 4. Kernels everywhere.

L´ eon Bottou 3/46 COS 424 – 4/1/2010

slide-4
SLIDE 4

The curse of dimensionality

Polynomial classifiers in dimension d Discriminant function: f(x) = w⊤Φ(x) + b. Degree Dim(Φ(x))

Φ(x)

1

d Φ(x) = [xi] 1≤i≤d

2

≈ d2/2 Φ(x) += [xixj] 1≤i≤j≤d

3

≈ d3/6 Φ(x) += [xixjxk] 1≤i≤j≤k≤d

. . . n

≈ dn/n!

The number of parameters increases quickly. Training such a classifier directly requires a number of examples that increases just as quickly as the number of parameters.

L´ eon Bottou 4/46 COS 424 – 4/1/2010

slide-5
SLIDE 5

Beating the curse of dimensionality?

Capacity ≪ number of parameters Assume the patterns x1 . . . x2l are known beforehand. The classes are unknown. Let R = max xi. We say that a hyperplane

w⊤x + b w, x ∈ Rd w = 1

separates patterns with margin ∆ if

∀i = 1 . . . 2l |w⊤xi + b| ≥ ∆

The family of ∆-margin separating hyperplanes has

log N(F, D) ≤ h log 2le h

with

h ≤ min

  • R2

∆2, d

  • + 1

L´ eon Bottou 5/46 COS 424 – 4/1/2010

slide-6
SLIDE 6

Maximizing margins

Patterns xi ∈ Rd, classes yi = ±1.

w 2∆

max

w,b,∆ ∆

subject to

w = 1

and

∀i yi(w⊤xi + b) ≥ ∆

L´ eon Bottou 6/46 COS 424 – 4/1/2010

slide-7
SLIDE 7

Maximizing margins

Classic formulation w wx+b = +1 wx+b = −1

min

w,b w2

subject to

∀i yi(w⊤xi + b) ≥ 1

This is a quadratic programming problem with linear constraints.

L´ eon Bottou 7/46 COS 424 – 4/1/2010

slide-8
SLIDE 8

Maximizing margins

Equivalence between the formulations Let w′ = w

∆ and b′ = b ∆.

Constraint yi(w⊤xi + b) ≥ ∆ becomes yi(w′⊤xi + b′) ≥ 1. Problem max

w,b,∆ ∆ subject to w = 1 becomes min w′,b′ w′

Both discriminant functions w⊤x + b and w′⊤x + b′ describe the same decision boundary.

L´ eon Bottou 8/46 COS 424 – 4/1/2010

slide-9
SLIDE 9

Primal and dual formulation

Karush-Kuhn-Tucker theory – Refined theory for convex otimization under constraints. – Construct a dual optimization problem whose constraints are simpler, and whose solution is related to the solution we seek.

L´ eon Bottou 9/46 COS 424 – 4/1/2010

slide-10
SLIDE 10

Primal and dual formulation

Karush-Kuhn-Tucker theory – Refined theory for convex otimization under constraints. – Construct a dual optimization problem whose constraints are simpler, and whose solution is related to the solution we seek. Primal formulation Dual formulation

Max margin between classes

Min distance between convex hulls

A B

L´ eon Bottou 10/46 COS 424 – 4/1/2010

slide-11
SLIDE 11

Dual formulation

Min distance between convex hulls

A B

– Point A:

  • i∈Pos

βi xi

subject to βi ≥ 0 and

  • i∈Pos

βi = 1

– Point B:

  • i∈Neg

βi xi

subject to βi ≥ 0 and

  • i∈Neg

βi = 1

– Vector BA:

  • i

yi βi xi subject to βi ≥ 0,

  • i

βi = 2, and

  • i

yi βi = 0.

L´ eon Bottou 11/46 COS 424 – 4/1/2010

slide-12
SLIDE 12

Dual formulation

Min distance between convex hulls

A B

min

β

  • ij

yiyj βiβj x⊤

i xj

subject to

       ∀i βi ≥ 0

  • i yiβi = 0
  • i βi = 2

Then w =

i yi βi xi.

Then b is easy to find by projecting all examples on w.

L´ eon Bottou 12/46 COS 424 – 4/1/2010

slide-13
SLIDE 13

Dual formulation

Classic formulation

Min distance between convex hulls

A B

max

α

  • i

αi − 1 2

  • ij

yiyj αiαj x⊤

i xj

subject to

  • ∀i αi ≥ 0
  • i yiαi = 0

This is equivalent with αi = βi∆−2 but the proof is nontrivial.

L´ eon Bottou 13/46 COS 424 – 4/1/2010

slide-14
SLIDE 14

Support Vectors Machines

Min distance between convex hulls

A B

min

β

  • ij

yiyj βiβj x⊤

i xj

subject to

       ∀i βi ≥ 0

  • i yiβi = 0
  • i βi = 2

The only non zero βi are those corresponding to support vectors.

L´ eon Bottou 14/46 COS 424 – 4/1/2010

slide-15
SLIDE 15

Leave-One-Out

Leave one out = n-fold cross-validation – Compute classifiers fi using training set minus example (xi, yi). – Estimate test misclassification rate as ELOO = 1

n

n

  • i=1

1 I {yifi(xi) ≤ 0} .

Leave one out for maximal margin classifier – Removing a non support vector does not change the classifier.

ELOO ≤ #support vectors #examples

– The important quantity is not the dimension but is the number of support vectors.

L´ eon Bottou 15/46 COS 424 – 4/1/2010

slide-16
SLIDE 16

Soft margins

When the examples are not linearly separable, the constraints yi(w⊤xi + b) ≥ 1 cannot be satisfied. Adding slack variables ξi

min

w,b,ξ w2 + C n

  • i=1

ξi

subject to

∀i yi(w⊤xi + b) ≥ 1 − ξi , ξi ≥ 0

Parameter C controls the relative importance of: – correctly classifying all the training examples, – obtaining the separation with the largest margin. Reduces to hard margins when C = ∞.

L´ eon Bottou 16/46 COS 424 – 4/1/2010

slide-17
SLIDE 17

Soft margins and Hinge loss

The soft margin problem

min

w,b,ξ w2 + C n

  • i=1

ξi

subject to

∀i yi(w⊤xi + b) ≥ 1 − ξi , ξi ≥ 0

is the same thing as

min

w,b,ξ w2 + C n

  • i=1

ℓ(yi(w⊤xi + b))

with

ℓ(z) = max(0, 1 − z)

L´ eon Bottou 17/46 COS 424 – 4/1/2010

slide-18
SLIDE 18

Soft Margins

Primal formulation

min

w,b,ξ w2 + C n

  • i=1

ξi

subject to

∀i yi(w⊤xi + b) ≥ 1 − ξi , ξi ≥ 0

Dual formulation

max

α

  • i

αi − 1 2

  • ij

yiyj αiαj x⊤

i xj

subject to

  • ∀i 0 ≤ αi ≤ C
  • i yiαi = 0

The primal and dual solutions obey the relation w =

n

  • i=1

yi αi xi .

The threshold b is easy to find once w is known.

L´ eon Bottou 18/46 COS 424 – 4/1/2010

slide-19
SLIDE 19

Soft Margins

αi<C 0< αi<C 0< αi<C 0< ξi α =0 i α =0 i ξi ξi αi=C αi=C αi=C α =0 i α =0 i α =0 i α =0 i α =0 i α =0 i αi<C 0< αi<C 0<

L´ eon Bottou 19/46 COS 424 – 4/1/2010

slide-20
SLIDE 20

Beyond linear separation

Reintroducing the Φ(x) – Define K(x, v) = Φ(x)⊤Φ(v). – Dual optimization problem

max

α

  • i

αi − 1 2

  • ij

yiyj αiαj K(xi, xj)

subject to

  • ∀i 0 ≤ αi ≤ C
  • i yiαi = 0

– Discriminant function

f(x) = w⊤Φ(x) + b =

n

  • i=1

yi αi K(xi, x)

Curious fact – We do not really need to compute Φ(x). – The dot products K(x, v) = Φ(x)⊤Φ(v) are enough. – Can we take advantage of this?

L´ eon Bottou 20/46 COS 424 – 4/1/2010

slide-21
SLIDE 21

Quadratic Kernel

Quadratic basis

Φ(x) = xi

  • i ,
  • x2

i

  • i ,

√ 2 xixj

  • i<j
  • Dot product

Φ(x)⊤Φ(v) =

  • i

xivi +

  • i

x2

iv2 i +

  • i<j

2 xivixjvj

– Are there d(d + 3)/2 terms to add ?

L´ eon Bottou 21/46 COS 424 – 4/1/2010

slide-22
SLIDE 22

Quadratic Kernel

Quadratic basis

Φ(x) = xi

  • i ,
  • x2

i

  • i ,

√ 2 xixj

  • i<j
  • Dot product

Φ(x)⊤Φ(v) =

  • i

xivi +

  • i

x2

iv2 i +

  • i<j

2 xivixjvj =

  • i

xivi +

  • i,j

xivixjvj =

  • i

xivi +

  • i

xivi 2 = (x⊤v) + (x⊤v)2

– There are only d terms to add !

L´ eon Bottou 22/46 COS 424 – 4/1/2010

slide-23
SLIDE 23

Polynomial kernel

Degree Dim(Φ(x))

Φ(x)⊤Φ(v)

1

d (x⊤v)

2

≈ d2/2 (x⊤v) + (x⊤v)2

3

≈ d3/6 (x⊤v) + (x⊤v)2 + (x⊤v)3

. . . n

≈ dn/n! (1 + x⊤v)d

The number of parameters increases exponentially quickly. But the total computation remains nearly constant.

L´ eon Bottou 23/46 COS 424 – 4/1/2010

slide-24
SLIDE 24

Linear

L´ eon Bottou 24/46 COS 424 – 4/1/2010

slide-25
SLIDE 25

Quadratic

L´ eon Bottou 25/46 COS 424 – 4/1/2010

slide-26
SLIDE 26

Polynomial degree 3

L´ eon Bottou 26/46 COS 424 – 4/1/2010

slide-27
SLIDE 27

Polynomial degree 5

L´ eon Bottou 27/46 COS 424 – 4/1/2010

slide-28
SLIDE 28

Polynomial kernels and more

Weighted polynomial kernel: Kd(x, v) =

d

  • i=0

γi i! (x⊤v)i.

– This is a polynomial kernel. – Coefficient γ controls the relative importance

  • f terms of various degree.

L´ eon Bottou 28/46 COS 424 – 4/1/2010

slide-29
SLIDE 29

Polynomial kernels and more

Weighted polynomial kernel: Kd(x, v) =

d

  • i=0

γi i! (x⊤v)i.

– This is a polynomial kernel. – Coefficient γ controls the relative importance

  • f terms of various degree.

Exponential kernel: K∞(x, v) =

  • i=0

γi i! (x⊤v)i = eγ x⊤v

– This is non longer a polynomial kernel. – The dimension of Φ(x) is infinite. – The computation remains finite.

L´ eon Bottou 29/46 COS 424 – 4/1/2010

slide-30
SLIDE 30

Radial Basis Function kernel

Radial Basis Functions – Approximating functions with expressions of the form

fw(x) =

  • i

wiF(x − xi)

– Gaussian kernel

F(r) = e−γr2

Radial Basis Kernel – Running a SVM with kernel K(x, v) = e−γx−v2 results in a discriminant function

fw(x) =

  • i

yiαie−γx−xi2

Questions – Is there a function Φ that corresponds to this kernel? – Does this work?

L´ eon Bottou 30/46 COS 424 – 4/1/2010

slide-31
SLIDE 31

Radial Basis (gamma = 0.1)

L´ eon Bottou 31/46 COS 424 – 4/1/2010

slide-32
SLIDE 32

Radial Basis (gamma = 1)

L´ eon Bottou 32/46 COS 424 – 4/1/2010

slide-33
SLIDE 33

Radial Basis (gamma = 10)

L´ eon Bottou 33/46 COS 424 – 4/1/2010

slide-34
SLIDE 34

Radial Basis (gamma = 100)

L´ eon Bottou 34/46 COS 424 – 4/1/2010

slide-35
SLIDE 35

Radial Basis (gamma = 100)

L´ eon Bottou 35/46 COS 424 – 4/1/2010

slide-36
SLIDE 36

Radial Basis (gamma = 100)

L´ eon Bottou 36/46 COS 424 – 4/1/2010

slide-37
SLIDE 37

Mercer kernel

Definition – Kernel K(x, v) is a Mercer kernel iff it is 1. symmetric: ∀x, v

K(x, v) = K(v, x)

2. positive: ∀k

∀x1 . . . xk ∀c1 . . . ck

k

  • i,j=1

cicjK(xi, xj) ≥ 0

Mercer theorem – For any Mercer kernel K(x, v) there exists a vectorial space Ω and a function Φ : x → Φ(x) ∈ Ω such that K(x, v) = Φ(x)⊤Φ(v). Practical consequences – We can create models by specifying basis functions Φ(x). – We can also create models by specifying kernels K(x, v).

L´ eon Bottou 37/46 COS 424 – 4/1/2010

slide-38
SLIDE 38

Usual and customary kernels

K(x, v)

Decision boundary Dim(Φ-space) linear

x⊤v

hyperplanes

n

quadratic

x⊤v + x⊤v2

conics

n(n+3) 2

d-polynomial (1 + x⊤v)d

?

nd d

gaussian:

exp(−γ||x − v||2)

smooth

L´ eon Bottou 38/46 COS 424 – 4/1/2010

slide-39
SLIDE 39

More kernels

K(x, v)

spline

1 + x⊤v +

d

  • j=1

R

−R

[xj − t]+ [vj − t]+ dt

multilayer perceptron tanh(αx⊤v − β) sum

  • j λjKj(x, v)

λj ≥ 0

tensor product

  • j Kj(xj, vj)

L´ eon Bottou 39/46 COS 424 – 4/1/2010

slide-40
SLIDE 40

Exotic kernels (1)

Input space needs not be a vector space. Kernels defined on histograms and p.d.f.

K(x, v)

Kullback exp(−β(D(xv) + D(vx))) Jensen

exp(−β(D(xx+v

2 ) + D(vx+v 2 )))

Hellinger exp

  • −β

x(t) −

  • v(t) dt

eon Bottou 40/46 COS 424 – 4/1/2010

slide-41
SLIDE 41

Exotic kernels (2)

Input space needs not be a vector space. Kernels defined on sequences.

K(x, v)

Fisher

∂ log L

∂λ (x)

⊤ ∂ log L

∂λ (v)

  • where L(.) is the likelihood of a H.M.M.

string number of common substrings of length d rational defined by certain finite state automatons

L´ eon Bottou 41/46 COS 424 – 4/1/2010

slide-42
SLIDE 42

Kernels everywhere

Kernel Principal Component Analysis. Compute principal subspaces in feature space.

  • Eigenvectors in Φ-space defined as:

Ep =

  • i

αi,pΦ(xi)

  • Cannot find pre-images ek such that Ek = Φ(ek).

But can extract components in principal subspace:

sk(x) =

  • i

αi,k K(x, xi)

  • Related to Isomap, LLE, Spectral Clustering.

L´ eon Bottou 42/46 COS 424 – 4/1/2010

slide-43
SLIDE 43

Kernels everywhere

One Class Support Vector Machines. Locate the support of the data distribution. w Assume ||xi|| = 1 Min||w||2 with ∀i, w.xi ≥ 1. Best done in Φ-space of course. Example: Novelty detection.

L´ eon Bottou 43/46 COS 424 – 4/1/2010

slide-44
SLIDE 44

Kernels everywhere

More kernel algorithms. Kernelizing a standard algorithm was in fashion SVR Support Vector Regression KLDA Kernel Linear Discriminant Analysis LS-SVM Least Square Support Vector Machine KLR Kernel Logistic Regression . . . and led to the rediscovery of old algorithms: Aizerman-Braverman Potential Functions

Kernel Perceptron, Kernel Adatron, etc. Gaussian Processes

Kriging

L´ eon Bottou 44/46 COS 424 – 4/1/2010

slide-45
SLIDE 45

Conclusion

Soft-margin SVM – a classifier using the hinge loss – with a kernel representation – and capacity control using regularization. Obvious variants – change the loss – change the representation – change the regularizer. . .

L´ eon Bottou 45/46 COS 424 – 4/1/2010

slide-46
SLIDE 46

Outlook

Success stories – Text categorization – Classification tasks in general the best classifier can change a lot, but the SVM is rarely far away. Weak points – Computationally costly with noisy data – L2 regularization works poorly when irrelevant inputs abound.

L´ eon Bottou 46/46 COS 424 – 4/1/2010