Kernel Machines Support Vector Machines 1 Kernel Machines Optimal - - PowerPoint PPT Presentation

kernel machines
SMART_READER_LITE
LIVE PREVIEW

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal - - PowerPoint PPT Presentation

Support Vector Machines Kernel Machines Support Vector Machines Kernel Machines Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft Margin HyperPlanes Steven J Zeil Kernel Machines 2 Old


slide-1
SLIDE 1

Support Vector Machines Kernel Machines

Kernel Machines

Steven J Zeil

Old Dominion Univ.

Fall 2010

1 Support Vector Machines Kernel Machines

Kernel Machines

1

Support Vector Machines Optimal Separating HyperPlanes Soft Margin HyperPlanes

2

Kernel Machines Kernel Functions Multi-Classes Regression Outlier Detection Dimensionality Reduction

2 Support Vector Machines Kernel Machines

Seeking Separation

Recall earlier discussion: The most specific hypothesis S is the tightest rectangle enclosing the positive examples.

prone to false negatives

The most general hypothesis G is the largest rectangle enclosing the positive examples but containing no negative examples.

prone to false positives

Perhaps we should choose something in between?

3 Support Vector Machines Kernel Machines

Support Vector Machines

Non-parametric, discriminant-based Defines the discriminant as a combination of support vectors Convex optimization problems with a unique solution

4

slide-2
SLIDE 2

Support Vector Machines Kernel Machines

Separating HyperPlanes

In our earlier linear discrimination techniques, we sought any separating hyperplane Some of the points could come arbitrarily close to the border

Distance from the plane was a measure of confidence

If the class was not linearly separable, too bad.

5 Support Vector Machines Kernel Machines

Support Vector Machines: Margins

SVM’s seek the plane that maximizes the margin between the plane and the closest instances of each class. After testing, we do not insist on the margin

But distance from plane is still an indication of confidence

With minor modification, can extend to classes that are not linearly separable.

6 Support Vector Machines Kernel Machines

Defining the Margin

X = { xt, rt} wherert = +1 if xt ∈ C1 −1 if xt ∈ C2 Find w and w0 such that

  • wT

xt + w0 ≥ +1 for rt = +1

  • wT

xt + w0 ≤ +1 for rt = −1

  • r, equialently,

rt( wT xt + w0) ≥ +1

7 Support Vector Machines Kernel Machines

Maximizing the Margin

Margin is the distance from the discriminant to the closest instances on either side Distance of x to the hyperplane is |

wT xt+w0| || w||

Let ρ denote the margin side: ∀t, | wT xt + w0| || w|| ≥ ρ If we try to maximize ρ, there are infinite solutions form,ed by simply rescaling the w

Fix ρ||w|| = 1, and minimize ||w|| to maximize ρ Minimize 1

2||

w||2 subject to ∀t, |

w T xt+w0| || w||

≥ ρ

8

slide-3
SLIDE 3

Support Vector Machines Kernel Machines

Margin

Circled inputs are the ones that determine the border

After training, we could forget about the others In fact, we might be able eliminate some of the

  • thers before training

9 Support Vector Machines Kernel Machines

Derivation (1/3)

Minimize 1

2||

w||2 subject to ∀t, |

wT xt+w0| || w||

≥ ρ Lp = 1 2|| w||2 −

N

  • t=1

αt[rt( wT xt + w0) − 1] = 1 2|| w||2 −

N

  • t=1

αt[rt( wT xt + w0)] +

N

  • t=1

αt ∂Lp ∂ w = 0 ⇒ w =

N

  • t=1

αtrt xt

10 Support Vector Machines Kernel Machines

Derivation (2/3)

∂Lp ∂ w = 0 ⇒ w =

N

  • t=1

αtrt xt ∂Lp ∂w0 = 0 ⇒

N

  • t=1

αtrt = 0 Maximizing Lp is equivalent to maximizing the dual Ld = 1 2( wT w) − wT

t

αtrt xt − w0

  • t

αtrt +

  • t

αt subject to

t αtrt = 0 and αt ≥ 0

11 Support Vector Machines Kernel Machines

Derivation (3/3)

Ld = 1 2( wT w) − wT

t

αtrt xt − w0

  • t

αtrt +

  • t

αt = −1 2( wT w) +

  • t

αt = −1 2

  • t
  • s

αtαsrtrs +

  • t

αt subject to

t αtrt = 0 and αt ≥ 0

Solve numerically. Most αt are 0 The small number with αt > 0 are the support vectors:

  • w =

N

  • t=1

αtrt xt

12

slide-4
SLIDE 4

Support Vector Machines Kernel Machines

Support Vectors

Circled inputs are the support vectors

  • w = N

t=1 αtrt

xt Compute w0 from the average over the support vectors of w0 = rt − wT xt

13 Support Vector Machines Kernel Machines

Demo

Applet on http://www.csie.ntu.edu.tw/ cjlin/libsvm/

14 Support Vector Machines Kernel Machines

Soft Margin HyperPlanes

Suppose the classes are almost, but not quite linearly separable rt( wT xt + w0) ≥ 1 − ξt The ξ are slack variables, ξt ≥ 0 storing deviation from the margin

ξt = 0 means xt is more than 1 away from hyperplane 0 < ξt < 1 means xt is within the margin, but correctly classified ξt ≥ 1 means xt is misclassified

  • t ξt is a measure of error. Add as a penalty term

Lp = 1 2|| w||2 + C

  • t

ξt C is a penalty factor that trades off complexity against data misfitting

15 Support Vector Machines Kernel Machines

Soft Margin Derivation

Lp = 1

2||

w||2 + C

t ξt C is a penalty factor that trades off

complexity against data misfitting Leads to same numerical optimization problem with new constraint 0 ≤ αt ≤ C

16

slide-5
SLIDE 5

Support Vector Machines Kernel Machines

Soft Margin Example

Applet on http://www.csie.ntu.edu.tw/ cjlin/libsvm/

17 Support Vector Machines Kernel Machines

Hinge Loss

Lhinge(yt, rt) = if ytrr ≥ 1 1 − ytrt

  • w

Behavior in 0..1 makes this more robust than 0/1 and squred error Close to cross-entropy over much

  • f its range

18 Support Vector Machines Kernel Machines

Non-Linear SVM

Replace inputs x by a sequence of basis functions z = φ( x) Linear SVM

  • w =
  • t

αtrt xt g( x) =

  • wT

x =

  • t

αtrt xT xt Kernel SVM

  • w =
  • t

αtrt zt =

  • t

αtrt φ( xt) g( x) =

  • wT

φ( x) =

  • t

αtrt φ( xt)T φ( x)

19 Support Vector Machines Kernel Machines

The Kernel

g( x) =

t αtrt

xT xt

  • xT

xt is a measure of similarity between x and a support vector. g( x) =

t αtrt

φ( xt)T φ( x)

  • φ(

xt)T φ( x) can be seen as a similarity measure in the non-linear basis space. To generalize, let K( xt, x) = φ( xt)T φ( x) g( x) =

  • t

αtrtK( xt, x)

K is a kernel function.

20

slide-6
SLIDE 6

Support Vector Machines Kernel Machines

Polynomial Kernels

Kq( xt, x) = ( xT xt + 1)q E.g., K( x, y) = ( x y + 1)2 = (x1y1 + x2y2 + 1)2 = 1 + 2x1y1 + 2x2y2 +2x1x2y1y2 +x2

1y2 1 + x2 2y2 2

  • φ(

x) = [1, √ 2x1, √ 2x2, √ 2x1x2, x2

1, x2 2]

(FWIW)

21 Support Vector Machines Kernel Machines

Radial-basis Kernels

K( xt, x) = exp

  • −||

xt − x||2 2s2

  • Other options include sigmoidal

(approximated as tanh )

22 Support Vector Machines Kernel Machines

Selecting Kernels

Kernels can be customized to application Choose appropriate measures of similarity

Bag of words (normalized cosines between vocabulary vectors) Genetics: edit distance between strings Graphs: length of shortest path between nodes, or number of connecting paths

For input sets with very large dimension, may be cheaper to pre-compute the and save the matrix of kernel values (Gram matrix) rather than keeping all the inputs available.

23 Support Vector Machines Kernel Machines

Multi-Classes

1 versus all

K separate N variable problems

pairwise separation

K(K-1) separate N variable problems

single multiclass optimization Minimize 1

2

K

i=1 ||wi||2 + C i

  • t ξt

i subject to

  • wT

zt

xt + wzt0 ≥ wT

i

xt + wi0 + 2 − ξt

i , ∀i = zt

where zt is the index of the class of xt

  • ne K*N variable problem

24

slide-7
SLIDE 7

Support Vector Machines Kernel Machines

Regression

Linear regression to f ( x) = wT x + w0 Instead of using the usual error measure: e2(rt, f ( xt)) = [rt − f ( xt)]2 we use a linear, ǫ-sensitive error eǫ(rt, f ( xt)) = max(0, |rt − f ( xt)| − ǫ) Errors of less than ǫ are tolerated and larger errors have only a linear effect more tolerant to noise

25 Support Vector Machines Kernel Machines

Linear Regression

f ( x) = wT x + w0 =

t

  • (αt

+ − αt −)(

xt)T x + w0 Function is a combination of a limited set of support vectors.

26 Support Vector Machines Kernel Machines

Kernel Regression

Again, we can replace the inputs by a basis function, eventually leading to a Kernel function as a similarity measure. Shown here: polynomial

27 Support Vector Machines Kernel Machines

Gaussian Kernel Regression

28

slide-8
SLIDE 8

Support Vector Machines Kernel Machines

One-Class Kernel Machines

Consider a sphere with center a amd radius R Minimize R2 + C

t ξt subject to

|| xt − a|| ≤ R2 + ξt

  • w is an outlier if it lies outside the sphere.

29 Support Vector Machines Kernel Machines

Gaussian Kernel One-Class

30 Support Vector Machines Kernel Machines

Dimensionality Reduction

Kernel PCA does PCA on the kernel matrix φT φ instead of

  • n the direct inputs

For high-dimension input spaces, we can work on an NxN problems instead of DxD

31