SVM Kernels COMPSCI 371D Machine Learning COMPSCI 371D Machine - - PowerPoint PPT Presentation

svm kernels
SMART_READER_LITE
LIVE PREVIEW

SVM Kernels COMPSCI 371D Machine Learning COMPSCI 371D Machine - - PowerPoint PPT Presentation

SVM Kernels COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning SVM Kernels 1 / 27 Outline 1 Linear Separability and Feature Augmentation 2 Sample Complexity 3 Computational Complexity 4 Kernels and Nonlinear SVMs 5 Mercers


slide-1
SLIDE 1

SVM Kernels

COMPSCI 371D — Machine Learning

COMPSCI 371D — Machine Learning SVM Kernels 1 / 27

slide-2
SLIDE 2

Outline

1 Linear Separability and Feature Augmentation 2 Sample Complexity 3 Computational Complexity 4 Kernels and Nonlinear SVMs 5 Mercer’s Conditions 6 Gaussian Kernels and Support Vectors

COMPSCI 371D — Machine Learning SVM Kernels 2 / 27

slide-3
SLIDE 3

Linear Separability and Feature Augmentation

Data Representations

  • Linear separability is a property of the data in a given

representation

  • A set that is not linearly separable. Boundary x2 = x2

1

COMPSCI 371D — Machine Learning SVM Kernels 3 / 27

slide-4
SLIDE 4

Linear Separability and Feature Augmentation

Feature Transformations

  • x = (x1, x2) → z = (z1, z2) = (x2

1, x2)

  • Now it is! Boundary z2 = z1

COMPSCI 371D — Machine Learning SVM Kernels 4 / 27

slide-5
SLIDE 5

Linear Separability and Feature Augmentation

Feature Augmentation

  • Feature transformation:

x = (x1, x2) → z = (z1, z2) = (x2

1, x2)

  • Problem: We don’t know the boundary!
  • We cannot guess the correct transformation
  • Feature augmentation:

x = (x1, x2) → z = (z1, z2, z3) = (x1, x2, x2

1)

  • Why is this better?
  • Add many features in the hope that some combination will

help

COMPSCI 371D — Machine Learning SVM Kernels 5 / 27

slide-6
SLIDE 6

Linear Separability and Feature Augmentation

Not Really Just a Hope!

  • Add all monomials of x1, x2 up to some degree k
  • Example: k = 3 ⇒ d′ =

d+k

d

  • =

2+3

2

  • = 10 monomials

z = (1 , x1 , x2 , x2

1 , x1x2 , x2 2 , x3 1 , x2 1x2 , x1x2 2 , x3 2)

  • From Taylor’s theorem, we know that with k high enough we

can approximate any hypersurface by a linear combination

  • f the features in z
  • Issue 1: Sample complexity: More dimensions, more

training data (remember the curse)

  • Issue 2: Computational complexity: More features, more

work

  • With SVMs, we can address both issues

COMPSCI 371D — Machine Learning SVM Kernels 6 / 27

slide-7
SLIDE 7

Sample Complexity

A Detour into Sample Complexity

  • The more training samples we have, the better we

generalize

  • With a larger N, the set T represents the model p(x, y)

better

  • How to formalize this notion?
  • Introduce a number ǫ that measures how far from optimal a

classifier is

  • The smaller ǫ we want to be, the bigger N needs to be
  • Easier to think about: the bigger 1/ǫ (“exactitude”), the

bigger N

  • The rate of growth of N(1/ǫ) is the sample complexity, more
  • r less
  • Removing “more or less” requires care

COMPSCI 371D — Machine Learning SVM Kernels 7 / 27

slide-8
SLIDE 8

Sample Complexity

Various Risks Involved

  • We train a classifier on set T, by picking the best h ∈ H:

ˆ h = ERMT(H) ∈ arg minh∈H LT(h)

  • Empirical risk actually achieved by ˆ

h: LT(ˆ h) = LT(H) = minh∈H LT(h)

  • When we deploy ˆ

h we want its statistical risk to be small Lp(ˆ h) = Ep[ℓ(y, ˆ h(x))] We can get some idea of Lp(ˆ h) by testing ˆ h

  • Typically, Lp(ˆ

h) > LT(ˆ h)

  • More importantly: How small can Lp(ˆ

h) conceivably be?

  • Lp(ˆ

h) is typically bigger than Lp(H) = minh∈H Lp(h)

COMPSCI 371D — Machine Learning SVM Kernels 8 / 27

slide-9
SLIDE 9

Sample Complexity

Risk Summary

  • Empirical training risk LT(ˆ

h) is just a means to an end

  • That’s what we minimize for training. Ignore that
  • Statistical risk achieved by ˆ

h: Lp(ˆ h)

  • Smallest statistical risk over all h ∈ H: Lp(H) = minh∈H Lp(h)
  • Obviously Lp(ˆ

h) ≥ Lp(H) (by definition of the latter)

  • Typically, Lp(ˆ

h)>Lp(H). Why?

  • Because T is a poor proxy for p(x, y)
  • Also, often Lp(H) > 0. Why?
  • Because H may not contain a perfect h
  • Example: Linear classifier for a non linearly-separable

problem

COMPSCI 371D — Machine Learning SVM Kernels 9 / 27

slide-10
SLIDE 10

Sample Complexity

Sample Complexity

  • Typically, Lp(ˆ

h) > Lp(H) ≥ 0

  • Best we can do is Lp(ˆ

h) = Lp(H) + ǫ with small ǫ > 0

  • High performance (large 1/ǫ) requires lots of data (large N)
  • Sample complexity measures

how fast N needs to grow as 1/ǫ grows

  • It is the rate of growth of N(1/ǫ)
  • Problem: T is random, so even a huge N might give poor

performance once in a while if we have bad luck (“statistical fluke”)

  • We cannot guarantee that a large N yields a small ǫ
  • We can guarantee that this happens with high probability

COMPSCI 371D — Machine Learning SVM Kernels 10 / 27

slide-11
SLIDE 11

Sample Complexity

Sample Complexity, Cont’d

  • We can only give a probabilistic guarantee:
  • Given probability 0 < δ < 1 (think of this as “small”), we can

guarantee that if N is large enough then the probability that Lp(ˆ h) ≥ Lp(H) + ǫ is less than δ: P[Lp(ˆ h) ≥ Lp(H) + ǫ] ≤ δ

  • The sample complexity for hypothesis space H is the

function NH(ǫ, δ) that gives the smallest N for which this bound holds, regardless of model p(x, y)

  • Tall order: Typically, we can only give asymptotic bounds for

NH(ǫ, δ)

COMPSCI 371D — Machine Learning SVM Kernels 11 / 27

slide-12
SLIDE 12

Sample Complexity

Sample Complexity for Linear Classifiers and SVMs

  • For a binary linear classifier, the sample complexity is

Ω d + log(1/δ) ǫ

  • Grows linearly with d, the dimensionality of X, and 1/ǫ
  • Not too bad, this is why linear classifiers are so successful
  • SVMs with bounded data space X do even better
  • “Bounded:” Contained in a hypersphere of finite radius
  • For SVMs with bounded X, the sample complexity is

independent of d. No curse!

  • We can augment features to our heart’s content

COMPSCI 371D — Machine Learning SVM Kernels 12 / 27

slide-13
SLIDE 13

Computational Complexity

What About Computational Complexity?

  • Remember our plan: Go from x = (x1, x2) to

z = (1 , x1 , x2 , x2

1 , x1x2 , x2 2 , x3 1 , x2 1x2 , x1x2 2 , x3 2)

in order to make the data separable

  • Can we do this without paying the computational cost?
  • Yes, with SVMs

COMPSCI 371D — Machine Learning SVM Kernels 13 / 27

slide-14
SLIDE 14

Computational Complexity

SVMs and the Representer Theorem

  • Recall the formulation of SVM training: Minimize

f(w, ξ) = 1 2w2 + γ

N

  • n=1

ξn . with constraints yn(wTxn + b) − 1 + ξn ≥ ξn ≥ 0 .

  • Representer theorem:

w =

n∈A(w,b) αnynxn

w2 = wTw =

  • m∈A(w,b)
  • n∈A(w,b)

αmαnymynxT

mxn

COMPSCI 371D — Machine Learning SVM Kernels 14 / 27

slide-15
SLIDE 15

Kernels and Nonlinear SVMs

Using the Representer Theorem

  • Representer theorem:

w =

n∈A(w,b) αnynxn

  • In the constraint yn(wTxn + b) − 1 + ξn ≥ 0 we have

wTxn =

  • m∈A(w,b)

αmymxT

mxn

  • Summary: x appears in an inner product, never alone:

min

w,b,ξ

1 2

  • m∈A(u)
  • n∈A(u)

αmαnymynxT

mxn + C N

  • n=1

ξn

subject to the constraints

yn  

m∈A(u)

αmymxT

mxn + b

  − 1 + ξn ≥ ξn ≥

COMPSCI 371D — Machine Learning SVM Kernels 15 / 27

slide-16
SLIDE 16

Kernels and Nonlinear SVMs

The Kernel

  • Augment x ∈ Rd to ϕ(x) ∈ Rd′, with d′ ≫ d (typically)

min

w,b,ξ

1 2

  • m∈A(u)
  • n∈A(u)

αmαnymynϕ(xm)Tϕ(xn) + C

N

  • n=1

ξn

subject to the constraints

yn  

m∈A(u)

αmymϕ(xm)Tϕ(xn) + b   − 1 + ξn ≥ ξn ≥ 0 .

  • The value K(xm, xn)

def

= ϕ(xm)Tϕ(xn) is a number

  • The optimization algorithm needs to know only K(xm, xn),

not ϕ(xn). K is called a kernel

COMPSCI 371D — Machine Learning SVM Kernels 16 / 27

slide-17
SLIDE 17

Kernels and Nonlinear SVMs

Decision Rule

  • Same holds for the decision rule:

ˆ y = h(x) = sign(wTx + b) becomes ˆ y = h(x) = sign  

  • m∈A(w,b)

αmymxT

mx + b

  because of the representer theorem w =

n∈A(w,b) αnynxn

and therefore, after feature augmentation, ˆ y = h(x) = sign  

  • m∈A(w,b)

αmymϕ(xm)Tϕ(x) + b  

COMPSCI 371D — Machine Learning SVM Kernels 17 / 27

slide-18
SLIDE 18

Kernels and Nonlinear SVMs

Kernel Idea 1

  • Start with some ϕ(x) and use the kernel to save

computation

  • Example: ϕ(x) =

(1 , x1 , x2 , x2

1 , x1x2 , x2 2 , x3 1 , x2 1x2 , x1x2 2 , x3 2)

  • Don’t know how to simplify. Try this: ϕ(x) =

(1 , √ 3x1 , √ 3x2 , √ 3x2

1 ,

√ 6x1x2 , √ 3x2

2 , x3 1 ,

√ 3x2

1x2 ,

√ 3x1x2

2 , x3 2)

  • Can show (see notes) that

K(x, z) = ϕ(x)Tϕ(z) = (xTz + 1)3

  • Something similar works for any d and k
  • 4 products and 2 sums instead of 10 products and 9 sums
  • Meager savings, but grows exponentially with d and k, as

we know

COMPSCI 371D — Machine Learning SVM Kernels 18 / 27

slide-19
SLIDE 19

Kernels and Nonlinear SVMs

Much Better Kernel Idea 2

  • Just come up with K(x, z) without knowing the

corresponding ϕ(x)

  • Not just any K. Must behave like an inner product
  • For instance, xTz = zTx and (xTz)2 ≤ x2 z2

(symmetry and Cauchy-Schwartz), so we need at least K(x, z) = K(z, x) and K 2(x, z) ≤ K(x, x) K(z, z)

  • These conditions are necessary, but they are not sufficient
  • Fortunately, there is a theory for this

COMPSCI 371D — Machine Learning SVM Kernels 19 / 27

slide-20
SLIDE 20

Mercer’s Conditions

Mercer Conditions

  • K(x, z) : Rd × Rd → R is a kernel function if there exists ϕ

for which K(x, z) = ϕ(x)Tϕ(z)

  • Finite case: Given xn ∈ Rd for n = 1, . . . , N (as in T), a

symmetric function K(x, z) is a kernel function on that set iff the N × N matrix A = [K(xi, xj)] is positive semi-definite

  • Problem: We would like to know if K(x, z) is a kernel for any

T, or even for x we have not yet seen

  • Infinite case: K(x, z) is a kernel function iff

for every f : Rd → R s.t. ´ R

d f(x) dx is finite,

´ R

d×R d K(x, z) f(x) f(z) dx dz ≥ 0

  • Immediate extension of positive-definiteness to the

continuous case

COMPSCI 371D — Machine Learning SVM Kernels 20 / 27

slide-21
SLIDE 21

Mercer’s Conditions

The “Kernel Trick”

  • There is a theory for checking the Mercer conditions

algorithmically (eigenfunctions instead of eigenvectors)

  • There is a calculus for how to build new kernel functions
  • A whole cottage industry tailors kernels to problems
  • This is rather tricky. However, the Gaussian kernel is very

popular K(x, z) = e− x−z2

σ2

  • A measure of similarity between x and z
  • Gaussian kernels are also called Radial Basis Functions

COMPSCI 371D — Machine Learning SVM Kernels 21 / 27

slide-22
SLIDE 22

Gaussian Kernels and Support Vectors

Kernels and Support Vectors

  • Recall: Decision rule for SVM is h(x) = sign(wTϕ(x) + b)

(in transformed space, where the SVM is linear)

  • The separating hyper-plane is wTϕ(x) + b = 0
  • From representer theorem, w =

n αnynϕ(xn)

where the sum is over support vectors only

  • Therefore the separating hyperplane is
  • n αnynϕ(xn)Tϕ(x) + b = 0
  • That is,

n αnynK(xn, x) + b = 0

  • xn and x are in the original space
  • This equation describes the decision boundary induced in

the original space

COMPSCI 371D — Machine Learning SVM Kernels 22 / 27

slide-23
SLIDE 23

Gaussian Kernels and Support Vectors

The “Kernel Trick:” Summary, Part 1

  • In a linear SVM, feature vectors x always

show up in inner products: xT

mxn or xT n x

  • If features are augmented, x → ϕ(x), also ϕ(x) always

shows up in inner products: ϕ(xm)Tϕ(xn) or ϕ(xn)Tϕ(x)

  • Define a kernel K(x, x′) such that there exists an (often

unknown) mapping ϕ() for which K(x, x′) = ϕ(x)Tϕ(x′)

  • We always work with K(x, x′) without ever involving ϕ(x) or

ϕ(x′) (which are large, possibly infinite)

  • We avoid the computational cost of feature augmentation

COMPSCI 371D — Machine Learning SVM Kernels 23 / 27

slide-24
SLIDE 24

Gaussian Kernels and Support Vectors

The “Kernel Trick:” Summary, Part 2

  • Given K(x, x′) to there exists a mapping ϕ() for which

K(x, x′) = ϕ(x)Tϕ(x′) iff K satisfies the Mercer condition: For every f : Rd → R s.t. ´ R

d f(x) dx is finite,

´ R

d×R d K(x, z) f(x) f(z) dx dz ≥ 0

  • This condition can be verified through eigenfunction

computations

  • Important examples: The Radial Basis Function (RBF)

K(x, x′) = e− x−x′2

σ2

is a kernel

  • What does the decision boundary look like now?

COMPSCI 371D — Machine Learning SVM Kernels 24 / 27

slide-25
SLIDE 25

Gaussian Kernels and Support Vectors

Gaussian Kernels and Support Vectors

  • The decision boundary in the original space is
  • n αnynK(xn, x) + b = 0

where the sum is over support vectors

  • For RBF SVMs,

n αnyne− x−xn2

σ2

= −b

  • Simple geometric interpretation

COMPSCI 371D — Machine Learning SVM Kernels 25 / 27

slide-26
SLIDE 26

Gaussian Kernels and Support Vectors

Classification

http://mldemos.b4silio.com

COMPSCI 371D — Machine Learning SVM Kernels 26 / 27

slide-27
SLIDE 27

Gaussian Kernels and Support Vectors

Regression

http://mldemos.b4silio.com

COMPSCI 371D — Machine Learning SVM Kernels 27 / 27