Support Vector Machine and Kernel Methods Jiayu Zhou 1 Department of - - PowerPoint PPT Presentation

support vector machine and kernel methods
SMART_READER_LITE
LIVE PREVIEW

Support Vector Machine and Kernel Methods Jiayu Zhou 1 Department of - - PowerPoint PPT Presentation

Support Vector Machine and Kernel Methods Jiayu Zhou 1 Department of Computer Science and Engineering Michigan State University East Lansing, MI USA February 26, 2017 Jiayu Zhou CSE 847 Machine Learning 1 / 50 Which Separator Do You Pick?


slide-1
SLIDE 1

Support Vector Machine and Kernel Methods

Jiayu Zhou

1Department of Computer Science and Engineering

Michigan State University East Lansing, MI USA

February 26, 2017

Jiayu Zhou CSE 847 Machine Learning 1 / 50

slide-2
SLIDE 2

Which Separator Do You Pick?

Jiayu Zhou CSE 847 Machine Learning 2 / 50

slide-3
SLIDE 3

Robustness to Noisy Data

Being robust to noise (measurement error) is good (remember regularization).

Jiayu Zhou CSE 847 Machine Learning 3 / 50

slide-4
SLIDE 4

Thicker Cushion Means More Robustness

We call such hyperplanes fat

Jiayu Zhou CSE 847 Machine Learning 4 / 50

slide-5
SLIDE 5

Two Crucial Questions

1 Can we efficiently find the fattest separating hyperplane? 2 Is a fatter hyperplane better than a thin one? Jiayu Zhou CSE 847 Machine Learning 5 / 50

slide-6
SLIDE 6

Pulling Out the Bias

Before x ∈ {1} × Rd; w ∈ Rd+1 x =      1 x1 . . . xd      ; w =      w0 w1 . . . wd      signal = wT x After x ∈ Rd; b ∈ R, w ∈ Rd x =    x1 . . . xd    ; w =    w1 . . . wd    bias b signal = wT x + b

Jiayu Zhou CSE 847 Machine Learning 6 / 50

slide-7
SLIDE 7

Separating The Data

Hyperplane h = (b, w) h separates the data means: yn(wT xn + b) > 0 By rescaling the weights and bias, min

n=1,...,N yn(wT xn + b) = 1

Jiayu Zhou CSE 847 Machine Learning 7 / 50

slide-8
SLIDE 8

Distance to the Hyperplane

w is normal to the hyperplane (why?)

wT (x2 − x1) = wT x2 − wT x1 = −b + b = 0

Scalar projection: aT b = ∥a∥∥b∥ cos(a, b) ⇒ aT b/∥b∥ = ∥a∥ cos(a, b) let x⊥ be the orthogonal projection of x to h, distance to hyperplane is given by projection of x − x⊥ to w (why?) dist(x, h) = 1 ∥w∥ · |wT x − wT x⊥| = 1 ∥w∥ · |wT x + b|

Jiayu Zhou CSE 847 Machine Learning 8 / 50

slide-9
SLIDE 9

Fatness of a Separating Hyperplane

dist(x, h) = 1 ∥w∥ · |wT x + b| = 1 ∥w∥ · |yn(wT x + b)| = 1 ∥w∥ · yn(wT x + b)

Fatness = Distance to the closest point

Fatness = min

n dist(xn, h)

= 1 ∥w∥ min

n yn(wT x + b)

= 1 ∥w∥

Jiayu Zhou CSE 847 Machine Learning 9 / 50

slide-10
SLIDE 10

Maximizing the Margin

Formal definition of margin: margin: γ(h) = 1 ∥w∥

NOTE: Bias b does not appear in the margin.

Objective maximizing margin: min

b,w

1 2wT w subject to: min

n=1,...,N yn(wT xn + b) = 1

An equivalent objective: min

b,w

1 2wT w subject to: yn(wT xn + b) ≥ 1 for n = 1, . . . , N

Jiayu Zhou CSE 847 Machine Learning 10 / 50

slide-11
SLIDE 11

Example - Our Toy Data Set

min

b,w

1 2wT w subject to: yn(wT xn + b) ≥ 1 for n = 1, . . . , N Training Data: X =     2 2 2 3     , y =     −1 −1 +1 +1     What is the margin?

Jiayu Zhou CSE 847 Machine Learning 11 / 50

slide-12
SLIDE 12

Example - Our Toy Data Set

min

b,w

1 2 wT w subject to: yn(wT xn + b) ≥ 1 for n = 1, . . . , N

X =     2 2 2 3     , y =     −1 −1 +1 +1     ⇒            (1) : −b ≥ 1 (2) : −(2w1 + 2w2 + b) ≥ 1 (3) : 2w1 + b ≥ 1 (4) : 3w1 + b ≥ 1 { (1) + (3) → w1 ≥ 1 (2) + (3) → w2 ≤ −1 ⇒ 1 2wT w = 1 2(w2

1 + w2 2) ≥ 1

Thus: w1 = 1, w2 = −1, b = −1

Jiayu Zhou CSE 847 Machine Learning 12 / 50

slide-13
SLIDE 13

Example - Our Toy Data Set

Given data X =

    2 2 2 3    

Optimal solution w∗ = [ w1 = 1 w2 = −1 ] , b∗ = −1 Optimal hyperplane g(x) = sign(x1 − x2 − 1) margin:

1 ∥w∥ = 1 √ 2 ≈ 0.707

For data points (1), (2) and (3) yn(xT

nw∗ + b∗) = 1

Support Vectors

Jiayu Zhou CSE 847 Machine Learning 13 / 50

slide-14
SLIDE 14

Solver: Quadratic Programming

min

u∈Rq

1 2uT Qu + pT u subject to: Au ≥ c u∗ ← QP(Q, p, A, c) (Q = 0 is linear programming.) http://cvxopt.org/examples/tutorial/qp.html

Jiayu Zhou CSE 847 Machine Learning 14 / 50

slide-15
SLIDE 15

Maximum Margin Hyperplane is QP

min

b,w

1 2wT w subject to: yn(wT xn + b) ≥ 1, ∀n min

u∈Rq

1 2uT Qu + pT u subject to: Au ≥ c

u = [ b w ] ∈ Rd+1 ⇒ 1 2 wT w = [b, wT ] [ 0T

d

0d Id ] [ b wT ] = uT [ 0T

d

0d Id ] u Q = [ 0T

d

0d Id ] , p = 0d+1 yn(wT xn + b) ≥ 1 = [yn, ynxT

n ]u ≥ 1 ⇒

   y1 y1xT

1

. . . . . . yN yNxT

N

   u ≥    1 . . . 1    A =    y1 y1xT

1

. . . . . . yN yNxT

N

   , c =    1 . . . 1   

Jiayu Zhou CSE 847 Machine Learning 15 / 50

slide-16
SLIDE 16

Back To Our Example

Exercise: X =     2 2 2 3     , y =     −1 −1 +1 +1                (1) : −b ≥ 1 (2) : −(2w1 + 2w2 + b) ≥ 1 (3) : 2w1 + b ≥ 1 (4) : 3w1 + b ≥ 1 Show the corresponding Q, p, A, c.

Q =   1 1   , p =     , A =     −1 −1 −2 −2 1 2 1 3     , c =     1 1 1 1     Use your QP-solver to give u∗ = [b∗, w∗

1, w∗ 2]T = [−1, 1, −1]

Jiayu Zhou CSE 847 Machine Learning 16 / 50

slide-17
SLIDE 17

Primal QP algorithm for linear-SVM

1 Let p = 0d+1 be the (d + 1)-vector of zeros and c = 1N the

N-vector of ones. Construct matrices Q and A, where A =    y1 −y1xT

1 −

. . . . . . yN −yNxT

N−

   , Q = [ 0 0T

d

0d Id ]

2 Return

[ b∗ w∗ ] = u∗ ← QP(Q, p, A, c).

3 The final hypothesis is g(x) = sign(xT w∗ + b∗). Jiayu Zhou CSE 847 Machine Learning 17 / 50

slide-18
SLIDE 18

Link to Regularization

min

w

Ein(w) subject to: wT w ≤ C

  • ptimal hyperplane

regularization minimize wT w Ein subject to Ein = 0 wT w ≤ C

Jiayu Zhou CSE 847 Machine Learning 18 / 50

slide-19
SLIDE 19

How to Handle Non-Separable Data?

(a) Tolerate noisy data points: soft-margin SVM. (b) Inherent nonlinear boundary: non-linear transformation.

Jiayu Zhou CSE 847 Machine Learning 19 / 50

slide-20
SLIDE 20

Non-Linear Transformation

Φ1(x) = (x1, x2) Φ2(x) = (x1, x2, x2

1, x1x2, x2 2)

Φ3(x) = (x1, x2, x2

1, x1x2, x2 2, x3 1, x2 1x2, x1x2 2, x3 2)

Jiayu Zhou CSE 847 Machine Learning 20 / 50

slide-21
SLIDE 21

Non-Linear Transformation

Using the nonlinear transform with the optimal hyperplane using a transform Φ: Rd → R ˜

d:

zn = Φ(xn) Solve the hard-margin SVM in the Z-space ( ˜ w∗,˜ b∗): min

˜ b, ˜ w

1 2 ˜ wT ˜ w subject to: yn( ˜ wT zn + ˜ b) ≥ 1, ∀n Final hypothesis: g(x) = sign( ˜ w∗T Φ(x) + ˜ b∗)

Jiayu Zhou CSE 847 Machine Learning 21 / 50

slide-22
SLIDE 22

SVM and non-linear transformation

The margin is shaded in yellow, and the support vectors are boxed.

For Φ2, ˜ d2 = 5 and for Φ3, ˜ d3 = 9 ˜ d2 is nearly double ˜ d3, yet the resulting SVM separator is not severely overfitting with Φ3 (regularization?).

Jiayu Zhou CSE 847 Machine Learning 22 / 50

slide-23
SLIDE 23

Support Vector Machine Summary

A very powerful, easy to use linear model which comes with automatic regularization. Fully exploit SVM: Kernel

potential robustness to overfitting even after transforming to a much higher dimension How about infinite dimensional transforms? Kernel Trick

Jiayu Zhou CSE 847 Machine Learning 23 / 50

slide-24
SLIDE 24

SVM Dual: Formulation

Primal and dual in optimization. The dual view of SVM enables us to exploit the kernel trick. In the primal SVM problem we solve w ∈ Rd, b, while in the dual problem we solve α ∈ RN max

α∈RN N

n=1

αn − 1 2

N

m=1 N

n=1

αnαmynymxT

nxm

subject to

N

n=1

ynαn = 0, αn ≥ 0, ∀n which is also a QP problem.

Jiayu Zhou CSE 847 Machine Learning 24 / 50

slide-25
SLIDE 25

SVM Dual: Prediction

We can obtain the primal solution: w∗ =

N

n=1

ynα∗

nxn

where for support vectors αn > 0 The optimal hypothesis: g(x) = sign(w∗T x + b∗) = sign ( N ∑

n=1

ynα∗

nxT nx + b∗

) = sign   ∑

α∗

n>0

ynα∗

nxT nx + b∗

 

Jiayu Zhou CSE 847 Machine Learning 25 / 50

slide-26
SLIDE 26

Dual SVM: Summary

max

α∈RN N

n=1

αn − 1 2

N

m=1 N

n=1

αnαmynymxT

nxm

subject to

N

n=1

ynαn = 0, αn ≥ 0, ∀n w∗ =

N

n=1

ynα∗

nxn

Jiayu Zhou CSE 847 Machine Learning 26 / 50

slide-27
SLIDE 27

Common SVM Basis Functions

zk = polynomial terms of xk of degree 1 to q zk = radial basis function of xk zk(j) = φj(xk) = exp(−|xk − cj|2/σ2) zk = sigmoid functions of xk

Jiayu Zhou CSE 847 Machine Learning 27 / 50

slide-28
SLIDE 28

Quadratic Basis Functions

Φ(x) =                            1 √ 2x1 . . . √ 2xd x2

1

. . . x2

d

√ 2x1x2 . . . √ 2x1xd √ 2x2x3 . . . √ 2xd−1xd                           

Including Constant Term, Linear Terms, Pure Quadratic Terms, Quadratic Cross-Terms The number of terms is approximately d2/2. You may be wondering what those √ 2s are doing. Youll find out why theyre there soon.

Jiayu Zhou CSE 847 Machine Learning 28 / 50

slide-29
SLIDE 29

Dual SVM: Non-linear Transformation

max

α∈RN N

n=1

αn − 1 2

N

m=1 N

n=1

αnαmynymΦ(xn)T Φ(xm) subject to

N

n=1

ynαn = 0, αn ≥ 0, ∀n w∗ =

N

n=1

ynα∗

nΦ(xn)

Need to prepare a matrix Q, Qnm = ynymΦ(xn)T Φ(xm) Cost?

We must do N 2/2 dot products to get this matrix ready. Each dot product requires d2/2 additions and multiplications, The whole thing costs N 2d2/4.

Jiayu Zhou CSE 847 Machine Learning 29 / 50

slide-30
SLIDE 30

Quadratic Dot Products

Φ(a)T Φ(b) =                            1 √ 2a1 . . . √ 2am a2

1

. . . a2

m

√ 2a1a2 . . . √ 2a1ad √ 2a2a3 . . . √ 2ad−1ad                           

T 

                          1 √ 2b1 . . . √ 2bd b2

1

. . . b2

d

√ 2b1b2 . . . √ 2b1bd √ 2b2b3 . . . √ 2bd−1bd                           

Constant Term 1 Linear Terms

d

i=1

2aibi Pure Quadratic Terms

d

i=1

a2

i b2 i

Quadratic Cross-Terms

d

i=1 d

j=i+1

2aiajbibj

Jiayu Zhou CSE 847 Machine Learning 30 / 50

slide-31
SLIDE 31

Quadratic Dot Product

Does Φ(a)T Φ(b) look familiar?

Φ(a)T Φ(b) = 1 + 2 ∑d

i=1 aibi +

∑d

i=1 a2 i b2 i +

∑d

i=1

∑d

j=i+1 2aiajbibj

Try this: (aT b + 1)2

(aT b + 1)2 = (aT b)2 + 2aT b + 1 = (∑d

i=1 aibi

)2 + 2 ∑d

i=1 aibi + 1

= ∑d

i=1

∑d

j=1 aibiajbj + 2

∑d

i=1 aibi + 1

= ∑d

i=1 a2 i b2 i + 2

∑d

i=1

∑d

j=i+1 aiajbibj + 2

∑d

i=1 aibi + 1

They’re the same! And this is only O(d) to compute!

Jiayu Zhou CSE 847 Machine Learning 31 / 50

slide-32
SLIDE 32

Dual SVM: Non-linear Transformation

max

α∈RN N

n=1

αn − 1 2

N

m=1 N

n=1

αnαmynymΦ(xn)T Φ(xm) subject to

N

n=1

ynαn = 0, αn ≥ 0, ∀n w∗ =

N

n=1

ynα∗

nΦ(xn)

Need to prepare a matrix Q, Qnm = ynymΦ(xn)T Φ(xm) Cost?

We must do N 2/2 dot products to get this matrix ready. Each dot product requires d additions and multiplications.

Jiayu Zhou CSE 847 Machine Learning 32 / 50

slide-33
SLIDE 33

Higher Order Polynomials

Φ(x) Cost 100dim Quadratic d2/2 terms d2N2/4 2.5kN 2 Cubic d3/6 terms d3N2/12 83kN 2 Quartic d4/24 terms d4N2/48 1.96mN2 Φ(a)T Φ(b) Cost 100dim Quadratic (aT b + 1)2 dN2/2 50N 2 Cubic (aT b + 1)3 dN2/2 50N 2 Quartic (aT b + 1)4 dN2/2 50N 2

Jiayu Zhou CSE 847 Machine Learning 33 / 50

slide-34
SLIDE 34

Dual SVM with Quintic basis functions

max

α∈RN N

n=1

αn − 1 2

N

m=1 N

n=1

αnαmynym Φ(xn)T Φ(xm)

  • (xT

n xm+1)5

subject to

N

n=1

ynαn = 0, αn ≥ 0, ∀n Classification: g(x) = sign(w∗T Φ(x) + b∗) = sign (∑

α∗

n>0 ynα∗

nΦ(xn)T Φ(x) + b∗

) = sign (∑

α∗

n>0 ynα∗

n(xT nx + 1)5 + b∗

)

Jiayu Zhou CSE 847 Machine Learning 34 / 50

slide-35
SLIDE 35

Dual SVM with general kernel functions

max

α∈RN N

n=1

αn − 1 2

N

m=1 N

n=1

αnαmynymK(xn, xm) subject to

N

n=1

ynαn = 0, αn ≥ 0, ∀n Classification: g(x) = sign(w∗T Φ(x) + b∗) = sign (∑

α∗

n>0 ynα∗

nΦ(xn)T Φ(x) + b∗

) = sign (∑

α∗

n>0 ynα∗

nK(xn, xm) + b∗

)

Jiayu Zhou CSE 847 Machine Learning 35 / 50

slide-36
SLIDE 36

Kernel Tricks

Replacing dot product with a kernel function Not all functions are kernel functions!

Need to be decomposable K(a, b) = Φ(a)T Φ(b) Could K(a, b) = (a − b)3 be a kernel function? Could K(a, b) = (a − b)4 − (a + b)2 be a kernel function?

Mercer’s condition

To expand Kernel function K(a, b) into a dot product, i.e., K(a, b) = Φ(a)T Φ(b), K(a, b) has to be positive semi-definite function. kernel matrix K is always symmetric PSD for any given x1, . . . , xN.

Jiayu Zhou CSE 847 Machine Learning 36 / 50

slide-37
SLIDE 37

Kernel Design: expression kernel

mRNA expression data:

Each matrix entry is an mRNA expression measurement. Each column is an experiment. Each row corresponds to a gene.

Similar or dissimilar Similar Dissimilar Kernel K(x, y) = ∑

i xiyi

√∑

i xixi

√∑

i yiyi

Jiayu Zhou CSE 847 Machine Learning 37 / 50

slide-38
SLIDE 38

Kernel Design: sequence kernel

Work with non-vectorial data Scalar product on a pair of variable-length, discrete strings?

>ICYA_MANSE GDIFYPGYCPDVKPVNDFDLSAFAGAWHEIAKLPLENENQGKCTIAEYKY DGKKASVYNSFVSNGVKEYMEGDLEIAPDAKYTKQGKYVMTFKFGQRVVN LVPWVLATDYKNYAINYMENSHPDKKAHSIHAWILSKSKVLEGNTKEVVD NVLKTFSHLIDASKFISNDFSEAACQYSTTYSLTGPDRH >LACB_BOVIN MKCLLLALALTCGAQALIVTQTMKGLDIQKVAGTWYSLAMAASDISLLDA QSAPLRVYVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTKIPAVFKI DALNENKVLVLDTDYKKYLLFCMENSAEPEQSLACQCLVRTPEVDDEALE KFDKALKALPMHIRLSFNPTQLEEQCHI

Jiayu Zhou CSE 847 Machine Learning 38 / 50

slide-39
SLIDE 39

Commonly Used SVM Kernel Functions

K(a, b) = (α · aT b + β)Q is an example of an SVM kernel function. Beyond polynomials there are other very high dimensional basis functions that can be made practical by finding the right Kernel Function

Radial-basis style kernel (RBF)/Gaussian kernel function K(a, b) = exp ( −γ∥a − b∥2) Sigmoid functions

Jiayu Zhou CSE 847 Machine Learning 39 / 50

slide-40
SLIDE 40

2nd Order Polynomial Kernel

K(a, b) = (α · aT b + β)2

Jiayu Zhou CSE 847 Machine Learning 40 / 50

slide-41
SLIDE 41

Gaussian Kernels

K(a, b) = exp ( −γ∥a − b∥2) When γ is large, we clearly see that even the protection of a large margin cannot suppress overfitting. However, for a reasonably small γ, the sophisticated boundary discovered by SVM with the Gaussian-RBF kernel looks quite good.

Jiayu Zhou CSE 847 Machine Learning 41 / 50

slide-42
SLIDE 42

Gaussian Kernels

For (a) a noisy data set that linear classifier appears to work quite well, (b) using the Gaussian-RBF kernel with the hard-margin SVM leads to overfitting.

Jiayu Zhou CSE 847 Machine Learning 42 / 50

slide-43
SLIDE 43

From hard-margin to soft-margin

When there are outliers, hard margin SVM + Gaussian-RBF kernel result in an unnecessarily complicated decision boundary that overfits the training noise. Remedy: a soft formulation that allows small violation of the margins or even some classification errors. Soft-margin: margin violation εn ≥ 0 for each data point (xn, yn) and require that yn(wT xn + b) ≥ 1 − εn

εn captures by how much (xn, yn) fails to be separated.

Jiayu Zhou CSE 847 Machine Learning 43 / 50

slide-44
SLIDE 44

Soft-Margin SVM

We modify the hard-margin SVM to the soft-margin SVM by allowing margin violations but adding a penalty term to discourage large violations: min

b,w,ε

1 2wT w + C

N

n=1

εn subject to: yn(wT xn + b) ≥ 1 − εn for n = 1, . . . , N εn ≥ 0, for n = 1, . . . , N The meaning of C?

When C is large, it means we care more about violating the margin, which gets us closer to the hard-margin SVM. When C is small, on the other hand, we care less about violating the margin.

Jiayu Zhou CSE 847 Machine Learning 44 / 50

slide-45
SLIDE 45

Soft Margin Example

Jiayu Zhou CSE 847 Machine Learning 45 / 50

slide-46
SLIDE 46

Soft Margin and Hard Margin

min

b,w,ε 1 2wT w margin

+ C ∑N

n=1 εn

  • error tolerance

subject to: yn(wT xn + b) ≥ 1 − εn, εn ≥ 0, ∀N

Jiayu Zhou CSE 847 Machine Learning 46 / 50

slide-47
SLIDE 47

The Hinge Loss

The trade-off sounds very similar, right? We have εn ≥ 0, and that yn(wT xn + b) ≥ 1 − εn ⇒ εn ≥ 1 − yn(wT xn + b) The SVM loss (aka. Hinge Loss) function ESVM(b, w) = 1 N ∑N

n=1 max(1 − yn(wT xn + b), 0)

The soft-margin SVM can be re-written as the following

  • ptimization problem:

min

b,w ESVM(b, w) + λwT w

Jiayu Zhou CSE 847 Machine Learning 47 / 50

slide-48
SLIDE 48

Dual Soft-Margin SVM

max

α∈RN N

n=1

αn − 1 2

N

m=1 N

n=1

αnαmynymxT

nxm

subject to

N

n=1

ynαn = 0, 0 ≤ αn≤ C, ∀n w∗ =

N

n=1

ynα∗

nxn

Jiayu Zhou CSE 847 Machine Learning 48 / 50

slide-49
SLIDE 49

Summary of Dual SVM

Deliver a large-margin hyperplane, and in so doing it can control the effective model complexity. Deal with high- or infinite-dimensional transforms using the kernel trick. Express the final hypothesis g(x) using only a few support vectors, their corresponding dual variables (Lagrange multipliers), and the kernel. Control the sensitivity to outliers and regularize the solution through setting C appropriately.

Jiayu Zhou CSE 847 Machine Learning 49 / 50

slide-50
SLIDE 50

Support Vector Machine

Robust Classifier: Maximum Margin Primal Objective Dual Objective QP Solver - d QP Solver - N Design: Hard Margin Design: Soft Margin Primal Objective Dual Objective QP Solver - d QP Solver - N Kernel Trick Allow Training Error Kernel Trick Support Vector Support Vector Hinge Loss

Jiayu Zhou CSE 847 Machine Learning 50 / 50