[PPT] - Pattern Recognition 2018 Support Vector Machines Ad Feelders PowerPoint Presentation

SLIDE 1

Pattern Recognition 2018 Support Vector Machines

Ad Feelders

Universiteit Utrecht

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 1 / 48

SLIDE 2

Support Vector Machines

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 2 / 48

SLIDE 3

Overview

1 Separable Case 2 Kernel Functions 3 Allowing Errors (Soft Margin) 4 SVM’s in R. Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 3 / 48

SLIDE 4

Linear Classifier for two classes

Linear model y(x) = w⊤φ(x) + b (7.1) with tn ∈ {−1, +1}. Predict t0 = +1 if y(x0) ≥ 0 and t0 = −1 otherwise. The decision boundary is given by y(x) = 0. This is a linear classifier in feature space φ(x).

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 4 / 48

SLIDE 5

Mapping

φ

y(x) = w⊤φ(x) + b = 0

φ maps x into higher dimensional space where data is linearly separable.

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 5 / 48

SLIDE 6

Data linearly separable

Assume training data is linearly separable in feature space, so there is at least one choice of w, b such that:

1 y(xn) > 0 for tn = +1; 2 y(xn) < 0 for tn = −1;

that is, all training points are classified correctly. Putting 1. and 2. together: tny(xn) > 0 for n = 1, . . . , N

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 6 / 48

SLIDE 7

Maximum Margin

There may be many solutions that separate the classes exactly. Which one gives smallest prediction error? SVM chooses line with maximal margin, where the margin is the distance between the line and the closest data point. In this way, it avoids “low confidence” classifications.

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 7 / 48

SLIDE 8

Two-class training data

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 8 / 48

SLIDE 9

Many Linear Separators

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 9 / 48

SLIDE 10

SVM Decision Boundary

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 10 / 48

SLIDE 11

Maximize Margin

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 11 / 48

SLIDE 12

Support Vectors

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 12 / 48

SLIDE 13

Weight vector is orthogonal to the decision boundary

Consider two points xA and xB both of which lie on the decision surface. Because y(xA) = y(xB) = 0, we have (w⊤xA + b) − (w⊤xB + b) = w⊤(xA − xB) = 0 and so the vector w is orthogonal to the decision surface.

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 13 / 48

SLIDE 14

Distance of a point to a line

x x⊥ x2 x1 y(x) = w⊤x + b = 0 r w

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 14 / 48

SLIDE 15

Distance to decision surface (φ(x) = x)

We have x = x⊥ + r w w. (4.6) where

w w is the unit vector in the direction of w, x⊥ is the orthogonal

projection of x onto the line y(x) = 0, and r is the (signed) distance of x to the line. Multiply (4.6) left and right by w⊤ and add b: w⊤x + b

y(x)

= w⊤x⊥ + b

+r w⊤w

w So we get r = y(x) w w2 = y(x) w (4.7)

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 15 / 48

SLIDE 16

Distance of a point to a line

The signed distance of xn to the decision boundary is r = y(xn) w For lines that separate the data perfectly, we have tny(xn) = |y(xn)|, so that the distance is given by tny(xn) w = tn(w⊤φ(xn) + b) w (7.2)

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 16 / 48

SLIDE 17

Maximum margin solution

Now we are ready to define the optimization problem: arg max

w,b

min

n

tn(w⊤φ(xn) + b) w

.

(7.3) Since

1 w does not depend on n, it can be moved outside of the

minimization: arg max

w,b

1 w min

n [tn(w⊤φ(xn) + b)]

.

(7.3) Direct solution of this problem would be rather complex. A more convenient representation is possible.

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 17 / 48

SLIDE 18

Canonical Representation

The hyperplane (decision boundary) is defined by w⊤φ(x) + b = 0 Then also κ(w⊤φ(x) + b) = κw⊤φ(x) + κb = 0 so rescaling w → κw and b → κb gives just another representation of the same decision boundary. To resolve this ambiguity, we choose the scaling factor such that ti(w⊤φ(xi) + b) = 1 (7.4) for the points xi closest to the decision boundary.

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 18 / 48

SLIDE 19

Canonical Representation (square=1,circle=−1)

y(x) = 0 y(x) = −1 y(x) = 1

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 19 / 48

SLIDE 20

Canonical Representation

In this case we have tn(w⊤φ(xn) + b) ≥ 1 n = 1, . . . , N (7.5) Quadratic program arg min

w,b

1 2w2 (7.6) subject to the constraints (7.5). This optimization problem has a unique global minimum.

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 20 / 48

SLIDE 21

Lagrangian Function

Introduce Lagrange multipliers an ≥ 0 to get Lagrangian function L(w, b, a) = 1 2w2 −

N

n=1

an{tn(w⊤φ(xn) + b) − 1} (7.7) with ∂L(w, b, a) ∂w = w −

N

n=1

antnφ(xn)

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 21 / 48

SLIDE 22

Lagrangian Function

and for b: ∂L(w, b, a) ∂b = −

N

n=1

antn Equating the derivatives to zero yields the conditions: w =

N

n=1

antnφ(xn) (7.8) and

N

n=1

antn = 0 (7.9)

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 22 / 48

SLIDE 23

Dual Representation

Eliminating w and b from L(w, b, a) gives the dual representation. L(w, b, a) = 1 2w2 −

N

n=1

an{tn(w⊤φ(xn) + b) − 1} = 1 2w2 −

N

n=1

antnw⊤φ(xn) − b

N

n=1

antn +

N

n=1

an = 1 2

N

n=1

N

m=1

anamtntmφ(xn)⊤φ(xm) −

N

n=1

N

m=1

antnamtmφ(xn)⊤φ(xm) +

N

n=1

an =

N

n=1

an − 1 2

N

n=1

N

m=1

antnamtmφ(xn)⊤φ(xm)

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 23 / 48

SLIDE 24

Dual Representation

Maximize

L(a) =

N

n=1

an − 1 2

N

n,m=1

antnamtmφ(xn)⊤φ(xm) (7.10) with respect to a and subject to the constraints an ≥ 0, n = 1, . . . , N (7.11)

N

n=1

antn = 0. (7.12)

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 24 / 48

SLIDE 25

Kernel Function

We map x to a high-dimensional space φ(x) in which data is linearly separable. Performing computations in this high-dimensional space may be very expensive. Use a kernel function k that computes a dot product in this space without making the actual mapping (“kernel trick”): k(x, x′) = φ(x)⊤φ(x′)

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 25 / 48

SLIDE 26

Example: polynomial kernel

Suppose x ∈ I R3 and φ(x) ∈ I R10 with φ(x) = (1, √ 2x1, √ 2x2, √ 2x3, x2

1, x2 2, x2 3,

√ 2x1x2, √ 2x1x3, √ 2x2x3) Then φ(x)⊤φ(z) = 1 + 2x1z1 + 2x2z2 + 2x3z3 + x2

1z2 1 + x2 2z2 2 + x2 3z2 3

+ 2x1x2z1z2 + 2x1x3z1z3 + 2x2x3z2z3 But this can be written as (1 + x⊤z)2 = (1 + x1z1 + x2z2 + x3z3)2 which costs much less operations to compute.

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 26 / 48

SLIDE 27

Polynomial kernel: numeric example

Suppose x = (3, 2, 6) and z = (4, 1, 5). Then φ(x) = (1, 3 √ 2, 2 √ 2, 6 √ 2, 9, 4, 36, 6 √ 2, 18 √ 2, 12 √ 2) φ(z) = (1, 4 √ 2, 1 √ 2, 5 √ 2, 16, 1, 25, 4 √ 2, 20 √ 2, 5 √ 2) Then φ(x)⊤φ(z) = 1 + 24 + 4 + 60 + 144 + 4 + 900 + 48 + 720 + 120 = 2025. But (1 + x⊤z)2 = (1 + (3)(4) + (2)(1) + (6)(5))2 = 452 = 2025 is a more efficient way to compute this dot product.

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 27 / 48

SLIDE 28

Kernels

Linear kernel k(x, x′) = x⊤x′ Two popular non-linear kernels are the polynomial kernel (of degree M): k(x, x′) = (x⊤x′ + c)M and Gaussian (or radial) kernel: k(x, x′) = exp(−x − x′2/2σ2), (6.23)

r

k(x, x′) = exp(−γx − x′2), where γ =

1 2σ2 .

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 28 / 48

SLIDE 29

Dual Representation with kernels

Using k(x, x′) = φ(x)⊤φ(x′) we get dual representation: Maximize

L(a) =

N

n=1

an − 1 2

N

n,m=1

antnamtmk(xn, xm) (7.10) with respect to a and subject to the constraints an ≥ 0, n = 1, . . . , N (7.11)

N

n=1

antn = 0. (7.12) Is this dual “easier” than the original problem?

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 29 / 48

SLIDE 30

Prediction

Recall that y(x) = w⊤φ(x) + b (7.1) Substituting w =

N

n=1

antnφ(xn) (7.8) into (7.1), we get y(x) = b +

N

n=1

antnk(x, xn) (7.13)

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 30 / 48

SLIDE 31

Constrained Optimization

Minimize f (x) subject to gi(x) ≥ 0 Lagrangian function: L(x, λ) = f (x) −

i

λigi(x) KKT conditions for solution:

1 gi(x) ≥ 0 2 λi ≥ 0 3 λigi(x) = 0 Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 31 / 48

SLIDE 32

Prediction: support vectors

KKT conditions: an ≥ 0 (7.14) tny(xn) − 1 ≥ 0 (7.15) an{tny(xn) − 1} = 0 (7.16) From (7.16) it follows that for every data point, either

1 an = 0, or 2 tny(xn) = 1.

The former play no role in making predictions (see 7.13), and the latter are the support vectors that lie on the maximum margin hyper planes. Only the support vectors play a role in predicting the class of new attribute vectors!

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 32 / 48

SLIDE 33

Only the support vectors are important for prediction

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 33 / 48

SLIDE 34

Prediction: computing b

Since for any support vector xn we have tny(xn) = 1, we can use (7.13) to get tn

b +
m∈S

amtmk(xn, xm)

= 1,

(7.17) where S denotes the set of support vectors. Hence we have tnb + tn

m∈S

amtmk(xn, xm) = 1 tnb = 1 − tn

m∈S

amtmk(xn, xm) and since tn ∈ {−1, +1} and so 1/tn = tn: b = tn −

m∈S

amtmk(xn, xm) (7.17a)

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 34 / 48

SLIDE 35

Prediction: Example

We receive the following output from the optimization software for fitting a support vector machine with linear kernel and perfect separation of the training data: n xn,1 xn,2 tn an 1 −2 2 −1 2 1 3 −1

1 8

3 3 1 −1

1 8

4 3 6 +1 5 4 4 +1

1 4

6 6 5 +1

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 35 / 48

SLIDE 36

Prediction: Example

The figure below is a plot of the same data set, where the dots represent points with class −1, and the crosses points with class +1.

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 36 / 48

SLIDE 37

Prediction: Example

(a) Compute the value of the SVM bias term b. Data points with a > 0 are support vectors. Let’s take the point x1 = 4, x2 = 4 with class label +1: b = tm −

N

n=1

antnx⊤

mxn = 1 + 1 8[4 4]

1 3

+ 1

8[4 4]

3 1

− 1

4[4 4]

4 4

= −3

(b) Which class does the SVM predict for the data point x1 = 5, x2 = 2? y(x) = b +

N

n=1

antnx⊤xn = −3 − 1

8[5 2]

1 3

− 1

8[5 2]

3 1

+ 1

4[5 2]

4 4

= 1

2

Since the sign is positive, we predict class +1.

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 37 / 48

SLIDE 38

Prediction: Example

Decision boundary and support vectors.

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 38 / 48

SLIDE 39

Allowing Errors

So far we assumed that the training data points are linearly separable in feature space φ(x). Resulting SVM gives exact separation of training data in original input space x, with non-linear decision boundary. Class distributions typically overlap, in which case exact separation of the training data leads to poor generalization (overfitting).

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 39 / 48

SLIDE 40

Allowing Errors

Data points are allowed to be on the “wrong” side of the margin boundary, but with a penalty that increases with the distance from that boundary. For convenience we make this penalty a linear function of the distance to the margin boundary. Introduce slack variables ξn ≥ 0 with one slack variable for each training data point.

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 40 / 48

SLIDE 41

Definition of Slack Variables

We define ξn = 0 for data points that are on the inside of the correct margin boundary and ξn = |tn − y(xn)| for all other data points.

ξ = 0 ξ > 1 ξ < 1 y(x) = 0 y(x) = −1 y(x) = 1 ξ = 0

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 41 / 48

SLIDE 42

New objective function

Our goal is to maximize the margin while softly penalizing points that lie on the wrong side of the margin boundary. We therefore minimize C

N

n=1

ξn + 1 2w2 (7.21) where the parameter C > 0 controls the trade-off between the slack variable penalty and the margin. Alternative view (divide by C and put λ =

1 2C ): N

n=1

ξn + λ

w2

i

First term represents lack-of-fit (hinge loss) and second term takes care of regularization.

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 42 / 48

SLIDE 43

Dual

Maximize

L(a) =

N

n=1

an − 1 2

N

n,m=1

antnamtmk(xn, xm) (7.32) with respect to a and subject to the constraints 0 ≤ an ≤ C, n = 1, . . . , N (7.33)

N

n=1

antn = 0. (7.34) The same as the separable case, except for the constraints an ≤ C.

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 43 / 48

SLIDE 44

Model Selection

As usual we are confronted with the problem of selecting the appropriate model complexity. The relevant parameters are C and any parameters of the chosen kernel function.

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 44 / 48

SLIDE 45

SVM in R

LIBSVM is available in package e1071 in R. It can also perform regression and non-binary classification. Non-binary classification is performed as follows: Train K(K − 1)/2 binary SVM’s on all possible pairs of classes. To classify a new point, let it be classified by every binary SVM, and pick the class with the highest number of votes. This is done automatically by function svm in e1071.

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 45 / 48

SLIDE 46

How to in R: analysis of optdigits data

# SVM with radial kernel and gamma=1/62 and cost=1 (default settings) > optdigits.svm <- svm(optdigits.train[, -c(1, 40,65)],optdigits.train[,65]) # make predictions on test set > svm.pred <- predict(optdigits.svm,optdigits.test[, -c(1, 40,65)]) > table(optdigits.test[,65],svm.pred) svm.pred 1 2 3 4 5 6 7 8 9 0 177 1 1 0 179 2 1 2 7 167 3 3 3 172 2 2 1 2 1 4 1 0 179 1 5 0 181 1 6 1 1 0 179 7 0 172 7 8 6 3 0 159 6 9 2 1 1 1 1 174 # accuracy on test sample is somewhat better than regularized multinomial logit # which had "only" 95% accuracy > sum(diag(table(optdigits.test[,65],svm.pred)))/nrow(optdigits.test) [1] 0.967724 Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 46 / 48

SLIDE 47

How to in R: : analysis of optdigits data

# tune cost parameter with cross-validation > optdigits.svm.tune <- tune.svm(optdigits.train[, -c(1, 40,65)],

ptdigits.train[,65],cost=1:10)

Warning messages: 1: In svm.default(list(V2 = c(0L, 0L, 6L, 0L, 3L, 1L, 0L, 0L, 0L, 0L, : Variable(s) V57 constant. Cannot scale data. # V57 is almost always zero, let’s remove it > optdigits.svm.tune <- tune.svm(optdigits.train[, -c(1, 40,57,65)],

ptdigits.train[,65],cost=1:10)

# show performance for each value of the cost parameter > optdigits.svm.tune$performances cost error dispersion 1 1 0.01490643 0.005227509 2 2 0.01202958 0.005934869 3 3 0.01229136 0.005376797 4 4 0.01176917 0.005119188 5 5 0.01150739 0.004803942 6 6 0.01150739 0.004803942 7 7 0.01176849 0.005115891 8 8 0.01150739 0.004803942 9 9 0.01176849 0.005115891 10 10 0.01176849 0.005115891

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 47 / 48

SLIDE 48

How to in R: : analysis of optdigits data

# cost=5 is the smallest among the parameter values that minimize # cross-validation error # We fit the SVM with this parameter value on the whole training set: > optdigits.svm.tuned <- svm(optdigits.train[, -c(1, 40,57,65)],

ptdigits.train[,65],cost=5)

# Generate predictions on the test set > svm.tuned.pred <- predict(optdigits.svm.tuned,

ptdigits.test[, -c(1, 40,57,65)])

# Compute accuracy on test set > sum(diag(table(optdigits.test[,65],svm.tuned.pred)))/nrow(optdigits.test) [1] 0.9727323

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 48 / 48