[PPT] - Optimal separating hyperplane. Basis expansion. Kernel trick. PowerPoint Presentation

SLIDE 1

CZECH TECHNICAL UNIVERSITY IN PRAGUE

Faculty of Electrical Engineering Department of Cybernetics

P. Poˇ

s´ ık c 2015 Artificial Intelligence – 1 / 20

Optimal separating hyperplane. Basis expansion. Kernel trick. Support vector machine.

Petr Poˇ s´ ık

SLIDE 2

Rehearsal

P. Poˇ

s´ ık c 2015 Artificial Intelligence – 2 / 20

SLIDE 3

Linear discrimination function

Rehearsal

Linear DF

Optimal separating hyperplane When a linear decision boundary is not

enough. . .

Support vector machine

P. Poˇ

s´ ık c 2015 Artificial Intelligence – 3 / 20

Binary classification of objects x (classification into 2 classes, dichotomy):

■ For 2 classes, 1 discrimination function is enough. ■ Decision rule:

f (x(i)) > 0 ⇐

⇒

y(i) = +1 f (x(i)) < 0 ⇐

⇒

y(i) = −1

i.e.
y(i) = sign
f (x(i))
Learning of the linear discrimination function by the perceptron algorithm:

■ Optimization of

J(w, T) =

|T|

∑

i=1

I

y(i) =

y(i)

■ The weight vector is a weighted sum

f the training points x(i).

■ Perceptron finds any separating

hyperplane, if exists.

■ Among the infinite number of

separating hyperplanes, which one is the best?

SLIDE 4

Optimal separating hyperplane

P. Poˇ

s´ ık c 2015 Artificial Intelligence – 4 / 20

SLIDE 5

Optimal separating hyperplane

P. Poˇ

s´ ık c 2015 Artificial Intelligence – 5 / 20

Margin (cz:odstup):

■ “The width of the band in which the decision

boundary can move (in the direction of its normal vector) without touching any data point.” Maximum margin linear classifier xwT + w0 = 1 xwT + w0 = 0 xwT + w0 = −1 Plus 1 level: {x : xwT + w0 = 1} Minus 1 level: {x : xwT + w0 = −1} Decision boundary: {x : xwT + w0 = 0} Support vectors:

■ Data points x lying at the plus 1 level or

minus 1 level.

■ Only these points influence the decision

boundary! Why we would like to maximize the margin?

■ Intuitively, it is safe. ■ If we make a small error in estimating the

boundary, the classification will likely stay correct.

■ The model is invariant with respect to the

training set changes, except the changes of support vectors.

■ There are sound theoretical results (based on

VC dimension) that having a maximum margin classifier is good.

■ Maximal margin works well in practice.

SLIDE 6

Margin size

Rehearsal Optimal separating hyperplane

Optimal SH
Margin size
OSH learning
OSH: remarks
Demo

When a linear decision boundary is not

enough. . .

Support vector machine

P. Poˇ

s´ ık c 2015 Artificial Intelligence – 6 / 20

How to compute the margin M given w = (w1, . . . , wD), w0?

■ Let’s choose two points x+ and

x−, lying in the plus 1 level and minus 1 level, respectively.

■ Let’s compute the margin M as

their distance. xwT + w0 = 1 xwT + w0 = 0 xwT + w0 = −1 w x+ x− M We know that: x+wT + w0 = 1 x−wT + w0 = −1 x− + λw = x+ And we can derive:

(x+ − x−)wT = 2 (x− + λw − x−)wT = 2

λwwT = 2 λ = 2 wwT = 2

w2

Thus the margin size is M = x+ − x− = λw = λw = 2

w2 w =

2 w

SLIDE 7

Optimal separating hyperplane learning

Rehearsal Optimal separating hyperplane

Optimal SH
Margin size
OSH learning
OSH: remarks
Demo

When a linear decision boundary is not

enough. . .

Support vector machine

P. Poˇ

s´ ık c 2015 Artificial Intelligence – 7 / 20

We want to maximize margin M =

2

w subject to the constraints ensuring correct

classification of the training set T. This optimization problem can be formulated as a quadratic programming (QP) task.

SLIDE 8

Optimal separating hyperplane learning

Rehearsal Optimal separating hyperplane

Optimal SH
Margin size
OSH learning
OSH: remarks
Demo

When a linear decision boundary is not

enough. . .

Support vector machine

P. Poˇ

s´ ık c 2015 Artificial Intelligence – 7 / 20

We want to maximize margin M =

2

w subject to the constraints ensuring correct

classification of the training set T. This optimization problem can be formulated as a quadratic programming (QP) task.

■ Primary QP task:

minimize wwT with respect to w1, . . . , wD subject to y(i)(x(i)wT + w0) ≥ 1.

SLIDE 9

Optimal separating hyperplane learning

Rehearsal Optimal separating hyperplane

Optimal SH
Margin size
OSH learning
OSH: remarks
Demo

When a linear decision boundary is not

enough. . .

Support vector machine

P. Poˇ

s´ ık c 2015 Artificial Intelligence – 7 / 20

We want to maximize margin M =

2

w subject to the constraints ensuring correct

classification of the training set T. This optimization problem can be formulated as a quadratic programming (QP) task.

■ Primary QP task:

minimize wwT with respect to w1, . . . , wD subject to y(i)(x(i)wT + w0) ≥ 1.

■ Dual QP task:

maximize

|T|

∑

i=1

αi − 1 2

|T|

∑

i=1

|T|

∑

j=1

αiαjy(i)y(j)x(i)x(j)T with respect to α1, . . . , α|T| subject to αi ≥ 0 and

|T|

∑

i=1

αiy(i) = 0.

SLIDE 10

Optimal separating hyperplane learning

Rehearsal Optimal separating hyperplane

Optimal SH
Margin size
OSH learning
OSH: remarks
Demo

When a linear decision boundary is not

enough. . .

Support vector machine

P. Poˇ

s´ ık c 2015 Artificial Intelligence – 7 / 20

We want to maximize margin M =

2

w subject to the constraints ensuring correct

classification of the training set T. This optimization problem can be formulated as a quadratic programming (QP) task.

■ Primary QP task:

minimize wwT with respect to w1, . . . , wD subject to y(i)(x(i)wT + w0) ≥ 1.

■ Dual QP task:

maximize

|T|

∑

i=1

αi − 1 2

|T|

∑

i=1

|T|

∑

j=1

αiαjy(i)y(j)x(i)x(j)T with respect to α1, . . . , α|T| subject to αi ≥ 0 and

|T|

∑

i=1

αiy(i) = 0.

■ From the solution of the dual task, we can compute the solution of the primal task:

w =

|T|

∑

i=1

αiy(i)x(i), w0 = y(k) − x(k)wT, where (x(k), y(k)) is any support vector, i.e. αk > 0.

SLIDE 11

Optimal separating hyperplane: concluding remarks

Rehearsal Optimal separating hyperplane

Optimal SH
Margin size
OSH learning
OSH: remarks
Demo

When a linear decision boundary is not

enough. . .

Support vector machine

P. Poˇ

s´ ık c 2015 Artificial Intelligence – 8 / 20

The importance of dual formulation:

■ The QP task in dual formulation is easier to solve for QP solvers than the primal

formulation.

■ New, unseen examples can be classified using function

f (x, w, w0) = sign(xwT + w0) = sign |T|

∑

i=1

αiy(i)x(i)xT + w0

,

i.e. the discrimination function contains the examples x only in the form of dot products (which will be useful later).

■ The examples with αi > 0 are support vectors, thus the sums may be carried out only

ver the support vectors.

■ The dual formulation allows for other tricks which you will learn later.

What if the data are not linearly separable?

■ There is a generalization of the QP task formulation for this case (soft margin). ■ The primal task has double the number of constraints, the task is more complex. ■ The results for the QP task with soft margin are of the same type as before.

SLIDE 12

Optimal separating hyperplane: demo

Rehearsal Optimal separating hyperplane

Optimal SH
Margin size
OSH learning
OSH: remarks
Demo

When a linear decision boundary is not

enough. . .

Support vector machine

P. Poˇ

s´ ık c 2015 Artificial Intelligence – 9 / 20

−0.2 0.2 0.4 0.6 0.8 1 1.2 −0.2 0.2 0.4 0.6 0.8 1

SLIDE 13

When a linear decision boundary is not enough. . .

P. Poˇ

s´ ık c 2015 Artificial Intelligence – 10 / 20

SLIDE 14

Basis expansion

Rehearsal Optimal separating hyperplane When a linear decision boundary is not

enough. . .
Basis expansion
Two spaces
Remarks

Support vector machine

P. Poˇ

s´ ık c 2015 Artificial Intelligence – 11 / 20

a.k.a. feature space straightening.

SLIDE 15

Basis expansion

Rehearsal Optimal separating hyperplane When a linear decision boundary is not

enough. . .
Basis expansion
Two spaces
Remarks

Support vector machine

P. Poˇ

s´ ık c 2015 Artificial Intelligence – 11 / 20

a.k.a. feature space straightening. Why?

■ Linear decision boundary (or linear regression model) may not be flexible enough to

perform precise classification (regression).

■ The algorithms for fitting linear models can be used to fit non-linear models!

SLIDE 16

Basis expansion

Rehearsal Optimal separating hyperplane When a linear decision boundary is not

enough. . .
Basis expansion
Two spaces
Remarks

Support vector machine

P. Poˇ

s´ ık c 2015 Artificial Intelligence – 11 / 20

a.k.a. feature space straightening. Why?

■ Linear decision boundary (or linear regression model) may not be flexible enough to

perform precise classification (regression).

■ The algorithms for fitting linear models can be used to fit non-linear models!

How?

■ Let’s define a new multidimensional image space F. ■ The examples are then tranformed into this image space (new features are derived):

x → z = Φ(x), x = (x1, x2, . . . , xD) → z = (Φ1(x), Φ2(x), . . . , ΦG(x)), while usually D ≪ G.

■ In the image space, a linear model is trained. However, this is equivalent to training a

non-linear model in the original space. fG(z) = w1z1 + w2z2 + . . . + wGzG + w0 f (x) = fG(Φ(x)) = w1Φ1(x) + w2Φ2(x) + . . . + wGΦG(x) + w0

SLIDE 17

Two coordinate systems

Rehearsal Optimal separating hyperplane When a linear decision boundary is not

enough. . .
Basis expansion
Two spaces
Remarks

Support vector machine

P. Poˇ

s´ ık c 2015 Artificial Intelligence – 12 / 20

Feature space Image space x = (x1, x2, . . . , xD) z = (z1, z2, . . . , zG) z1 = log x1 z2 = x2

1x3

z3 = ex2 ... f (x) = w1 log x1 + w2x2

1x3 +

w3ex2 + . . . + w0 fG(z) = w1z1 + w2z2 + w3z3 + . . . + w0 Transformation into a high-dimensional image space Training a linear model in the image space Non-linear model in the feature space

SLIDE 18

Two coordinate systems: graphically

Rehearsal Optimal separating hyperplane When a linear decision boundary is not

enough. . .
Basis expansion
Two spaces
Remarks

Support vector machine

P. Poˇ

s´ ık c 2015 Artificial Intelligence – 13 / 20

Feature space Image space

2 4 6 8 10 −1 −0.8 −0.6 −0.4 −0.2 0.2 0.4 0.6 0.8 1 x

SLIDE 19

Two coordinate systems: graphically

Rehearsal Optimal separating hyperplane When a linear decision boundary is not

enough. . .
Basis expansion
Two spaces
Remarks

Support vector machine

P. Poˇ

s´ ık c 2015 Artificial Intelligence – 13 / 20

Feature space Image space

2 4 6 8 10 −1 −0.8 −0.6 −0.4 −0.2 0.2 0.4 0.6 0.8 1 x 2 4 6 8 10 10 20 30 40 50 60 70 80 90 100 x x2

Transformation into a high-dimensional image space

SLIDE 20

Two coordinate systems: graphically

Rehearsal Optimal separating hyperplane When a linear decision boundary is not

enough. . .
Basis expansion
Two spaces
Remarks

Support vector machine

P. Poˇ

s´ ık c 2015 Artificial Intelligence – 13 / 20

Feature space Image space

2 4 6 8 10 −1 −0.8 −0.6 −0.4 −0.2 0.2 0.4 0.6 0.8 1 x 2 4 6 8 10 10 20 30 40 50 60 70 80 90 100 x x2 2 4 6 8 10 10 20 30 40 50 60 70 80 90 100 x x2

Transformation into a high-dimensional image space Training a linear model in the image space

SLIDE 21

Two coordinate systems: graphically

Rehearsal Optimal separating hyperplane When a linear decision boundary is not

enough. . .
Basis expansion
Two spaces
Remarks

Support vector machine

P. Poˇ

s´ ık c 2015 Artificial Intelligence – 13 / 20

Feature space Image space

2 4 6 8 10 −1 −0.8 −0.6 −0.4 −0.2 0.2 0.4 0.6 0.8 1 x 2 4 6 8 10 10 20 30 40 50 60 70 80 90 100 x x2 2 4 6 8 10 −3000 −2500 −2000 −1500 −1000 −500 500 x 2 4 6 8 10 10 20 30 40 50 60 70 80 90 100 x x2

Transformation into a high-dimensional image space Training a linear model in the image space Non-linear model in the feature space

SLIDE 22

Basis expansion: remarks

Rehearsal Optimal separating hyperplane When a linear decision boundary is not

enough. . .
Basis expansion
Two spaces
Remarks

Support vector machine

P. Poˇ

s´ ık c 2015 Artificial Intelligence – 14 / 20

Advantages:

■ Universal, generally usable method.

Disadvantages:

■ We must define what new features shall form the high-dimensional space F. ■ The examples must be really transformed into the high-dimensional space F.

For certain type of algorithms, there is a method how to perform the basis expansion without actually carrying out the mapping!

SLIDE 23

Support vector machine

P. Poˇ

s´ ık c 2015 Artificial Intelligence – 15 / 20

SLIDE 24

Optimal separating hyperplane combined with the basis expansion

Rehearsal Optimal separating hyperplane When a linear decision boundary is not

enough. . .

Support vector machine

OSH + basis exp.
Kernel trick
SVM
Linear SVM
Gaussian SVM
P. Poˇ

s´ ık c 2015 Artificial Intelligence – 16 / 20

To reiterate: when using the optimal separating hyperplane, the examples x occur only in the optimization criterion

|T|

∑

i=1

αi − 1 2

|T|

∑

i=1

|T|

∑

j=1

αiαjy(i)y(j)x(i)x(j)T and in the decision rule f (x) = sign |T|

∑

i=1

αiy(i)x(i)xT + w0

.

SLIDE 25

Optimal separating hyperplane combined with the basis expansion

Rehearsal Optimal separating hyperplane When a linear decision boundary is not

enough. . .

Support vector machine

OSH + basis exp.
Kernel trick
SVM
Linear SVM
Gaussian SVM
P. Poˇ

s´ ık c 2015 Artificial Intelligence – 16 / 20

To reiterate: when using the optimal separating hyperplane, the examples x occur only in the optimization criterion

|T|

∑

i=1

αi − 1 2

|T|

∑

i=1

|T|

∑

j=1

αiαjy(i)y(j)x(i)x(j)T and in the decision rule f (x) = sign |T|

∑

i=1

αiy(i)x(i)xT + w0

.

Application of the basis expansion changes the optimization criterion to

|T|

∑

i=1

αi − 1 2

|T|

∑

i=1

|T|

∑

j=1

αiαjy(i)y(j)Φ(x(i))Φ(x(j))T and the decision rule to f (x) = sign |T|

∑

i=1

αiy(i)Φ(x(i))Φ(x)T + w0

.

SLIDE 26

Optimal separating hyperplane combined with the basis expansion

Rehearsal Optimal separating hyperplane When a linear decision boundary is not

enough. . .

Support vector machine

OSH + basis exp.
Kernel trick
SVM
Linear SVM
Gaussian SVM
P. Poˇ

s´ ık c 2015 Artificial Intelligence – 16 / 20

To reiterate: when using the optimal separating hyperplane, the examples x occur only in the optimization criterion

|T|

∑

i=1

αi − 1 2

|T|

∑

i=1

|T|

∑

j=1

αiαjy(i)y(j)x(i)x(j)T and in the decision rule f (x) = sign |T|

∑

i=1

αiy(i)x(i)xT + w0

.

Application of the basis expansion changes the optimization criterion to

|T|

∑

i=1

αi − 1 2

|T|

∑

i=1

|T|

∑

j=1

αiαjy(i)y(j)Φ(x(i))Φ(x(j))T and the decision rule to f (x) = sign |T|

∑

i=1

αiy(i)Φ(x(i))Φ(x)T + w0

.

What if we use a scalar function K(x(i), x(j)) instead of the dot product in the image space? The optimization criterion:

|T|

∑

i=1

αi − 1 2

|T|

∑

i=1

|T|

∑

j=1

αiαjy(i)y(j)K(x(i), x(j)) The discrimination function: f (x) = sign |T|

∑

i=1

αiy(i)K(x(i), x) + w0

.

SLIDE 27

Kernel trick

Rehearsal Optimal separating hyperplane When a linear decision boundary is not

enough. . .

Support vector machine

OSH + basis exp.
Kernel trick
SVM
Linear SVM
Gaussian SVM
P. Poˇ

s´ ık c 2015 Artificial Intelligence – 17 / 20

There are function of 2 vector arguments K(a, b) which provide values equal to the dot product Φ(a)Φ(b)T of the images of the vectors a and b in certain high-dimensional image space. Such functions are called kernel functions or kernels.

SLIDE 28

Kernel trick

Rehearsal Optimal separating hyperplane When a linear decision boundary is not

enough. . .

Support vector machine

OSH + basis exp.
Kernel trick
SVM
Linear SVM
Gaussian SVM
P. Poˇ

s´ ık c 2015 Artificial Intelligence – 17 / 20

There are function of 2 vector arguments K(a, b) which provide values equal to the dot product Φ(a)Φ(b)T of the images of the vectors a and b in certain high-dimensional image space. Such functions are called kernel functions or kernels. Kernel trick: Let’s have a linear algorithm in which the examples x occur only in dot products.

■ Such an algorithm can be made non-linear by replacing the dot products of examples

x with kernels.

■ The result is the same is if the algorithm was trained in some high-dimensional image

space with the coordinates given by many non-linear basis functions.

■ Thanks to kernels, it is not needed to perform the mapping, the algorithm is much

more efficient.

SLIDE 29

Kernel trick

Rehearsal Optimal separating hyperplane When a linear decision boundary is not

enough. . .

Support vector machine

OSH + basis exp.
Kernel trick
SVM
Linear SVM
Gaussian SVM
P. Poˇ

s´ ık c 2015 Artificial Intelligence – 17 / 20

There are function of 2 vector arguments K(a, b) which provide values equal to the dot product Φ(a)Φ(b)T of the images of the vectors a and b in certain high-dimensional image space. Such functions are called kernel functions or kernels. Kernel trick: Let’s have a linear algorithm in which the examples x occur only in dot products.

■ Such an algorithm can be made non-linear by replacing the dot products of examples

x with kernels.

■ The result is the same is if the algorithm was trained in some high-dimensional image

space with the coordinates given by many non-linear basis functions.

■ Thanks to kernels, it is not needed to perform the mapping, the algorithm is much

more efficient. Frequently used kernels: Polynomial: K(a, b) = (abT + 1)d, where d is the degree of the polynom. Gaussian (RBF): K(a, b) = exp

− |a − b|2

σ2

, where σ2 is the ,,width“ of Gaussian.

SLIDE 30

Support vector machine

Rehearsal Optimal separating hyperplane When a linear decision boundary is not

enough. . .

Support vector machine

OSH + basis exp.
Kernel trick
SVM
Linear SVM
Gaussian SVM
P. Poˇ

s´ ık c 2015 Artificial Intelligence – 18 / 20

SLIDE 31

Support vector machine

Rehearsal Optimal separating hyperplane When a linear decision boundary is not

enough. . .

Support vector machine

OSH + basis exp.
Kernel trick
SVM
Linear SVM
Gaussian SVM
P. Poˇ

s´ ık c 2015 Artificial Intelligence – 18 / 20

Support vector machine (SVM) =

ptimal separating hyperplane

+ kernel trick

SLIDE 32

Demo: SVM with linear kernel

Rehearsal Optimal separating hyperplane When a linear decision boundary is not

enough. . .

Support vector machine

OSH + basis exp.
Kernel trick
SVM
Linear SVM
Gaussian SVM
P. Poˇ

s´ ık c 2015 Artificial Intelligence – 19 / 20

−0.2 0.2 0.4 0.6 0.8 1 1.2 −0.2 0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 −1.5 −1 −0.5 0.5 1 1.5 −1 −0.5 0.5 1 1.5 2 −3 −2 −1 1 2 3 −3 −2 −1 1 2 3

SLIDE 33

Demo: SVM with Gaussian (RBF) kernel

Rehearsal Optimal separating hyperplane When a linear decision boundary is not

enough. . .

Support vector machine

OSH + basis exp.
Kernel trick
SVM
Linear SVM
Gaussian SVM
P. Poˇ

s´ ık c 2015 Artificial Intelligence – 20 / 20

−0.2 0.2 0.4 0.6 0.8 1 1.2 −0.2 0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 −1.5 −1 −0.5 0.5 1 1.5 −1 −0.5 0.5 1 1.5 2 −3 −2 −1 1 2 3 −3 −2 −1 1 2 3