COMP24111: Machine Learning and Optimisation Chapter 4: Support - - PowerPoint PPT Presentation

comp24111 machine learning and optimisation
SMART_READER_LITE
LIVE PREVIEW

COMP24111: Machine Learning and Optimisation Chapter 4: Support - - PowerPoint PPT Presentation

COMP24111: Machine Learning and Optimisation Chapter 4: Support Vector Machines Dr. Tingting Mu Email: tingting.mu@manchester.ac.uk Outline Geometry concepts: hyperplane, distance, parallel hyperplane, margin. Basic idea of support


slide-1
SLIDE 1

COMP24111: Machine Learning and Optimisation

  • Dr. Tingting Mu

Email: tingting.mu@manchester.ac.uk

Chapter 4: Support Vector Machines

slide-2
SLIDE 2

Outline

  • Geometry concepts: hyperplane, distance, parallel hyperplane, margin.
  • Basic idea of support vector machine (SVM).
  • Hard-margin SVM
  • Soft-margin SVM
  • Support Vectors
  • Nonlinear classification:

– Kernel trick – Linear basis function model

1

slide-3
SLIDE 3

History and Information

  • Vapnik and Lerner (1963) introduced the generalised portrait
  • algorithm. The algorithm implemented by SVMs is a nonlinear

generalisation of the generalised portrait algorithm.

  • Support vector machine was first introduced in 1992:

– Boser et al. A training algorithm for optimal margin classifiers. Proceedings of the 5- th Annual Workshop on Computational Learning Theory 5 144-152, Pittsburgh, 1992.

  • More on SVM history: http://www.svms.org/history.html
  • Centralised website: http://www.kernel-machines.org
  • Popular textbook:

– N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines and Other Kernel-based Learning Methods, 2000. http://www.support-vector.net

  • Popular library: LIBSVM, MATLAB SVM, scikit-learn (machine

learning in Python).

2

slide-4
SLIDE 4

Hyperplane and Distance

3

The above is called a hyperplane. In 2D space (w1x1+w2x2+b=0), it is a straight line. In 3D space (w1x1+w2x2+w3x3+b=0), it is a plane.

(0,0) x2 x1 w

Hyperplane direction

w1x1 + w2x2 +…+ wdxd +b = 0 ⇔ wTx +b = 0

3D space

wTx +b = 0

slide-5
SLIDE 5

Hyperplane and Distance

4

The above is called a hyperplane. In 2D space (w1x1+w2x2+b=0), it is a straight line. In 3D space (w1x1+w2x2+w3x3+b=0), it is a plane.

(0,0) x2 x1 w

Hyperplane direction

w1x1 + w2x2 +…+ wdxd +b = 0 ⇔ wTx +b = 0

3D space

wTx +b = 0 r = wTx +b wi

2 i=1 d

= wTx +b w

2

x

r: Distance from an arbitrary point x to the plane. Whether r is positive or negative depends

  • n which side of the

hyperplane x lies.

slide-6
SLIDE 6

Hyperplane and Distance

5

The above is called a hyperplane. In 2D space (w1x1+w2x2+b=0), it is a straight line. In 3D space (w1x1+w2x2+w3x3+b=0), it is a plane.

(0,0) x2 x1 w

Hyperplane direction Distance from the origin to the plane.

w1x1 + w2x2 +…+ wdxd +b = 0 ⇔ wTx +b = 0

3D space

wTx +b = 0 r = wTx +b wi

2 i=1 d

= wTx +b w

2

x

r: Distance from an arbitrary point x to the plane. Whether r is positive or negative depends

  • n which side of the

hyperplane x lies.

slide-7
SLIDE 7
  • We focus on two parallel hyperplanes:
  • Geometrically, distance between these two planes is

2 / ||w||2 w wTx+b=0 w

T

x + b =

  • 1

wTx+b=1 ρ ρ x1 x2

Parallel Hyperplanes

6

wTx +b =1 wTx +b = −1 ⎧ ⎨ ⎪ ⎩ ⎪

2 w

2

ρ=1 / ||w||2

slide-8
SLIDE 8
  • We focus on two parallel hyperplanes:
  • Geometrically, distance between these two planes is

2 / ||w||2 w wTx+b=0 w

T

x + b =

  • 1

wTx+b=1 ρ ρ x1 x2

Parallel Hyperplanes

7

wTx +b =1 wTx +b = −1 ⎧ ⎨ ⎪ ⎩ ⎪

2 w

2

ρ=1 / ||w||2 z: wTz+b=1

r = wTz +b w

2

= 1 w

2

r

slide-9
SLIDE 9

8

x1 x2

Linearly separable case!

We start from an ideal classification case!

We focus on the binary classification problem in this lecture.

slide-10
SLIDE 10

Separation Margin

9

  • Given two parallel hyperplanes below, we separate two classes of data

points by preventing the data points from falling into the margin:

  • The region bounded by these

two hyperplanes is called the separation “margin”, given by

wTx +b ≥1, if y =1, wTx +b ≤ −1, if y = −1. ⎧ ⎨ ⎪ ⎩ ⎪ y wTx +b

( ) ≥1

equivalent expression

x1 x2

2 / | | w | |2

w w

T

x+b=0 wTx+b=-1 wTx+b=1 ρ ρ

ρ = 2 w 2 = 2 wTw

slide-11
SLIDE 11

Support Vector Machine (SVM)

10

  • The aim of SVM is simply to find an optimal hyperplane to separate the

two classes of data points with the widest margin.

x1 x2

slide-12
SLIDE 12

Support Vector Machine (SVM)

11

  • The aim of SVM is simply to find an optimal hyperplane to separate the

two classes of data points with the widest margin.

x1 x2

slide-13
SLIDE 13

Support Vector Machine (SVM)

12

  • The aim of SVM is simply to find an optimal hyperplane to separate the

two classes of data points with the widest margin.

x1 x2

Which is better?

slide-14
SLIDE 14

Support Vector Machine (SVM)

13

  • The aim of SVM is simply to find an optimal hyperplane to separate the

two classes of data points with the widest margin.

  • This can be formulated as a constrained optimisation problem:

x1 x2

Which is better?

min

w,b 1

2 wTw s.t. yi wTxi +b

( ) ≥1

∀i ∈ 1,...,N

{ }

slide-15
SLIDE 15

Support Vector Machine (SVM)

14

  • The aim of SVM is simply to find an optimal hyperplane to separate the

two classes of data points with the widest margin.

  • This can be formulated as a constrained optimisation problem:

min

w,b 1

2 wTw s.t. yi wTxi +b

( ) ≥1

∀i ∈ 1,...,N

{ }

Margin maximisation

margin: 2 wTw

x1 x2

Which is better?

slide-16
SLIDE 16

Support Vector Machine (SVM)

15

  • The aim of SVM is simply to find an optimal hyperplane to separate the

two classes of data points with the widest margin.

  • This results in the following constrained optimisation:

min

w,b 1

2 wTw s.t. yi wTxi +b

( ) ≥1

∀i ∈ 1,...,N

{ }

Margin maximisation

Stopping training samples from falling into the margin. margin: 2 wTw

x1 x2

Which is better?

slide-17
SLIDE 17

Support Vectors

16

  • Support vectors: training points that satisfy
  • These points are the most difficult to classify and are very important for

the location of the optimal hyperplane:

yi wTxi +b

( ) =1

x1 x2 2 / ||w||2 Optimal hyperplane wTx+b= 0 Upper hyperplane wTx+b= +1 Lower hyperplane wTx+b= -1 w Support vectors Support vectors

slide-18
SLIDE 18

SVM Training

17

  • SVM training: the process of solving the following constrained
  • ptimisation problem:
  • The above problem is solved by solving a dual problem as shown below.
  • The new variables are called Lagrangian multipliers. They should be

positive numbers.

  • A fixed relationship exists between w, b and .

min

w,b 1

2 wTw s.t. yi wTxi +b

( ) ≥1

∀i ∈ 1,...,N

{ }

L λ

( ) =

λi

i=1 N

− 1 2 λiλ j yi y jxi

Tx j j=1 N

i=1 N

How to derive the dual form can be found in the notes as optional reading materials.

λi

{ }i=1

N

λi

{ }i=1

N

slide-19
SLIDE 19

SVM Training

18

  • The dual problem is called a quardratic programing (QP) problem in
  • ptimisation.
  • There are many QP solvers available:

https://en.wikipedia.org/wiki/Quadratic_programming Dual problem max

λ∈ℜN

λi − 1 2 λiλ j yi y jxi

Tx j j=1 N

i=1 N

i=1 N

s.t. λi yi = 0

i=1 N

λi ≥ 0 ⎧ ⎨ ⎪ ⎪ ⎪ ⎩ ⎪ ⎪ ⎪

The SVM we have learned so far is called hard-margin SVM.

One way to solve the QP problem for SVM can be found in the notes as optional reading materials.

slide-20
SLIDE 20

19

So far, we work on simple cases like this:

x1 x2 x1 x2

What if the data points look like this?

separable data patterns non-separable data patterns In practice, no datasets are ideally linearly separable. This means that some data points are bounded to be misclassified by a linear hyperplane.

slide-21
SLIDE 21

Non-separable Patterns

20

  • We use the slack variable ξi≥0 (i=1,2,…N), each of which measures

the deviation of the i-th point from the ideal situation, to relax the previous constraints as:

  • We don’t push all the points to stay outside the margin any more.

wTxi +b ≥1−ξi, if yi =1, wTxi +b ≤ − 1−ξi

( ),

if yi = −1. ⎧ ⎨ ⎪ ⎩ ⎪

yi wTxi +b

( ) ≥1−ξi

equivalent expression

x1 x2

Point within region of separation, but still in the right side: 0<ξi≤1 Point in the wrong side of the decision boundary: ξi>1

slide-22
SLIDE 22

Modified Optimisation

21

  • In addition to maximising the margin as before, we need to keep all slacks ξi as

small as possible to minimise the classification errors. The modified SVM

  • ptimisation problem becomes:
  • The above constrained optimisation problem can be converted to a QP problem.

min

w,b

( )∈ℜd+1,

ξi

{ }i=1

N

1 2 wTw +C ξi

i=1 N

s.t. yi wTx +b

( ) ≥1−ξi

ξi ≥ 0 ⎫ ⎬ ⎪ ⎭ ⎪ ∀i ∈ 1,...,N

{ }

C≥0 is a user defined parameter, which controls the regularisation. This is the trade-off between complexity and nonseparable patterns.

Dual problem max

λ∈ℜN

λi − 1 2 λiλ j yi y jxi

Tx j j=1 N

i=1 N

i=1 N

s.t. λi yi = 0

i=1 N

0 ≤ λi ≤ C ⎧ ⎨ ⎪ ⎪ ⎪ ⎩ ⎪ ⎪ ⎪

Soft-margin SVM

slide-23
SLIDE 23

Support Vectors

22

  • Support vectors: training points that satisfy
  • These points either distribute along one of the two parallel hyperplanes

(1), or fall within the margin (2), or stay in the wrong side of the separating hyperplane (3).

yi wTxi +b

( ) =1−ξi ξi ≥ 0

( )

x1 x2 w support vectors support vectors

(3) (1) (1) (1) (1) (2) (2) Support vectors represent points that are difficult to classify and are important for deciding the location

  • f the separating

hyperplane.

slide-24
SLIDE 24

23

So far, we can handle linear cases like this:

x1 x2

What if the data points look like this?

linear data patterns non-linear data patterns

x1 x2

slide-25
SLIDE 25

Handle Nonlinear Data Patterns

24

  • Each data point is mapped to a new feature space.
  • We wish that the patterns in this new space is linearly separable.

after projection

Original input space New feature space

φ : ℜd → ℜb

φ x

( )

x φ x

( ) = φ

1 x

( ),φ

2 x

( ),…,φ

b x

( )

⎡ ⎣ ⎤ ⎦

T

x = x1,x2,…,xd ⎡ ⎣ ⎤ ⎦

T

Example:

https://www.youtube.com/watch?v=9NrALgHFwTo

slide-26
SLIDE 26

Kernel Trick

25

  • In the original space, the inner product between two points xi and xj is
  • In the new feature space, the inner product between two points xi and xj is
  • Kernel trick avoids to directly define the mapping function , but defines the

inner product function in the new space using a kernel function:

φ xi

( )

T φ x j

( ) = K xi,x j ( )

xi

Tx j =

xikx jk

k=1 d

φ xi

( )

T φ x j

( ) =

φ

k xi

( )φ

k x j

( )

k=1 b

φ x

( )

slide-27
SLIDE 27

Kernel Trick

26

  • Examples of kernel functions:

Kernel Name Expression K(x,y) Comments Linear

No parameter

Polynomial

p is a user defined parameter

Gaussian, also called radial basis function (RBF)

σ is the user defined width parameter

Hyperbolic tangent

α & β are user defined parameters

xTy +1

( )

p

exp − x − y

2 2 / 2σ 2

( )

( )

tanh αxTy +β

( )

xTy

slide-28
SLIDE 28

Kernel SVM

27

  • The SVM dual problem in the original space:
  • Kernel SVM with the modified dual problem:

Dual problem max

λ∈ℜN

λi − 1 2 λiλ j yi y jxi

Tx j j=1 N

i=1 N

i=1 N

s.t. λi yi = 0

i=1 N

0 ≤ λi ≤ C ⎧ ⎨ ⎪ ⎪ ⎪ ⎩ ⎪ ⎪ ⎪

Dual problem max

λ∈ℜN

λi − 1 2 λiλ j yi y jφ xi

( )

T φ x j

( )

j=1 N

i=1 N

i=1 N

s.t. λi yi = 0

i=1 N

0 ≤ λi ≤ C ⎧ ⎨ ⎪ ⎪ ⎪ ⎩ ⎪ ⎪ ⎪

max

λ∈ℜN

λi − 1 2 λiλ j yi y jK xi,x j

( )

j=1 N

i=1 N

i=1 N

s.t. λi yi = 0

i=1 N

0 ≤ λi ≤ C

slide-29
SLIDE 29

SVM Decision Function

28

  • After solving the QP problem for SVM, we compute the optimal values for

the multipliers

  • Linear SVM decision function:
  • Nonlinear SVM decision function:

λi

*

{ }i=1

N

f x

( ) =

λi

*yixi T i=1 N

x +b

f x

( ) =

λi

*yiK xi,x

( )

i=1 N

+b

Bias parameter b can be estimated from support vectors.

Kernel SVM demos: https://www.youtube.com/watch?v=3liCbRZPrZA https://www.youtube.com/watch?v=ndNE8he7Nnk

Polynomial: f x

( ) =

λi

*yi xTxi +1

( )

p i=1 N

+b Gaussion: f x

( ) =

λi

*yi exp

− x − xi 2

2

2σ 2 ⎛ ⎝ ⎜ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟ ⎟

i=1 N

+b

slide-30
SLIDE 30

Iris Classification Example:

29

4 4.5 5 5.5 6 6.5 7 7.5 8

sepal length in cm

0.5 1 1.5 2 2.5

petal width in cm Setosa, Virginica Versicolour Support Vectors

Soft-margin SVM with Gaussian kernel (σ=1, C=1)

4 4.5 5 5.5 6 6.5 7 7.5 8

sepal length in cm

0.5 1 1.5 2 2.5

petal width in cm Setosa, Virginica Versicolour Support Vectors

Soft-margin SVM with Gaussian kernel (σ=0.1, C=1)

slide-31
SLIDE 31

Iris Classification Example:

30

Soft-margin SVM with polynomial kernel (p=1, C=1) Soft-margin SVM with polynomial kernel (p=1, C=0.01)

4 4.5 5 5.5 6 6.5 7 7.5 8

sepal length in cm

0.5 1 1.5 2 2.5

petal width in cm Setosa, Virginica Versicolour Support Vectors

4 4.5 5 5.5 6 6.5 7 7.5 8

sepal length in cm

0.5 1 1.5 2 2.5

petal width in cm Setosa, Virginica Versicolour Support Vectors

More support vectors.

slide-32
SLIDE 32

Linear Basis Function Model

  • An alternative way to handle nonlinear data patterns is through
  • These nonlinear functions are called basis functions, each can be viewed as a

feature extractors.

  • Apply a linear model to the mapped features:

31

A basis function example:

φi x

( ) = exp

− x j −µij

( )

2 j=1 d

2σ i

2

⎛ ⎝ ⎜ ⎜ ⎜ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟ ⎟ ⎟ ⎟ = exp − x − µi 2

2

2σ i

2

⎛ ⎝ ⎜ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟ ⎟ .

µi and σi are basis function parameters.

Another basis function example: For single input variable, φ x

( ) = 1,x, x2,…x D

⎡ ⎣ ⎤ ⎦

T

. This is known as the polynomial regression. The case of D =1 becomes linear regression.

ˆ y = w0 + w1φ1 x

( )+ w2φ2 x ( )+…+ wDφD x ( ) = wTφ x ( ),

where w =[w0,w1,w2,…wD]T and φ x

( ) = 1,φ1 x ( ), φ2 x ( ), … φD x ( )

⎡ ⎣ ⎤ ⎦

T

directly formulating φi x

( ) { }i=1

D

slide-33
SLIDE 33

Iris Classification Example

32

Incorporate basis functions to a linear least squares model.

φ x

( ) = 1, x1, x2, x1

2, x2 2, x1x2

⎡ ⎣ ⎤ ⎦

T

training samples and separation boundary

slide-34
SLIDE 34

A Regression Example

33

Curve fitting: construct a curve that has the best fit to a series of data points. Method: incorporate basis functions to a linear least squares model.

ˆ y = wTφ x

( )

φ x

( ) = 1, x, x2,…xD

⎡ ⎣ ⎤ ⎦

T

Training samples. Regression curve.

5 10

x

  • 1
  • 0.5

0.5 1

y

2 4 6

x

0.2 0.4 0.6 0.8 1 1.2

y

5 10

x

  • 1
  • 0.5

0.5 1

y

2 4 6

x

0.2 0.4 0.6 0.8 1 1.2

y

5 10

x

  • 1
  • 0.5

0.5 1

y

2 4 6

x

  • 0.5

0.5 1 1.5

y D=1 D=3 D=7

data 1 data 2

slide-35
SLIDE 35

A Regression Example

34 D=1 D=3 D=7

5 10

x

  • 1
  • 0.5

0.5 1

y

2 4 6

x

  • 0.5

0.5 1

y

5 10

x

  • 1
  • 0.5

0.5 1

y

2 4 6

x

  • 0.5

0.5 1

y

5 10

x

  • 1
  • 0.5

0.5 1

y

2 4 6

x

  • 0.5

0.5 1

y

ˆ y = wTφ x

( )

φ x

( ) = 1, x, x2,…xD

⎡ ⎣ ⎤ ⎦

T

Training samples. Testing samples. Regression curve.

  • 5

5 10 15

x

  • 1
  • 0.5

0.5 1

y

  • 2

2 4 6

x

  • 1
  • 0.5

0.5 1

y

ground truth

Testing the fitted curve with new points.

data 1 data 2

slide-36
SLIDE 36

Summary

  • In this chapter, we have learned

– Support vector machines for classification. – Nonlinear classification approaches.

  • In next chapter, we will talk about neural networks.

35