COMP24111: Machine Learning and Optimisation
- Dr. Tingting Mu
COMP24111: Machine Learning and Optimisation Chapter 4: Support - - PowerPoint PPT Presentation
COMP24111: Machine Learning and Optimisation Chapter 4: Support Vector Machines Dr. Tingting Mu Email: tingting.mu@manchester.ac.uk Outline Geometry concepts: hyperplane, distance, parallel hyperplane, margin. Basic idea of support
1
– Boser et al. A training algorithm for optimal margin classifiers. Proceedings of the 5- th Annual Workshop on Computational Learning Theory 5 144-152, Pittsburgh, 1992.
– N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines and Other Kernel-based Learning Methods, 2000. http://www.support-vector.net
2
3
Hyperplane direction
3D space
4
Hyperplane direction
3D space
2 i=1 d
2
r: Distance from an arbitrary point x to the plane. Whether r is positive or negative depends
hyperplane x lies.
5
Hyperplane direction Distance from the origin to the plane.
3D space
2 i=1 d
2
r: Distance from an arbitrary point x to the plane. Whether r is positive or negative depends
hyperplane x lies.
T
6
2
T
7
2
2
2
8
x1 x2
9
equivalent expression
x1 x2
2 / | | w | |2
w w
T
x+b=0 wTx+b=-1 wTx+b=1 ρ ρ
10
x1 x2
11
x1 x2
12
x1 x2
13
x1 x2
w,b 1
14
w,b 1
Margin maximisation
x1 x2
15
w,b 1
Margin maximisation
x1 x2
16
x1 x2 2 / ||w||2 Optimal hyperplane wTx+b= 0 Upper hyperplane wTx+b= +1 Lower hyperplane wTx+b= -1 w Support vectors Support vectors
17
min
w,b 1
2 wTw s.t. yi wTxi +b
∀i ∈ 1,...,N
i=1 N
Tx j j=1 N
i=1 N
How to derive the dual form can be found in the notes as optional reading materials.
N
N
18
λ∈ℜN
Tx j j=1 N
i=1 N
i=1 N
i=1 N
The SVM we have learned so far is called hard-margin SVM.
One way to solve the QP problem for SVM can be found in the notes as optional reading materials.
19
x1 x2 x1 x2
20
equivalent expression
x1 x2
21
min
w,b
( )∈ℜd+1,
ξi
{ }i=1
N
1 2 wTw +C ξi
i=1 N
s.t. yi wTx +b
ξi ≥ 0 ⎫ ⎬ ⎪ ⎭ ⎪ ∀i ∈ 1,...,N
C≥0 is a user defined parameter, which controls the regularisation. This is the trade-off between complexity and nonseparable patterns.
Dual problem max
λ∈ℜN
λi − 1 2 λiλ j yi y jxi
Tx j j=1 N
i=1 N
i=1 N
s.t. λi yi = 0
i=1 N
0 ≤ λi ≤ C ⎧ ⎨ ⎪ ⎪ ⎪ ⎩ ⎪ ⎪ ⎪
22
x1 x2 w support vectors support vectors
(3) (1) (1) (1) (1) (2) (2) Support vectors represent points that are difficult to classify and are important for deciding the location
hyperplane.
23
x1 x2
x1 x2
24
after projection
Original input space New feature space
φ : ℜd → ℜb
φ x
x φ x
1 x
2 x
b x
⎡ ⎣ ⎤ ⎦
T
T
https://www.youtube.com/watch?v=9NrALgHFwTo
25
T φ x j
Tx j =
k=1 d
T φ x j
k xi
k x j
k=1 b
φ x
26
Kernel Name Expression K(x,y) Comments Linear
No parameter
Polynomial
p is a user defined parameter
Gaussian, also called radial basis function (RBF)
σ is the user defined width parameter
Hyperbolic tangent
α & β are user defined parameters
p
2 2 / 2σ 2
27
Dual problem max
λ∈ℜN
λi − 1 2 λiλ j yi y jxi
Tx j j=1 N
i=1 N
i=1 N
s.t. λi yi = 0
i=1 N
0 ≤ λi ≤ C ⎧ ⎨ ⎪ ⎪ ⎪ ⎩ ⎪ ⎪ ⎪
Dual problem max
λ∈ℜN
λi − 1 2 λiλ j yi y jφ xi
T φ x j
j=1 N
i=1 N
i=1 N
s.t. λi yi = 0
i=1 N
0 ≤ λi ≤ C ⎧ ⎨ ⎪ ⎪ ⎪ ⎩ ⎪ ⎪ ⎪
max
λ∈ℜN
λi − 1 2 λiλ j yi y jK xi,x j
( )
j=1 N
i=1 N
i=1 N
s.t. λi yi = 0
i=1 N
0 ≤ λi ≤ C
28
*
N
*yixi T i=1 N
*yiK xi,x
i=1 N
Kernel SVM demos: https://www.youtube.com/watch?v=3liCbRZPrZA https://www.youtube.com/watch?v=ndNE8he7Nnk
Polynomial: f x
( ) =
λi
*yi xTxi +1
( )
p i=1 N
+b Gaussion: f x
( ) =
λi
*yi exp
− x − xi 2
2
2σ 2 ⎛ ⎝ ⎜ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟ ⎟
i=1 N
+b
29
4 4.5 5 5.5 6 6.5 7 7.5 8
sepal length in cm
0.5 1 1.5 2 2.5
petal width in cm Setosa, Virginica Versicolour Support Vectors
4 4.5 5 5.5 6 6.5 7 7.5 8
sepal length in cm
0.5 1 1.5 2 2.5
petal width in cm Setosa, Virginica Versicolour Support Vectors
30
4 4.5 5 5.5 6 6.5 7 7.5 8
sepal length in cm
0.5 1 1.5 2 2.5
petal width in cm Setosa, Virginica Versicolour Support Vectors
4 4.5 5 5.5 6 6.5 7 7.5 8
sepal length in cm
0.5 1 1.5 2 2.5
petal width in cm Setosa, Virginica Versicolour Support Vectors
More support vectors.
31
A basis function example:
φi x
( ) = exp
− x j −µij
( )
2 j=1 d
∑
2σ i
2
⎛ ⎝ ⎜ ⎜ ⎜ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟ ⎟ ⎟ ⎟ = exp − x − µi 2
2
2σ i
2
⎛ ⎝ ⎜ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟ ⎟ .
µi and σi are basis function parameters.
Another basis function example: For single input variable, φ x
( ) = 1,x, x2,…x D
⎡ ⎣ ⎤ ⎦
T
. This is known as the polynomial regression. The case of D =1 becomes linear regression.
T
D
32
φ x
2, x2 2, x1x2
⎡ ⎣ ⎤ ⎦
T
training samples and separation boundary
33
Curve fitting: construct a curve that has the best fit to a series of data points. Method: incorporate basis functions to a linear least squares model.
ˆ y = wTφ x
φ x
⎡ ⎣ ⎤ ⎦
T
Training samples. Regression curve.
5 10
x
0.5 1
y
2 4 6
x
0.2 0.4 0.6 0.8 1 1.2
y
5 10
x
0.5 1
y
2 4 6
x
0.2 0.4 0.6 0.8 1 1.2
y
5 10
x
0.5 1
y
2 4 6
x
0.5 1 1.5
y D=1 D=3 D=7
data 1 data 2
34 D=1 D=3 D=7
5 10
x
0.5 1
y
2 4 6
x
0.5 1
y
5 10
x
0.5 1
y
2 4 6
x
0.5 1
y
5 10
x
0.5 1
y
2 4 6
x
0.5 1
y
ˆ y = wTφ x
φ x
⎡ ⎣ ⎤ ⎦
T
Training samples. Testing samples. Regression curve.
5 10 15
x
0.5 1
y
2 4 6
x
0.5 1
y
ground truth
Testing the fitted curve with new points.
data 1 data 2
35