The Support Vector Machine
Nuno Vasconcelos (Ken Kreutz-Delgado) ( g )
UC San Diego
The Support Vector Machine (Ken Kreutz-Delgado) ) Nuno Vasconcelos - - PDF document
The Support Vector Machine (Ken Kreutz-Delgado) ) Nuno Vasconcelos g UC San Diego ( Classification a Classification Problem has two types of variables X - vector of observations (features) in the world Y - state (class) of the
UC San Diego
X X
2
X (f bl d )
relationship can be well approximated by an “optimal” classifier function
2
approximate the unknown optimal classifier f, h ≈ f
,
X Y
,
3
|
Y X i
| i
X Y Y
boundaries is a computationally intractable (hence bad) strategy
more general (and thus usually much harder) problem as
4
more general (and thus usually much harder) problem as an intermediate step!”
1. Postulate a (parametric) family of decision boundaries 2 Pi k th l t i thi f il th t d th b t l ifi 2. Pick the element in this family that produces the best classifier
i 1
i i T i i
−
i 1 1 1 1
i i T T
− −
5
1 1
− −
T T
1 1 1 T T T
1 1
1 1 1 1 1 1
− − −
T T T
T
1 1 1 1 1 1
T T
− − −
6
2 T T
T
x 1
w x0
T
2
1 x2 x n
x
x 3 x2
7
point closest to the origin
1 1 1 1
0, if ( ) ( ) ( ) ( ) *( )
T T
x x x x h µ µ µ µ
− −
⎧ − Σ − < − Σ − ⎨
1 1 1 1 1 1
, ( ) ( ) ( ) ( ) *( ) 1, if ( ) ( ) ( ) ( )
T T
h x x x x x µ µ µ µ µ µ µ µ
− −
= ⎨ − Σ − > − Σ − ⎩
if ( ) *( ) 1 if ( ) g x h x g x > ⎧ = ⎨ < ⎩
x-x0 x
( ) g ⎩
( )
T
g x w x x = −
w x0 x x0
θ
( ) w cos · · g x x θ = −
x n x 1
8
x 3 x2
( )
T
g x w
w x
( ) g x w x x w w = −
x 1
w x0 x-x0 θ
( ) g x
vector in the direction of w
x n 1 x2
w b w
x 3 x2
9
2
if ( ) g x > ⎧
( )
T
g x w x b = + if ( ) *( ) 1 if ( ) g x h x g x > ⎧ = ⎨ < ⎩
w
The boundary is the hyperplane with:
| | b w x ( ) g x w
from point x to the boundary
10
probabilities and covariances
11
w
y 1 for points on the positive side
1 if ( ) *( ) *( ) sgn ( ) 1 if ( ) g x h x h x g x g x > ⎧ = ⇔ = ⎨ < ⎩
12
1 if ( ) g x − < ⎩
if
y = -1 and g(x) < 0 y 1 and g(x) > 0
13
0, ···, 1,
T i i
y w x b i n + > ∀ =
w
min
T i i
w x b w γ + =
0,
T i i
y w x b i γ + > ∀ ⇔ >
w
| | b w
x
w x g ) (
14
w
min 1
T i i
w x b + ≡
1 w γ =
w
| | b w
x
w x g ) (
2
15
2 ,
T i i w b
set as a sample from a probability density centered on it
γ
p y y
will not get the same points
pdf with a certain variance
pdfs” provides a density estimate pdfs provides a density estimate (a so-called “kernel estimate”)
training set we are safe against training set, we are safe against this “resampling” uncertainty (as long as the radius of support
16
p p γ)
is the classifier when applied to new data!
uncertain estimate because it is learned from random data samples
draw to draw, the hyperplane , yp p parameters are random variables
g yp p have a probability distribution over possible hyperplanes
e a ge t e a g , t e a ge the number of hyperplanes that will not originate errors on the data
17
The larger the value of γ, the larger the variance allowed on the plane parameter estimates!
dual problem which is easier to solve dual problem which is easier to solve
multipliers αi > 0, one for each constraint, and solve
w
α α
≥ ≥
2
i T i i i
18
is the Lagrangian
1
T
⎧ ⎫ ⎨ ⎬
i
1 max 2 subject to
T i j i j i j i ij i i
y y x x y
α
α α α α
≥
⎧ ⎫ − + ⎨ ⎬ ⎩ ⎭ =
i
j
i i
y
*
i i i i
w y x α = ∑
19
w
i t i th 1 id
T T + +
⎫ 1 ( ) * 2 1
T T T
w x b w x x b w x b
+ + − −
⎫ + = + ⇔ = − ⎬ + = − ⎭
1/||w||
x
there is always at least one point “on the margin”
1/|| *||
x
y p g
an even larger margin (see figure on the right)
1/||w*|| 1/||w*||
x
20
a e e a ge a g (see gu e o t e g t)
αi=0
αi>0
T
i
and yi(w*Txi + b*) = 1
and yi(w*Txi + b*) > 1
αi=0
21
αi=0
αi>0
( ) sgn * *
T
f x w x b ⎡ ⎤ = + ⎣ ⎦ ⎡ ⎤ ⎛ ⎞
i
*
sgn 2
T i i i i
y x x x x α
+ −
⎡ ⎤ ⎛ ⎞ = − ⎢ ⎥ ⎜ ⎟ ⎝ ⎠ ⎣ ⎦ ⎡ ⎤ ⎛ ⎞ +
αi=0
* SV
sgn 2
T i i i i
y x x x x α
+ − ∈
⎡ ⎤ ⎛ ⎞ = − + ⎢ ⎥ ⎜ ⎟ ⎝ ⎠ ⎣ ⎦
22
αi=0
*
( ) sgn 2
T i i i
f x y x x x x α
+ −
+ ⎡ ⎤ ⎛ ⎞ = − ⎢ ⎥ ⎜ ⎟ ⎝ ⎠ ⎣ ⎦
SV
2
i ∈
⎝ ⎠ ⎣ ⎦
αi>0
αi=0
23
precision of pdf estimation, and pdf-based classification, is exponential p p , p , p in the number of dimensions
Although the number of dimensions may be
x
Although the number of dimensions may be large, the number of parameters is relatively small and there is not much room for overfitting
x
24
In fact, d+1 points are enough to specify the decision rule in Rd !!
To see this let’s look at the decision function To see this let s look at the decision function
* SV
( ) sgn *
T i i i i
f x y x x b α
∈
⎡ ⎤ = + ⎢ ⎥ ⎣ ⎦
This is a thresholding of the quantity
SV i ∈
⎣ ⎦
* T
y x x α
Note that each of the terms xi
Tx is the projection (actually,
inner product) of the vector which we wish to classify x
SV i i i i
y x x α
∈
25
inner product) of the vector which we wish to classify, x,
1
k
T T T i i
* *
( ) sgn * sgn ( ) *
T
f x y x x b w z x b α ⎡ ⎤ ⎡ ⎤ = + = + ⎢ ⎥ ⎢ ⎥
SV
( ) sgn * sgn ( ) *
i i i k k i k
f x y x x b w z x b α
∈
= + = + ⎢ ⎥ ⎢ ⎥ ⎣ ⎦ ⎣ ⎦
* *
T
1 1 ,
k k
i i i i
26
span of the support vectors
xi
x
* *
· * ··
T
w y y α α =
z(x) (w* b*)
1 1 ·
* , , ··
k k
i i i i
w y y α α =
(w ,b )
27
1 ⎧ ⎫
i
1 max 2 subject to
T i j i j i j i ij
y y x x y
α
α α α α
≥
⎧ ⎫ − + ⎨ ⎬ ⎩ ⎭ =
“large margin” linear discriminant function:
i
subject to
i i
y α
large margin linear discriminant function:
* SV
*
i i i i
w y x α
∈
= ∑
*
1 * 2
T T i i i i i SV
b y x x x x α
+ − ∈
= − +
*
( ) *
T
f b ⎡ ⎤ ⎢ ⎥
28
SV
( ) sgn *
T i i i i
f x y x x b α
∈
= + ⎢ ⎥ ⎣ ⎦
With l l t f (“h d”) i
29
p g that cross-over, or are closer to the boundary than the margin. So how do we handle the latter set of points?
2 ,
T i i w b
1/||w*|| 1/||w*||
2
T
1/|| *|| || ||
x
, ,
i i i w b i
ξ
1/||w*|| 1/||w*||
x
ξi / ||w*|| xi
30
2
i
1/||w*|| 1/||w*||
, ,
i w b i T i i i
ξ
1/||w ||
x
ξi / ||w*|| xi
i
C ξ
31
i i
C ξ
αi = 0
αi = 0
i
1 max 2
T i j i j i j i ij
y y x x
α
α α α
≥
⎧ ⎫ − + ⎨ ⎬ ⎩ ⎭
0 < αi < C
i
subject to 0,
i i i
y C α α = ≤ ≤
αi = 0 * *
i
* αi = C
32
⎡ ⎤
*
( ) sgn *
T i i i i SV
f x y x x b α
∈
⎡ ⎤ = + ⎢ ⎥ ⎣ ⎦
ll x t 0 < < C
any single support vector outlier from having an unduly large impact in the
33
having an unduly large impact in the decision rule.
Txj
x x2
1 max ,
i j i j i j i
y y k x x α α α ⎧ ⎫ − + ⎨ ⎬
x x x x x x x x x x x x
i i
max , 2 subject to 0, 0
i j i j i j i ij i i i
y y k x x y C
α
α α α α α
≥
+ ⎨ ⎬ ⎩ ⎭ = ≤ ≤
φ
* SV
1 * , , 2
i i i i i
b y K x x K x x α
+ − ∈
= − +
x x x x x x x x x x x x
*
⎡ ⎤ ⎢ ⎥
x
x2 xn
34
* SV
( ) sgn , *
i i i i
f x y K x x b α
∈
⎡ ⎤ = + ⎢ ⎥ ⎣ ⎦
x3 x2
We could have simply used <xi,xj> to denote for the inner product We could have simply used xi,xj to denote for the inner product
hold
The only difference is that we can no longer recover w explicitly without determining the feature transformationφ , since
*
*
i i i
w y x φ α = ∑
Gaussians (“lives” in an infinite dimensional function space) when we
SV i i i i
y φ
∈
( p ) use the Gaussian kernel
⎡ ⎤
35
* SV
( ) sgn , *
i i i i
f x y K x x b α
∈
⎡ ⎤ = + ⎢ ⎥ ⎣ ⎦
is no generic “optimal” procedure to find the kernel or its parameters p
kernel, e.g. Gaussian
Then, determine kernel parameters, e.g. variance, by trial and error
( g )
36
Therefore, writing “your own” algorithm is not going to be competitive
SVM fu: http://five-percent-nation.mit.edu/SvmFu/
e g B S hölk
f d A S l L i ith K l MIT P
37
e.g. B. Schölkopf and A. Smola. Learning with Kernels. MIT Press,
2002)
38