Kernel-based Methods and Support Vector Machines
Larry Holder CptS 570 – Machine Learning School of Electrical Engineering and Computer Science Washington State University
Kernel-based Methods and Support Vector Machines Larry Holder CptS - - PowerPoint PPT Presentation
Kernel-based Methods and Support Vector Machines Larry Holder CptS 570 Machine Learning School of Electrical Engineering and Computer Science Washington State University References Muller et al., An Introduction to Kernel-Based
Larry Holder CptS 570 – Machine Learning School of Electrical Engineering and Computer Science Washington State University
Muller et al., “An Introduction to Kernel-Based
Learning Algorithms,” IEEE Transactions on Neural Networks, 12(2):181-201, 2001.
Estimate function f : RN { -1,+ 1}
Want f minimizing expected error (risk)
P(x,y) unknown, so compute empirical
= ) , ( ) ), ( ( ] [ y dP y f loss f R x x
) ), ( ( 1 ] [
1 i i n i emp
y f loss n f R x
=
=
Using Remp[f] to estimate R[f] for small
Can restrict the class F of f
I.e., restrict the VC dimension h of F
Model selection
Find F such that learned f ∈ F minimizes
With probability 1- δ and n> h:
emp
Tradeoff between empirical risk Remp[f]
Complexity of F Expected Risk Empirical Risk Uncertainty
Consider a training sample separable by the
hyperplane f(x) = (w · x) + b
Margin is the minimal distance of a sample to the
decision surface
We can bound the VC dimension of the set of
hyperplanes by bounding the margin
w margin
Likely to underfit using only hyperplanes But we can map the data to a nonlinear
Φ: RN F x Φ(x)
Difficulty of learning increases with the
I.e., Harder to learn with more features
But, difficulty based on complexity of learning
Hyperplanes are easy to learn
Still mapping to extremely high dimensional
For some feature spaces F and
Kernel functions compute scalar
Example: kernel k
2 2 2 1 2 1 3 2 1 2 1 3 2
) , ( ) ( ) ) , )( , (( ) , 2 , )( , 2 , ( )) ( ) ( (
2 2 2 1 2 1 2 2 2 1 2 1 2 2 2 1 2 1
y x y x y x k y y x x y y y y x x x x = ⋅ = = = Φ ⋅ Φ
Τ Τ
2 2 2
d
Supervised learning Mapping to nonlinear space Minimize (subject to Eq. 8)
i i
i i
2 ,
w b
Problem: w resides in F, where
Solution: remove dependency on w
Introduce Lagrange multipliers
αi ≥ 0, i = 1,…,n One for each constraint in Eq. 8
And use kernel function
= = =
Φ = → = ∂ ∂ = → = ∂ ∂ − + Φ ⋅ − =
n i i i i n i i i n i i i i
y L y b L b y b L
1 1 1 2
) ( ) 1 ) )) ( (( ( || || 2 1 ) , , ( x w w x w w α w α α α
Substituting last two equations into first and replacing
(Φ(xi ) · Φ(xj ))
with kernel function k(xi
,xj )
…
1 1 , 1
= = = i n i i i j i j i j n j i i n i i
α
This is a quadratic optimization function.
Once we have α, we have w and can
= = = =
n i n j j i j j i n i i i i n i i i i
1 1 1 1
Until now, assuming problem is linearly
But if noise is present, this may be a bad
Solution: Introduce noise terms (slack
i i i i
Now, we want to minimize
Where C > 0 determines tradeoff between
=
n i i b
1 2 , ,
ξ w
1 1 , 1
= = = i n i i i j i j i j n j i i n i i
α where C is limiting the size of the Lagrange multipliers αi
Note that many training examples will be
Therefore, their optimal αi = 0 This reduces the optimization problem from n
and 1 ) ( and 1 ) ( and 1 ) ( ≥ ≤ ⇒ = = = ⇒ < < = ≥ ⇒ =
i i i i i i i i i i i i
f y C f y C f y ξ α ξ α ξ α x x x
w
margin
Fisher’s linear discriminant
Find a linear projection of the feature
“Well separated” defined as a large
Can be solved using kernel methods to
Optical pattern and object recognition
Invariant SVM achieved best error rate
Better than humans (2.5%)
Text categorization Time-series prediction
Gene expression profile analysis DNA and protein analysis
SVM method (13%) of classifying DNA
Virtual SVMs, incorporating prior biological
Principal Components Analysis (PCA) used in
PCA is a linear method Kernel-based PCA can achieve non-linear
Application to USPS data to reduce noise
(+ ) Kernel-based methods allow linear-speed
learning in non-linear spaces
(+ ) Support vector machines ignore all but the most
differentiating training data (those on or inside the margin)
(+ ) Kernel-based methods and SVMs in particular,
are among the best performing classifiers on many learning problems
(-) Choosing an appropriate kernel can be difficult (-) High dimensionality of original learning problem
can still be a computational bottleneck