Kernel-based Methods and Support Vector Machines Larry Holder CSE - - PowerPoint PPT Presentation
Kernel-based Methods and Support Vector Machines Larry Holder CSE - - PowerPoint PPT Presentation
Kernel-based Methods and Support Vector Machines Larry Holder CSE 6363 Machine Learning Computer Science and Engineering University of Texas at Arlington References Muller et al., An Introduction to Kernel-Based Learning
References
Muller et al., “An Introduction to Kernel-Based
Learning Algorithms,” IEEE Transactions on Neural Networks, 12(2):181-201, 2001.
Learning Problem
Estimate function f : RN { -1,+ 1}
using n training data (xi,yi) sampled from P(x,y)
Want f minimizing expected error (risk)
R[f]
P(x,y) unknown, so compute empirical
risk Remp[f]
∫
= ) , ( ) ), ( ( ] [ y dP y f loss f R x x
) ), ( ( 1 ] [
n
y f loss f R x
∑
=
1 i i i emp
n
=
Overfit
Using Remp[f] to estimate R[f] for small
n may lead to overfit
Overfit
Can restrict the class F of f
I.e., restrict the VC dimension h of F
Model selection
Find F such that learned f ∈ F minimizes
Remp[f]’s overestimate of R[f]
With probability 1- δ and n> h:
n h n h f R f R
emp
) 4 / ln( ) 1 2 (ln ] [ ] [ δ − + + ≤
Overfit
Tradeoff between empirical risk Remp[f]
and uncertainty in estimate of R[f]
Complexity of F Expected Risk Empirical Risk Uncertainty
Margins
Consider a training sample separable by the
hyperplane f(x) = (w · x) + b
Margin is the minimal distance of a sample to the
decision surface
We can bound the VC dimension of the set of
hyperplanes by bounding the margin
w margin
Nonlinear Algorithms
Likely to underfit using only hyperplanes But we can map the data to a nonlinear
space and use hyperplanes there
Φ: RN F x Φ(x)
Φ
Curse of Dimensionality
Difficulty of learning increases with the
dimensionality of the problem
I.e., Harder to learn with more features
But, difficulty based on complexity of learning
algorithm and VC of hypothesis class
Hyperplanes are easy to learn
Still mapping to extremely high dimensional
spaces makes even hyperplane learning difficult
Kernel Functions
For some feature spaces F and
mappings Φ there is a “trick” for efficiently computing scalar products
Kernel functions compute scalar
products in F without mapping data to
F or even knowing Φ
Kernel Functions
Example: kernel k
) , 2 , ( ) , , ( ) , ( :
2 2 2 1 2 1 3 2 1 2 1 3 2
x x x x z z z x x R R = → → Φ
) , ( ) ( ) ) , )( , (( ) , 2 , )( , 2 , ( )) ( ) ( (
2 2 2 1 2 1 2 2 2 1 2 1 2 2 2 1 2 1
y x y x y x k y y x x y y y y x x x x = ⋅ = = = Φ ⋅ Φ
Τ Τ
Kernel Functions
2 2 2
|| || 1 : atic multiquadr Inverse ) ) ( tanh( : Sigmoidal ) ) (( : Polynomial || || exp ) , ( : RBF Gaussian c c k
d
+ − + ⋅ + ⋅ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ − − = y x y x y x y x y x θ κ θ
Support Vector Machines
Supervised learning Mapping to nonlinear space Minimize (subject to Eq. 8)
n i b y
i i
, , 1 , 1 ) ) (( K = ≥ + ⋅ x w
8) (Eq. , , 1 , 1 ) ) ( (( n i b y
i i
K = ≥ + Φ ⋅ x w
2 ,
|| || 2 1 min w
w b
Support Vector Machines
Problem: w resides in F, where
computation is difficult
Solution: remove dependency on w
Introduce Lagrange multipliers
αi ≥ 0, i = 1,…,n One for each constraint in Eq. 8
And use kernel function
Support Vector Machines
∑ ∑ ∑
= = =
Φ = → = ∂ ∂ = → = ∂ ∂ − + Φ ⋅ − =
n i i i i n i i i n i i i i
y L y b L b y b L
1 1 1 2
) ( ) 1 ) )) ( (( ( || || 2 1 ) , , ( x w w x w w α w α α α
Substituting last two equations into first and replacing
(Φ(xi) · Φ(xj)) with kernel function k(xi,xj) …
Support Vector Machines
,..., 1 , : Subject to ) , ( 2 1 max
1 1 , 1
= = ≥ −
∑ ∑ ∑
= = = i n i i i j i j i j n j i i n i i
y n i k y y α α α α α x x
α
This is a quadratic optimization function.
Support Vector Machines
Once we have α, we have w and can
perform classification
∑ ∑ ∑ ∑
= = = =
⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ − = ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ + = ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ + Φ ⋅ Φ =
n i n j j i j j i n i i i i n i i i i
k y y n b b k y b y f
1 1 1 1
) , ( 1 where , ) , ( sgn )) ( ) ( ( sgn ) ( x x x x x x x α α α
SVMs with Noise
Until now, assuming problem is linearly
separable in some space
But if noise is present, this may be a bad
assumption
Solution: Introduce noise terms (slack
variables ξi) into the classification
n i b y
i i i i
, , 1 , , 1 ) ) (( K = ≥ − ≥ + ⋅ ξ ξ x w
SVMs with Noise
Now, we want to minimize
Where C > 0 determines tradeoff between
empirical error and hypothesis complexity
∑
=
+
n i i b
C
1 2 , ,
|| || 2 1 min ξ w
ξ w
SVMs with Noise
,..., 1 , : Subject to ) , ( 2 1 max
1 1 , 1
= = ≥ ≤ −
∑ ∑ ∑
= = = i n i i i j i j i j n j i i n i i
y n i C k y y α α α α α x x
α where C is limiting the size of the Lagrange multipliers αi
Sparsity
Note that many training examples will be
- utside the margin
Therefore, their optimal αi = 0 This reduces the optimization problem from n
variables down to the number of examples on
- r inside the margin
and 1 ) ( and 1 ) ( and 1 ) ( ≥ ≤ ⇒ = = = ⇒ < < = ≥ ⇒ =
i i i i i i i i i i i i
f y C f y C f y ξ α ξ α ξ α x x x
w
margin
Kernel Methods
Fisher’s linear discriminant
Find a linear projection of the feature
space such that classes are well separated
“Well separated” defined as a large
difference in the means and a small variance along the discriminant
Can be solved using kernel methods to
find nonlinear discriminants
Applications
Optical pattern and object recognition
Invariant SVM achieved best error rate
(0.6%) on USPS handwritten digit recognition problem
Better than humans (2.5%)
Text categorization Time-series prediction
Applications
Gene expression profile analysis DNA and protein analysis
SVM method (13%) of classifying DNA
translation initiation sites outperforms best neural network (15%)
Virtual SVMs, incorporating prior biological
knowledge, reached 11-12% error rate
Kernel Methods for Unsupervised Learning
Principal Components Analysis (PCA) used in
unsupervised learning
PCA is a linear method Kernel-based PCA can achieve non-linear
components using standard kernel techniques
Application to USPS data to reduce noise
indicated a factor of 8 performance improvement over linear PCA method
Summary
(+ ) Kernel-based methods allow linear-speed
learning in non-linear spaces
(+ ) Support vector machines ignore all but the most
differentiating training data (those on or inside the margin)
(+ ) Kernel-based methods and SVMs in particular,
are among the best performing classifiers on many learning problems
(-) Choosing an appropriate kernel can be difficult (-) High dimensionality of original learning problem
can still be a computational bottleneck