Kernel-based Methods and Support Vector Machines Larry Holder CSE - - PowerPoint PPT Presentation

▶

Mar 16, 2023 538 likes •812 views

Kernel-based Methods and Support Vector Machines Larry Holder CSE 6363 Machine Learning Computer Science and Engineering University of Texas at Arlington References Muller et al., An Introduction to Kernel-Based Learning

SLIDE 1

Kernel-based Methods and Support Vector Machines

Larry Holder CSE 6363 – Machine Learning Computer Science and Engineering University of Texas at Arlington

SLIDE 2

References

Muller et al., “An Introduction to Kernel-Based

Learning Algorithms,” IEEE Transactions on Neural Networks, 12(2):181-201, 2001.

SLIDE 3

Learning Problem

Estimate function f : RN { -1,+ 1}

using n training data (xi,yi) sampled from P(x,y)

Want f minimizing expected error (risk)

R[f]

P(x,y) unknown, so compute empirical

risk Remp[f]

∫

= ) , ( ) ), ( ( ] [ y dP y f loss f R x x

) ), ( ( 1 ] [

y f loss f R x

∑

1 i i i emp

SLIDE 4

Overfit

Using Remp[f] to estimate R[f] for small

n may lead to overfit

SLIDE 5

Overfit

Can restrict the class F of f

I.e., restrict the VC dimension h of F

Model selection

Find F such that learned f ∈ F minimizes

Remp[f]’s overestimate of R[f]

With probability 1- δ and n> h:

n h n h f R f R

emp

) 4 / ln( ) 1 2 (ln ] [ ] [ δ − + + ≤

SLIDE 6

Overfit

Tradeoff between empirical risk Remp[f]

and uncertainty in estimate of R[f]

Complexity of F Expected Risk Empirical Risk Uncertainty

SLIDE 7

Margins

Consider a training sample separable by the

hyperplane f(x) = (w · x) + b

Margin is the minimal distance of a sample to the

decision surface

We can bound the VC dimension of the set of

hyperplanes by bounding the margin

w margin

SLIDE 8

Nonlinear Algorithms

Likely to underfit using only hyperplanes But we can map the data to a nonlinear

space and use hyperplanes there

Φ: RN F x Φ(x)

Φ

SLIDE 9

Curse of Dimensionality

Difficulty of learning increases with the

dimensionality of the problem

I.e., Harder to learn with more features

But, difficulty based on complexity of learning

algorithm and VC of hypothesis class

Hyperplanes are easy to learn

Still mapping to extremely high dimensional

spaces makes even hyperplane learning difficult

SLIDE 10

Kernel Functions

For some feature spaces F and

mappings Φ there is a “trick” for efficiently computing scalar products

Kernel functions compute scalar

products in F without mapping data to

F or even knowing Φ

SLIDE 11

Kernel Functions

Example: kernel k

) , 2 , ( ) , , ( ) , ( :

2 2 2 1 2 1 3 2 1 2 1 3 2

x x x x z z z x x R R = → → Φ

) , ( ) ( ) ) , )( , (( ) , 2 , )( , 2 , ( )) ( ) ( (

2 2 2 1 2 1 2 2 2 1 2 1 2 2 2 1 2 1

y x y x y x k y y x x y y y y x x x x = ⋅ = = = Φ ⋅ Φ

Τ Τ

SLIDE 12

Kernel Functions

2 2 2

|| || 1 : atic multiquadr Inverse ) ) ( tanh( : Sigmoidal ) ) (( : Polynomial || || exp ) , ( : RBF Gaussian c c k

+ − + ⋅ + ⋅ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ − − = y x y x y x y x y x θ κ θ

SLIDE 13

Support Vector Machines

Supervised learning Mapping to nonlinear space Minimize (subject to Eq. 8)

n i b y

i i

, , 1 , 1 ) ) (( K = ≥ + ⋅ x w

8) (Eq. , , 1 , 1 ) ) ( (( n i b y

i i

K = ≥ + Φ ⋅ x w

2 ,

|| || 2 1 min w

w b

SLIDE 14

Support Vector Machines

Problem: w resides in F, where

computation is difficult

Solution: remove dependency on w

Introduce Lagrange multipliers

αi ≥ 0, i = 1,…,n One for each constraint in Eq. 8

And use kernel function

SLIDE 15

Support Vector Machines

∑ ∑ ∑

= = =

Φ = → = ∂ ∂ = → = ∂ ∂ − + Φ ⋅ − =

n i i i i n i i i n i i i i

y L y b L b y b L

1 1 1 2

) ( ) 1 ) )) ( (( ( || || 2 1 ) , , ( x w w x w w α w α α α

Substituting last two equations into first and replacing

(Φ(xi) · Φ(xj)) with kernel function k(xi,xj) …

SLIDE 16

Support Vector Machines

,..., 1 , : Subject to ) , ( 2 1 max

1 1 , 1

= = ≥ −

∑ ∑ ∑

= = = i n i i i j i j i j n j i i n i i

y n i k y y α α α α α x x

This is a quadratic optimization function.

SLIDE 17

Support Vector Machines

Once we have α, we have w and can

perform classification

∑ ∑ ∑ ∑

= = = =

⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ − = ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ + = ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ + Φ ⋅ Φ =

n i n j j i j j i n i i i i n i i i i

k y y n b b k y b y f

1 1 1 1

) , ( 1 where , ) , ( sgn )) ( ) ( ( sgn ) ( x x x x x x x α α α

SLIDE 18

SVMs with Noise

Until now, assuming problem is linearly

separable in some space

But if noise is present, this may be a bad

assumption

Solution: Introduce noise terms (slack

variables ξi) into the classification

n i b y

i i i i

, , 1 , , 1 ) ) (( K = ≥ − ≥ + ⋅ ξ ξ x w

SLIDE 19

SVMs with Noise

Now, we want to minimize

Where C > 0 determines tradeoff between

empirical error and hypothesis complexity

∑

+

n i i b

C

1 2 , ,

|| || 2 1 min ξ w

ξ w

SLIDE 20

SVMs with Noise

,..., 1 , : Subject to ) , ( 2 1 max

1 1 , 1

= = ≥ ≤ −

∑ ∑ ∑

= = = i n i i i j i j i j n j i i n i i

y n i C k y y α α α α α x x

α where C is limiting the size of the Lagrange multipliers αi

SLIDE 21

Sparsity

Note that many training examples will be

utside the margin

Therefore, their optimal αi = 0 This reduces the optimization problem from n

variables down to the number of examples on

r inside the margin

and 1 ) ( and 1 ) ( and 1 ) ( ≥ ≤ ⇒ = = = ⇒ < < = ≥ ⇒ =

i i i i i i i i i i i i

f y C f y C f y ξ α ξ α ξ α x x x

margin

SLIDE 22

Kernel Methods

Fisher’s linear discriminant

Find a linear projection of the feature

space such that classes are well separated

“Well separated” defined as a large

difference in the means and a small variance along the discriminant

Can be solved using kernel methods to

find nonlinear discriminants

SLIDE 23

Applications

Optical pattern and object recognition

Invariant SVM achieved best error rate

(0.6%) on USPS handwritten digit recognition problem

Better than humans (2.5%)

Text categorization Time-series prediction

SLIDE 24

Applications

Gene expression profile analysis DNA and protein analysis

SVM method (13%) of classifying DNA

translation initiation sites outperforms best neural network (15%)

Virtual SVMs, incorporating prior biological

knowledge, reached 11-12% error rate

SLIDE 25

Kernel Methods for Unsupervised Learning

Principal Components Analysis (PCA) used in

unsupervised learning

PCA is a linear method Kernel-based PCA can achieve non-linear

components using standard kernel techniques

Application to USPS data to reduce noise

indicated a factor of 8 performance improvement over linear PCA method

SLIDE 26

Summary

(+ ) Kernel-based methods allow linear-speed

learning in non-linear spaces

(+ ) Support vector machines ignore all but the most

differentiating training data (those on or inside the margin)

(+ ) Kernel-based methods and SVMs in particular,

are among the best performing classifiers on many learning problems

(-) Choosing an appropriate kernel can be difficult (-) High dimensionality of original learning problem

can still be a computational bottleneck