Introduction to Support Vector Machines Andreas Maletti Technische - - PowerPoint PPT Presentation

introduction to support vector machines
SMART_READER_LITE
LIVE PREVIEW

Introduction to Support Vector Machines Andreas Maletti Technische - - PowerPoint PPT Presentation

Introduction to Support Vector Machines Andreas Maletti Technische Universitt Dresden Fakultt Informatik June 15, 2006 1 The Problem 2 The Basics 3 The Proposed Solution Learning by Machines Learning Rote Learning Reinforcement feedback


slide-1
SLIDE 1

Introduction to Support Vector Machines

Andreas Maletti

Technische Universität Dresden Fakultät Informatik

June 15, 2006

slide-2
SLIDE 2

1 The Problem 2 The Basics 3 The Proposed Solution

slide-3
SLIDE 3

Learning by Machines

Learning

Rote Learning memorization (Hash tables) Reinforcement feedback at end (Q [Watkins 89]) Induction generalizing examples (ID3 [Quinlan 79]) Clustering grouping data (CMLIB [Hartigan 75]) Analogy representation similarity (JUPA [Yvon 94]) Discovery unsupervised, no goal Genetic Alg. simulated evolution (GABIL [DeJong 93])

slide-4
SLIDE 4

Supervised Learning

Definition

Supervised Learning: given nontrivial training data (labels known) predict test data (labels unknown)

Implementations

Clustering Rote Learning Nearest Neighbor [Cover, Hart 67] Induction Hash tables Neural Networks [McCulloch, Pitts 43] Decision Trees [Hunt 66] SVMs [Vapnik et al 92]

slide-5
SLIDE 5

Problem Description—General

Problem

Classify a given input

  • binary classification: two classes
  • multi-class classification: several, but finitely many classes
  • regression: infinitely many classes

Major Applications

  • Handwriting recognition
  • Cheminformatics (Quantitative Structure-Activity Relationship)
  • Pattern recognition
  • Spam detection (HP Labs, Palo Alto)
slide-6
SLIDE 6

Problem Description—Specific

Electricity Load Prediction Challenge 2001

  • Power plant that supports energy demand of a region
  • Excess production expensive
  • Load varies substantially
  • Challenge won by libSVM [Chang, Lin 06]

Problem

  • given: load and temperature for 730 days (≈ 70kB data)
  • predict: load for the next 365 days
slide-7
SLIDE 7

Example Data

400 450 500 550 600 650 700 750 800 850 50 100 150 200 250 300 350 Load Day of Year Load 1997 12:00 24:00

slide-8
SLIDE 8

Problem Description—Formal

Definition (cf. [Lin 01])

Given a training set S ⊆ Rn × {−1, 1} of correctly classified input data vectors x ∈ Rn, where:

  • every input data vector appears at most once in S
  • there exist input data vectors

p and n such that ( p, 1) ∈ S as well as ( n, −1) ∈ S (non-trivial) successfully classify unseen input data vectors.

slide-9
SLIDE 9

Linear Classification [Vapnik 63]

  • Given: A training set S ⊆ Rn × {−1, 1}
  • Goal: Find a hyperplane that separates Rn into halves that

contain only elements of one class

slide-10
SLIDE 10

Representation of Hyperplane

Definition

Hyperplane n · ( x − x0) = 0

n ∈ Rn weight vector

x ∈ Rn input vector

x0 ∈ Rn offset Alternatively: w · x + b = 0

Decision Function

  • training set S = {(

xi, yi) | 1 ≤ i ≤ k}

  • separating hyperplane

w · x + b = 0 for S Decision:

xi+b

  • > 0

if yi = 1 < 0 if yi = −1 ⇒ f ( x) = sgn( w · x + b)

slide-11
SLIDE 11

Learn Hyperplane

Problem

  • Given: training set S
  • Goal: coefficients

w and b of a separating hyperplane

  • Difficulty: several or no candidates for

w and b

Solution [cf. Vapnik’s statistical learning theory]

Select admissible w and b with maximal margin (minimal distance to any input data vector)

Observation

We can scale w and b such that

  • w ·

xi + b

  • ≥ 1

if yi = 1 ≤ −1 if yi = −1

slide-12
SLIDE 12

Maximizing the Margin

  • Closest points

x+ and x− (with w · x± + b = ±1)

  • Distance between

w · x + b = ±1: ( w · x+ + b) − ( w · x− + b)

  • w

= 2

  • w =

2 √

  • w ·

w

  • max

w,b 2 √

w ≡ min w,b

w 2

Basic (Primal) Support Vector Machine Form target: min

w,b 1 2(

w · w) subject to: yi( w · xi + b) ≥ 1 (i = 1, . . . , k)

slide-13
SLIDE 13

Non-separable Data

Problem

Maybe a linear separating hyperplane does not exist!

Solution

Allow training errors ξi penalized by large penalty parameter C Standard (Primal) Support Vector Machine Form target: min

w,b, ξ 1 2(

w · w) + C k

i=1 ξi

  • subject to:

yi( w · xi + b) ≥ 1 − ξi ξi ≥ 0 (i = 1, . . . , k) If ξi > 1, then misclassification of xi

slide-14
SLIDE 14

Higher Dimensional Feature Spaces

Problem

Data not separable because target function is essentially nonlinear!

Approach

Potentially separable in higher dimensional space

  • Map input vectors nonlinearly into

high dimensional space (feature space)

  • Perform separation there
slide-15
SLIDE 15

Higher Dimensional Feature Spaces

Literature

  • Classic approach [Cover 65]
  • “Kernel trick” [Boser, Guyon, Vapnik 92]
  • Extension to soft margin [Cortes, Vapnik 95]

Example (cf. [Lin 01])

Mapping φ from R3 into feature space R10 φ( x) = (1, √ 2x1, √ 2x2, √ 2x3, x2

1, x2 2, x2 3,

√ 2x1x2, √ 2x1x3, √ 2x2x3)

slide-16
SLIDE 16

Adapted Standard Form

Definition

Standard (Primal) Support Vector Machine Form target: min

w,b, ξ 1 2(

w · w) + C k

i=1 ξi

  • subject to:

yi( w · φ( xi) + b) ≥ 1 − ξi ξi ≥ 0 (i = 1, . . . , k)

  • w is a vector in a high dimensional space
slide-17
SLIDE 17

How to Solve?

Problem

Find w and b from the standard SVM form

Solution

Solve via Lagrangian dual [Bazaraa et al 93]: max

α≥0, π≥0

  • min

w,b, ξ L(

w, b, ξ, α)

  • where

L( w, b, ξ, α) = w · w 2 + C k

  • i=1

ξi

  • +

k

  • i=1

αi(1 − ξi − yi( w · φ( xi) + b)) −

k

  • i=1

πiξi

slide-18
SLIDE 18

Simplifying the Dual [Chen et al 03]

Standard (Dual) Support Vector Machine Form target: min

α 1 2(

αTQ α) − k

i=1 αi

subject to:

  • y ·

α = 0 0 ≤ αi ≤ C (i = 1, . . . , k) where: Qij = yiyj

  • φ(

xi) · φ( xj)

  • Solution

We obtain w as

  • w =

k

  • i=1

αiyiφ( xi)

slide-19
SLIDE 19

Where is the Benefit?

α ∈ Rk (dimension independent from feature space)

  • Only inner products in feature space

Kernel Trick

  • Inner products efficiently calculated on input vectors via

kernel K K( xi, xj) = φ( xi) · φ( xj)

  • Select appropriate feature space
  • Avoid nonlinear transformation into feature space
  • Benefit from better separation properties in feature space
slide-20
SLIDE 20

Kernels

Example

Mapping into feature space φ: R3 → R10 φ( x) = (1, √ 2x1, √ 2x2, . . . , √ 2x2x3) Kernel K( xi, xj) = φ( xi) · φ( xj) = (1 + xi · xj)2.

Popular Kernels

  • Gaussian Radial Basis Function:

(feature space is an infinite dimensional Hilbert space) g( xi, xj) = exp(−γ xi − xj2)

  • Polynomial: g(

xi, xj) = ( xi · xj + 1)d

slide-21
SLIDE 21

The Decision Function

Observation

  • No need for

w because f ( x) = sgn

  • w · φ(

x) + b

  • = sgn

k

  • i=1

αiyi

  • φ(

xi) · φ( x)

  • + b
  • Uses only

xi (support vectors) where αi > 0 Few points determine the separation; borderline points

slide-22
SLIDE 22

Support Vectors

slide-23
SLIDE 23

Support Vector Machines

Definition

  • Given: Kernel K and training set S
  • Goal: decision function f

target: min

α

  • αTQ

α 2 −

k

  • i=1

αi

  • Qij = yiyjK(

xi, xj) subject to:

  • y ·

α = 0 0 ≤ αi ≤ C (i = 1, . . . , k) decide: f ( x) = sgn k

  • i=1

αiyiK( xi, x) + b

slide-24
SLIDE 24

Quadratic Programming

  • Suppose Q (k by k) fully dense matrix
  • 70,000 training points 70,000 variables
  • 70, 0002 · 4B ≈ 19GB: huge problem
  • Traditional methods: Newton, Quasi Newton cannot be

directly applied

  • Current methods:
  • Decomposition [Osuna et al 97], [Joachims 98], [Platt 98]
  • Nearest point of two convex hulls [Keerthi et al 99]
slide-25
SLIDE 25

Sample Implementation

www.kernel-machines.org

  • Main forum on kernel machines
  • Lists over 250 active researchers
  • 43 competing implementations

libSVM [Chang, Lin 06]

  • Supports binary and multi-class classification and regression
  • Beginners Guide for SVM classification
  • “Out of the box”-system (automatic data scaling, parameter

selection)

  • Won EUNITE and IJCNN challenge
slide-26
SLIDE 26

Application Accuracy

Automatic Training using libSVM

Application Training Data Features Classes Accuracy Astroparticle 3,089 4 2 96.9% Bioinformatics 391 20 3 85.2% Vehicle 1,243 21 2 87.8%

slide-27
SLIDE 27

References

Books

  • Statistical Learning Theory (Vapnik). Wiley, 1998
  • Advances in Kernel Methods—Support Vector Learning

(Schölkopf, Burges, Smola). MIT Press, 1999

  • An Introduction to Support Vector Machines (Cristianini,

Shawe-Taylor). Cambridge Univ., 2000

  • Support Vector Machines—Theory and Applications (Wang).

Springer, 2005

slide-28
SLIDE 28

References

Seminal Papers

  • A training algorithm for optimal margin classifiers (Boser,

Guyon, Vapnik). COLT’92, ACM Press.

  • Support vector networks (Cortes, Vapnik). Machine

Learning 20, 1995

  • Fast training of support vector machines using sequential

minimal optimization (Platt). In Advances in Kernel Methods, MIT Press, 1999

  • Improvements to Platt’s SMO algorithm for SVM classifier

design (Keerthi, Shevade, Bhattacharyya, Murthy). Technical Report, 1999

slide-29
SLIDE 29

References

Recent Papers

  • A tutorial on ν-Support Vector Machines (Chen, Lin,

Schölkopf). 2003

  • Support Vector and Kernel Machines (Nello Christianini).

ICML, 2001

  • libSVM: A library for Support Vector Machines (Chang, Lin).

System Documentation, 2006

slide-30
SLIDE 30

Sequential Minimal Optimization [Platt 98]

  • Commonly used to solve standard SVM form
  • Decomposition method with smallest working set, |B| = 2
  • Subproblem analytically solved; no need for optimization

software

  • Contained flaws; modified version [Keerthi et al 99]
  • Karush-Kuhn-Tucker (KKT) of the dual (

E = (1, . . . , 1)): Q α − E + b y − λ + µ = 0 µi(C − αi) = 0

  • µ ≥ 0

αiλi = 0

  • λ ≥ 0
slide-31
SLIDE 31

Computing b

  • KKT yield

(Q α − E + b y)i

  • ≥ 0

if αi < C ≤ 0 if αi > 0

  • Let Fi(

α) = k

j=1 αjyjK(

xi, xj) − yi and I0 = {i | 0 < αi < C} I1 = {i | yi = 1, αi = 0} I2 = {i | yi = −1, αi = C} I3 = {i | yi = 1, αi = C} I4 = {i | yi = −1, αi = 0}

  • Case analysis on yi yields bounds on b

max{Fi( α) | i ∈ I0 ∪ I3 ∪ I4} ≤ b ≤ min{Fi( α) | i ∈ I0 ∪ I1 ∪ I2}

slide-32
SLIDE 32

Working Set Selection

Observation (see [Keerthi et al 99])

  • α not optimal solution iff

max{Fi( α) | i ∈ I0 ∪ I3 ∪ I4} > min{Fi( α) | i ∈ I0 ∪ I1 ∪ I2}

Approach

Select working set B = {i, j} with i ≡ arg maxm{Fm( α) | m ∈ I0 ∪ I3 ∪ I4} j ≡ arg minm{Fm( α) | m ∈ I0 ∪ I1 ∪ I2}

slide-33
SLIDE 33

The Subproblem

Definition

Let B = {i, j} and N = {1, . . . , k} \ B.

αB = αi αj

  • and

αN = α|N (similar for matrices) B-Subproblem target: min

αB

  • αT

B QBB

αB 2 +

  • b∈B

αbQb,N αN

  • b∈B

αb

  • subject to:
  • y ·

α = 0 0 ≤ αi, αj ≤ C

slide-34
SLIDE 34

Final Solution

  • Note that −yiαi =

yN · αN + yjαj

  • Substitute αi = −yi(

yN · αN + yjαj) into target

  • One-variable optimization problem
  • Can be solved analytically (cf., e.g., [Lin 01])
  • Iterate (yielding new

α) until max{Fi( α) | i ∈ I0 ∪ I3 ∪ I4} ≤ min{Fi( α) | i ∈ I0 ∪ I1 ∪ I2} − ǫ