Support Vector Machines Support Vector Machines Hypothesis Space - - PowerPoint PPT Presentation

support vector machines support vector machines
SMART_READER_LITE
LIVE PREVIEW

Support Vector Machines Support Vector Machines Hypothesis Space - - PowerPoint PPT Presentation

Support Vector Machines Support Vector Machines Hypothesis Space Hypothesis Space variable size variable size deterministic deterministic continuous parameters continuous parameters Learning Algorithm Learning


slide-1
SLIDE 1

Support Vector Machines Support Vector Machines

Hypothesis Space Hypothesis Space

– – variable size variable size – – deterministic deterministic – – continuous parameters continuous parameters

Learning Algorithm Learning Algorithm

– – linear and quadratic programming linear and quadratic programming – – eager eager – – batch batch

SVMs combine three important ideas SVMs combine three important ideas

– – Apply optimization algorithms from Operations Reseach (Linear Apply optimization algorithms from Operations Reseach (Linear Programming and Quadratic Programming) Programming and Quadratic Programming) – – Implicit feature transformation using kernels Implicit feature transformation using kernels – – Control of overfitting by maximizing the margin Control of overfitting by maximizing the margin

slide-2
SLIDE 2

White Lie Warning White Lie Warning

This first introduction to SVMs describes a This first introduction to SVMs describes a special case with simplifying assumptions special case with simplifying assumptions We will revisit SVMs later in the quarter We will revisit SVMs later in the quarter and remove the assumptions and remove the assumptions The material you are about to see does The material you are about to see does not describe not describe “ “real real” ” SVMs SVMs

slide-3
SLIDE 3

Linear Programming Linear Programming

The linear programming problem is the The linear programming problem is the following: following:

find find w w minimize c minimize c · · w w subject to subject to w w · · a ai

i = b

= bi

i for

for i i = 1, = 1, … …, , m m w wj

j ≥

≥ 0 for 0 for j j = 1, = 1, … …, , n n

There are fast algorithms for solving linear There are fast algorithms for solving linear programs including the simplex algorithm and programs including the simplex algorithm and Karmarkar Karmarkar’ ’s algorithm s algorithm

slide-4
SLIDE 4

Formulating LTU Learning as Formulating LTU Learning as Linear Programming Linear Programming

Encode classes as {+1, Encode classes as {+1, – –1} 1} LTU: LTU:

h(x) = +1 if h(x) = +1 if w w · · x x ≥ ≥ 0 = = – –1 otherwise 1 otherwise

An example ( An example (x xi

i,

,y yi

i) is classified correctly by

) is classified correctly by h h if if

y yi

i ·

· w w · · x xi

i > 0

> 0

Basic idea: The constraints on the linear Basic idea: The constraints on the linear programming problem will be of the form programming problem will be of the form

y yi

i ·

· w w · · x xi

i > 0

> 0

We need to introduce two more steps to convert We need to introduce two more steps to convert this into the standard format for a linear program this into the standard format for a linear program

slide-5
SLIDE 5

Converting to Standard LP Form Converting to Standard LP Form

Step 1: Convert to equality constraints by using slack Step 1: Convert to equality constraints by using slack variables variables

– – Introduce one slack variable Introduce one slack variable s si

i for each training example

for each training example x xi

i and

and require that require that s si

i ≥

≥ 0: 0:

y yi

i ·

· w w · · x xi

i –

– s si

i = 0

= 0

Step 2: Make all variables positive by subtracting pairs Step 2: Make all variables positive by subtracting pairs

– – Replace each Replace each w wj

j be a difference of two variables:

be a difference of two variables: w wj

j =

= u uj

j –

– v vj

j,

, where where u uj

j,

, v vj

j ≥

≥ 0

y yi

i ·

· ( (u u – – v v) ) · · x xi

i –

– s si

i = 0

= 0

Linear program: Linear program:

– – Find Find s si

i,

, u uj

j,

, v vj

j

– – Minimize (no objective function) Minimize (no objective function) – – Subject to: Subject to:

y yi

i (

(∑ ∑j

j (

(u uj

j –

– v vj

j)

)x xij

ij)

) – – s si

i = 0

= 0 s si

i ≥

≥ 0, 0, u uj

j ≥

≥ 0, 0, v vj

j ≥

≥ 0

The linear program will have a solution iff the points are The linear program will have a solution iff the points are linearly separable linearly separable

slide-6
SLIDE 6

Example Example

30 random data points labeled according to the line 30 random data points labeled according to the line x x2

2 =

= 1 + 1 + x x1

1

Pink line is true classifier Pink line is true classifier Blue line is the linear programming fit Blue line is the linear programming fit

slide-7
SLIDE 7

What Happens with What Happens with Non Non-

  • Separable Data?

Separable Data?

Bad News: Linear Program is Infeasible Bad News: Linear Program is Infeasible

slide-8
SLIDE 8

Higher Dimensional Spaces Higher Dimensional Spaces

Theorem: For Theorem: For any any data set, there exists a data set, there exists a mapping mapping Φ Φ to a higher to a higher-

  • dimensional space

dimensional space such that the data is linearly separable such that the data is linearly separable Φ Φ( (X X) = ( ) = (φ φ1

1(

(x x), ), φ φ2

2(

(x x), ), … …, , φ φD

D(

(x x)) )) Example: Map to quadratic space Example: Map to quadratic space

– – x x = (x = (x1

1, x

, x2

2) (just two features)

) (just two features) – – Φ Φ( (x x) = ) = – – compute linear separator in this space compute linear separator in this space

(x2

1,

√ 2 x1x2, x2

2,

√ 2 x1, √ 2 x2, 1)

slide-9
SLIDE 9

Drawback of this approach Drawback of this approach

The number of features increases rapidly The number of features increases rapidly This makes the linear program much This makes the linear program much slower to solve slower to solve

slide-10
SLIDE 10

Kernel Trick Kernel Trick

A dot product between two higher A dot product between two higher-

  • dimensional mappings can sometimes be

dimensional mappings can sometimes be implemented by a implemented by a kernel function kernel function

K( K(x xi

i,

, x xj

j) =

) = Φ Φ( (x xi

i)

) · · Φ Φ( (x xj

j)

)

Example: Quadratic Kernel Example: Quadratic Kernel

K(xi,xj) = (xi · xj + 1)2 = (xi1xj1 + xi2xj2 + 1)2 = x2

i1x2 j1 + 2xi1xi2xj1xj2 + x2 i2x2 j2 + 2xi1xj1 + 2xi2xj2 + 1

= (x2

i1,

√ 2 xi1xi2, x2

i2,

√ 2 xi1, √ 2 xi2, 1) · (x2

j1,

√ 2 xj1xj2, x2

j2,

√ 2 xj1, √ 2 xj2, 1) = Φ(xi) · Φ(xj)

slide-11
SLIDE 11

Idea Idea

Reformulate the LTU linear program so Reformulate the LTU linear program so that it only involves dot products between that it only involves dot products between pairs of training examples pairs of training examples Then we can use kernels to compute Then we can use kernels to compute these dot products these dot products Running time of algorithm will not depend Running time of algorithm will not depend

  • n number of dimensions D of high
  • n number of dimensions D of high-
  • dimensional space

dimensional space

slide-12
SLIDE 12

Reformulating the Reformulating the LTU Linear Program LTU Linear Program

Claim: In online Perceptron, Claim: In online Perceptron, w w can be written as can be written as

w w = = ∑ ∑j

j α

αj

j y

yj

j x

xj

j

Proof: Proof:

– – Each weight update has the form Each weight update has the form

w wt

t :=

:= w wt

t-

  • 1

1 +

+ η η g gi,t

i,t

– – g gi,t

i,t is computed as

is computed as

g gi,t

i,t = error

= errorit

it y

yi

i x

xi

i (error

(errorit

it = 1 if x

= 1 if xi

i misclassified in iteration t; 0

misclassified in iteration t; 0

  • therwise)
  • therwise)

– – Hence Hence

w wt

t =

= w w0

0 +

+ ∑ ∑t

t ∑

∑i

i η

η error errorit

it y

yi

i x

xi

i

– – But But w w0

0 = (0, 0,

= (0, 0, … …, 0), so , 0), so

w wt

t =

= ∑ ∑t

t ∑

∑i

i η

η error errorit

it y

yi

i x

xi

i

w wt

t =

= ∑ ∑i

i (

(∑ ∑t

t η

η error errorit

it) y

) yi

i x

xi

i

w wt

t =

= ∑ ∑i

i α

αi

i y

yi

i x

xi

i

slide-13
SLIDE 13

Rewriting the Linear Separator Using Dot Products Rewriting the Linear Separator Using Dot Products

Change of variables Change of variables

– – instead of optimizing instead of optimizing w w, optimize { , optimize {α αj

j}.

}. – – Rewrite the constraint Rewrite the constraint

y yi

i w

w · · x xi

i > 0 as

> 0 as y yi

i ∑

∑j

j α

αj

j y

yj

j (

(x xj

j ·

· x xi

i) > 0 or

) > 0 or ∑ ∑j

j α

αj

j y

yj

j y

yi

i (

(x xj

j ·

· x xi

i) > 0

) > 0

w · xi

=

⎛ ⎝X j

αjyjxj

⎞ ⎠ · xi

=

X j

αjyj(xj · xi)

slide-14
SLIDE 14

The Linear Program becomes The Linear Program becomes

– – Find { Find {α αj

j}

} – – minimize (no objective function) minimize (no objective function) – – subject to subject to

∑ ∑j

j α

αj

j y

yj

j y

yi

i (

(x xj

j ·

· x xi

i) > 0

) > 0 α αj

j ≥

≥ 0

– – Notes: Notes:

The weight The weight α αj

j tells us how

tells us how “ “important important” ” example example x xj

j is. If

  • is. If α

αj

j is

is non non-

  • zero, then

zero, then x xj

j is called a

is called a “ “support vector support vector” ” To classify a new data point To classify a new data point x x, we take its dot product with , we take its dot product with the support vectors the support vectors

∑j

j α

αj

j y

yj

j (

(x xj

j ·

· x x) > 0? ) > 0?

slide-15
SLIDE 15

Kernel Version of Linear Program Kernel Version of Linear Program

– – Find { Find {α αj

j}

} – – minimize (no objective function) minimize (no objective function) – – subject to subject to

∑ ∑j

j α

αj

j y

yj

j y

yi

i K(

K(x xj

j,

,x xi

i) > 0

) > 0 α αj

j ≥

≥ 0

Classify new x according to Classify new x according to

∑ ∑j

j α

αj

j y

yj

j K(

K(x xj

j,

,x x) > 0? ) > 0?

slide-16
SLIDE 16

Two support vectors (blue) with Two support vectors (blue) with α α1

1 = 0.205 and

= 0.205 and α α2

2 =

= 0.338 0.338 Equivalent to the line x Equivalent to the line x2

2 =

= – –0.0974 + x 0.0974 + x1

1*1.341

*1.341

slide-17
SLIDE 17

Solving the Non Solving the Non-

  • Separable Case

Separable Case with Cubic Polynomial Kernel with Cubic Polynomial Kernel

slide-18
SLIDE 18

Kernels Kernels

Dot product Dot product

K( K(x xi

i,

, x xj

j) =

) = x xi

i ·

· x xj

j

Polynomial of degree d Polynomial of degree d

K( K(x xi

i,

, x xj

j) = (

) = (x xi

i ·

· x xj

j + 1)

+ 1)d

d

Gaussian with scale Gaussian with scale σ σ

K( K(x xi

i,

, x xj

j) = exp(

) = exp(– –|| ||x xi

i –

– x xj

j||

||2

2/

/σ σ2

2)

)

Polynomials often give strange Polynomials often give strange

  • boundaries. Gaussians generally work
  • boundaries. Gaussians generally work

well. well.

slide-19
SLIDE 19

Gaussian kernel with Gaussian kernel with σ σ2

2 = 4

= 4

The gaussian kernel is equivalent to an infinite The gaussian kernel is equivalent to an infinite-

  • dimensional feature space!

dimensional feature space!

slide-20
SLIDE 20

Evaluation of SVMs Evaluation of SVMs

no no no no somewhat somewhat no no no no no no yes yes somewhat somewhat no no NNbr NNbr yes yes yes** yes** yes yes yes* yes* no no no no yes yes no no no no SVM SVM yes yes no no yes yes no no yes yes somewhat somewhat yes yes no no no no Nets Nets no no yes yes no no somewhat somewhat yes yes yes yes yes yes yes yes yes yes Trees Trees yes yes yes yes yes yes no no yes yes no no no no yes yes no no LDA LDA yes yes yes yes Accurate Accurate yes yes yes yes Interpretable Interpretable yes yes yes yes Linear combinations Linear combinations no no no no Irrelevant inputs Irrelevant inputs yes yes yes yes Scalability Scalability no no no no Monotone transformations Monotone transformations yes yes no no Outliers Outliers no no no no Missing values Missing values no no no no Mixed data Mixed data Logistic Logistic Perc Perc Criterion Criterion

* = dot product kernel with absolute value penalty ** = dot product kernel

slide-21
SLIDE 21

Support Vector Machines Summary Support Vector Machines Summary

Advantages of SVMs Advantages of SVMs

– – variable variable-

  • sized hypothesis space

sized hypothesis space – – polynomial polynomial-

  • time exact optimization rather than approximate

time exact optimization rather than approximate methods methods

unlike decision trees and neural networks unlike decision trees and neural networks

– – Kernels allow very flexible hypotheses Kernels allow very flexible hypotheses

Disadvantages of SVMs Disadvantages of SVMs

– – Must choose kernel and kernel parameters: Gaussian, Must choose kernel and kernel parameters: Gaussian, σ σ – – Very large problems are computationally intractable Very large problems are computationally intractable

quadratic in number of examples quadratic in number of examples problems with more than 20,000 examples are very difficult to so problems with more than 20,000 examples are very difficult to solve lve exactly exactly

– – Batch algorithm Batch algorithm

slide-22
SLIDE 22

SVMs Unify SVMs Unify LTUs and Nearest Neighbor LTUs and Nearest Neighbor

With Gaussian kernel With Gaussian kernel

– – compute distance to a set of compute distance to a set of “ “support vector support vector” ” nearest neighbors nearest neighbors – – transform through a gaussian (nearer transform through a gaussian (nearer neighbors get bigger votes) neighbors get bigger votes) – – take weighted sum of those distances for take weighted sum of those distances for each class each class – – classify to the class with most votes classify to the class with most votes