Lecture 10 Support Vector Machines Oct - 20 - 2008 Linear - - PowerPoint PPT Presentation

lecture 10 support vector machines
SMART_READER_LITE
LIVE PREVIEW

Lecture 10 Support Vector Machines Oct - 20 - 2008 Linear - - PowerPoint PPT Presentation

Lecture 10 Support Vector Machines Oct - 20 - 2008 Linear Separators Linear Separators Which of the linear separators is optimal? p p + + + + + + + + + Concept of Margin


slide-1
SLIDE 1

Lecture 10 Support Vector Machines

Oct - 20 - 2008

slide-2
SLIDE 2

Linear Separators Linear Separators

  • Which of the linear separators is optimal?

p p

+ + + + − + + + + − − − − − + − − − − −

slide-3
SLIDE 3

Concept of Margin Concept of Margin

  • Recall that in Perceptron we learned that

Recall that in Perceptron, we learned that the convergence rate of the Perceptron algorithm depends on a concept called algorithm depends on a concept called margin

slide-4
SLIDE 4

Intuition of Margin

  • Consider points A, B, and C
  • We are quite confident in our

+

A

w · x + b = 0

q prediction for A because it is far from the decision boundary

+ + + − −

B

w · x + b = 0 w · x + b > 0

boundary.

  • In contrast, we are not so

confident in our prediction for

+ + + − − − − − + w · x + b < 0

C because a slight change in the decision boundary may flip the decision.

+ − − − − −

C

flip the decision. Given a training set, we would like to make all of our di ti t d fid t! Thi b t d b predictions correct and confident! This can be captured by the concept of margin

slide-5
SLIDE 5

Functional Margin g

  • One possible way to define margin:
  • We define this as the functional margin of the linear

classifier w.r.t training example ( xi, yi )

  • The large the value, the better – really?
  • What if we rescale (w, b) by a factor α, consider the

linear classifier specified by (αw, αb)

Decision boundary remain the same – Decision boundary remain the same – Yet, functional margin gets multiplied by α – We can change the functional margin of a linear classifier We can change the functional margin of a linear classifier without changing anything meaningful – We need something more meaningful

slide-6
SLIDE 6

What we really want What we really want

+

A

w · x + b = 0 + + + −

B

w · x + b = 0 + + + + − − − − + + − − − − −

C We want the distances between the examples and the decision boundary to be large – this quantity is what we call geometric margin But how do we compute the geometric margin of a data point w.r.t a particular line (parameterized by w and b)?

slide-7
SLIDE 7

Some basic facts about lines Some basic facts about lines

w · x + b = 0

?

X1

1

?

X1

|| ||

1

w b x w + ⋅

slide-8
SLIDE 8

Geometric Margin

+

A

  • The geometric margin of (w, b)

w.r.t. x(i) is the distance from x(i) to the decision surface

+ + + + −

B

γA

the decision surface

  • This distance can be computed as

+ + + − − − − − + + − − − − −

C

w x w ) ( b y

i i i

+ ⋅ = γ

  • Given training set S={(xi, yi): i=1,…, N}, the geometric

margin of the classifier w.r.t. S is

) ( 1

min

i N i

γ γ

L =

=

Note that the points closest to the boundary are called the support Note that the points closest to the boundary are called the support vectors – in fact these are the only points that really matters, other examples are ignorable

slide-9
SLIDE 9

What we have done so far What we have done so far

  • We have established that we want to find a

We have established that we want to find a linear decision boundary whose margin is the largest

  • We know how to measure the margin of a linear

decision boundary

  • Now what?
  • We have a new learning objective

– Given a linearly separable (will be relaxed later) training set S={(xi, yi): i=1,…, N}, we would like to find a linear classifier (w b) with maximum margin a linear classifier (w, b) with maximum margin.

slide-10
SLIDE 10

Maximum Margin Classifier

  • This can be represented as a constrained optimization

problem.

N i b y

(i) (i) b

1 ) ( : subject to max

,

= ≥ + ⋅ γ γ x w

w

  • This optimization problem is in a nasty form so we

N i y( ) , , 1 , : subject to L = ≥ γ w

  • This optimization problem is in a nasty form, so we

need to do some rewriting

  • Let γ’ = γ ⋅ ||w||, we can rewrite this as

γ γ || ||

b

' max

,

γ w

w

N i b y

i i

, , 1 , ' ) ( : subject to L = ≥ + ⋅ γ x w w

slide-11
SLIDE 11

Maximum Margin Classifier

  • Note that we can arbitrarily rescale w and b to make the

functional margin large or small

' γ

g g

  • So we can rescale them such that =1

' max γ

γ ' γ

N i b y

i i b

, , 1 , ' ) ( : subject to max

,

L = ≥ + ⋅ γ x w w

w

) min ly equivalent (or 1 max

2

w N i b y

i i b b

, , 1 , 1 ) ( : subject to ) min ly equivalent (or max

, ,

L = ≥ + ⋅x w w w

w w

Maximizing the geometric margin is equivalent to minimizing the magnitude of w subject to maintaining a functional margin of at least 1

slide-12
SLIDE 12

Solving the Optimization Problem

N i b

i i b

1 1 ) ( : s bject to 2 1 min

2 ,

≥ + x w w

w

  • This results in a quadratic optimization problem with

li i lit t i t

N i b y

i i

, , 1 , 1 ) ( : subject to L = ≥ + ⋅x w

linear inequality constraints.

  • This is a well-known class of mathematical

programming problems for which several (non-trivial) programming problems for which several (non trivial) algorithms exist.

– In practice, we can just regard the QP solver as a “black box” without bothering how it works “black-box” without bothering how it works

  • You will be spared of the excruciating details and

jump to jump to

slide-13
SLIDE 13

The solution

  • We can not give you a close form solution that you can

directly plug in the numbers and compute for an arbitrary y p g p y data sets

  • But, the solution can always be written in the following

form

N N

form

  • This is the form of w b can be calculated accordingly

∑ ∑

= =

= =

N i i i N i i i i

y x y

1 1

s.t. , α α w

This is the form of w, b can be calculated accordingly using some additional steps

  • The weight vector is a linear combination of all the

training examples

  • Importantly, many of the αi’s are zeros

These points that have non zero ’s are the support

  • These points that have non-zero αi’s are the support

vectors

slide-14
SLIDE 14

A Geometrical Interpretation A Geometrical Interpretation

Class 2

α8= 0.6 α10= 0 α2= 0 α5= 0 α7= 0 α6= 1.4 α1= 0.8 α4= 0

6

Class 1

α3= 0 α9= 0

slide-15
SLIDE 15

A few important notes regarding the geometric interpretation

  • gives the decision boundary

gives the decision boundary

  • positive support vectors lie on this

line

  • negative support vectors lie on

this line

  • We can think of a decision boundary now as a

tube of certain width, no points can be inside the tube

– Learning involves adjusting the location and i t ti f th t b t fi d th l t fitti t b f

  • rientation of the tube to find the largest fitting tube for

the given training set