Learning From Data Lecture 23 SVMs: Maximizing the Margin A Better - - PowerPoint PPT Presentation

learning from data lecture 23 svm s maximizing the margin
SMART_READER_LITE
LIVE PREVIEW

Learning From Data Lecture 23 SVMs: Maximizing the Margin A Better - - PowerPoint PPT Presentation

Learning From Data Lecture 23 SVMs: Maximizing the Margin A Better Hyperplane Maximizing the Margin Link to Regularization M. Magdon-Ismail CSCI 4100/6100 recap: Linear Models, RBFs, Neural Networks Linear Model with Nonlinear Transform


slide-1
SLIDE 1

Learning From Data Lecture 23 SVM’s: Maximizing the Margin

A Better Hyperplane Maximizing the Margin Link to Regularization

  • M. Magdon-Ismail

CSCI 4100/6100

slide-2
SLIDE 2

recap: Linear Models, RBFs, Neural Networks Linear Model with Nonlinear Transform Neural Network k-RBF-Network h(x) = θ  w0 +

˜ d

  • j=1

wjΦj(x)   h(x) = θ

  • w0 +

m

  • j=1

wjθ (vj

tx)

  • h(x) = θ
  • w0 +

k

  • j=1

wjφ (| | x − µj | |)

  • gradient descent

k-means

Neural Network: generalization of linear model by adding layers. Support Vector Machine: more ‘robust’ linear model

c A M L Creator: Malik Magdon-Ismail

Maximizing the Margin: 2 /19

Which separator to pick? − →

slide-3
SLIDE 3

Which Separator Do You Pick?

Being robust to noise (measurement error) is good (remember regularization).

c A M L Creator: Malik Magdon-Ismail

Maximizing the Margin: 3 /19

Robustness to noise − →

slide-4
SLIDE 4

Robustness to Noisy Data

Being robust to noise (measurement error) is good (remember regularization).

c A M L Creator: Malik Magdon-Ismail

Maximizing the Margin: 4 /19

Thicker cushion means more robust − →

slide-5
SLIDE 5

Thicker Cushion Means More Robustness

We call such hyperplanes fat

c A M L Creator: Malik Magdon-Ismail

Maximizing the Margin: 5 /19

Two crucial questions − →

slide-6
SLIDE 6

Two Crucial Questions

  • 1. Can we efficiently find the fattest separating hyperplane?
  • 2. Is a fatter hyperplane better than a thin one?

c A M L Creator: Malik Magdon-Ismail

Maximizing the Margin: 6 /19

Pulling out the bias − →

slide-7
SLIDE 7

Pulling Out the Bias

Before Now x ∈ {1} × Rd; w ∈ Rd+1 x ∈ Rd; b ∈ R, w ∈ Rd x =

   

1 x1 . . . xd

    ;

w =

   

w0 w1 . . . wd

    .

x =

 

x1 . . . xd

  ;

bias b w =

 

w1 . . . wd

  .

signal = wtx signal = wtx + b

c A M L Creator: Malik Magdon-Ismail

Maximizing the Margin: 7 /19

Separating the data − →

slide-8
SLIDE 8

Separating The Data

wtxn + b > 0 wtxn + b < 0

Hyperplane h = (b, w) h separates the data means: yn(wtxn + b) > 0 By rescaling the weights and bias, min

n=1,...,N yn(wtxn + b) = 1

(renormalize the weights so that the signal wtx+b is meaningful)

c A M L Creator: Malik Magdon-Ismail

Maximizing the Margin: 8 /19

Distance to the hyperplane − →

slide-9
SLIDE 9

Distance to the Hyperplane

dist(x, h)

x1 x2 w x

w is normal to the hyperplane:

wt(x2 − x1) = wtx2 − wtx1 = −b + b = 0. (because wtx = −b on the hyperplane)

Unit normal u = w/| | w | |. dist(x, h) = |ut(x − x1)| = 1 | | w | | · |wtx − wtx1| = 1 | | w | | · |wtx + b|

c A M L Creator: Malik Magdon-Ismail

Maximizing the Margin: 9 /19

Fatness of a separating hyperplane − →

slide-10
SLIDE 10

Fatness of a Separating Hyperplane

dist(x, h) =

1 | | w | | · |wtx + b|

Fatness = Distance to the closest point

Since |wtxn + b| = |yn(wtxn + b)| = yn(wtxn + b), dist(xn, h) = 1 | | w | | · yn(wtxn + b). Fatness = min

n dist(xn, h)

= 1 | | w | | · min

n yn(wtxn + b)

← − separation condition = 1 | | w | | ← −

the margin γ(h)

c A M L Creator: Malik Magdon-Ismail

Maximizing the Margin: 10 /19

Maximizing the margin − →

slide-11
SLIDE 11

Maximizing the Margin

margin γ(h) =

1 | | w | |

← − bias b does not appear here minimize

b,w 1 2wtw

subject to: min

n=1,...,N yn(wtxn + b) = 1.

minimize

b,w 1 2wtw

subject to: yn(wtxn + b) ≥ 1 for n = 1, . . . , N.

c A M L Creator: Malik Magdon-Ismail

Maximizing the Margin: 11 /19

Equivalent form − →

slide-12
SLIDE 12

Maximizing the Margin

margin γ(h) =

1 | | w | |

← − bias b does not appear here minimize

b,w 1 2wtw

subject to: min

n=1,...,N yn(wtxn + b) = 1.

minimize

b,w 1 2wtw

subject to: yn(wtxn + b) ≥ 1 for n = 1, . . . , N.

c A M L Creator: Malik Magdon-Ismail

Maximizing the Margin: 12 /19

Example – our toy data set − →

slide-13
SLIDE 13

Example – Our Toy Data Set

yn(wtxn + b) ≥ 1

X =     0 0 2 2 2 0 3 0     y =     −1 −1 +1 +1     −b ≥ 1 (i) −(2w1 + 2w2 + b) ≥ 1 (ii) 2w1 + b ≥ 1 (iii) 3w1 + b ≥ 1 (iv)

(i) and (iii) gives w1 ≥ 1 (ii) and (iii) gives w2 ≤ −1 So, 1

2(w2 1 + w2 2) ≥ 1 (b = −1, w1 = 1, w2 = −1)

Optimal Hyperplane g(x) = sign(x1 − x2 − 1) margin: 1 | | w∗ | | = 1 √ 2 ≈ 0.707. For data points (i), (ii) and (iii) yn(w∗txn + b∗) = 1 ↑ Support Vectors

x

1

− x

2

− 1 =

0.707

c A M L Creator: Malik Magdon-Ismail

Maximizing the Margin: 13 /19

Quadratic programming − →

slide-14
SLIDE 14

Quadratic Programming

minimize

u∈Rq 1 2utQu + ptu

subject to: Au ≥ c u∗ ← QP(Q, p, A, c)

(Q = 0 is linear programming)

c A M L Creator: Malik Magdon-Ismail

Maximizing the Margin: 14 /19

Maximum margin hyperplane is QP − →

slide-15
SLIDE 15

Maximum Margin Hyperplane is QP

minimize

b,w 1 2wtw

subject to: yn(wtxn + b) ≥ 1 for n = 1, . . . , N.

minimize

u∈Rq 1 2utQu + ctu

subject to: Au ≥ a u =

  • b

w

  • ∈ Rd+1

1 2wtw = b wt 0t

d

0d Id b wt

  • = ut
  • 0t

d

0d Id

  • u =

⇒ Q =

  • 0t

d

0d Id

  • , p = 0d+1

yn(wtxn + b) ≥ 1 ≡ yn ynxt

n

  • u ≥ 1 =

⇒   y1 y1xt

1

. . . . . . yN yNxt

N

  u ≥   1 . . . 1   = ⇒ A =   y1 y1xt

1

. . . . . . yN yNxt

N

  , c =   1 . . . 1  

c A M L Creator: Malik Magdon-Ismail

Maximizing the Margin: 15 /19

Back to our example − →

slide-16
SLIDE 16

Back To Our Example

Exercise:

yn(wtxn + b) ≥ 1

X =     0 0 2 2 2 0 3 0     y =     −1 −1 +1 +1     −b ≥ 1 (i) −(2w1 + 2w2 + b) ≥ 1 (ii) 2w1 + b ≥ 1 (iii) 3w1 + b ≥ 1 (iv)

Show that Q =

 

0 0 0 0 1 0 0 0 1

 

p =

   

A =

   

−1 −1 −2 −2 1 2 1 3

   

c =

   

1 1 1 1

   

Use your QP-solver to give (b∗, w∗

1, w∗ 2) = (−1, 1, −1)

c A M L Creator: Malik Magdon-Ismail

Maximizing the Margin: 16 /19

Primal QP algorithm − →

slide-17
SLIDE 17

Primal QP algorithm for linear-SVM

1: Let p = 0d+1 be the (d + 1)-vector of zeros and c = 1N the N-vector of

  • nes. Construct matrices Q and A, where

A =

 

y1 —y1xt

1—

. . . . . . yN —yNxt

N—

 

  • signed data matrix

, Q = 0t

d

0d Id

  • 2: Return

b∗

w∗

  • = u∗ ← QP(Q, p, A, c).

3: The final hypothesis is g(x) = sign(w∗tx + b∗). c A M L Creator: Malik Magdon-Ismail

Maximizing the Margin: 17 /19

Example: SVM vs PLA − →

slide-18
SLIDE 18

Example: SVM vs PLA

Eout Eout(SVM)

0.02 0.04 0.06 0.08

PLA depends on the ordering of data (e.g. random)

c A M L Creator: Malik Magdon-Ismail

Maximizing the Margin: 18 /19

Link to regularization

slide-19
SLIDE 19

Link to Regularization

minimize

w

Ein(w) subject to: wtw ≤ C.

  • ptimal hyperplane

regularization minimize wtw Ein subject to Ein = 0 wtw ≤ C

c A M L Creator: Malik Magdon-Ismail

Maximizing the Margin: 19 /19