Support Vector Machines Support Vector Machines CSC 411 Tutorial - - PowerPoint PPT Presentation

support vector machines support vector machines
SMART_READER_LITE
LIVE PREVIEW

Support Vector Machines Support Vector Machines CSC 411 Tutorial - - PowerPoint PPT Presentation

Support Vector Machines Support Vector Machines CSC 411 Tutorial April 1, 2015 Tutor: Shenlong Wang Many thanks to Renjie Liao, Jake Snell, Yujia Li and Kevin Swersky for much of the following material. 1 of 36 2 of 36 Brief Review of SVMs


slide-1
SLIDE 1

Support Vector Machines Support Vector Machines

CSC 411 Tutorial April 1, 2015 Tutor: Shenlong Wang Many thanks to Renjie Liao, Jake Snell, Yujia Li and Kevin Swersky for much of the following material.

1 of 36

slide-2
SLIDE 2

2 of 36

slide-3
SLIDE 3

Brief Review of SVMs Brief Review of SVMs

Out[9]:

Click here to toggle on/off the raw code.

3 of 36

slide-4
SLIDE 4

Geometric Intuition Geometric Intuition

Out[13]:

4 of 36

slide-5
SLIDE 5

Geometric Intuition Geometric Intuition

Out[14]:

5 of 36

slide-6
SLIDE 6

Margin Derivation Margin Derivation

Out[16]:

6 of 36 d * w / |w|

slide-7
SLIDE 7

Margin Derivation Margin Derivation

Compute the distance

  • f an arbitrary point

in the (+) class to the separating hyperplane. If we let denote the class of , then the distance becomes

  • We can set

for the point closest to the decision boundary, leading to the problem:

  • 7 of 36
slide-8
SLIDE 8

SVM Problem SVM Problem

But scaling and doesn't change .

  • r equivalently:
  • 8 of 36
slide-9
SLIDE 9

Non-linear SVMs Non-linear SVMs

For a linear SVM, .

  • We can just as well work in an alternate feature space:

.

  • Out[17]:

http://i.imgur.com/WuxyO.png

9 of 36

slide-10
SLIDE 10

Non-linear SVMs Non-linear SVMs

Out[19]:

SVM with polynomial kernel visualization

http://www.youtube.com/watch?v=3liCbRZPrZA

10 of 36

slide-11
SLIDE 11

Non-linear SVMs Non-linear SVMs

Demo (by Andrej Karparthy and LIBSVM):

http://cs.stanford.edu/people/karpathy/svmjs/demo/ https://www.csie.ntu.edu.tw/~cjlin/libsvm/

11 of 36

slide-12
SLIDE 12

SVMs vs Logistic Regression SVMs vs Logistic Regression

12 of 36

slide-13
SLIDE 13

Logistic Regression Logistic Regression

Out[21]: [<matplotlib.lines.Line2D at 0x7fb3ad1af0f0>]

13 of 36

slide-14
SLIDE 14

Logistic Regression Logistic Regression

Assign probability to each outcome Train to maximize likelihood Linear decision boundary

  • 14 of 36
slide-15
SLIDE 15

SVMs SVMs

Out[22]:

15 of 36

slide-16
SLIDE 16

SVMs SVMs

Enforce a margin of separation Train to find the maximum margin Linear decision boundary

  • 16 of 36
slide-17
SLIDE 17

Comparison Comparison

Logistic regression wants to maximize the probability of the data. The greater the distance from each point to the decision boundary, the better. SVMs want to maximize the distance from the closest points (support vectors) to the decision boundary. Doesn't care about points that aren't support vectors.

17 of 36

slide-18
SLIDE 18

A Different Take A Different Take

Consider an alternate form of the logistic regression decision function:

  • 18 of 36
slide-19
SLIDE 19

A Different Take A Different Take

Suppose we don't actually care about the probabilities. All we want to do is make the right decision. We can put a constraint on the likelihood ratio, for some constant :

  • 19 of 36
slide-20
SLIDE 20

A Different Take A Different Take

Take the log of both sides:

  • Recalling that

and :

  • But is arbitrary, so set it s.t.

: Similiary the negative sample case should be: Try to derive it by yourself.

  • 20 of 36
slide-21
SLIDE 21

A Different Take A Different Take

So now we have . But this may not have a unique solution, so put a quadratic penalty on the weights to make the solution unique: This gives us a SVM! By asking logistic regression to make the right decisions instead of maximizing the probability

  • f the data, we derived an SVM.
  • 21 of 36
slide-22
SLIDE 22

Likelihood Ratio Likelihood Ratio

The likelihood ratio drives this derivation: Different classifiers assign different costs to .

  • 22 of 36
slide-23
SLIDE 23

LR Cost LR Cost

Choose (for a positive example)

  • Out[23]:

<matplotlib.text.Text at 0x7fb3ad135748>

23 of 36

slide-24
SLIDE 24

LR Cost LR Cost

Minimizing is the same as minimizing the negative log-likelihood objective for logistic regression!

  • 24 of 36
slide-25
SLIDE 25

SVM with Slack Variables SVM with Slack Variables

If the data is not linearly separable, we can introduce slack variables.

  • 25 of 36
slide-26
SLIDE 26

SVM with Slack Variables SVM with Slack Variables

Out[24]:

26 of 36

slide-27
SLIDE 27

SVM Cost SVM Cost

Choose

  • Out[25]:

<matplotlib.text.Text at 0x7fb3ad09c208>

27 of 36

slide-28
SLIDE 28

Plotted in terms of Plotted in terms of

Out[26]: <matplotlib.legend.Legend at 0x7fb3ad019dd8>

28 of 36

slide-29
SLIDE 29

Plotted in terms of Plotted in terms of

  • Out[27]:

<matplotlib.legend.Legend at 0x7fb3acf98710>

29 of 36

slide-30
SLIDE 30

Exploiting the Connection between LR and SVMs Exploiting the Connection between LR and SVMs

30 of 36

slide-31
SLIDE 31

Kernel Trick for LR Kernel Trick for LR

In the dual form, the SVM decision boundary is

  • We could plug this into the LR cost:
  • 31 of 36
slide-32
SLIDE 32

Multi-class SVMS Multi-class SVMS

Recall multi-class logistic regression

  • 32 of 36
slide-33
SLIDE 33

Multi-class SVMS Multi-class SVMS

Suppose instead we just want the decision rule to satisfy Taking logs as before,

  • 33 of 36
slide-34
SLIDE 34

Multi-class SVMS Multi-class SVMS

Now we have the quadratic program for multi-class SVMs.

  • 34 of 36
slide-35
SLIDE 35

LR and SVMs are closely linked LR and SVMs are closely linked

Both can be viewed as taking a probabilistic model and miminizing some cost associated with the likelihood ratio. This allows use to extend both models in principled ways.

35 of 36

slide-36
SLIDE 36

Which to Use? Which to Use?

Logistic regression Gives calibrated probabilities that can be interpreted as confidence in a decision. Unconstrained, smooth objective. Can be used within Bayesian models. SVMs No penalty for examples where the correct decision is made with sufficient confidence, which can lead to good generalization. Dual form gives sparse solutions when using the kernel trick, leading to better scalability.

36 of 36