Support Vector Machines Prof. Mike Hughes Many ideas/slides - - PowerPoint PPT Presentation

support vector machines
SMART_READER_LITE
LIVE PREVIEW

Support Vector Machines Prof. Mike Hughes Many ideas/slides - - PowerPoint PPT Presentation

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Support Vector Machines Prof. Mike Hughes Many ideas/slides attributable to: Dan Sheldon (U.Mass.) James, Witten, Hastie, Tibshirani (ISL/ESL books) 2


slide-1
SLIDE 1

Support Vector Machines

2

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/

Many ideas/slides attributable to: Dan Sheldon (U.Mass.) James, Witten, Hastie, Tibshirani (ISL/ESL books)

  • Prof. Mike Hughes
slide-2
SLIDE 2

SVM Unit Objectives

Big ideas: Support Vector Machine

  • Why maximize margin?
  • What is a support vector?
  • What is hinge loss?
  • Advantages over logistic regression
  • Less sensitive to outliers
  • Advantages from sparsity in when using kernels
  • Disadvantages
  • Not probabilistic
  • Less elegant to do multi-class

3

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-3
SLIDE 3

What will we learn?

4

Mike Hughes - Tufts COMP 135 - Spring 2019

Supervised Learning Unsupervised Learning Reinforcement Learning

Data, Label Pairs Performance measure Task data x label y

{xn, yn}N

n=1

Training Prediction Evaluation

slide-4
SLIDE 4

Recall: Kernelized Regression

5

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-5
SLIDE 5

Linear Regression

6

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-6
SLIDE 6

Kernel Function

7

Mike Hughes - Tufts COMP 135 - Spring 2019

k(xi, xj) = φ(xi)T φ(xj)

Interpret: similarity function for x_i and x_j Properties: Input: any two vectors Output: scalar real larger values mean more similar symmetric

slide-7
SLIDE 7

Common Kernels

8

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-8
SLIDE 8
  • K_train : N x N symmetric matrix
  • K_test : T x N matrix for test feature

9

Mike Hughes - Tufts COMP 135 - Spring 2019

Kernel Matrices

K =      k(x1, x1) k(x1, x2) . . . k(x1, xN) k(x2, x1) k(x2, x2) . . . k(x2, xN) . . . k(xN, x1) k(xN, x2) . . . k(xN, xN)     

slide-9
SLIDE 9

Linear Kernel Regression

10

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-10
SLIDE 10

Radial basis kernel aka Gaussian aka Squared Exponential

11

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-11
SLIDE 11

12

Mike Hughes - Tufts COMP 135 - Spring 2019

Compare: Linear Regression

Training: Solve G-dim optimization problem Prediction: Linear transform of G-dim features

+ L2 penalty (optional)

ˆ y(xi, θ) = θT φ(xi) =

G

X

g=1

θg · φ(xi)g min

θ N

X

n=1

(yn − ˆ y(xn, θ))2

slide-12
SLIDE 12

Kernelized Linear Regression

13

Mike Hughes - Tufts COMP 135 - Spring 2019

  • Prediction:
  • Training: Solve N-dim optimization problem

ˆ y(xi, α, {xn}N

n=1) = N

X

n=1

αnk(xn, xi)

min

α N

X

n=1

(yn − ˆ y(xn, α, X))2

= X

Can do all needed operations with only access to kernel (no feature vectors are created in memory)

slide-13
SLIDE 13

Math on board

  • Goal: Kernelized linear prediction reweights

each training example

14

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-14
SLIDE 14

Can kernelize any linear model

15

Mike Hughes - Tufts COMP 135 - Spring 2019

ˆ y(xi, α, {xn}N

n=1) = N

X

n=1

αnk(xn, xi)

Regression: Prediction Logistic Regression: Prediction

p(Yi = 1|xi) = σ(ˆ y(xi, α, X))

slide-15
SLIDE 15

Training : Reg. Vs. Logistic Reg.

16

Mike Hughes - Tufts COMP 135 - Spring 2019

min

α N

X

n=1

(yn − ˆ y(xn, α, X))2 min

α N

X

n=1

log loss(yn, σ(ˆ y(xn, α, X)))

slide-16
SLIDE 16

Downsides of LR

17

Mike Hughes - Tufts COMP 135 - Spring 2019

Log loss means any example misclassified pays a steep price, pretty sensitive to outliers

slide-17
SLIDE 17

Stepping back

Which do we prefer? Why?

18

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-18
SLIDE 18

Idea: Define Regions Separated by Wide Margin

19

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-19
SLIDE 19

w is perpendicular to boundary

20

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-20
SLIDE 20

Examples that define the margin are called support vectors

21

Mike Hughes - Tufts COMP 135 - Spring 2019

Nearest negative example Nearest positive example

slide-21
SLIDE 21

Observation: Non-support training examples do not influence margin at all

22

Mike Hughes - Tufts COMP 135 - Spring 2019

Could perturb these examples without impacting boundary

slide-22
SLIDE 22

How wide is the margin?

23

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-23
SLIDE 23

Small margin

24

Mike Hughes - Tufts COMP 135 - Spring 2019

y positive y negative ! = #+1 &'w x + b ≥ 0 −1 &'w x + b < 0

w x + b<0 w x + b>0

Margin: distance to the boundary

slide-24
SLIDE 24

Large margin

25

Mike Hughes - Tufts COMP 135 - Spring 2019

! = #+1 &'w x + b ≥ 0 −1 &'w x + b < 0

w x + b<0 w x + b>0

Margin: distance to the boundary y positive y negative

slide-25
SLIDE 25

How wide is the margin?

26

Mike Hughes - Tufts COMP 135 - Spring 2019

Distance from nearest positive example to nearest negative example along vector w:

slide-26
SLIDE 26

How wide is the margin?

27

Mike Hughes - Tufts COMP 135 - Spring 2019

Distance from nearest positive example to nearest negative example along vector w:

By construction, we assume

slide-27
SLIDE 27

SVM Training Problem Version 1: Hard margin

28

Mike Hughes - Tufts COMP 135 - Spring 2019

Requires all training examples to be correctly classified

This is a constrained quadratic optimization problem. There are standard

  • ptimization methods as well methods specially designed for SVM.

For each n = 1, 2, …. N

slide-28
SLIDE 28

SVM Training Problem Version 1: Hard margin

29

Mike Hughes - Tufts COMP 135 - Spring 2019

Requires all training examples to be correctly classified

This is a constrained quadratic optimization problem. There are standard

  • ptimization methods as well methods specially designed for SVM.

Minimizing this equivalent to maximizing the margin width in feature space

For each n = 1, 2, …. N

slide-29
SLIDE 29

Soft margin: Allow some misclassifications

30

Mike Hughes - Tufts COMP 135 - Spring 2019

Slack at example i Distance on wrong side

  • f the margin
slide-30
SLIDE 30

Hard vs. soft constraints

31

Mike Hughes - Tufts COMP 135 - Spring 2019

HARD: All positive examples to satisfy SOFT: Want each positive examples to satisfy with slack as small as possible (minimize absolute value)

slide-31
SLIDE 31

Hinge loss for positive example

32

Mike Hughes - Tufts COMP 135 - Spring 2019

Assumes current example has positive label y = +1 +1

slide-32
SLIDE 32

33

Mike Hughes - Tufts COMP 135 - Spring 2019

SVM Training Problem Version 2: Soft margin

Tradeoff parameter C controls model complexity Smaller C: Simpler model, encourage large margin even if we make lots of mistakes Bigger C: Avoid mistakes

slide-33
SLIDE 33

34

Mike Hughes - Tufts COMP 135 - Spring 2019

SVM vs LR: Compare training

Both require tuning complexity hyperparameter C to avoid overfitting

slide-34
SLIDE 34

35

Mike Hughes - Tufts COMP 135 - Spring 2019

Loss functions: SVM vs Logistic Regr.

slide-35
SLIDE 35

SVMs: Prediction

36

Mike Hughes - Tufts COMP 135 - Spring 2019

Make binary prediction via hard threshold

slide-36
SLIDE 36

SVMs and Kernels: Prediction

37

Mike Hughes - Tufts COMP 135 - Spring 2019

Make binary prediction via hard threshold

Efficient training algorithms using modern quadratic programming solve the dual optimization problem of SVM soft margin problem

slide-37
SLIDE 37

Support vectors are often small fraction of all examples

38

Mike Hughes - Tufts COMP 135 - Spring 2019

Nearest negative example Nearest positive example

slide-38
SLIDE 38

Support vectors defined by non-zero alpha in kernel view

39

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-39
SLIDE 39

SVM + Squared Exponential Kernel

40

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-40
SLIDE 40

41

Mike Hughes - Tufts COMP 135 - Spring 2019

SVM Logistic Regr

Loss hinge cross entropy (log loss) Sensitive to

  • utliers

less more Probabilistic? no yes Kernelizable? Yes, with speed benefits from sparsity Yes Multi-class? Only via a separate

  • ne-vs-all for each

class Easy, use softmax

slide-41
SLIDE 41

Activity

  • KernelRegressionDemo.ipynb
  • Scroll down to SVM vs logistic regression
  • Can you visualize behavior with different C?
  • Try different kernels?
  • Examine alpha vector?

42

Mike Hughes - Tufts COMP 135 - Spring 2019