L ECTURE 10: L ARGE M ARGIN C LASSIFIERS Prof. Julia Hockenmaier - - PowerPoint PPT Presentation

l ecture 10
SMART_READER_LITE
LIVE PREVIEW

L ECTURE 10: L ARGE M ARGIN C LASSIFIERS Prof. Julia Hockenmaier - - PowerPoint PPT Presentation

CS446 Introduction to Machine Learning (Spring 2015) University of Illinois at Urbana-Champaign http://courses.engr.illinois.edu/cs446 L ECTURE 10: L ARGE M ARGIN C LASSIFIERS Prof. Julia Hockenmaier juliahmr@illinois.edu Todays class Large


slide-1
SLIDE 1

CS446 Introduction to Machine Learning (Spring 2015) University of Illinois at Urbana-Champaign

http://courses.engr.illinois.edu/cs446

  • Prof. Julia Hockenmaier

juliahmr@illinois.edu

LECTURE 10:

LARGE MARGIN CLASSIFIERS

slide-2
SLIDE 2

CS446 Machine Learning

Today’s class

Large margin classifiers: – Why do we care about the margin? – Perceptron with margin – Support Vector Machines Dealing with outliers: – Soft margins

2

slide-3
SLIDE 3

Large margin classifiers

slide-4
SLIDE 4

What’s the best separating hyperplane?

+ − − − − − − − + + + + +

4

slide-5
SLIDE 5

What’s the best separating hyperplane?

5

+ − − − − − − − + + + + +

slide-6
SLIDE 6

What’s the best separating hyperplane?

6

Margin m

+ − − − − − − − + + + + +

slide-7
SLIDE 7

CS446 Machine Learning

Maximum margin classifiers

7

These decision boundaries are very close to some items in the training data. They have small margins. Minor changes in the data could lead to different decision boundaries This decision boundary is as far away from any training items as possible. It has a large margin. Minor changes in the data result in (roughly) the same decision boundary

slide-8
SLIDE 8

Maximum margin classifier

Margin = the distance of the decision boundary to the closest items in the training data. We want to find a classifier whose decision boundary is furthest away from the nearest data

  • points. (This classifier has the largest margin).

This additional requirement (bias) reduces the variance (i.e. reduces overfitting).

8 CS440/ECE448: Intro AI

slide-9
SLIDE 9

Margins

slide-10
SLIDE 10

Decision boundary: Hyperplane with f(x) = 0 i.e. wx + b = 0

Distance of hyperplane wx + b = 0 to origin:

−b w

CS446 Machine Learning

Margins

10

w Absolute distance

  • f point x

to hyperplane wx + b = 0:

wx + b w

hyperplane wx + b = 0 point x

slide-11
SLIDE 11

CS446 Machine Learning

Margin

If the data are linearly separable, y(i)(wx(i) +b) > 0 Euclidean distance of x(i) to the decision boundary:

11

y(i) f (x(i)) w = y(i)(wx(i) + b) w

slide-12
SLIDE 12

CS446 Machine Learning

Functional vs. Geometric margin

Geometric margin (Euclidean distance)

  • f hyperplane wx + b = 0 to point x(i):

Functional margin

  • f hyperplane wx + b = 0 to point x(i):

γ = y(i) f(x(i)) i.e . γ = y(i) (wx(i) + b)

12

y(i) f (x(i)) w = y(i)(wx(i) + b) w

slide-13
SLIDE 13

…spell out wx,‖w‖… …multiply by k/k… …move k inside… Geometric margin of x to kwx + kb = 0 Geometric margin

  • f x to wx + b = 0

CS446 Machine Learning

Rescaling w and b

y(i)(wx(i) + b) w = y(i) wnxn

(i) n

+ b " # $ % & ' wnwn

n

= ky(i) wnxn

(i) n

+ b " # $ % & ' k wnwn

n

= y(i) kwnxn

(i) n

+ kb " # $ % & ' kwnkwn

n

= y(i)(kwx(i) + kb) kw

13

Rescaling w and b by a factor k to kw and kb does not change the geometric margin (Euclidean distance):

slide-14
SLIDE 14

CS446 Machine Learning

Rescaling w and b

Rescaling w and b by a factor k does change the functional margin γ by a factor k: γ = y(i) (wx(i) + b) kγ = y(i) (kwx(i) + kb) The point that is closest to the decision boundary has functional margin γmin – w and b can be rescaled so that γmin = 1 – When learning w and b, we can set γmin = 1

(and still get the same decision boundary)

14

slide-15
SLIDE 15

wxj = -1 = yj

+

wxi = +1 = yi

+ + − − − − − − + +

The maximum margin decision boundary

15

Margin m

wx = 0

wxk = +1 = yk

+

slide-16
SLIDE 16

CS446 Machine Learning

Hinge loss

L(y, f(x)) = max(0, 1 − yf(x))

16 1 2 3 4

  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2 yf(x) y*f(x)

Loss as a function of y*f(x) Hinge Loss Case 1: f(x) > 1 x outside of margin Hinge loss = 0 Case 2: 0< yf(x) <1: x inside of margin Hinge loss = 1-yf(x) Case 3: yf(x) < 0: x misclassified Hinge loss = 1-yf(x)

slide-17
SLIDE 17

CS446 Machine Learning

Perceptron with margin

17

slide-18
SLIDE 18

CS446 Machine Learning

Perceptron with Margin

Standard Perceptron update: Update w if ym·w·xm < 0 Perceptron with Margin update: Define a functional margin γ > 0 Update w if ym·w·xm < γ

18

slide-19
SLIDE 19

CS446 Machine Learning

Support Vector Machines

19

slide-20
SLIDE 20

wxj = -1 = yj

+

wxi = +1 = yi

+ + − − − − − − + +

The maximum margin decision boundary

20

Margin m

wx = 0

wxk = +1 = yk

+

slide-21
SLIDE 21

The maximum margin decision boundary…

… is defined by two parallel hyperplanes: – one that goes through the positive data points (yj = +1) that are closest to the decision boundary, and – one that goes through the negative data points (yj = −1) that are closest to the decision boundary.

21 CS440/ECE448: Intro AI

slide-22
SLIDE 22

Support vectors

We can express the separating hyperplane in terms of the data points xj closest to the decision boundary. These data points are called the support vectors.

22 CS440/ECE448: Intro AI

slide-23
SLIDE 23

wxj = -1 = yj

+

wxi = +1 = yi

+ + − − − − − − + +

Support vectors

23

Margin m

wx = 0

wxk = +1 = yk

+

slide-24
SLIDE 24

Perceptrons and SVMs: Differences in notation

Perceptrons:

– Weight vector has bias term w0 (x0 = dummy value 1) – Decision boundary: wx = 0

SVMs/Large Margin classifiers:

– Explicit bias term b; weight vector w = (w1…wn) – Decision boundary wx + b = 0

24 CS440/ECE448: Intro AI

slide-25
SLIDE 25

CS446 Machine Learning

Support Vector Machines

The functional margin of the data for (w, b) is determined by the points closest to the hyperplane Distance of x(n) to the hyperplane wx = 0: Learn w in an SVM = maximize the margin:

25

argmax

w, b

1 w min

n

y(n)(wx + b) ! " # $ % & ' ( ' ) * ' + '

γmin =min n

y(n)(wx(n) + b) ! " # $

wx + b w

slide-26
SLIDE 26

CS446 Machine Learning

Support Vector Machines

Learn w in an SVM = maximize the margin: This is difficult to optimize. Let’s convert it to an equivalent problem that is easier.

26

argmax

w, b

1 w min

n

y(n)(wx + b) ! " # $ % & ' ( ' ) * ' + '

slide-27
SLIDE 27

CS446 Machine Learning

Support Vector Machines

Learn w in an SVM = maximize the margin: Easier equivalent problem: – We can always rescale w and b without affecting Euclidian distances. – This allows us to set the functional margin to 1: minn(y(n)(wx(n) + b) = 1

27

argmax

w, b

1 w min

n

y(n)(wx + b) ! " # $ % & ' ( ' ) * ' + '

slide-28
SLIDE 28

CS446 Machine Learning

Support Vector Machines

Learn w in an SVM = maximize the margin: Easier equivalent problem: a quadratic program – Setting minn(y(n)(wx(n) + b) = 1 implies (y(n)(wx(n) + b) ≥ 1 for all n – argmax(1/ww)= argmin(ww) = argmin(1/2·ww)

28

argmax

w, b

1 w min

n

y(n)(wx + b) ! " # $ % & ' ( ' ) * ' + '

argmin

w

1 2 w⋅w subject to yi(w⋅xi + b) ≥1 ∀i

slide-29
SLIDE 29

Support Vector Machines

The name “Support Vector Machine” stems from the fact that w* is supported by (i.e. is the linear span of) the examples that are exactly at a distance 1/||w*|| from the separating hyperplane. These vectors are therefore called support vectors. Theorem: Let w* be the minimizer of the SVM optimization problem for S = {(xi, yi)}. Let I= {i: yi (w*xi + b) = 1}. Then there exist coefficients αi > 0 such that: w* = ∑i∈ ¡I αi yi xi ¡

29

slide-30
SLIDE 30

The primal representation

The data items x = (x1…xn) have n features The weight vector w = (w1…wn) has n elements

Learning: Find a weight wj for each feature xj Classification: Evaluate wx

30 CS440/ECE448: Intro AI

slide-31
SLIDE 31

The dual representation

Learning: Find a weight αj ( ≥ 0) for each data point xj

This requires computing the inner product xixj between all pairs of data items xi and xj

Support vectors = the set of data points xj with non-zero weights αj

31 CS440/ECE448: Intro AI

w = α jx j

j

slide-32
SLIDE 32

Classifying test data with SVM

In the primal: Compute inner product between weight vector and test item wx = 〈w, x〉 In the dual: Compute inner products between the support vectors and test item wx = 〈w, x〉 = 〈∑ j α j x j, x〉 = ∑ j αj 〈 x j, x〉

32 CS440/ECE448: Intro AI

slide-33
SLIDE 33

Dealing with outliers: Soft margins

slide-34
SLIDE 34

CS446 Machine Learning

Dealing with outliers: Slack variables ξi

ξi measures by how much example (xi, yi) fails to achieve margin δ

34

slide-35
SLIDE 35

CS446 Machine Learning

Dealing with outliers: Slack variables ξi

If xi is on correct side of the margin: ξi = 0

  • therwise

ξi = |yi − wxi|

If ξi = 1: xi is on the decision boundary wxi = 0 If ξi > 1: xi is misclassified

Replace y(n)(wx(n) + b) ≥ 1 (hard margin) with y(n)(wx(n) + b) ≥ 1− ξ(n) (soft margin)

35

slide-36
SLIDE 36

CS446 Machine Learning

Soft margins

ξi (slack): how far off is xi from the margin? C (cost): how much do we have to pay for misclassifying xi We want to minimize C∑i ξi and maximize the margin C controls the tradeoff between margin and training error

36

argmin

w

1 2 w⋅w +C ξi

i=1 n

subject to ξi ≥ 0 ∀i yi(w⋅xi + b) ≥ (1−ξi)∀i

slide-37
SLIDE 37

CS446 Machine Learning

Soft SVMs

Now the optimization problem becomes Minw ½ ||w||2 + C ∑(x,y)∈S max(0, 1 – y wx) where the parameter C controls the tradeoff between choosing a large margin (small ||w||) and choosing a small hinge-loss.

37

slide-38
SLIDE 38

CS446 Machine Learning

Training SVMs

Traditional approach: Solve quadratic program. – This is very slow. Current approaches: Use variants of stochastic gradient descent

  • r coordinate descent.

More on Tuesday!

38