CS 559: Machine Learning Fundamentals and Applications 9 th Set of - - PowerPoint PPT Presentation

cs 559 machine learning fundamentals and applications 9
SMART_READER_LITE
LIVE PREVIEW

CS 559: Machine Learning Fundamentals and Applications 9 th Set of - - PowerPoint PPT Presentation

1 CS 559: Machine Learning Fundamentals and Applications 9 th Set of Notes Instructor: Philippos Mordohai Webpage: www.cs.stevens.edu/~mordohai E-mail: Philippos.Mordohai@stevens.edu Office: Lieb 215 Overview Logistic Regression Notes


slide-1
SLIDE 1

CS 559: Machine Learning Fundamentals and Applications 9th Set of Notes

Instructor: Philippos Mordohai Webpage: www.cs.stevens.edu/~mordohai E-mail: Philippos.Mordohai@stevens.edu Office: Lieb 215

1

slide-2
SLIDE 2

Overview

  • Logistic Regression

– Notes by T. Mitchell – Barber Ch. 17 – HTF Ch. 4

  • Linear Discriminant Functions (Slides based on

Olga Veksler’s)

– Optimization with gradient descent – Perceptron Criterion Function

  • Batch perceptron rule
  • Single sample perceptron rule

– Minimum Squared Error (MSE) rule

2

slide-3
SLIDE 3

Overview (cont.)

  • Support Vector Machines (SVM)

– Introduction – Linear Discriminant

  • Linearly Separable Case
  • Linearly Non Separable Case

– Kernel Trick

  • Non Linear Discriminant

– Multi-class SVMs

  • See HTF Ch. 12

3

slide-4
SLIDE 4

Logistic Regression

  • Idea: generative models compute P(Y|X)

by learning P(Y) and P(X|Y)

  • Why not learn P(Y|X) directly?

4

slide-5
SLIDE 5

Logistic Regression

  • Consider learning f: X  Y, where

– X is a vector of real-valued features, < X1 … Xn > – Y is boolean – assume all Xi are conditionally independent given Y – model P(Xi | Y = yk) as Gaussian N(μik,σi

2)

– model P(Y) as Bernoulli (π)

  • Y is 1 with probability π

5

slide-6
SLIDE 6

Derivation of P(Y|X)

6

slide-7
SLIDE 7

Very Convenient

7

slide-8
SLIDE 8

Very Convenient

8

  • Posteriors sum to 1 and remain in [0, 1]
  • Logit:
  • =α+βx

– is linear in x

  • Probability:
slide-9
SLIDE 9

Logistic Function

  • 0, log
  • , log
  • 1, log

9

slide-10
SLIDE 10

Decision Boundary

  • How to make decisions given

10

slide-11
SLIDE 11

Logistic Regression More Generally

  • Logistic regression when Y is not boolean (but

still discrete)

– yϵ{y1 … yR}: learn R-1 sets of weights – for k<R – for k=R

11

slide-12
SLIDE 12

Training Logistic Regression: MCLE

  • We have L training examples:
  • Maximum likelihood estimate for

parameters W

  • Maximum conditional likelihood estimate

12

slide-13
SLIDE 13

Training Logistic Regression: MCLE

  • Choose parameters <w0 … wn> to maximize

conditional likelihood of training data

  • Training data D=
  • Data likelihood =
  • Data conditional likelihood =

13

slide-14
SLIDE 14

Conditional Log Likelihood

14

slide-15
SLIDE 15

Maximizing Conditional Log Likelihood

15

Good news: l(W) is concave function of W Bad news: no closed-form solution to maximize l(W)

slide-16
SLIDE 16

Gradient Descent

16

slide-17
SLIDE 17

Maximize Conditional Log Likelihood: Gradient Ascent

17

slide-18
SLIDE 18

Maximize Conditional Log Likelihood: Gradient Ascent

18

slide-19
SLIDE 19

Logistic Regression: Summary

  • Consider learning f: X  Y, where

– X is a vector of real-valued features, < X1 … Xn > – Y is boolean – assume all Xi are conditionally independent given Y – model P(Xi | Y = yk) as Gaussian N(μik,σi

2)

– model P(Y) as Bernoulli (π)

  • Then P(Y|X) is of this form and we can directly

estimate W

19

slide-20
SLIDE 20

Linear Discriminant Functions

20

slide-21
SLIDE 21

Augmented Feature Vector

  • Linear discriminant function: g(x) = w

g(x) = wt x +w x +w0

  • Can rewrite it:
  • y

y is called the augmented feature vector

  • Added a dummy dimension to get a completely

equivalent new homogeneous problem

Pattern Classification, Chapter 5 21

slide-22
SLIDE 22
  • Feature augmentation is done for simpler

notation

  • From now on, always assume that we have

augmented feature vectors

– Given samples x1,…, x ,…, xn convert them to augmented samples y1,…, y ,…, yn by adding a new dimension of value 1

22

slide-23
SLIDE 23

Training Error

  • For the rest of this part, assume we have 2 classes

– Samples: y1,…, y ,…, yn , , some in class 1, some in class 2

  • Use samples to determine weights a

a in the discriminant function g(y) = a g(y) = aty

  • What should the criterion for determining a

a be?

  • For now, suppose we want to minimize the training

error (the number of misclassified samples y1,…, y ,…, yn)

  • Recall that:
  • Thus training error is 0

0 if

23

g(yi)>0 => yi classified as c1 g(yi)<0 => yi classified as c2

slide-24
SLIDE 24

“Normalization”

  • Thus training error is 0 if:
  • Equivalently, training error is 0 if:
  • This suggests “normalization” (a.k.a. reflection):
  • 1. Replace all examples from class 2 by:
  • 2. Seek weight vector a

a such that – If such a a exists, it is called a separating or solution vector – Original samples x1,…, x ,…, xn can indeed be separated by a line

24

slide-25
SLIDE 25

Normalization

  • Seek a hyperplane

that separates patterns from different categories

  • Seek hyperplane that

puts normalized patterns on the same(positive) side

25

slide-26
SLIDE 26

Solution Region

  • Find weight vector a

a such that for all samples:

  • In general, there can be many solutions

26

slide-27
SLIDE 27

Solution Region

  • Solution region for a: set of all possible

solutions defined in terms of normal a to the separating hyperplane

27

slide-28
SLIDE 28

Optimization

  • Need to minimize a function of many

variables

J(x) =J(x J(x) =J(x 1,..., x ,..., xd ) )

  • We know how to minimize J(x)

J(x)

– Take partial derivatives and set them to zero

28

slide-29
SLIDE 29

Optimization

  • However solving analytically is not always

easy

– For example:

  • Sometimes it is not even possible to write

down an analytical expression for the derivative (example later today)

29

slide-30
SLIDE 30

Gradient Descent

  • Gradient points in direction of steepest increase
  • f J(x)

J(x), , and in direction of steepest decrease

30

slide-31
SLIDE 31

Gradient Descent for minimizing any function J(x) J(x)

– Set k = 1 and x(1) to some initial guess for the weight vector – While

  • Choose learning rate η(k)

(k)

(update rule)

31

Gradient Descent

slide-32
SLIDE 32

Gradient Descent

  • Gradient decent is guaranteed to only find

local minima

  • Nevertheless gradient descent is very

popular because it is simple and applicable to any function

32

slide-33
SLIDE 33

Gradient Descent

  • Main issue: how to set parameter η

(learning rate)

– If η is too small, too many iterations – If η is too large may

  • vershoot the minimum

and possibly never find it

33

slide-34
SLIDE 34

LDF Criterion Function

  • Find weight vector a

a such that for all samples y1,…, y ,…, yn

  • Need criterion function J(a)

J(a) which is minimized when a a is a solution vector

  • Let YM be the set of examples misclassified by a

YM(a) (a) ={ y { yi s.t. s.t. a atyi<0 } <0 }

  • First natural choice: number of misclassified examples

J(a) =|Y J(a) =|YM(a)| (a)|

  • Piecewise constant, gradient

descent is useless

34

slide-35
SLIDE 35

Perceptron

35

slide-36
SLIDE 36

Perceptron Criterion Function

  • If y

y is misclassified, aty<0 y<0

  • Thus Jp(a) >0

(a) >0

  • Jp(a)

(a) is ||a|| ||a|| times the sum of distances of misclassified examples to decision boundary

  • Jp(a)

(a) is piecewise linear and thus suitable for gradient descent

36

slide-37
SLIDE 37
  • Gradient of Jp(a)

(a) is:

– YM are samples misclassified by a(k)

(k)

– It is not possible to solve analytically because of YM

  • Update rule for gradient descent:
  • Thus the gradient decent batch update rule for

Jp(a) (a) is:

  • It is called batch rule because it is based on all

misclassified examples

Perceptron Batch Rule

37

slide-38
SLIDE 38

Perceptron Single Sample Rule

  • The gradient decent single sample rule for Jp(a)

(a) is:

– Note that yM is one sample misclassified by a(k) a(k) – Must have a consistent way of visiting samples

  • Geometric Interpretation:

– yM misclassified by a(k)

(k)

– yM is on the wrong side of decision hyperplane – Adding ηyM to a moves the new decision hyperplane in the right direction with respect to yM

38

slide-39
SLIDE 39

Perceptron Single Sample Rule

39

slide-40
SLIDE 40

Perceptron Example

  • Class 1: students who get A
  • Class 2: students who get F
  • O. Veksler

40

slide-41
SLIDE 41
  • Augment samples by adding an extra

feature (dimension) equal to 1

41

  • O. Veksler
slide-42
SLIDE 42
  • Normalize:

– Replace all examples from class 2 by their negative values – Seek a such that:

42

  • O. Veksler
slide-43
SLIDE 43
  • Single Sample Rule

– Sample is misclassified if – Gradient descent single sample rule: – Set η fixed learning rate to

43

  • O. Veksler
slide-44
SLIDE 44
  • Set equal initial weights

a(1)

(1) = [0.25, 0.25, 0.25, 0.25, 0.25]

= [0.25, 0.25, 0.25, 0.25, 0.25]

  • Visit all samples sequentially, modifying the

weights after each misclassified example

  • New weights

44

  • O. Veksler
slide-45
SLIDE 45
  • New weights

45

  • O. Veksler
slide-46
SLIDE 46
  • New weights

46

  • O. Veksler
slide-47
SLIDE 47
  • Thus the discriminant function is:
  • Converting back to the original features x:

x:

47

  • O. Veksler
slide-48
SLIDE 48
  • Converting back to the original features x:

x:

  • This is just one possible solution vector
  • If we started with weights

a(1)

(1)=[0,0.5, 0.5, 0, 0 ],

=[0,0.5, 0.5, 0, 0 ],

  • The solution would be [-1,1.5, -0.5, -1, -1]
  • In this solution, being tall is the least important feature

48

  • O. Veksler
slide-49
SLIDE 49

LDF: Non-separable Example

  • Suppose we have 2 features

and the samples are:

– Class 1: [2,1], [4,3], [3,5] – Class 2: [1,3] and [5,6]

  • These samples are not

separable by a line

  • Still would like to get approximate

separation by a line

– A good choice is shown in green – Some samples may be “noisy”, and we could accept them being misclassified

49

  • O. Veksler
slide-50
SLIDE 50

LDF: Non-separable Example

  • Obtain y1, y

, y2, y , y3, y , y4 by adding extra feature and “normalizing”

50

  • O. Veksler
slide-51
SLIDE 51

LDF: Non-separable Example

  • Apply Perceptron single

sample algorithm

  • Initial equal weights

a(1)

(1) = [1 1 1]

= [1 1 1]

– Line equation x(1)+x(2)+1=0

  • Fixed learning rate η = 1

= 1

51

slide-52
SLIDE 52

LDF: Non-separable Example

52

  • O. Veksler
slide-53
SLIDE 53

LDF: Non-separable Example

53

  • O. Veksler
slide-54
SLIDE 54

LDF: Non-separable Example

  • O. Veksler

54

  • y5

ta(4)=[-1 -5 -6]*[0 1 -4]=19>0

  • y1

ta(4)=[1 2 1]*[0 1 -4]=-2<0

  • ….
slide-55
SLIDE 55

LDF: Non-separable Example

  • We can continue this forever
  • There is no solution vector a

a satisfying for all i

  • Need to stop but at a good point
  • Solutions at iterations

900 through 915

– Some are good and some are not

  • How do we stop at a good

solution?

55

  • O. Veksler
slide-56
SLIDE 56

Convergence of Perceptron Rules

  • If classes are linearly separable and we use

fixed learning rate, that is for η(k)

(k) =const

=const

  • Then, both the single sample and batch

perceptron rules converge to a correct solution (could be any a a in the solution space)

  • If classes are not linearly separable:

– The algorithm does not stop, it keeps looking for a solution which does not exist

56

slide-57
SLIDE 57

Convergence of Perceptron Rules

  • If classes are not linearly separable:

– By choosing appropriate learning rate, we can always ensure convergence: – For example inverse linear learning rate: – For inverse linear learning rate, convergence in the linearly separable case can also be proven – No guarantee that we stopped at a good point, but there are good reasons to choose inverse linear learning rate

57

slide-58
SLIDE 58

Minimum Squared-Error Procedures

58

slide-59
SLIDE 59

Minimum Squared-Error Procedures

  • Idea: convert to easier and better understood problem
  • MSE procedure

– Choose positive constants b1, b , b2,…, b ,…, bn – Try to find weight vector a a such that atyi = b = bi for all samples yi – If we can find such a vector, then a a is a solution because the bi’s are positive – Consider all the samples (not just the misclassified ones)

59

slide-60
SLIDE 60

MSE Margins

  • If atyi = b

= bi, yi must be at distance bi from the separating hyperplane (normalized by ||a||)

  • Thus b1, b

, b2,…, b ,…, bn give relative expected distances or “margins” of samples from the hyperplane

  • Should make bi small if sample i is expected to be

near separating hyperplane, and large otherwise

  • In the absence of any additional information, set

b1 = b = b2 =… = =… = bn = 1 = 1

60

slide-61
SLIDE 61

MSE Matrix Notation

  • Need to solve n equations
  • In matrix form Ya=b

Ya=b

61

slide-62
SLIDE 62

Exact Solution is Rare

  • Need to solve a linear system Ya

Ya = b = b

– Y Y is an n× n×(d +1 d +1) matrix

  • Exact solution only if Y

Y is non-singular and square (the inverse Y-1

  • 1 exists)

– a =Y-1

  • 1 b

– (number of samples) = (number of features + 1) – Almost never happens in practice – Guaranteed to find the separating hyperplane

62

slide-63
SLIDE 63

Approximate Solution

  • Typically Y is overdetermined, that is it has more rows

(examples) than columns (features)

– If it has more features than examples, should reduce dimensionality

  • Need Ya

Ya = b = b, but no exact solution exists for an over- determined system of equations

– More equations than unknowns

  • Find an approximate solution

– Note that approximate solution a does not not necessarily give the separating hyperplane in the separable case – But the hyperplane corresponding to a may still be a good solution, especially if there is no separating hyperplane

63

slide-64
SLIDE 64

MSE Criterion Function

  • Minimum squared error approach: find a which

minimizes the length of the error vector e e = Ya e = Ya – b

  • Thus minimize the minimum squared error

criterion function:

  • Unlike the perceptron criterion function, we can
  • ptimize the minimum squared error criterion

function analytically by setting the gradient to 0

64

slide-65
SLIDE 65

Computing the Gradient

Pattern Classification, Chapter 5 65

slide-66
SLIDE 66

Pseudo-Inverse Solution

  • Setting the gradient to 0:
  • The matrix YtY is square (it has d +1 rows and columns)

and it is often non-singular

  • If YtY is non-singular, its inverse exists and we can solve

for a uniquely:

66

slide-67
SLIDE 67

MSE Procedures

  • Only guaranteed separating hyperplane if Ya

Ya > 0 > 0

– That is if all elements of vector Ya Ya are positive – where ε may be negative

  • If ε1,…,

,…, εn are small relative to b1,…, b ,…, bn, then each element of Ya Ya is positive, and a gives a separating hyperplane

– If the approximation is not good, εi may be large and negative, for some i, thus bi + + εi will be negative and a is not a separating hyperplane

  • In linearly separable case, least squares solution a does not

necessarily give separating hyperplane

67

slide-68
SLIDE 68

MSE Procedures

  • We are free to choose b. We may be tempted to make

b large as a way to ensure Ya Ya =b > 0 b > 0

– Does not work – Let β be a scalar, let’s try βb instead of b

  • If a*

a* is a least squares solution to Ya Ya = b = b, , then for any scalar β, the least squares solution to Ya Ya = = βb b is βa* a*

  • Thus if the i th

th element of Ya

Ya is less than 0, that is yi

ta

< 0, < 0, then yi

t (βa) <

a) < 0, 0,

– The relative difference between components of b matters, but not the size of each individual component

68

slide-69
SLIDE 69

LDF using MSE: Example 1

  • Class 1: (6 9), (5 7)
  • Class 2: (5 9), (0 4)
  • Add extra feature and

“normalize”

69

slide-70
SLIDE 70

LDF using MSE: Example 1

  • Choose b=[1 1

b=[1 1 1 1 1] 1]T

  • In Matlab, a=Y\b

a=Y\b solves the least squares problem

  • Note a is an approximation to Ya

Ya = b, = b, since no exact solution exists

  • This solution gives a separating

hyperplane since Ya Ya >0 >0

70

            944 . 045 . 1 66 . 2 a

             11 . 1 61 . 28 . 1 44 . Ya

slide-71
SLIDE 71

LDF using MSE: Example 2

  • Class 1: (6 9), (5 7)
  • Class 2: (5 9), (0 10)
  • The last sample is very far

compared to others from the separating hyperplane

71

slide-72
SLIDE 72

LDF using MSE: Example 2

  • Choose b=[1 1 1 1]

b=[1 1 1 1]T

  • In Matlab, a=Y\b

a=Y\b solves the least squares problem

  • This solution does not provide a

separating hyperplane since aty3

3 < 0

< 0

72

slide-73
SLIDE 73

LDF using MSE: Example 2

  • MSE pays too much attention to isolated

“noisy” examples

– such examples are called outliers

  • No problems with convergence
  • Solution ranges from reasonable to good

73

slide-74
SLIDE 74

LDF using MSE: Example 2

  • We can see that the 4th point is

vary far from separating hyperplane

– In practice we don’t know this

  • A more appropriate b could be
  • In Matlab, solve a=Y\b

a=Y\b

  • This solution gives the separating

hyperplane since Ya Ya > 0 > 0

74

slide-75
SLIDE 75

Gradient Descent for MSE

  • May wish to find MSE solution by gradient

descent:

  • 1. Computing the inverse of YtY may be too costly

2.

  • 2. YtY may be close to singular if samples are

highly correlated (rows of Y are almost linear combinations of each other) computing the inverse of YtY is not numerically stable

  • As shown before, the gradient is:

75

slide-76
SLIDE 76

Widrow-Hoff Procedure

  • Thus the update rule for gradient descent is:
  • If η(k)

(k)=η(1) (1)/k

/k, then a(k)

(k) converges to the MSE

solution a, that is Yt(Ya-b)=0 (Ya-b)=0

  • TheWidrow-Hoff procedure reduces storage

requirements by considering single samples sequentially

76

slide-77
SLIDE 77

LDF Summary

  • Perceptron procedures

Perceptron procedures

– Find a separating hyperplane in the linearly separable case, – Do not converge in the non-separable case – Can force convergence by using a decreasing learning rate, but are not guaranteed a reasonable stopping point

  • MSE procedures

MSE procedures

– Converge in separable and not separable case – May not find separating hyperplane even if classes are linearly separable – Use pseudoinverse if YtY is not singular and not too large – Use gradient descent (Widrow-Hoff procedure) otherwise

77

slide-78
SLIDE 78

Support Vector Machines

78

slide-79
SLIDE 79

SVM Resources

  • Burges tutorial

– http://research.microsoft.com/en- us/um/people/cburges/papers/SVMTutorial.pdf

  • Shawe-Taylor and Christianini tutorial

– http://www.support-vector.net/icml-tutorial.pdf

  • Lib SVM

– http://www.csie.ntu.edu.tw/~cjlin/libsvm/

  • LibLinear

– http://www.csie.ntu.edu.tw/~cjlin/liblinear/

  • SVM Light

– http://svmlight.joachims.org/

  • Power Mean SVM (very fast for histogram features)

– https://sites.google.com/site/wujx2001/home/power-mean-svm

79

slide-80
SLIDE 80

SVMs

  • One of the most important developments

in pattern recognition in the last years

  • Elegant theory

– Has good generalization properties

  • Have been applied to diverse problems

very successfully

80

slide-81
SLIDE 81

Linear Discriminant Functions

  • A discriminant function is linear if it can be written as
  • which separating hyperplane should we choose?

81

slide-82
SLIDE 82

Linear Discriminant Functions

  • Training data is just a subset of all possible data

– Suppose hyperplane is close to sample xi – If we see new sample close to xi, it may be on the wrong side of the hyperplane

  • Poor generalization (performance on unseen data)

82

slide-83
SLIDE 83

Linear Discriminant Functions

  • Hyperplane as far as possible from any sample
  • New samples close to the old samples will be

classified correctly

  • Good generalization

83

slide-84
SLIDE 84

SVM

  • Idea: maximize distance to the closest example
  • For the optimal hyperplane

– distance to the closest negative example = distance to the closest positive example

84

slide-85
SLIDE 85

SVM: Linearly Separable Case

  • SVM: maximize the margin
  • The margin is twice the absolute value of distance b

b of the closest example to the separating hyperplane

  • Better generalization (performance on test data)

– in practice – and in theory

85

slide-86
SLIDE 86

SVM: Linearly Separable Case

  • Support vectors

Support vectors are the samples closest to the separating hyperplane

– They are the most difficult patterns to classify – Recall perceptron update rule

  • Optimal hyperplane is completely defined by support

vectors

– Of course, we do not know which samples are support vectors without finding the optimal hyperplane

86

slide-87
SLIDE 87

SVM: Formula for the Margin

  • Absolute distance between x

x and the boundary g(x) = 0 g(x) = 0

  • Distance is unchanged for hyperplane
  • Let xi be an example closest to the boundary (on the

positive side). Set:

  • Now the largest margin hyperplane is unique

87

slide-88
SLIDE 88

SVM: Formula for the Margin

  • For uniqueness, set |w

|wTxi+w +w0|=1 |=1 for any sample xi closest to the boundary

  • The distance from closest sample xi to

g(x) = 0 g(x) = 0 is

  • Thus the margin is

88

slide-89
SLIDE 89

SVM: Optimal Hyperplane

  • Maximize margin
  • Subject to constraints
  • Let
  • Can convert our problem to minimize
  • J(w)

J(w) is a quadratic function, thus there is a single global minimum

89

slide-90
SLIDE 90

SVM: Optimal Hyperplane

  • Use Kuhn-Tucker theorem to convert our

problem to:

  • a =

a ={a1,…, a ,…, an} are new variables, one for each sample

  • Optimized by quadratic programming

90

slide-91
SLIDE 91

SVM: Optimal Hyperplane

  • After finding the optimal a = {a1,…, a

,…, an}

  • Final discriminant function:
  • where S is the set of support vectors

S is the set of support vectors

91

slide-92
SLIDE 92

SVM: Optimal Hyperplane

  • LD(a)

(a) depends on the number of samples, not

  • n dimension

– samples appear only through the dot products xj

txi

  • This will become important when looking for a

nonlinear discriminant function, as we will see soon

92

slide-93
SLIDE 93

SVM: Non-Separable Case

  • Data are most likely to be not linearly separable,

but linear classifier may still be appropriate

  • Can apply SVM in non linearly separable case
  • Data should be “almost” linearly separable for

good performance

93

slide-94
SLIDE 94

SVM: Non-Separable Case

  • Use slack variables ξ1,…,

,…, ξn (one for each sample)

  • Change constraints from to
  • ξi is a measure of deviation

from the ideal for xi

– ξi >1: xi is on the wrong side

  • f the separating hyperplane

– 0 0 < ξi <1: xi is on the right side

  • f separating hyperplane but

within the region of maximum margin – ξi < 0 0 : is the ideal case forxi

94

slide-95
SLIDE 95

SVM: Non-Separable Case

  • We would like to minimize
  • where
  • Constrained to
  • β is a constant that measures the relative

weight of first and second term

– If β is small, we allow a lot of samples to be in not ideal positions – If β is large, few samples can be in non-ideal positions

95

slide-96
SLIDE 96

SVM: Non-Separable Case

96

slide-97
SLIDE 97

SVM: Non-Separable Case

  • Unfortunately this minimization problem is

NP-hard due to the discontinuity of I(ξi)

  • Instead, we minimize
  • Subject to

97

slide-98
SLIDE 98

SVM: Non-Separable Case

  • Use Kuhn-Tucker theorem to convert to:
  • w is computed using:
  • Remember that

98

slide-99
SLIDE 99

Nonlinear Mapping

  • Cover’s theorem: “a pattern-classification problem

cast in a high dimensional space non-linearly is more likely to be linearly separable than in a low- dimensional space”

  • One dimensional space, not linearly separable
  • Lift to two dimensional space with φ(x)=(x,x

x)=(x,x2

2 )

99

slide-100
SLIDE 100

Nonlinear Mapping

  • To solve a non linear classification problem with a linear

classifier 1. Project data x x to high dimension using function φ(x) (x) 2. Find a linear discriminant function for transformed data φ(x) x) 3. Final nonlinear discriminant function is g(x) = g(x) = wtφ(x) +w (x) +w0

  • In 2D, the discriminant function is linear
  • In 1D, the discriminant function is not linear

100

slide-101
SLIDE 101

Nonlinear Mapping

  • However, there always exists a mapping of

N samples to an N-dimensional space in which the samples are separable by hyperplanes

101

slide-102
SLIDE 102

Nonlinear SVM

  • Can use any linear classifier after lifting data to a

higher dimensional space. However we will have to deal with the curse of dimensionality

– Poor generalization to test data – Computationally expensive

  • SVM avoids the curse of dimensionality problems

– Enforcing largest margin permits good generalization

  • It can be shown that generalization in SVM is a function of

the margin, independent of the dimensionality

– Computation in the higher dimensional case is performed only implicitly through the use of kernel kernel functions functions

102

slide-103
SLIDE 103

Kernels

  • SVM optimization:

maximize

  • Note this optimization depends on samples xi only

through the dot product xi

txj

  • If we lift xi to high dimension using φ(x),

(x), we need to compute high dimensional product φ(xi)t φ(x (xj) maximize

  • Idea: find kernel function K(x

K(xi,x ,xj) s.t.

103

slide-104
SLIDE 104

Kernel Trick

  • Then we only need to compute K(x

K(xi,x ,xj) ) instead of φ(xi)t φ(x (xj)

  • “kernel trick”: do not need to perform
  • perations in high dimensional space

explicitly

104

slide-105
SLIDE 105

Kernel Example

  • Suppose we have two features and K(x,y) =

K(x,y) = (x (xty) y)2

  • Which mapping φ(x)

x) does this correspond to?

105

slide-106
SLIDE 106

Choice of Kernel

  • How to choose kernel function K(x

K(xi,x ,xj)?

– K(x K(xi,x ,xj) ) should correspond to φ(xi)t φ(x (xj) ) in a higher dimensional space – Mercer’s condition tells us which kernel function can be expressed as dot product of two vectors – If K and K’ K’ are kernels aK+bK’ aK+bK’ is a kernel

  • Intuitively: Kernel should measure the

similarity between xi

i and xj

– As inner product measures similarity of unit vectors – May be problem-specific

106

slide-107
SLIDE 107

Choice of Kernel

  • Some common choices:

– Polynomial kernel – Gaussian radial Basis kernel – Hyperbolic tangent (sigmoid) kernel

  • The mappings φ(xi)

) never have to be computed!!

107

K(x K(xi,x ,xj) = tanh(k x ) = tanh(k xi

txj + c)

+ c)

slide-108
SLIDE 108

Intersection Kernel

  • Feature vectors are histograms
  • When K(x

K(xi,x ,xj) ) is small, xi and xj are dissimilar

  • When K(x

K(xi,x ,xj) ) is large, xi and xj are similar

  • The mapping φ(x)

x) does not exist

108

n k jk ik j i

x x x x K

1

) , min( ) , (

slide-109
SLIDE 109

More Additive Kernels

  • χ2 kernel
  • Hellinger’s kernel
  • Designed for feature vectors that are

histograms

– Can be used for other feature vectors

  • Offer very large speed-ups

109

 

n k k k k k

y x y x K

1

2

2

n k k k H

y x K

1

slide-110
SLIDE 110

The Kernel Matrix

  • a.k.a the Gram matrix
  • Contains all necessary information for the

learning algorithm

  • Fuses information about the data and the

kernel (similarity measure)

110

slide-111
SLIDE 111

Bad Kernels

  • The kernel matrix is mostly diagonal

– All points are orthogonal to each other

  • Bad similarity measure
  • Too many irrelevant features in high

dimensional space

  • We need problem-specific knowledge to

choose appropriate kernel

111

slide-112
SLIDE 112

Nonlinear SVM Step-by-Step

  • Start with data x1,…,x

,…,xn which live in feature space of dimension d

  • Choose kernel K(x

K(xi,x ,xj) ) or function φ(x (xi) ) which lifts sample xi to a higher dimensional space

  • Find the maximum margin linear discriminant function in

the higher dimensional space by using quadratic programming package to solve:

112

slide-113
SLIDE 113

Nonlinear SVM Step-by-Step

  • Weight vector w

w in the high dimensional space:

– where S S is the set of support vectors

  • Linear discriminant function of maximum margin in

the high dimensional space:

  • Non linear discriminant function in the original

space:

  • decide class 1 if g(x ) > 0

g(x ) > 0, otherwise decide class 2

113

slide-114
SLIDE 114

Nonlinear SVM

  • Nonlinear discriminant function

114

slide-115
SLIDE 115

SVM Example: XOR Problem

  • Class 1: x1 = [1,-1], x

= [1,-1], x2 = [-1,1] = [-1,1]

  • Class 2: x3 = [1,1],

= [1,1], x4 = [-1,-1] = [-1,-1]

  • Use polynomial kernel of degree 2:

– This kernel corresponds to the mapping

  • Need to maximize

constrained to

115

slide-116
SLIDE 116

SVM Example: XOR Problem

  • After some manipulation …
  • The solution is a1= a

= a2 = a = a3 = a = a4 = 0.25 = 0.25

– satisfies the constraints

  • All samples are support vectors

116

slide-117
SLIDE 117

SVM Example: XOR Problem

  • The weight vector w is:
  • Thus the nonlinear discriminant function is:

117

slide-118
SLIDE 118

SVM Example: XOR Problem

118

slide-119
SLIDE 119

SVM Summary

  • Advantages:

– Based on very strong theory – Excellent generalization properties – Objective function has no local minima – Can be used to find non linear discriminant functions – Complexity of the classifier is characterized by the number of support vectors rather than the dimensionality of the transformed space

  • Disadvantages:

– Directly applicable to two-class problems – Quadratic programming is computationally expensive – Need to choose kernel

119

slide-120
SLIDE 120

Multi-Class SVMs

  • One against all
  • Pairwise
  • These ideas apply to all binary classifiers

when faced with multi-class problems

120

slide-121
SLIDE 121

One-Against-All

  • SVMs can only handle two-class outputs
  • What can be done?
  • Answer: learn N SVM’s

– SVM 1 learns “Output==1” vs “Output != 1” – SVM 2 learns “Output==2” vs “Output != 2” – … – SVM N learns “Output==N” vs “Output != N”

121

slide-122
SLIDE 122

One-Against-All

  • Original idea (Vapnik, 1995): classify x as

ωi if and only if the corresponding SVM accepts x and all other SVMs reject it

122

?

slide-123
SLIDE 123

One-Against-All

  • Modified idea (Vapnik, 1998): classify x

according to the SVM that produces the highest value (use more than sign of decision function)

123

slide-124
SLIDE 124

Pairwise SVMs

  • Learn N(N-1)/2 SVM’s

– SVM 1 learns “Output==1” vs “Output == 2” – SVM 2 learns “Output==1” vs “Output == 3” – … – SVM M learns “Output==N-1” vs “Output == N”

124

slide-125
SLIDE 125

Pairwise SVMs

  • To classify a new input, apply each SVM

and choose the label that “wins” most

  • ften

125