Support Vector Machines Machine Learning 1 Big picture Linear - - PowerPoint PPT Presentation

β–Ά
support vector machines
SMART_READER_LITE
LIVE PREVIEW

Support Vector Machines Machine Learning 1 Big picture Linear - - PowerPoint PPT Presentation

Support Vector Machines Machine Learning 1 Big picture Linear models 2 Big picture Linear models How good is a learning algorithm? 3 Big picture Linear models Perceptron, Winnow Online learning How good is a learning algorithm?


slide-1
SLIDE 1

Machine Learning

Support Vector Machines

1

slide-2
SLIDE 2

Big picture

2

Linear models

slide-3
SLIDE 3

Big picture

3

Linear models How good is a learning algorithm?

slide-4
SLIDE 4

Big picture

4

Linear models How good is a learning algorithm? Online learning Perceptron, Winnow

slide-5
SLIDE 5

Big picture

5

Linear models How good is a learning algorithm? Online learning PAC, Agnostic learning Perceptron, Winnow

slide-6
SLIDE 6

Big picture

6

Linear models How good is a learning algorithm? Online learning PAC, Agnostic learning Perceptron, Winnow Support Vector Machines

slide-7
SLIDE 7

Big picture

7

Linear models How good is a learning algorithm? Online learning PAC, Agnostic learning Perceptron, Winnow Support Vector Machines

…. ….

slide-8
SLIDE 8

This lecture: Support vector machines

  • Training by maximizing margin
  • The SVM objective
  • Solving the SVM optimization problem
  • Support vectors, duals and kernels

8

slide-9
SLIDE 9

This lecture: Support vector machines

  • Training by maximizing margin
  • The SVM objective
  • Solving the SVM optimization problem
  • Support vectors, duals and kernels

9

slide-10
SLIDE 10

VC dimensions and linear classifiers

What we know so far 1. If we have 𝑛 examples, then with probability 1 - πœ€, the true error of a hypothesis β„Ž with training error 𝑓𝑠𝑠

! β„Ž is

bounded by

10

Generalization error Training error A function of VC dimension. Low VC dimension gives tighter bound

𝑓𝑠𝑠! β„Ž ≀ 𝑓𝑠𝑠

" β„Ž +

π‘Šπ· 𝐼 ln 2𝑛 π‘Šπ·(𝐼) + 1 + ln 4 πœ€ 𝑛

slide-11
SLIDE 11

VC dimensions and linear classifiers

What we know so far 1. If we have 𝑛 examples, then with probability 1 - πœ€, the true error of a hypothesis β„Ž with training error 𝑓𝑠𝑠

! β„Ž is

bounded by

11

Generalization error Training error

𝑓𝑠𝑠! β„Ž ≀ 𝑓𝑠𝑠

" β„Ž +

π‘Šπ· 𝐼 ln 2𝑛 π‘Šπ·(𝐼) + 1 + ln 4 πœ€ 𝑛

A function of VC dimension. Low VC dimension gives tighter bound

slide-12
SLIDE 12

What we know so far 1. If we have 𝑛 examples, then with probability 1 - πœ€, the true error of a hypothesis β„Ž with training error 𝑓𝑠𝑠

! β„Ž is

bounded by

VC dimensions and linear classifiers

What we know so far 1. If we have 𝑛 examples, then with probability 1 - πœ€, the true error of a hypothesis β„Ž with training error 𝑓𝑠𝑠

! β„Ž is

bounded by 2. VC dimension of a linear classifier in 𝑒 dimensions = 𝑒 + 1

12

Generalization error Training error

𝑓𝑠𝑠! β„Ž ≀ 𝑓𝑠𝑠

" β„Ž +

π‘Šπ· 𝐼 ln 2𝑛 π‘Šπ·(𝐼) + 1 + ln 4 πœ€ 𝑛

A function of VC dimension. Low VC dimension gives tighter bound

slide-13
SLIDE 13

VC dimensions and linear classifiers

What we know so far 1. If we have 𝑛 examples, then with probability 1 - πœ€, the true error of a hypothesis β„Ž with training error 𝑓𝑠𝑠

! β„Ž is

bounded by 2. VC dimension of a linear classifier in 𝑒 dimensions = 𝑒 + 1

13

But are all linear classifiers the same?

Generalization error Training error

𝑓𝑠𝑠! β„Ž ≀ 𝑓𝑠𝑠

" β„Ž +

π‘Šπ· 𝐼 ln 2𝑛 π‘Šπ·(𝐼) + 1 + ln 4 πœ€ 𝑛

A function of VC dimension. Low VC dimension gives tighter bound

slide-14
SLIDE 14

Recall: Margin

The margin of a hyperplane for a dataset is the distance between the hyperplane and the data point nearest to it.

14

+ + + + + ++ +

  • -
  • -
  • -
slide-15
SLIDE 15

Recall: Margin

The margin of a hyperplane for a dataset is the distance between the hyperplane and the data point nearest to it.

15

+ + + + + ++ +

  • -
  • -
  • -
  • Margin with respect to this hyperplane
slide-16
SLIDE 16

Which line is a better choice? Why?

+ + + + + ++ +

  • -
  • -
  • -
  • +

+ + + + ++ +

  • -
  • -
  • -
  • h2

h1

16

slide-17
SLIDE 17

Which line is a better choice? Why?

+ + + + + ++ +

  • -
  • -
  • -
  • +

+ + + + ++ +

  • -
  • -
  • -
  • h2

+ +

A new example, not from the training set might be misclassified if the margin is smaller

h1

17

slide-18
SLIDE 18

Data dependent VC dimension

  • Intuitively, larger margins are better
  • Suppose we only consider linear separators with

margins 𝜈! and 𝜈"

– 𝐼" = linear separators that have a margin 𝜈" – 𝐼# = linear separators that have a margin 𝜈# – And 𝜈" > 𝜈#

  • The entire set of functions 𝐼! is β€œbetter”

18

slide-19
SLIDE 19

Data dependent VC dimension

Theorem (Vapnik):

– Let H be the set of linear classifiers that separate the training set by a margin at least 𝛿 – Then π‘Šπ· 𝐼 ≀ min 𝑆# 𝛿# , 𝑒 + 1 – 𝑆 is the radius of the smallest sphere containing the data

19

slide-20
SLIDE 20

Data dependent VC dimension

Theorem (Vapnik):

– Let H be the set of linear classifiers that separate the training set by a margin at least 𝛿 – Then π‘Šπ· 𝐼 ≀ min 𝑆# 𝛿# , 𝑒 + 1 – 𝑆 is the radius of the smallest sphere containing the data Larger margin β‡’ Lower VC dimension Lower VC dimension β‡’ Better generalization bound

20

slide-21
SLIDE 21

Learning strategy

Find the linear separator that maximizes the margin

21

slide-22
SLIDE 22

This lecture: Support vector machines

  • Training by maximizing margin
  • The SVM objective
  • Solving the SVM optimization problem
  • Support vectors, duals and kernels

22

slide-23
SLIDE 23

Support Vector Machines

  • Lower VC dimension β†’ Better generalization
  • Vapnik: For linear separators, the VC dimension depends inversely
  • n the margin

– That is, larger margin β†’ better generalization

  • For the separable case:

– Among all linear classifiers that separate the data, find the one that maximizes the margin – Maximizing the margin by minimizing 𝒙!𝒙 if for all examples 𝑧𝒙!π’š β‰₯ 1

  • General case:

– Introduce slack variables – one 𝜊" for each example – Slack variables allow the margin constraint above to be violated

23

So far

slide-24
SLIDE 24

Recall: The geometry of a linear classifier

24

Prediction = sgn(𝑐 + π‘₯1 𝑦1 + π‘₯2𝑦2)

+ + + + + ++ +

  • -
  • -
  • -
  • |π‘₯! 𝑦! + π‘₯"𝑦" + 𝑐|

π‘₯!

" + π‘₯" "

= 𝑧(π‘₯!𝑦! + π‘₯"𝑦" + 𝑐) | 𝐱 | 𝑐 + π‘₯1 𝑦1 + π‘₯2𝑦2 = 0

slide-25
SLIDE 25

Recall: The geometry of a linear classifier

25

Prediction = sgn(𝑐 + π‘₯1 𝑦1 + π‘₯2𝑦2)

+ + + + + ++ +

  • -
  • -
  • -
  • We only care about

the sign, not the magnitude

|π‘₯! 𝑦! + π‘₯"𝑦" + 𝑐| π‘₯!

" + π‘₯" "

= 𝑧(π‘₯!𝑦! + π‘₯"𝑦" + 𝑐) | 𝐱 | 𝑐 + π‘₯1 𝑦1 + π‘₯2𝑦2 = 0

slide-26
SLIDE 26

Recall: The geometry of a linear classifier

26

𝑐 + π‘₯1 𝑦1 + π‘₯2𝑦2 = 0 𝑐 2 + π‘₯1 2 𝑦1 + π‘₯2 2 𝑦2 = 0 1000𝑐 + 1000π‘₯1 𝑦1 + 1000π‘₯2𝑦2 = 0

+ + + + + ++ +

  • -
  • -
  • -
  • We only care about

the sign, not the magnitude

All these are equivalent. We could multiply or divide the coefficients by any positive number and the sign of the prediction will not change |π‘₯! 𝑦! + π‘₯"𝑦" + 𝑐| π‘₯!

" + π‘₯" "

= 𝑧(π‘₯!𝑦! + π‘₯"𝑦" + 𝑐) | 𝐱 |

Prediction = sgn(𝑐 + π‘₯1 𝑦1 + π‘₯2𝑦2)

slide-27
SLIDE 27

Maximizing margin

  • Margin of a hyperplane = distance of the closest

point from the hyperplane 𝛿𝐱,% = max

&

𝑧&(𝐱'𝐲& + 𝑐) | 𝐱 |

  • We want maxw Β°

27

Some people call this the geometric margin The numerator alone is called the functional margin

slide-28
SLIDE 28

Maximizing margin

  • Margin of a hyperplane = distance of the closest

point from the hyperplane 𝛿𝐱,% = max

&

𝑧&(𝐱'𝐲& + 𝑐) | 𝐱 |

  • We want to maximize this margin: max

𝐱,% 𝛿𝐱,%

28

Sometimes this is called the geometric margin The numerator alone is called the functional margin

slide-29
SLIDE 29

Recall: The geometry of a linear classifier

29

b +w1 x1 + w2x2=0

We only care about the sign, not the magnitude

+ + + + + ++ +

  • -
  • -
  • -
  • Prediction = sgn(𝑐 + π‘₯1 𝑦1 + π‘₯2𝑦2)

|π‘₯! 𝑦! + π‘₯"𝑦" + 𝑐| π‘₯!

" + π‘₯" "

slide-30
SLIDE 30

Towards maximizing the margin

30

π‘₯! 𝑑 𝑦! + π‘₯" 𝑑 𝑦" + 𝑐 𝑑 π‘₯! 𝑑

"

+ π‘₯" 𝑑

"

+ + + + + ++ +

  • -
  • -
  • -
  • |π‘₯! 𝑦! + π‘₯"𝑦" + 𝑐|

π‘₯!

" + π‘₯" "

We only care about the sign, not the magnitude

We can scale the weights to make the optimization easier b +w1 x1 + w2x2=0

slide-31
SLIDE 31

Towards maximizing the margin

31

π‘₯! 𝑑 𝑦! + π‘₯" 𝑑 𝑦" + 𝑐 𝑑 π‘₯! 𝑑

"

+ π‘₯" 𝑑

"

+ + + + + ++ +

  • -
  • -
  • -
  • |π‘₯! 𝑦! + π‘₯"𝑦" + 𝑐|

π‘₯!

" + π‘₯" "

Key observation: We can scale the 𝑑 so that the numerator is 1 for points that define the margin.

We only care about the sign, not the magnitude

We can scale the weights to make the optimization easier b +w1 x1 + w2x2=0

slide-32
SLIDE 32

Towards maximizing the margin

32

π‘₯! 𝑑 𝑦! + π‘₯" 𝑑 𝑦" + 𝑐 𝑑 π‘₯! 𝑑

"

+ π‘₯" 𝑑

"

+ + + + + ++ +

  • -
  • -
  • -
  • |π‘₯! 𝑦! + π‘₯"𝑦" + 𝑐|

π‘₯!

" + π‘₯" "

1 π‘₯! 𝑑

"

+ π‘₯" 𝑑

"

We only care about the sign, not the magnitude

We can scale the weights to make the optimization easier b +w1 x1 + w2x2=0 Key observation: We can scale the 𝑑 so that the numerator is 1 for points that define the margin.

slide-33
SLIDE 33

Towards maximizing the margin

33

π‘₯! 𝑑 𝑦! + π‘₯" 𝑑 𝑦" + 𝑐 𝑑 π‘₯! 𝑑

"

+ π‘₯" 𝑑

"

+ + + + + ++ +

  • -
  • -
  • -
  • |π‘₯! 𝑦! + π‘₯"𝑦" + 𝑐|

π‘₯!

" + π‘₯" "

1 𝑣!

" + 𝑣" "

We only care about the sign, not the magnitude

We can scale the weights to make the optimization easier b +w1 x1 + w2x2=0 Key observation: We can scale the 𝑑 so that the numerator is 1 for points that define the margin.

slide-34
SLIDE 34

Towards maximizing the margin

34

b +w1 x1 + w2x2=0

+ + + + + ++ +

  • -
  • -
  • -
  • |π‘₯! 𝑦! + π‘₯"𝑦" + 𝑐|

π‘₯!

" + π‘₯" "

slide-35
SLIDE 35

Maximizing margin

  • Margin of a hyperplane = distance of the closest point from

the hyperplane 𝛿𝐱,4 = max

5

𝑧5(𝐱6𝐲5 + 𝑐) | 𝐱 |

  • We want to maximize this margin: max

𝐱,4 𝛿𝐱,4

  • We only care about the sign of 𝐱 and b in the end and not the

magnitude

– Set the absolute score (functional margin) of the closest point to be 1 and allow 𝐱 to adjust itself

35

max

𝐱

𝛿 is equivalent to max

𝐱 " 𝐱 in this setting

slide-36
SLIDE 36

Max-margin classifiers

Learning problem:

36

slide-37
SLIDE 37

Max-margin classifiers

Learning problem:

37

Mimimizing gives us max

𝐱 ! 𝐱

slide-38
SLIDE 38

Max-margin classifiers

Learning problem:

38

This condition is true for every example, specifically, for the example closest to the separator Mimimizing gives us max

𝐱 ! 𝐱

slide-39
SLIDE 39

Max-margin classifiers

Learning problem:

  • This is called the β€œhard” Support Vector Machine

We will look at how to solve this optimization problem later

39

This condition is true for every example, specifically, for the example closest to the separator Mimimizing gives us max

𝐱 ! 𝐱

slide-40
SLIDE 40

Support Vector Machines

  • Lower VC dimension β†’ Better generalization
  • Vapnik: For linear separators, the VC dimension depends inversely
  • n the margin

– That is, larger margin β†’ better generalization

  • For the separable case:

– Among all linear classifiers that separate the data, find the one that maximizes the margin – Maximizing the margin by minimizing 𝒙!𝒙 if for all examples 𝑧𝒙!π’š β‰₯ 1

  • General case:

– Introduce slack variables – one 𝜊" for each example – Slack variables allow the margin constraint above to be violated

40

So far

slide-41
SLIDE 41

What if the data is not separable?

41

Maximize margin Every example has an functional margin of at least 1

Hard SVM

  • This is a constrained optimization problem
  • If the data is not separable, there is no w that will

classify the data

  • Infeasible problem, no solution!
slide-42
SLIDE 42

What if the data is not separable?

Hard SVM

  • This is a constrained optimization problem
  • If the data is not separable, there is no w that will

classify the data

  • Infeasible problem, no solution!

42

Maximize margin Every example has an functional margin of at least 1

slide-43
SLIDE 43

Dealing with non-separable data

Key idea: Allow some examples to β€œbreak into the margin”

43

+ + + + + ++ +

  • -
  • -
  • -
  • +
slide-44
SLIDE 44

Dealing with non-separable data

Key idea: Allow some examples to β€œbreak into the margin”

44

+ + + + + ++ +

  • -
  • -
  • -
  • +
slide-45
SLIDE 45

Dealing with non-separable data

Key idea: Allow some examples to β€œbreak into the margin”

45

+ + + + + ++ +

  • -
  • -
  • -
  • +
  • This separator has a large

enough margin that it should generalize well.

slide-46
SLIDE 46

Dealing with non-separable data

Key idea: Allow some examples to β€œbreak into the margin”

46

So, when computing margin, ignore the examples that make the margin smaller or the data inseparable.

+ + + + + ++ +

  • -
  • -
  • -
  • +
  • This separator has a large

enough margin that it should generalize well.

slide-47
SLIDE 47

Soft SVM

  • Hard SVM:
  • Introduce one slack variable Β»i per example

– And require yiwTxi ΒΈ 1 - Β»i and Β»i ΒΈ 0

47

Maximize margin Every example has an functional margin of at least 1 Intuition: The slack variable allows examples to β€œbreak” into the margin If the slack value is zero, then the example is either on or outside the margin

slide-48
SLIDE 48

Soft SVM

  • Hard SVM:
  • Introduce one slack variable 𝜊5 per example

– And require 𝑧$π‘₯%𝑦$ β‰₯ 1 βˆ’ 𝜊$ and 𝜊$ β‰₯ 0

48

Maximize margin Every example has an functional margin of at least 1 Intuition: The slack variable allows examples to β€œbreak” into the margin If the slack value is zero, then the example is either on or outside the margin

slide-49
SLIDE 49

Soft SVM

  • Hard SVM:
  • Introduce one slack variable 𝜊5 per example

– And require 𝑧$π‘₯%𝑦$ β‰₯ 1 βˆ’ 𝜊$ and 𝜊$ β‰₯ 0

49

Maximize margin Every example has an functional margin of at least 1 Intuition: The slack variable allows examples to β€œbreak” into the margin If the slack value is zero, then the example is either on or outside the margin

slide-50
SLIDE 50

Soft SVM

  • Hard SVM:
  • Introduce one slack variable 𝜊5 per example

– And require 𝑧$π‘₯%𝑦$ β‰₯ 1 βˆ’ 𝜊$ and 𝜊$ β‰₯ 0

  • New optimization problem for learning

50

Maximize margin Every example has an functional margin of at least 1

slide-51
SLIDE 51

Soft SVM

  • Hard SVM:
  • Introduce one slack variable 𝜊5 per example

– And require 𝑧$π‘₯%𝑦$ β‰₯ 1 βˆ’ 𝜊$ and 𝜊$ β‰₯ 0

  • New optimization problem for learning

51

Maximize margin Every example has an functional margin of at least 1

slide-52
SLIDE 52

Soft SVM

52

slide-53
SLIDE 53

Soft SVM

53

Maximize margin

slide-54
SLIDE 54

Soft SVM

54

Maximize margin Minimize total slack (i.e allow as few examples as possible to violate the margin)

slide-55
SLIDE 55

Soft SVM

55

Maximize margin Minimize total slack (i.e allow as few examples as possible to violate the margin) Tradeoff between the two terms

slide-56
SLIDE 56

Support Vector Machines

  • Lower VC dimension β†’ Better generalization
  • Vapnik: For linear separators, the VC dimension depends inversely
  • n the margin

– That is, larger margin β†’ better generalization

  • For the separable case:

– Among all linear classifiers that separate the data, find the one that maximizes the margin – Maximizing the margin by minimizing 𝒙!𝒙 if for all examples 𝑧𝒙!π’š β‰₯ 1

  • General case:

– Introduce slack variables – one 𝜊" for each example – Slack variables allow the margin constraint above to be violated

56

So far

slide-57
SLIDE 57

Support Vector Machines

Eliminate the slack variables to rewrite this This form is more interpretable

57

Maximize margin Minimize total slack (i.e allow as few examples as possible to violate the margin) Tradeoff between the two terms

slide-58
SLIDE 58

Support Vector Machines

Eliminate the slack variables to rewrite this This form is more interpretable

58

Maximize margin Minimize total slack (i.e allow as few examples as possible to violate the margin) Tradeoff between the two terms

slide-59
SLIDE 59

Maximizing margin and minimizing loss

Three cases

  • Example is correctly classified and is outside the margin: penalty = 0
  • Example is incorrectly classified: penalty = 1 – yi wTxi
  • Example is correctly classified but within the margin: penalty = 1 – yi wTxi

This is the hinge loss function

59

Maximize margin Penalty for the prediction

slide-60
SLIDE 60

Maximizing margin and minimizing loss

We can consider three cases

  • Example is correctly classified and is outside the margin: penalty = 0
  • Example is incorrectly classified: penalty = 1 βˆ’ 𝑧!𝐱"𝐲!
  • Example is correctly classified but within the margin: penalty =1 βˆ’ 𝑧!𝐱"𝐲!

This is the hinge loss function

60

Maximize margin Penalty for the prediction

slide-61
SLIDE 61

Maximizing margin and minimizing loss

We can consider three cases

  • Example is correctly classified and is outside the margin: penalty = 0
  • Example is incorrectly classified: penalty = 1 βˆ’ 𝑧!𝐱"𝐲!
  • Example is correctly classified but within the margin: penalty =1 βˆ’ 𝑧!𝐱"𝐲!

This is the hinge loss function

61

Maximize margin Penalty for the prediction

slide-62
SLIDE 62

Maximizing margin and minimizing loss

We can consider three cases

  • Example is correctly classified and is outside the margin: penalty = 0
  • Example is incorrectly classified: penalty = 1 βˆ’ 𝑧!𝐱"𝐲!
  • Example is correctly classified but within the margin: penalty =1 βˆ’ 𝑧!𝐱"𝐲!

This is the hinge loss function

62

Maximize margin Penalty for the prediction

slide-63
SLIDE 63

Maximizing margin and minimizing loss

We can consider three cases

  • Example is correctly classified and is outside the margin: penalty = 0
  • Example is incorrectly classified: penalty = 1 βˆ’ 𝑧!𝐱"𝐲!
  • Example is correctly classified but within the margin: penalty =1 βˆ’ 𝑧!𝐱"𝐲!

This is the hinge loss function

63

Maximize margin Penalty for the prediction

slide-64
SLIDE 64

Maximizing margin and minimizing loss

We can consider three cases

  • Example is correctly classified and is outside the margin: penalty = 0
  • Example is incorrectly classified: penalty = 1 βˆ’ 𝑧!𝐱"𝐲!
  • Example is correctly classified but within the margin: penalty =1 βˆ’ 𝑧!𝐱"𝐲!

This is the hinge loss function

64

Maximize margin Penalty for the prediction

slide-65
SLIDE 65

The Hinge Loss

65

ywTx Loss

slide-66
SLIDE 66

The Hinge Loss

66

0-1 loss Hinge loss

ywTx Loss

slide-67
SLIDE 67

The Hinge Loss

67

0-1 loss

ywTx Loss

0-1 loss: If the sign of y and wTx is the same, then no penalty 0-1 loss: If the sign of y and wTx are different, then penalty = 1 Hinge loss

slide-68
SLIDE 68

The Hinge Loss

68

Hinge: Penalize predictions even if they are correct, but too close to the margin

ywTx Loss

Hinge: No penalty if wTx is far away from 1 (-1 for negative examples) Hinge: Incorrect predictions get a linearly increasing penalty with wTx

slide-69
SLIDE 69

Maximizing margin and minimizing loss

We can consider three cases

  • Example is correctly classified and is outside the margin: penalty = 0
  • Example is incorrectly classified: penalty = 1 βˆ’ 𝑧!𝐱"𝐲!
  • Example is correctly classified but within the margin: penalty =1 βˆ’ 𝑧!𝐱"𝐲!

69

Maximize margin Penalty for the prediction

slide-70
SLIDE 70

General learning principle

Risk minimization

70

Define the notion of β€œloss” over the training data as a function of a hypothesis Learning = find the hypothesis that has lowest loss on the training data

slide-71
SLIDE 71

General learning principle

Regularized risk minimization

71

Define the notion of β€œloss” over the training data as a function of a hypothesis Learning = find the hypothesis that has lowest [Regularizer + loss on the training data] Define a regularization function that penalizes

  • ver-complex hypothesis.

Capacity control gives better generalization

slide-72
SLIDE 72

SVM objective function

72

Regularization term:

  • Maximize the margin
  • Imposes a preference over the

hypothesis space and pushes for better generalization

  • Can be replaced with other

regularization terms which impose

  • ther preferences

Empirical Loss:

  • Hinge loss
  • Penalizes weight vectors that make

mistakes

  • Can be replaced with other loss

functions which impose other preferences

slide-73
SLIDE 73

SVM objective function

73

Regularization term:

  • Maximize the margin
  • Imposes a preference over the

hypothesis space and pushes for better generalization

  • Can be replaced with other

regularization terms which impose

  • ther preferences

Empirical Loss:

  • Hinge loss
  • Penalizes weight vectors that make

mistakes

  • Can be replaced with other loss

functions which impose other preferences A hyper-parameter that controls the tradeoff between a large margin and a small hinge-loss