Machine Learning
Support Vector Machines
1
Support Vector Machines Machine Learning 1 Big picture Linear - - PowerPoint PPT Presentation
Support Vector Machines Machine Learning 1 Big picture Linear models 2 Big picture Linear models How good is a learning algorithm? 3 Big picture Linear models Perceptron, Winnow Online learning How good is a learning algorithm?
1
2
Linear models
3
Linear models How good is a learning algorithm?
4
Linear models How good is a learning algorithm? Online learning Perceptron, Winnow
5
Linear models How good is a learning algorithm? Online learning PAC, Agnostic learning Perceptron, Winnow
6
Linear models How good is a learning algorithm? Online learning PAC, Agnostic learning Perceptron, Winnow Support Vector Machines
7
Linear models How good is a learning algorithm? Online learning PAC, Agnostic learning Perceptron, Winnow Support Vector Machines
8
9
! β is
10
Generalization error Training error A function of VC dimension. Low VC dimension gives tighter bound
" β +
! β is
11
Generalization error Training error
" β +
A function of VC dimension. Low VC dimension gives tighter bound
! β is
! β is
12
Generalization error Training error
" β +
A function of VC dimension. Low VC dimension gives tighter bound
! β is
13
Generalization error Training error
" β +
A function of VC dimension. Low VC dimension gives tighter bound
14
15
16
17
18
19
20
21
22
β That is, larger margin β better generalization
β Among all linear classifiers that separate the data, find the one that maximizes the margin β Maximizing the margin by minimizing π!π if for all examples π§π!π β₯ 1
β Introduce slack variables β one π" for each example β Slack variables allow the margin constraint above to be violated
23
So far
24
π₯!
" + π₯" "
= π§(π₯!π¦! + π₯"π¦" + π) | π± | π + π₯1 π¦1 + π₯2π¦2 = 0
25
|π₯! π¦! + π₯"π¦" + π| π₯!
" + π₯" "
= π§(π₯!π¦! + π₯"π¦" + π) | π± | π + π₯1 π¦1 + π₯2π¦2 = 0
26
π + π₯1 π¦1 + π₯2π¦2 = 0 π 2 + π₯1 2 π¦1 + π₯2 2 π¦2 = 0 1000π + 1000π₯1 π¦1 + 1000π₯2π¦2 = 0
All these are equivalent. We could multiply or divide the coefficients by any positive number and the sign of the prediction will not change |π₯! π¦! + π₯"π¦" + π| π₯!
" + π₯" "
= π§(π₯!π¦! + π₯"π¦" + π) | π± |
27
Some people call this the geometric margin The numerator alone is called the functional margin
28
Sometimes this is called the geometric margin The numerator alone is called the functional margin
29
b +w1 x1 + w2x2=0
|π₯! π¦! + π₯"π¦" + π| π₯!
" + π₯" "
30
π₯! π π¦! + π₯" π π¦" + π π π₯! π
"
+ π₯" π
"
π₯!
" + π₯" "
We can scale the weights to make the optimization easier b +w1 x1 + w2x2=0
31
π₯! π π¦! + π₯" π π¦" + π π π₯! π
"
+ π₯" π
"
π₯!
" + π₯" "
Key observation: We can scale the π so that the numerator is 1 for points that define the margin.
We can scale the weights to make the optimization easier b +w1 x1 + w2x2=0
32
π₯! π π¦! + π₯" π π¦" + π π π₯! π
"
+ π₯" π
"
π₯!
" + π₯" "
1 π₯! π
"
+ π₯" π
"
We can scale the weights to make the optimization easier b +w1 x1 + w2x2=0 Key observation: We can scale the π so that the numerator is 1 for points that define the margin.
33
π₯! π π¦! + π₯" π π¦" + π π π₯! π
"
+ π₯" π
"
π₯!
" + π₯" "
1 π£!
" + π£" "
We can scale the weights to make the optimization easier b +w1 x1 + w2x2=0 Key observation: We can scale the π so that the numerator is 1 for points that define the margin.
34
b +w1 x1 + w2x2=0
π₯!
" + π₯" "
5
π±,4 πΏπ±,4
35
π±
π± " π± in this setting
36
37
Mimimizing gives us max
π± ! π±
38
This condition is true for every example, specifically, for the example closest to the separator Mimimizing gives us max
π± ! π±
39
This condition is true for every example, specifically, for the example closest to the separator Mimimizing gives us max
π± ! π±
β That is, larger margin β better generalization
β Among all linear classifiers that separate the data, find the one that maximizes the margin β Maximizing the margin by minimizing π!π if for all examples π§π!π β₯ 1
β Introduce slack variables β one π" for each example β Slack variables allow the margin constraint above to be violated
40
So far
41
Maximize margin Every example has an functional margin of at least 1
42
Maximize margin Every example has an functional margin of at least 1
43
44
45
enough margin that it should generalize well.
46
enough margin that it should generalize well.
47
Maximize margin Every example has an functional margin of at least 1 Intuition: The slack variable allows examples to βbreakβ into the margin If the slack value is zero, then the example is either on or outside the margin
48
Maximize margin Every example has an functional margin of at least 1 Intuition: The slack variable allows examples to βbreakβ into the margin If the slack value is zero, then the example is either on or outside the margin
49
Maximize margin Every example has an functional margin of at least 1 Intuition: The slack variable allows examples to βbreakβ into the margin If the slack value is zero, then the example is either on or outside the margin
50
Maximize margin Every example has an functional margin of at least 1
51
Maximize margin Every example has an functional margin of at least 1
52
53
Maximize margin
54
Maximize margin Minimize total slack (i.e allow as few examples as possible to violate the margin)
55
Maximize margin Minimize total slack (i.e allow as few examples as possible to violate the margin) Tradeoff between the two terms
β That is, larger margin β better generalization
β Among all linear classifiers that separate the data, find the one that maximizes the margin β Maximizing the margin by minimizing π!π if for all examples π§π!π β₯ 1
β Introduce slack variables β one π" for each example β Slack variables allow the margin constraint above to be violated
56
So far
57
Maximize margin Minimize total slack (i.e allow as few examples as possible to violate the margin) Tradeoff between the two terms
58
Maximize margin Minimize total slack (i.e allow as few examples as possible to violate the margin) Tradeoff between the two terms
59
Maximize margin Penalty for the prediction
60
Maximize margin Penalty for the prediction
61
Maximize margin Penalty for the prediction
62
Maximize margin Penalty for the prediction
63
Maximize margin Penalty for the prediction
64
Maximize margin Penalty for the prediction
65
66
0-1 loss Hinge loss
67
0-1 loss
0-1 loss: If the sign of y and wTx is the same, then no penalty 0-1 loss: If the sign of y and wTx are different, then penalty = 1 Hinge loss
68
Hinge: Penalize predictions even if they are correct, but too close to the margin
Hinge: No penalty if wTx is far away from 1 (-1 for negative examples) Hinge: Incorrect predictions get a linearly increasing penalty with wTx
69
Maximize margin Penalty for the prediction
70
Define the notion of βlossβ over the training data as a function of a hypothesis Learning = find the hypothesis that has lowest loss on the training data
71
Define the notion of βlossβ over the training data as a function of a hypothesis Learning = find the hypothesis that has lowest [Regularizer + loss on the training data] Define a regularization function that penalizes
Capacity control gives better generalization
72
Regularization term:
hypothesis space and pushes for better generalization
regularization terms which impose
Empirical Loss:
mistakes
functions which impose other preferences
73
Regularization term:
hypothesis space and pushes for better generalization
regularization terms which impose
Empirical Loss:
mistakes
functions which impose other preferences A hyper-parameter that controls the tradeoff between a large margin and a small hinge-loss