CS 6355: Structured Prediction
From Binary to Multiclass Classification
1
From Binary to Multiclass Classification CS 6355: Structured - - PowerPoint PPT Presentation
From Binary to Multiclass Classification CS 6355: Structured Prediction 1 We have seen binary classification We have seen linear models Learning algorithms Perceptron SVM Logistic Regression Prediction is simple Given
1
2
3
4
5
6
7
8
all map to the letter A Car tire Car tire Duck laptop
9
10
11
12
13
𝒚 ∈ ℜ- 𝒛 ∈ 1,2, ⋯ , 𝐿
14
𝐲 ∈ ℜ- 𝐳 ∈ 1,2, ⋯ , 𝐿
𝑈𝐲
15
𝒚 ∈ ℜ- 𝒛 ∈ 1,2, ⋯ , 𝐿
𝑈𝐲
16
𝒚 ∈ ℜ- 𝒛 ∈ 1,2, ⋯ , 𝐿 Question: What is the dimensionality of each wi?
17
From the full dataset, construct three binary classifiers, one for each class
18
From the full dataset, construct three binary classifiers, one for each class
19
wblue
Tx > 0
for blue inputs
From the full dataset, construct three binary classifiers, one for each class
20
wblue
Tx > 0
for blue inputs wred
Tx > 0
for red inputs wgreen
Tx > 0
for green inputs
From the full dataset, construct three binary classifiers, one for each class
21
wblue
Tx > 0
for blue inputs wred
Tx > 0
for red inputs wgreen
Tx > 0
for green inputs Notation: Score for blue label
From the full dataset, construct three binary classifiers, one for each class
22
wblue
Tx > 0
for blue inputs wred
Tx > 0
for red inputs wgreen
Tx > 0
for green inputs Notation: Score for blue label Winner Take All will predict the right answer. Only the correct label will have a positive score
Black points are not separable with a single binary classifier The decomposition will not work for these cases! wblue
Tx > 0
for blue inputs wred
Tx > 0
for red inputs wgreen
Tx > 0
for green inputs ???
23
24
Sometimes called one-vs-one
25
C
Sometimes called one-vs-one
26
𝐲 ∈ ℜ- 𝐳 ∈ 1,2, ⋯ , 𝐿
C
Sometimes called one-vs-one
27
𝐲 ∈ ℜ- 𝐳 ∈ 1,2, ⋯ , 𝐿
Eg: What if two classes get the same number of votes? For a tournament, what is the sequence in which the labels compete?
28
29
30
label# Code 1 1 2 1 3 1 1 4 1 5 1 1 6 1 1 7 1 1 1
8 classes, code-length = 3 Example: For some example, if the three classifiers predict 0, 1 and 1, then the label is 3
31
label# Code 1 1 2 1 3 1 1 4 1 5 1 1 6 1 1 7 1 1 1
8 classes, code-length = 3
32
label# Code 1 1 2 1 3 1 1 4 1 5 1 1 6 1 1 7 1 1 1
8 classes, code-length = 3
33
8 classes, code-length = 5
# Code 0 0 0 0 0 0 1 0 0 1 1 0 2 0 1 0 1 1 3 0 1 1 0 1 4 1 0 0 1 1 5 1 0 1 0 0 6 1 1 0 0 0 7 1 1 1 1 1
34
8 classes, code-length = 5
# Code 0 0 0 0 0 0 1 0 0 1 1 0 2 0 1 0 1 1 3 0 1 1 0 1 4 1 0 0 1 1 5 1 0 1 0 0 6 1 1 0 0 0 7 1 1 1 1 1
35
8 classes, code-length = 5
# Code 0 0 0 0 0 0 1 0 0 1 1 0 2 0 1 0 1 1 3 0 1 1 0 1 4 1 0 0 1 1 5 1 0 1 0 0 6 1 1 0 0 0 7 1 1 1 1 1
One-vs-all is a special case
36
37
Exercise: Convince yourself that this is correct
38
39
40
41
42
Labels Score for a label Blue Red Green Black = wlabel
Tx
43
Labels Score for a label Blue Red Green Black = wlabel
Tx
Multiclass Margin
Minimize norm of weights such that the closest points to the hyperplane have a score ±1
Minimize total norm of the weights such that the true label is scored at least 1 more than the second best one
44
45
Recall hard binary SVM 𝑡𝑑𝑝𝑠𝑓 𝑧J – 𝑡𝑑𝑝𝑠𝑓 𝑙 ≥ 1 𝑆𝑓𝑣𝑚𝑏𝑠𝑗𝑨𝑓𝑠 𝐱B, ⋯ , 𝒙@
46
Recall hard binary SVM 𝑆𝑓𝑣𝑚𝑏𝑠𝑗𝑨𝑓𝑠 𝐱B, ⋯ , 𝒙@
47
Recall hard binary SVM
48
Recall hard binary SVM The score for the true label is higher than the score for any other label by 1
49
Recall hard binary SVM The score for the true label is higher than the score for any other label by 1 Size of the weights. Effectively, regularizer
50
Recall hard binary SVM The score for the true label is higher than the score for any other label by 1 Size of the weights. Effectively, regularizer Problems with this?
51
Recall hard binary SVM The score for the true label is higher than the score for any other label by 1 Size of the weights. Effectively, regularizer Problems with this? What if there is no set of weights that achieves this separation? That is, what if the data is not linearly separable?
52
Size of the weights. Effectively, regularizer The score for the true label is higher than the score for any other label by 1- »i Slack variables. Not all examples need to satisfy the margin constraint.
53
Size of the weights. Effectively, regularizer The score for the true label is higher than the score for any other label by 1- »i Slack variables. Not all examples need to satisfy the margin constraint. Total slack. Don’t allow too many examples to violate the margin constraint
54
Size of the weights. Effectively, regularizer The score for the true label is higher than the score for any other label by 1- »i Slack variables. Not all examples need to satisfy the margin constraint. Total slack. Don’t allow too many examples to violate the margin constraint Slack variables can
55
Size of the weights. Effectively, regularizer The score for the true label is higher than the score for any other label by 1- »i Slack variables. Not all examples need to satisfy the margin constraint. Total slack. Don’t allow too many examples to violate the margin constraint Slack variables can
56
The score for the true label is higher than the score for any other label by 1 - »i Size of the weights. Effectively, regularizer Slack variables. Not all examples need to satisfy the margin constraint. Total slack. Don’t allow too many examples to violate the margin constraint Slack variables can
57
Solving Is equivalent to solving
𝐱U,𝐱V,⋯,𝐱W
Y𝐱J + 𝐷
]^𝐳_ 𝐱] Y𝐲J − 𝐱𝐳_ Y 𝐲J + 1
Why?
58
𝐱U,𝐱V,⋯,𝐱W
Y𝐱J + 𝐷
]^𝐳_ 𝐱] Y𝐲J − 𝐱𝐳_ Y 𝐲J + 1
Size of the weights. Effectively, regularizer
59
𝐱U,𝐱V,⋯,𝐱W
Y𝐱J + 𝐷
]^𝐳_ 𝐱] Y𝐲J − 𝐱𝐳_ Y 𝐲J + 1
Size of the weights. Effectively, regularizer The multiclass hinge loss
60
𝐱U,𝐱V,⋯,𝐱W
Y𝐱J + 𝐷
]^𝐳_ 𝐱] Y𝐲J − 𝐱𝐳_ Y 𝐲J + 1
Size of the weights. Effectively, regularizer The multiclass hinge loss The tradeoff hyperparameter
61
argmaxi wiTx
62
Questions?
63
Tx
Tx to be positive for examples of
Tx to be more than all others
64
65
x in the ith block, zeros everywhere else
66
x in the ith block, zeros everywhere else
67
x in the ith block, zeros everywhere else
Or:
68
ith block
Or equivalently:
69
ith block For every example (x, i) in dataset, all other labels j Positive examples Negative examples
Or equivalently:
70
71
72
Exercise: What do the perceptron update rule look like in terms of the Ás? Interpret the update step
73
Note: The binary classification task only expresses preferences over label assignments This approach extends to training a ranker, can use partial preferences too, more on this later…
74
Labels Score for a label Blue Red Green Black Multiclass Margin
75
Labels Score for a label Blue Red Green Black Multiclass Margin In terms of Kesler construction Here y is the label that has the highest score
– Multiclass SVM via the definition of the learning objective – Constraint classification by constructing a binary classification problem
76
Questions?
77
Questions?
78