Subhransu Maji
19 February 2015
Beyond binary classification Subhransu Maji CMPSCI 689: Machine - - PowerPoint PPT Presentation
Beyond binary classification Subhransu Maji CMPSCI 689: Machine Learning 19 February 2015 Administrivia Mini-project 1 posted One of three Decision trees and perceptrons Theory and programming Due Wednesday, March 04, 11:55pm
19 February 2015
Subhransu Maji (UMASS) CMPSCI 689 /27
Mini-project 1 posted
➡ Turn in a hard copy in the CS office
2
Subhransu Maji (UMASS) CMPSCI 689 /27
Learning with imbalanced data Beyond binary classification
3
Subhransu Maji (UMASS) CMPSCI 689 /27
One class might be rare (E.g., face detection) Mistakes on the rare class cost more:
Why? we want is a better f-score (or average precision)
4
α E(x,y)∼D[f(x) 6= y] E(x,y)∼D[αy=1f(x) 6= y] binary classification
Suppose we have an algorithm to train a binary classifier, can we use it to train the alpha weighted version?
Subhransu Maji (UMASS) CMPSCI 689 /27
Input: Output:
➡ return
5
D, α Dα (x, y) ∼ D t ∼ uniform(0, 1) y > 0 or t < 1/α (x, y)
We have sub-sampled the negatives by
sub-sampling algorithm ✏ ↵✏ Dα D binary classification
Claim
Subhransu Maji (UMASS) CMPSCI 689 /27
6
✏ ↵✏ Dα D binary classification
Error on D = E(x,y)∼D[`α(ˆ y, y)] = X
x
(D(x, +1)↵[ˆ y 6= 1] + D(x, 1)[ˆ y 6= 1]) = ↵ X
x
✓ D(x, +1)[ˆ y 6= 1] + 1 ↵D(x, 1)[ˆ y 6= 1] ◆! = ↵ X
x
(Dα(x, +1)[ˆ y 6= 1] + Dα(x, 1)[ˆ y 6= 1]) ! = ↵✏
Subhransu Maji (UMASS) CMPSCI 689 /27
To train simply —
For some learners we don’t need to keep copies of the positives
➡ Modify accuracy to the weighted version
➡ Take weighted votes during prediction
7
Subhransu Maji (UMASS) CMPSCI 689 /27
Learning with imbalanced data Beyond binary classification
8
Subhransu Maji (UMASS) CMPSCI 689 /27
Labels are one of K different ones. Some classifiers are inherently multi-class —
highest vote (break ties arbitrarily)
feature to splits. At the leaves predict the most frequent label. Question: can we take a binary classifier and turn it into multi-class?
9
Subhransu Maji (UMASS) CMPSCI 689 /27
Train K classifiers, each to distinguish one class from the rest Prediction: pick the class with the highest score:
➡ May have to calibrate the weights (e.g., fix the norm to 1) since we are
comparing the scores of classifiers
➡ In practice, doing this right is tricky when there are a large number of
classes
10
i ← arg max fi(x) i ← arg max wT
i x
score function
Subhransu Maji (UMASS) CMPSCI 689 /27
Train K(K-1)/2 classifiers, each to distinguish one class from another Each classifier votes for the winning class in a pair The class with most votes wins
11
i ← arg max @X
j
sign
ijx
A i ← arg max @X
j
fij(x) 1 A fji = −fij wji = −wij
Subhransu Maji (UMASS) CMPSCI 689 /27
DAG SVM [Platt et al., NIPS 2000]
12
Figure from Platt et al.
Subhransu Maji (UMASS) CMPSCI 689 /27
Learning with imbalanced data Beyond binary classification
13
Subhransu Maji (UMASS) CMPSCI 689 /27
14
Subhransu Maji (UMASS) CMPSCI 689 /27
Input: query (e.g. “cats”) Output: a sorted list of items
The loss function is trickier than in the binary classification case
15
Subhransu Maji (UMASS) CMPSCI 689 /27
For simplicity lets assume we are learning to rank for a given query. Learning to rank:
➡ Assumes that each document has a numerical score. ➡ Learn a model to predict the score (e.g. linear regression).
➡ Ranking is approximated by a classification problem. ➡ Learn a binary classifier that can tell which item is better given a pair.
16
Subhransu Maji (UMASS) CMPSCI 689 /27
y ← f(ˆ xij) scorei = scorei + y scorej = scorej − y ranking ← arg sort(score) score ← [0, 0, . . . , 0]
Create a dataset with binary labels
➡ If item i is more relevant than j
➡ If item i is less relevant than j
Learn a binary classifier on D Ranking
➡ Calculate prediction: ➡ Update scores:
17
D ← D ∪ (xij, +1) D ← D ∪ (xij, −1) D ← φ xij
features for comparing item i and j
Subhransu Maji (UMASS) CMPSCI 689 /27
Naive rank train works well for bipartite ranking problems
There is no notion of an item being more relevant than another. A better strategy is to account for the positions of the items in the list Denote a ranking by:
Let the space of all permutations of M objects be: A ranking function maps M items to a permutation: A cost function (omega)
Ranking loss:
18
f : X → ΣM
σu < σv ΣM ω(i, j) `(, ˆ ) = X
u6=v
[u < v][ˆ v < ˆ u]!(u, v)
f
Subhransu Maji (UMASS) CMPSCI 689 /27
To be a valid loss function ω must be:
19
ω(i, j) = ω(j, i) ω(i, j) ≤ ω(i, k) if i < j < k or k < j < i ω(i, j) + ω(j, k) ≥ ω(i, k) ω(i, j) = 1, for i 6= j ω(i, j) = ⇢ 1 if min(i, j) K, i 6= j
Subhransu Maji (UMASS) CMPSCI 689 /27
y ← f(ˆ xij) scorei = scorei + y scorej = scorej − y ranking ← arg sort(score) score ← [0, 0, . . . , 0] D ← φ xij
features for comparing item i and j
D ← D ∪ (xij, −1, ω(i, j)) D ← D ∪ (xij, +1, ω(i, j))
Create a dataset with binary labels
➡ If σᵢ < σⱼ (item i is more relevant)
➡ If σᵢ > σⱼ (item j is more relevant)
Learn a binary classifier on D (each instance has a weight) Ranking
➡ Calculate prediction: ➡ Update scores:
20
Subhransu Maji (UMASS) CMPSCI 689 /27
Learning with imbalanced data Beyond binary classification
21
Subhransu Maji (UMASS) CMPSCI 689 /27
Predicting multiple correlated variables
22
input
(x, k) ∈ X × [K] G(X, k) be the set of all graphs features f : G(X) → G([K]) E(V,E)∼D [Σv∈V (ˆ yv 6= yv)]
labels
Subhransu Maji (UMASS) CMPSCI 689 /27
Predicting multiple correlated variables
23
independent predictions can be noisy ˆ yv ← f(xv) labels of nearby vertices as features xv ← [xv, φ ([K], nbhd(v))]
E.g., histogram of labels in a 5x5 neighborhood
Subhransu Maji (UMASS) CMPSCI 689 /27
Train a two classifiers First one is trained to predict output from the input Second is trained on the input and the output of first classifier
24
ˆ y(1)
v
← f1(xv) ˆ y(2)
v
← f2 ⇣ xv, φ ⇣ ˆ y(1)
v , nbhd(v)
⌘⌘
Subhransu Maji (UMASS) CMPSCI 689 /27
25
Train a stack of N classifiers ith classifier is trained on the input + output of the previous i-1 classifiers
not on test data leading to a cascade of overconfident classifiers Solution: held-out data
f1 f1 + f2 f1 + f2 + f3
…
Subhransu Maji (UMASS) CMPSCI 689 /27
Learning with imbalanced data
for the weighted loss case Beyond binary classification
➡ Some classifiers are inherently multi-class ➡ Others can be combined using: one-vs-one, one-vs-all methods
➡ Ranking loss functions to capture distance between permutations ➡ Pointwise and pairwise methods
➡ Stacking classifiers trained with held-out data
26
Subhransu Maji (UMASS) CMPSCI 689 /27
Some slides are adapted from CIML book by Hal Daume Images for collective classification are from the PASCAL VOC dataset
Some of the discussion is based on Wikipedia
27