SLIDE 1
Multiclass Multilabel Classification with More Classes than Examples - - PowerPoint PPT Presentation
Multiclass Multilabel Classification with More Classes than Examples - - PowerPoint PPT Presentation
Multiclass Multilabel Classification with More Classes than Examples Ohad Shamir Weizmann Institute of Science Joint work with Ofer Dekel, MSR NIPS 2015 Extreme Classification Workshop Extreme Multiclass Multilabel Problems Label set is a
SLIDE 2
SLIDE 3
SLIDE 4
SLIDE 5
Categories 1452 births / 1519 deaths / 15th century in science / ambassadors
- f the republic of Florence /
Ballistic experts / Fabulists / giftedness / mathematics and culture / Italian inventors / Members of the Guild of Saint Luke / Tuscan painters / people persecuted under anti- homosexuality laws...
SLIDE 6
Problem Definition
- Multiclass multilabel classification
- π training examples, π categories
- π, π β β together
β Possibly even π > π
- Goal: Categorize unseen instances
SLIDE 7
Extreme Multiclass
- Supervised learning starts with binary classification
(π=2) and extends to multiclass learning
β Theory: VC dimension β Natarajan dimension β Algorithms: binary β multiclass
- Usually, assume π = π« 1
- Some exceptions
β Hierarchy with prior knowledge on relationships β not always available β Additional assumptions (e.g. talk by Marius earlier)
SLIDE 8
Application
- Classify the web based on Wikipedia categories
- Training set: All Wikipedia pages (π = 4.2 Γ 106)
- Labels: All Wikipedia categories (π = 1.1 Γ 106)
SLIDE 9
Challenges
- Statistical problem: Canβt get a large (or even
moderate) sample from each class.
- Computational problem: Many classification
algorithms will choke on millions of labels
SLIDE 10
Propagating Labels on the Click-Graph
- A bipartite graph derived
from search engine logs: clicks encoded as weighted edges
- Wikipedia pages are
labeled web pages
- Labels propagate along
edges to other pages
queries web pages
SLIDE 11
Example
- http://en.wikipedia.org/wiki/Leonardo da Vinci
passes multiple labels to http://www.greatItalians.com
- Among them
β βRenaissance artistsβ β good β β1452 birthsβ β bad
- Observation: β1452 birthsβ induces many false-positives
(FP): best to remove it altogether from classifier output β (FP β TN, TP β FN)
SLIDE 12
Simple Label Pruning Approach
- 1. Split dataset to training and validation set
- 2. Use training set to build an initial classifier βππ π
(e.g. by propagating labels over click-graph)
- 3. Apply βππ π to validation set, count FP and TP
- 4. βπ β 1, β¦ , π , remove label π if
- Defines a new βprunedβ classifier βπππ‘π’
πΊπ
π
ππ
π
> 1 β πΏ πΏ
SLIDE 13
Simple Label Pruning Approach
Explicitly minimizes empirical risk with respect to the πΏ-weighted loss:
β β π , π =
π=1 π
πΏ π βπ π = 1, π§π = 0 + 1 β πΏ π βπ π = 0, π§π = 1
FP (false positive) FN (false negative)
SLIDE 14
Main Question
Would this actually reduce the risk?
π½ π,π β βπππ‘π’ π , π < π½ π,π β βππ π π , π
- positive
SLIDE 15
Baseline Approach
- Prove that uniformly for all labels π
πΊπ
π
ππ
π
βΆ πΊπ
π
ππ
π
Problem: π, π β β together. Many classes only have a handful of examples
Pr(label π and predicted) Pr(label π and not predicted)
SLIDE 16
Uniform Convergence Approach
- Algorithm implicitly chooses a hypothesis from a
certain hypothesis class
β Pruning rules on top of fixed predictor βππ π
- Prove uniform convergence by bounding VC
dimension / Rademacher complexity
- Conclude that if empirical risk decreases, the risk
decreases as well
SLIDE 17
Uniform Convergence Fails
- Unfortunately, no uniform convergence...
- ... and even no algorithm/data-dependent convergence!
π½ π βπππ‘π’ β π βπππ‘π’ β₯
π=1 π
ππ π pruned ππ
π β πΊπ π
=
π=1 π
ππ πΊπ
π >
ππ
π
ππ
π β πΊπ π
Weak correlation in π β π regime
SLIDE 18
A Less Obvious Approach
- Prove directly that risk decreases
- Important (but mild) assumption: Each example
labeled by β€ π‘ labels
- Step 1: Risk of βπππ‘π’ is concentrated. For all π,
Pr π βπππ‘π’ β π½π βπππ‘π’
SLIDE 19
A Less Obvious Approach
- Part 2: Enough to prove π βππ π β π½π βπππ‘π’ > 0
- Assuming for πΏ =
1 2 for simplicity, can be shown that
π βππ π β π½π βπππ‘π’ > pos β π« πΊπ
π + ππ π π 1/2
π where π 1/2 = π π₯
π 2
SLIDE 20
A Less Obvious Approach
- Part 2: Enough to prove π βππ π β π½π βπππ‘π’ > 0
- Assuming for πΏ =
1 2 for simplicity, can be shown that
π βππ π β π½π βπππ‘π’ > pos β π« πΊπ
π + ππ π π 1/2
π
π:πΊππβ₯πππ
πΊπ
π β ππ π
where π 1/2 = π π₯
π 2
- For probability vector,
always at most k
- Smaller the more non-
uniform is the distribution
SLIDE 21
Wikipedia Power-Law: π = 1.6
SLIDE 22
Wikipedia Power-Law: π = 1.6
π βππ π β π½π βπππ‘π’ > pos β π« π0.4 π
SLIDE 23
Experiment
Click graph on the entire web (based on search engine logs)
SLIDE 24
Experiment
Categories from Wikipedia pages propagated twice through graph
SLIDE 25
Experiment
Train/test split of Wikipedia pages How good are propagated categories from training set in predicting categories at test set pages?
SLIDE 26
Experiment
SLIDE 27
Another less obvious approach
π βππ π β π½π βπππ‘π’ =
π=1 π
ππ π pruned πΊπ
π β ππ π
=
π=1 π
ππ πΊπ
π >
ππ
π
πΊπ
π β ππ π
Weak but positive correlation, even if only few examples per label For large k, sum will tend to be positive
SLIDE 28
Different Application: Crowdsourcing
(Dekel and S., 2009)
SLIDE 29
Different Application: Crowdsourcing
(Dekel and S., 2009)
SLIDE 30
Different Application: Crowdsourcing
(Dekel and S., 2009)
SLIDE 31
Different Application: Crowdsourcing
(Dekel and S., 2009)
SLIDE 32
Different Application: Crowdsourcing
- How can we improve crowdsourced data?
- Standard approach: Repeated labeling, but expensive
- A bootstrap approach:
β Learn predictor from data of all workers β Throw away examples labeled by workers disagreeing a lot with the predictor β Re-train on remaining examples
- Works! (Under certain assumptions)
- Challenge: Workers often labels only a handful of
examples
SLIDE 33
Different Application: Crowdsourcing
# examples/worker might be small, but many workers...
SLIDE 34
Different Application: Crowdsourcing
# examples/worker might be small, but many workers...
SLIDE 35
Different Application: Crowdsourcing
# examples/worker might be small, but many workers...
SLIDE 36
Conclusions
- # classes β β violates assumptions of most
multiclass analyses
β Often based on generalizations of binary classification
- Possible approach
β Avoid standard analysis β βExtreme Xβ can be a blessing rather than a curse
- Other applications? More complex learning
algorithms (e.g. substitution)?
SLIDE 37