Multiclass Multilabel Classification with More Classes than Examples - - PowerPoint PPT Presentation

β–Ά
multiclass multilabel classification
SMART_READER_LITE
LIVE PREVIEW

Multiclass Multilabel Classification with More Classes than Examples - - PowerPoint PPT Presentation

Multiclass Multilabel Classification with More Classes than Examples Ohad Shamir Weizmann Institute of Science Joint work with Ofer Dekel, MSR NIPS 2015 Extreme Classification Workshop Extreme Multiclass Multilabel Problems Label set is a


slide-1
SLIDE 1

Multiclass Multilabel Classification with More Classes than Examples

Ohad Shamir

Weizmann Institute of Science

Joint work with Ofer Dekel, MSR

NIPS 2015 Extreme Classification Workshop

slide-2
SLIDE 2

Extreme Multiclass Multilabel Problems

Label set is a folksonomy (a.k.a. collaborative tagging or social tagging)

slide-3
SLIDE 3
slide-4
SLIDE 4
slide-5
SLIDE 5

Categories 1452 births / 1519 deaths / 15th century in science / ambassadors

  • f the republic of Florence /

Ballistic experts / Fabulists / giftedness / mathematics and culture / Italian inventors / Members of the Guild of Saint Luke / Tuscan painters / people persecuted under anti- homosexuality laws...

slide-6
SLIDE 6

Problem Definition

  • Multiclass multilabel classification
  • 𝑛 training examples, 𝑙 categories
  • 𝑛, 𝑙 β†’ ∞ together

– Possibly even 𝑙 > 𝑛

  • Goal: Categorize unseen instances
slide-7
SLIDE 7

Extreme Multiclass

  • Supervised learning starts with binary classification

(𝑙=2) and extends to multiclass learning

– Theory: VC dimension β†’ Natarajan dimension – Algorithms: binary β†’ multiclass

  • Usually, assume 𝑙 = 𝒫 1
  • Some exceptions

– Hierarchy with prior knowledge on relationships – not always available – Additional assumptions (e.g. talk by Marius earlier)

slide-8
SLIDE 8

Application

  • Classify the web based on Wikipedia categories
  • Training set: All Wikipedia pages (𝑛 = 4.2 Γ— 106)
  • Labels: All Wikipedia categories (𝑙 = 1.1 Γ— 106)
slide-9
SLIDE 9

Challenges

  • Statistical problem: Can’t get a large (or even

moderate) sample from each class.

  • Computational problem: Many classification

algorithms will choke on millions of labels

slide-10
SLIDE 10

Propagating Labels on the Click-Graph

  • A bipartite graph derived

from search engine logs: clicks encoded as weighted edges

  • Wikipedia pages are

labeled web pages

  • Labels propagate along

edges to other pages

queries web pages

slide-11
SLIDE 11

Example

  • http://en.wikipedia.org/wiki/Leonardo da Vinci

passes multiple labels to http://www.greatItalians.com

  • Among them

– β€œRenaissance artists” – good – β€œ1452 births” – bad

  • Observation: β€œ1452 births” induces many false-positives

(FP): best to remove it altogether from classifier output – (FP β‡’ TN, TP β‡’ FN)

slide-12
SLIDE 12

Simple Label Pruning Approach

  • 1. Split dataset to training and validation set
  • 2. Use training set to build an initial classifier β„Žπ‘žπ‘ π‘“

(e.g. by propagating labels over click-graph)

  • 3. Apply β„Žπ‘žπ‘ π‘“ to validation set, count FP and TP
  • 4. βˆ€π‘˜ ∈ 1, … , 𝑙 , remove label π‘˜ if
  • Defines a new β€œpruned” classifier β„Žπ‘žπ‘π‘‘π‘’

𝐺𝑄

π‘˜

π‘ˆπ‘„

π‘˜

> 1 βˆ’ 𝛿 𝛿

slide-13
SLIDE 13

Simple Label Pruning Approach

Explicitly minimizes empirical risk with respect to the 𝛿-weighted loss:

β„“ β„Ž π’š , 𝒛 =

π‘˜=1 𝑙

𝛿 𝕁 β„Žπ‘˜ π’š = 1, π‘§π‘˜ = 0 + 1 βˆ’ 𝛿 𝕁 β„Žπ‘˜ π’š = 0, π‘§π‘˜ = 1

FP (false positive) FN (false negative)

slide-14
SLIDE 14

Main Question

Would this actually reduce the risk?

𝔽 π’š,𝒛 β„“ β„Žπ‘žπ‘π‘‘π‘’ π’š , 𝒛 < 𝔽 π’š,𝒛 β„“ β„Žπ‘žπ‘ π‘“ π’š , 𝒛

  • positive
slide-15
SLIDE 15

Baseline Approach

  • Prove that uniformly for all labels π‘˜

𝐺𝑄

π‘˜

π‘ˆπ‘„

π‘˜

⟢ 𝐺𝑄

π‘˜

π‘ˆπ‘„

π‘˜

Problem: 𝑛, 𝑙 β†’ ∞ together. Many classes only have a handful of examples

Pr(label π‘˜ and predicted) Pr(label π‘˜ and not predicted)

slide-16
SLIDE 16

Uniform Convergence Approach

  • Algorithm implicitly chooses a hypothesis from a

certain hypothesis class

– Pruning rules on top of fixed predictor β„Žπ‘žπ‘ π‘“

  • Prove uniform convergence by bounding VC

dimension / Rademacher complexity

  • Conclude that if empirical risk decreases, the risk

decreases as well

slide-17
SLIDE 17

Uniform Convergence Fails

  • Unfortunately, no uniform convergence...
  • ... and even no algorithm/data-dependent convergence!

𝔽 𝑆 β„Žπ‘žπ‘π‘‘π‘’ βˆ’ 𝑆 β„Žπ‘žπ‘π‘‘π‘’ β‰₯

π‘˜=1 𝑙

𝑄𝑠 π‘˜ pruned π‘ˆπ‘„

π‘˜ βˆ’ 𝐺𝑄 π‘˜

=

π‘˜=1 𝑙

𝑄𝑠 𝐺𝑄

π‘˜ >

π‘ˆπ‘„

π‘˜

π‘ˆπ‘„

π‘˜ βˆ’ 𝐺𝑄 π‘˜

Weak correlation in 𝑛 β‰ˆ 𝑙 regime

slide-18
SLIDE 18

A Less Obvious Approach

  • Prove directly that risk decreases
  • Important (but mild) assumption: Each example

labeled by ≀ 𝑑 labels

  • Step 1: Risk of β„Žπ‘žπ‘π‘‘π‘’ is concentrated. For all πœ—,

Pr 𝑆 β„Žπ‘žπ‘π‘‘π‘’ βˆ’ 𝔽𝑆 β„Žπ‘žπ‘π‘‘π‘’

slide-19
SLIDE 19

A Less Obvious Approach

  • Part 2: Enough to prove 𝑆 β„Žπ‘žπ‘ π‘“ βˆ’ 𝔽𝑆 β„Žπ‘žπ‘π‘‘π‘’ > 0
  • Assuming for 𝛿 =

1 2 for simplicity, can be shown that

𝑆 β„Žπ‘žπ‘ π‘“ βˆ’ 𝔽𝑆 β„Žπ‘žπ‘π‘‘π‘’ > pos βˆ’ 𝒫 𝐺𝑄

π‘˜ + π‘ˆπ‘„ π‘˜ π‘˜ 1/2

𝑛 where 𝒙 1/2 = π‘˜ π‘₯

π‘˜ 2

slide-20
SLIDE 20

A Less Obvious Approach

  • Part 2: Enough to prove 𝑆 β„Žπ‘žπ‘ π‘“ βˆ’ 𝔽𝑆 β„Žπ‘žπ‘π‘‘π‘’ > 0
  • Assuming for 𝛿 =

1 2 for simplicity, can be shown that

𝑆 β„Žπ‘žπ‘ π‘“ βˆ’ 𝔽𝑆 β„Žπ‘žπ‘π‘‘π‘’ > pos βˆ’ 𝒫 𝐺𝑄

π‘˜ + π‘ˆπ‘„ π‘˜ π‘˜ 1/2

𝑛

π‘˜:πΊπ‘„π‘˜β‰₯π‘ˆπ‘„π‘˜

𝐺𝑄

π‘˜ βˆ’ π‘ˆπ‘„ π‘˜

where 𝒙 1/2 = π‘˜ π‘₯

π‘˜ 2

  • For probability vector,

always at most k

  • Smaller the more non-

uniform is the distribution

slide-21
SLIDE 21

Wikipedia Power-Law: 𝑠 = 1.6

slide-22
SLIDE 22

Wikipedia Power-Law: 𝑠 = 1.6

𝑆 β„Žπ‘žπ‘ π‘“ βˆ’ 𝔽𝑆 β„Žπ‘žπ‘π‘‘π‘’ > pos βˆ’ 𝒫 𝑙0.4 𝑛

slide-23
SLIDE 23

Experiment

Click graph on the entire web (based on search engine logs)

slide-24
SLIDE 24

Experiment

Categories from Wikipedia pages propagated twice through graph

slide-25
SLIDE 25

Experiment

Train/test split of Wikipedia pages How good are propagated categories from training set in predicting categories at test set pages?

slide-26
SLIDE 26

Experiment

slide-27
SLIDE 27

Another less obvious approach

𝑆 β„Žπ‘žπ‘ π‘“ βˆ’ 𝔽𝑆 β„Žπ‘žπ‘π‘‘π‘’ =

π‘˜=1 𝑙

𝑄𝑠 π‘˜ pruned 𝐺𝑄

π‘˜ βˆ’ π‘ˆπ‘„ π‘˜

=

π‘˜=1 𝑙

𝑄𝑠 𝐺𝑄

π‘˜ >

π‘ˆπ‘„

π‘˜

𝐺𝑄

π‘˜ βˆ’ π‘ˆπ‘„ π‘˜

Weak but positive correlation, even if only few examples per label For large k, sum will tend to be positive

slide-28
SLIDE 28

Different Application: Crowdsourcing

(Dekel and S., 2009)

slide-29
SLIDE 29

Different Application: Crowdsourcing

(Dekel and S., 2009)

slide-30
SLIDE 30

Different Application: Crowdsourcing

(Dekel and S., 2009)

slide-31
SLIDE 31

Different Application: Crowdsourcing

(Dekel and S., 2009)

slide-32
SLIDE 32

Different Application: Crowdsourcing

  • How can we improve crowdsourced data?
  • Standard approach: Repeated labeling, but expensive
  • A bootstrap approach:

– Learn predictor from data of all workers – Throw away examples labeled by workers disagreeing a lot with the predictor – Re-train on remaining examples

  • Works! (Under certain assumptions)
  • Challenge: Workers often labels only a handful of

examples

slide-33
SLIDE 33

Different Application: Crowdsourcing

# examples/worker might be small, but many workers...

slide-34
SLIDE 34

Different Application: Crowdsourcing

# examples/worker might be small, but many workers...

slide-35
SLIDE 35

Different Application: Crowdsourcing

# examples/worker might be small, but many workers...

slide-36
SLIDE 36

Conclusions

  • # classes β†’ ∞ violates assumptions of most

multiclass analyses

– Often based on generalizations of binary classification

  • Possible approach

– Avoid standard analysis – β€œExtreme X” can be a blessing rather than a curse

  • Other applications? More complex learning

algorithms (e.g. substitution)?

slide-37
SLIDE 37

Thanks!