Multiclass Multilabel Classification with More Classes than Examples - PowerPoint PPT Presentation

Multiclass Multilabel Classification with More Classes than Examples Ohad Shamir Weizmann Institute of Science Joint work with Ofer Dekel, MSR NIPS 2015 Extreme Classification Workshop

Extreme Multiclass Multilabel Problems Label set is a folksonomy (a.k.a. collaborative tagging or social tagging)

Categories 1452 births / 1519 deaths / 15 th century in science / ambassadors of the republic of Florence / Ballistic experts / Fabulists / giftedness / mathematics and culture / Italian inventors / Members of the Guild of Saint Luke / Tuscan painters / people persecuted under anti- homosexuality laws...

Problem Definition • Multiclass multilabel classification • 𝑛 training examples, 𝑙 categories • 𝑛, 𝑙 → ∞ together – Possibly even 𝑙 > 𝑛 • Goal: Categorize unseen instances

Extreme Multiclass • Supervised learning starts with binary classification ( 𝑙 =2) and extends to multiclass learning – Theory: VC dimension → Natarajan dimension – Algorithms: binary → multiclass • Usually, assume 𝑙 = 𝒫 1 • Some exceptions – Hierarchy with prior knowledge on relationships – not always available – Additional assumptions (e.g. talk by Marius earlier)

Application • Classify the web based on Wikipedia categories • Training set: All Wikipedia pages ( 𝑛 = 4.2 × 10 6 ) • Labels: All Wikipedia categories ( 𝑙 = 1.1 × 10 6 )

Challenges • Statistical problem: Can ’ t get a large (or even moderate) sample from each class. • Computational problem: Many classification algorithms will choke on millions of labels

Propagating Labels on the Click-Graph queries web pages • A bipartite graph derived from search engine logs: clicks encoded as weighted edges • Wikipedia pages are labeled web pages • Labels propagate along edges to other pages

Example • http://en.wikipedia.org/wiki/Leonardo da Vinci passes multiple labels to http://www.greatItalians.com • Among them – “ Renaissance artists ” – good – “ 1452 births ” – bad • Observation: “ 1452 births ” induces many false-positives (FP): best to remove it altogether from classifier output – (FP ⇒ TN, TP ⇒ FN)

Simple Label Pruning Approach 1. Split dataset to training and validation set 2. Use training set to build an initial classifier ℎ 𝑞𝑠𝑓 (e.g. by propagating labels over click-graph) 3. Apply ℎ 𝑞𝑠𝑓 to validation set, count FP and TP 4. ∀𝑘 ∈ 1, … , 𝑙 , remove label 𝑘 if 𝐺𝑄 > 1 − 𝛿 𝑘 𝑈𝑄 𝛿 𝑘 • Defines a new “ pruned ” classifier ℎ 𝑞𝑝𝑡𝑢

Simple Label Pruning Approach Explicitly minimizes empirical risk with respect to the 𝛿 -weighted loss: ℓ ℎ 𝒚 , 𝒛 = 𝑙 𝛿 𝕁 ℎ 𝑘 𝒚 = 1, 𝑧 𝑘 = 0 + 1 − 𝛿 𝕁 ℎ 𝑘 𝒚 = 0, 𝑧 𝑘 = 1 𝑘=1 FP FN (false positive) (false negative)

Main Question Would this actually reduce the risk? 𝔽 𝒚,𝒛 ℓ ℎ 𝑞𝑝𝑡𝑢 𝒚 , 𝒛 < 𝔽 𝒚,𝒛 ℓ ℎ 𝑞𝑠𝑓 𝒚 , 𝒛 - positive

Baseline Approach • Prove that uniformly for all labels 𝑘 Pr(label 𝑘 and not predicted) 𝐺𝑄 ⟶ 𝐺𝑄 𝑘 𝑘 𝑈𝑄 𝑈𝑄 𝑘 𝑘 Pr(label 𝑘 and predicted) Problem: 𝑛, 𝑙 → ∞ together. Many classes only have a handful of examples

Uniform Convergence Approach • Algorithm implicitly chooses a hypothesis from a certain hypothesis class – Pruning rules on top of fixed predictor ℎ 𝑞𝑠𝑓 • Prove uniform convergence by bounding VC dimension / Rademacher complexity • Conclude that if empirical risk decreases, the risk decreases as well

Uniform Convergence Fails • Unfortunately, no uniform convergence... • ... and even no algorithm/data-dependent convergence! 𝔽 𝑆 ℎ 𝑞𝑝𝑡𝑢 − 𝑆 ℎ 𝑞𝑝𝑡𝑢 ≥ 𝑙 𝑄𝑠 𝑘 pruned 𝑈𝑄 𝑘 − 𝐺𝑄 𝑘 𝑘=1 𝑙 𝑄𝑠 𝑘 > = 𝐺𝑄 𝑈𝑄 𝑈𝑄 𝑘 − 𝐺𝑄 𝑘 𝑘 𝑘=1 Weak correlation in 𝑛 ≈ 𝑙 regime

A Less Obvious Approach • Prove directly that risk decreases • Important (but mild) assumption: Each example labeled by ≤ 𝑡 labels • Step 1: Risk of ℎ 𝑞𝑝𝑡𝑢 is concentrated. For all 𝜗 , Pr 𝑆 ℎ 𝑞𝑝𝑡𝑢 − 𝔽𝑆 ℎ 𝑞𝑝𝑡𝑢

A Less Obvious Approach • Part 2: Enough to prove 𝑆 ℎ 𝑞𝑠𝑓 − 𝔽𝑆 ℎ 𝑞𝑝𝑡𝑢 > 0 1 • Assuming for 𝛿 = 2 for simplicity, can be shown that 𝑆 ℎ 𝑞𝑠𝑓 − 𝔽𝑆 ℎ 𝑞𝑝𝑡𝑢 𝐺𝑄 𝑘 + 𝑈𝑄 𝑘 𝑘 1/2 > pos − 𝒫 𝑛 2 where 𝒙 1/2 = 𝑘 𝑥 𝑘

A Less Obvious Approach • For probability vector, • Part 2: Enough to prove 𝑆 ℎ 𝑞𝑠𝑓 − 𝔽𝑆 ℎ 𝑞𝑝𝑡𝑢 > 0 always at most k 𝐺𝑄 𝑘 − 𝑈𝑄 𝑘 • 1 Smaller the more non- • Assuming for 𝛿 = 2 for simplicity, can be shown that 𝑘:𝐺𝑄 𝑘 ≥𝑈𝑄 𝑘 uniform is the distribution 𝑆 ℎ 𝑞𝑠𝑓 − 𝔽𝑆 ℎ 𝑞𝑝𝑡𝑢 𝐺𝑄 𝑘 + 𝑈𝑄 𝑘 𝑘 1/2 > pos − 𝒫 𝑛 2 where 𝒙 1/2 = 𝑘 𝑥 𝑘

Wikipedia Power-Law: 𝑠 = 1.6

Wikipedia Power-Law: 𝑠 = 1.6 𝑙 0.4 𝑆 ℎ 𝑞𝑠𝑓 − 𝔽𝑆 ℎ 𝑞𝑝𝑡𝑢 > pos − 𝒫 𝑛

Experiment Click graph on the entire web (based on search engine logs)

Experiment Categories from Wikipedia pages propagated twice through graph

Experiment Train/test split of Wikipedia pages How good are propagated categories from training set in predicting categories at test set pages?

Experiment

Another less obvious approach 𝑆 ℎ 𝑞𝑠𝑓 − 𝔽𝑆 ℎ 𝑞𝑝𝑡𝑢 𝑙 = 𝑄𝑠 𝑘 pruned 𝐺𝑄 𝑘 − 𝑈𝑄 𝑘 𝑘=1 𝑙 𝑄𝑠 𝑘 > = 𝐺𝑄 𝑈𝑄 𝐺𝑄 𝑘 − 𝑈𝑄 𝑘 𝑘 𝑘=1 Weak but positive correlation, even if only few examples per label For large k , sum will tend to be positive

Different Application: Crowdsourcing (Dekel and S., 2009)

Different Application: Crowdsourcing • How can we improve crowdsourced data? • Standard approach: Repeated labeling, but expensive • A bootstrap approach: – Learn predictor from data of all workers – Throw away examples labeled by workers disagreeing a lot with the predictor – Re-train on remaining examples • Works! (Under certain assumptions) • Challenge: Workers often labels only a handful of examples

Different Application: Crowdsourcing # examples/worker might be small, but many workers...

Conclusions • # classes → ∞ violates assumptions of most multiclass analyses – Often based on generalizations of binary classification • Possible approach – Avoid standard analysis – “ Extreme X ” can be a blessing rather than a curse • Other applications? More complex learning algorithms (e.g. substitution)?

Thanks!

Multiclass Multilabel Classification with More Classes than Examples - PowerPoint PPT Presentation

Multiclass Multilabel Classification with More Classes than Examples Ohad Shamir Weizmann Institute of Science Joint work with Ofer Dekel, MSR NIPS 2015 Extreme Classification Workshop Extreme Multiclass Multilabel Problems Label set is a

Active Learning for Sparse Bayesian Multilabel Classification Deepak Vasisht, MIT & IIT Delhi

An Empirical Study on Lazy Multilabel Classification Algorithms Eleftherios Spyromitros,

Extreme multilabel learning Charles Elkan Amazon Fellow December 12, 2015 1/32 Massive

Multiclass Predictions CMSC 422 M ARINE C ARPUAT marine@cs.umd.edu T opics Given an arbitrary

Lecture 7: Multiclass Classification Princeton University COS 495 Instructor: Yingyu Liang

Multiclass Boosting with Repartitioning Ling Li Learning Systems Group, Caltech ICML 2006

Model Combination in Multiclass Classification Sam Reid Advisors: Mike Mozer, Greg Grudic

CSC 411: Lecture 07: Multiclass Classification Class based on Raquel Urtasun & Rich Zemels

Multiclass Classification CS 6956: Deep Learning for NLP 1 So far: Binary Classification We

Adversarial Surrogate Losses for General Multiclass Classification Rizal Zaini Ahmad Fathony

Multiclass Classification Machine Learning So far: Binary Classification We have seen linear

Perception of Average Value in Multiclass Scatterplots Michael Gleicher, Michael Correll,

Selective sampling algorithms for cost-sensitive multiclass prediction Alekh Agarwal Microsoft

MOTIVATE 1 and 2 Trials Maraviroc in Patients with Multiclass Drug Resistance MOTIVATE 1 and 2:

Multiclass object recognition Sharing parts and transfer learning Sharat Chikkerur Outline

From Binary to Multiclass Classification CS 6355: Structured Prediction 1 We have seen binary

Dr Abubakr Muhammad Assistant Professor Electrical

Creative AI Combining Knowledge, Learning and Control for Expressive Modeling & Animation

Mainstreaming Green Chemistry Webinar Series March 26, 2014 Perceptions and Experiences of

Tap or Shower Or Can be Used Communally Made From UV stable food and water grade silicone it

5. Novelty & Diversity Outline 5.1. Why Novelty & Diversity? 5.2. Probability Ranking

Shape from X: perspective, texture, shading Thurs. Feb. 15, 2018 1 Level of Analysis in

Uniqueness of Solutions to the Stochastic Observations Navier-Stokes, the Invariant Measure and

MODERN METHODS MODERN METHODS FOR I MPROVI NG THE QUALI TY I N FOR I MPROVI NG THE QUALI TY I N

Multiclass Multilabel Classification with More Classes than Examples - PowerPoint PPT Presentation

Multiclass Multilabel Classification with More Classes than Examples Ohad Shamir Weizmann Institute of Science Joint work with Ofer Dekel, MSR NIPS 2015 Extreme Classification Workshop Extreme Multiclass Multilabel Problems Label set is a

Active Learning for Sparse Bayesian Multilabel Classification Deepak Vasisht, MIT &amp; IIT Delhi

An Empirical Study on Lazy Multilabel Classification Algorithms Eleftherios Spyromitros,

Extreme multilabel learning Charles Elkan Amazon Fellow December 12, 2015 1/32 Massive

Multiclass Predictions CMSC 422 M ARINE C ARPUAT marine@cs.umd.edu T opics Given an arbitrary

Lecture 7: Multiclass Classification Princeton University COS 495 Instructor: Yingyu Liang

Multiclass Boosting with Repartitioning Ling Li Learning Systems Group, Caltech ICML 2006

Model Combination in Multiclass Classification Sam Reid Advisors: Mike Mozer, Greg Grudic

CSC 411: Lecture 07: Multiclass Classification Class based on Raquel Urtasun &amp; Rich Zemels

Multiclass Classification CS 6956: Deep Learning for NLP 1 So far: Binary Classification We

Adversarial Surrogate Losses for General Multiclass Classification Rizal Zaini Ahmad Fathony

Multiclass Classification Machine Learning So far: Binary Classification We have seen linear

Perception of Average Value in Multiclass Scatterplots Michael Gleicher, Michael Correll,

Selective sampling algorithms for cost-sensitive multiclass prediction Alekh Agarwal Microsoft

MOTIVATE 1 and 2 Trials Maraviroc in Patients with Multiclass Drug Resistance MOTIVATE 1 and 2:

Multiclass object recognition Sharing parts and transfer learning Sharat Chikkerur Outline

From Binary to Multiclass Classification CS 6355: Structured Prediction 1 We have seen binary

Dr Abubakr Muhammad Assistant Professor Electrical

Creative AI Combining Knowledge, Learning and Control for Expressive Modeling &amp; Animation

Mainstreaming Green Chemistry Webinar Series March 26, 2014 Perceptions and Experiences of

Tap or Shower Or Can be Used Communally Made From UV stable food and water grade silicone it

5. Novelty &amp; Diversity Outline 5.1. Why Novelty &amp; Diversity? 5.2. Probability Ranking

Shape from X: perspective, texture, shading Thurs. Feb. 15, 2018 1 Level of Analysis in

Uniqueness of Solutions to the Stochastic Observations Navier-Stokes, the Invariant Measure and

MODERN METHODS MODERN METHODS FOR I MPROVI NG THE QUALI TY I N FOR I MPROVI NG THE QUALI TY I N

Active Learning for Sparse Bayesian Multilabel Classification Deepak Vasisht, MIT & IIT Delhi

CSC 411: Lecture 07: Multiclass Classification Class based on Raquel Urtasun & Rich Zemels

Creative AI Combining Knowledge, Learning and Control for Expressive Modeling & Animation

5. Novelty & Diversity Outline 5.1. Why Novelty & Diversity? 5.2. Probability Ranking