Extreme Classification A New Paradigm for Ranking & - PowerPoint PPT Presentation

Extreme Classification A New Paradigm for Ranking & Recommendation Manik Varma Microsoft Research 1

Classification Pick one Pick one Pick all that apply Label 1 Label 1 Label 1 Label 2 Label 2 Label 2 Label 3 Label 3 … … Label L Label L Multi-label Binary Multi-class 2

Extreme Multi-label Learning • Learning with millions of labels Predict the set of monetizable Bing queries that might lead to a click on this ad geico auto insurance geico car insurance geico insurance www geico com care geicos geico com need cheap auto insurance wisconsin cheap car insurance quotes cheap auto insurance florida all state car insurance coupon code 3 MLRF: Multi-label Random Forests [Agrawal, Gupta, Prabhu, Varma WWW 2013]

Research Problems • Defining millions of labels • Obtaining good quality training data • Training using limited resources • Log time and log space prediction • Obtaining discriminative features at scale • Performance evaluation • Dealing with tail labels and label correlations • Dealing with missing and noisy labels • Statistical guarantees • Applications 4

Extreme Multi-label Learning - People • Which people are present in this selfie? 5

Extreme Multi-label Learning – Wikipedia Labels: Living people, American computer scientists, Formal methods people, Carnegie Mellon University faculty, Massachusetts Institute of Technology alumni, Academic journal editors, Women in technology, Women computer scientists. 6

Reformulating ML Problems • Ranking or recommending millions of items Labels Items Label 1 Label 2 Label 3 Label 4 Label 5 … … Label 1M 7

FastXML A Fast, Accurate & Stable Tree-classifier for eXtreme Multi-label Learning Yashoteja Prabhu (IIT Delhi) Manik Varma (Microsoft Research) 8

FastXML • Logarithmic time prediction in milliseconds • Ensemble of balanced tree classifiers • Accuracy gains upto 25% over competing methods • Nodes partitioned using nDCG • Upto 1000x faster training over the state-of-the-art • Alternating minimization based optimization • Proof of convergence to a stationary point 9

Extreme Multi-Label Learning • Problem formulation f : X → 2 Y Y: Items X: Users 10

Extreme Multi-Label Learning • Problem formulation f ( ) 11

Tree Based Extreme Classification • Prediction in logarithmic time

FastXML Architecture 2 2 0 0 14

FastXML Architecture 2 2 0 0 16

Learning to Partition a Node Training data 4 2 0

Learning to Partition a Node Min 𝐱 𝐱 1 − 𝐷 nDCG 𝐲 𝒋 , 𝐳 𝒋 , 𝐱 𝑗∈Users 𝐱 𝑢 𝐲 < 0 𝐱 𝑢 𝐲 > 0 𝐱 𝑢 𝐲 < 0 𝐱 𝑢 𝐲 > 0 𝐱 X: Space of Users 18

Optimizing nDCG • nDCG is hard to optimize • nDCG is non-convex and non-smooth • Large input variations → No change in nDCG • Small input variations → Large changes in nDCG 𝑀 like(𝑗, 𝐬 𝑚 ) nDCG ∝ like(𝑗, 𝐬 1 ) + log(𝑚 + 1) 𝑚=2 like 𝑗, 𝐬 𝑚 = { 1 If user i likes the item with rank 𝐬 𝑚 0 otherwise 20

Optimizing nDCG Min 𝐱 𝐱 1 − 𝐷 nDCG 𝐲 𝒋 , 𝐳 𝒋 , 𝐱 𝑗∈Users 21

Optimizing nDCG – Reformulation nDCG 𝐬 𝜺 𝒋 𝑢 𝑂 𝐳 𝑗 𝐳 𝑗 𝐷 𝜀 𝜀 𝑗 log 1 + 𝑓 −𝜀 𝑗 𝐱 𝑢 𝐲 𝑗 Min 𝐱,𝛆,𝐬 ± 𝐱 1 + − 𝐷 𝑠 𝑗 𝑗 22

Optimizing nDCG – Initialization nDCG 𝐬 𝜺 𝒋 𝑢 𝑂 𝐳 𝑗 𝐳 𝑗 𝐷 𝜀 𝜀 𝑗 log 1 + 𝑓 −𝜀 𝑗 𝐱 𝑢 𝐲 𝑗 Min 𝐱,𝛆,𝐬 ± 𝐱 1 + − 𝐷 𝑠 𝑗 𝑗 𝜀 𝑗 ~ Bernoulli(0.5), ∀𝑗

Optimizing nDCG – Initialization nDCG 𝐬 𝜺 𝒋 𝑢 𝑂 𝐳 𝑗 𝐳 𝑗 𝐷 𝜀 𝜀 𝑗 log 1 + 𝑓 −𝜀 𝑗 𝐱 𝑢 𝐲 𝑗 Min 𝐱,𝛆,𝐬 ± 𝐱 1 + − 𝐷 𝑠 𝑗 𝑗 𝐬 ±∗ = rank 𝑂 𝐳 𝑗 𝐳 𝑗 𝑗: 𝜀 𝑗 =±1

Optimizing nDCG – Repartitioning Users nDCG 𝐬 𝜺 𝒋 𝑢 𝑂 𝐳 𝑗 𝐳 𝑗 𝐷 𝜀 𝜀 𝑗 log 1 + 𝑓 −𝜀 𝑗 𝐱 𝑢 𝐲 𝑗 Min 𝐱,𝛆,𝐬 ± 𝐱 1 + − 𝐷 𝑠 𝑗 𝑗 ∗ = sign 𝑤 𝑗 − − 𝑤 𝑗 + 𝜀 𝑗 ± = 𝐷 𝜀 ±1 log 1 + 𝑓 ∓𝐱 𝑢 𝐲 𝑗 − 𝐷 𝑠 nDCG 𝐬 ± 𝑢 𝑂 𝐳 𝑗 𝐳 𝑗 𝑤 𝑗

Optimizing nDCG – Reranking Items nDCG 𝐬 𝜺 𝒋 𝑢 𝑂 𝐳 𝑗 𝐳 𝑗 𝐷 𝜀 𝜀 𝑗 log 1 + 𝑓 −𝜀 𝑗 𝐱 𝑢 𝐲 𝑗 Min 𝐱,𝛆,𝐬 ± 𝐱 1 + − 𝐷 𝑠 𝑗 𝑗 𝐬 ±∗ = rank 𝑂 𝐳 𝑗 𝐳 𝑗 𝑗: 𝜀 𝑗 =±1

Optimizing nDCG nDCG 𝐬 𝜺 𝒋 𝑢 𝑂 𝐳 𝑗 𝐳 𝑗 𝐷 𝜀 𝜀 𝑗 log 1 + 𝑓 −𝜀 𝑗 𝐱 𝑢 𝐲 𝑗 Min 𝐱,𝛆,𝐬 ± 𝐱 1 + − 𝐷 𝑠 𝑗 𝑗

Optimizing nDCG – Hyperplane Separator nDCG 𝐬 𝜺 𝒋 𝑢 𝑂 𝐳 𝑗 𝐳 𝑗 𝐷 𝜀 𝜀 𝑗 log 1 + 𝑓 −𝜀 𝑗 𝐱 𝑢 𝐲 𝑗 Min 𝐱,𝛆,𝐬 ± 𝐱 1 + − 𝐷 𝑠 𝑗 𝑗

Data Set Statistics Small data sets # of Training # of Test # of Data Set # of Labels Points Points Dimensions Delicious 12,920 3,185 500 983 MediaMill 30,993 12,914 120 101 RCV1-X 781,265 23,149 47,236 2,456 BibTeX 4,880 2,515 1,836 159 Large data sets # of Training # of Test # of # of Labels Data Set Points (M) Points (M) Dimensions (M) (M) WikiLSHTC 1.89 0.47 1.62 0.33 Ads-430K 1.12 0.50 0.088 0.43 Ads-1M 3.92 1.56 0.16 1.08 Ads-9M 70.46 22.63 2.08 8.84 31

Results on Small Data Sets Delicious MediaMill 70 90 FastXML FastXML 65 80 MLRF MLRF 60 70 LPSR 55 LPSR 60 50 1-vs-All 50 1-vs-All 45 40 LEML LEML 40 30 CS CS P1 P3 P5 P1 P3 P5 BibTeX RCV1-X 95 70 FastXML FastXML 60 85 MLRF MLRF 50 75 LPSR LPSR 40 65 1-vs-All 1-vs-All 30 55 LEML LEML 20 CS CS 45 10 32 P1 P3 P5 P1 P3 P5

Large Data Sets - WikiLSHTC Precision at K 60 Dataset Statistics 50 40 Training Points 1,892,600 FastXML 30 Features 1,617,899 ( sparse ) LPSR-NB 20 LEML Labels 325,056 10 Test Points 472,835 0 P1 P3 P5 Test Time (millisec) Training Time (hr) 30 243.00 FastXML FastXML 20 LPSR-NB LPSR-NB 9.00 10 LEML LEML 0 33 0.33

Large Data Sets - Ads Ads-430K Ads-1M Ads-9M Precision at K Precision at K Precision at K 25 30 16 25 20 11 20 FastXML FastXML 15 15 LPSR-NB 6 FastXML LPSR-NB 10 LEML 10 5 1 0 5 P1 P3 P5 P1 P3 P5 P1 P3 P5 -4 Test Time (millisec) Test Time (millisec) Test Time (millisec) 4 1.5 1 0.8 3 FastXML 1 0.6 FastXML 2 LPSR-NB FastXML 0.4 LPSR-NB 0.5 LEML 1 0.2 0 0 0 34

Training Times in Hours Versus Cores 4 6 1 5 1 1 3 1 0.8 4 2 2 2 0.6 2 3 4 4 4 0.4 2 1 8 8 8 0.2 1 16 16 16 0 0 0 WikiLSHTC Ads-430K Ads-1M 35

Conclusions • Extreme classification • Tackle applications with millions of labels • A new paradigm for recommendation • FastXML • Significantly higher prediction accuracy • Can train on a single desktop • Publications and code • WWW13, KDD14, NIPS15 paperps • Code and data available from my website 36

Unbiased Performance Evaluation Himanshu Jain (IIT Delhi) Yashoteja Prabhu (IIT Delhi) Manik Varma (Microsoft Research) 37

Traditional Loss/Gain Functions • Hamming loss • Subset 0/1 loss • Precision • Recall • F-score • Jaccard distance 38

Extreme Classification A New Paradigm for Ranking & - PowerPoint PPT Presentation

Extreme Classification A New Paradigm for Ranking & Recommendation Manik Varma Microsoft Research 1 Classification Pick one Pick one Pick all that apply Label 1 Label 1 Label 1 Label 2 Label 2 Label 2 Label 3 Label 3

Extreme Heat Preparedness Objectives What is extreme heat ? How does it impact SF? What are the

2014: Extreme territories 2 2015: Extreme territories 3 2016: Extreme territories 4 2018:

MATHEMATICS 1 CONTENTS Extreme values in one dimension Extreme values in two dimensions

Graph Classification Classification Outline Introduction, Overview Classification using

Classification of Symmetry Classification of Symmetry Classification of Symmetry Classification

The JEM-EUSO Mission to Explore the The JEM-EUSO Mission to Explore the Extreme Universe Extreme

Extreme value theory QUAN TITATIVE RIS K MAN AGEMEN T IN P YTH ON Jamsheed Shorish

Community Resilience to Extreme Events 15 th April 2019 University of Stirling Extreme Events

Low rank SDP extreme points and Applications Mohit Singh Georgia Tech SDP extreme points

Extreme Value Theory in Risk Management See McNeil, Extreme Value Theory for Risk Managers Risk

Lecture 12: Extreme Value Theory Applied Statistics 2015 1 / 18 A real problem Extreme Value

(a) Quantitative classification (b) Qualitative classification (c) Area classification (d) Simple

Classification Image Classification Set of predefined categories [eg: table, apple, dog, giraffe]

Classification 1 Classification: Basic Concepts and Methods Classification: Basic Concepts

Library of Congress Classification: Module 1.3 1 Library of Congress Classification: Module 1.3

Classification K-nearest neighbor classification D istance functions Choice of k Choice of k

Bond Valuation and Analysis Bond Valuation & Analysis The Fixed Income Market Is

15-853:Algorithms in the Real World Fountain codes and Raptor codes Start with compression

Markov Functional Model Peter Caspers IKB November 13, 2013 Peter Caspers (IKB) Markov

NGLC TIER 2 Learning Session #4 - Organizational Change & Sustainability October 10th, 2019

Social Media Management: Case Studies JASON WEAVER, CEO SHOUTLET INC. Build. Engage. Measure.

Brought to you today through the generous support of WHY HDR IMAGES SUCK! (and how you can

tar Homes Certification Workbook update GreenS Presenter Brett Little Executive Director

Sell More Tickets with Facebook Advertising NACS 2018 June 27, 2018 ShowClix Event technology

Extreme Classification A New Paradigm for Ranking & - PowerPoint PPT Presentation

Extreme Classification A New Paradigm for Ranking & Recommendation Manik Varma Microsoft Research 1 Classification Pick one Pick one Pick all that apply Label 1 Label 1 Label 1 Label 2 Label 2 Label 2 Label 3 Label 3

Extreme Heat Preparedness Objectives What is extreme heat ? How does it impact SF? What are the

2014: Extreme territories 2 2015: Extreme territories 3 2016: Extreme territories 4 2018:

MATHEMATICS 1 CONTENTS Extreme values in one dimension Extreme values in two dimensions

Graph Classification Classification Outline Introduction, Overview Classification using

Classification of Symmetry Classification of Symmetry Classification of Symmetry Classification

The JEM-EUSO Mission to Explore the The JEM-EUSO Mission to Explore the Extreme Universe Extreme

Extreme value theory QUAN TITATIVE RIS K MAN AGEMEN T IN P YTH ON Jamsheed Shorish

Community Resilience to Extreme Events 15 th April 2019 University of Stirling Extreme Events

Low rank SDP extreme points and Applications Mohit Singh Georgia Tech SDP extreme points

Extreme Value Theory in Risk Management See McNeil, Extreme Value Theory for Risk Managers Risk

Lecture 12: Extreme Value Theory Applied Statistics 2015 1 / 18 A real problem Extreme Value

(a) Quantitative classification (b) Qualitative classification (c) Area classification (d) Simple

Classification Image Classification Set of predefined categories [eg: table, apple, dog, giraffe]

Classification 1 Classification: Basic Concepts and Methods Classification: Basic Concepts

Library of Congress Classification: Module 1.3 1 Library of Congress Classification: Module 1.3

Classification K-nearest neighbor classification D istance functions Choice of k Choice of k

Bond Valuation and Analysis Bond Valuation &amp; Analysis The Fixed Income Market Is

15-853:Algorithms in the Real World Fountain codes and Raptor codes Start with compression

Markov Functional Model Peter Caspers IKB November 13, 2013 Peter Caspers (IKB) Markov

NGLC TIER 2 Learning Session #4 - Organizational Change &amp; Sustainability October 10th, 2019

Social Media Management: Case Studies JASON WEAVER, CEO SHOUTLET INC. Build. Engage. Measure.

Brought to you today through the generous support of WHY HDR IMAGES SUCK! (and how you can

tar Homes Certification Workbook update GreenS Presenter Brett Little Executive Director

Sell More Tickets with Facebook Advertising NACS 2018 June 27, 2018 ShowClix Event technology

Bond Valuation and Analysis Bond Valuation & Analysis The Fixed Income Market Is

NGLC TIER 2 Learning Session #4 - Organizational Change & Sustainability October 10th, 2019