Extreme Classification A New Paradigm for Ranking & - - PowerPoint PPT Presentation

extreme classification
SMART_READER_LITE
LIVE PREVIEW

Extreme Classification A New Paradigm for Ranking & - - PowerPoint PPT Presentation

Extreme Classification A New Paradigm for Ranking & Recommendation Manik Varma Microsoft Research 1 Classification Pick one Pick one Pick all that apply Label 1 Label 1 Label 1 Label 2 Label 2 Label 2 Label 3 Label 3


slide-1
SLIDE 1

Extreme Classification

Manik Varma Microsoft Research

A New Paradigm for Ranking & Recommendation

1

slide-2
SLIDE 2

Classification

2

Label 1 Label 2 Label 1 Label 2 Label 3 … Label L Label 1 Label 2 Label 3 … Label L

Binary Multi-label Multi-class

Pick one Pick all that apply Pick one

slide-3
SLIDE 3

Extreme Multi-label Learning

  • Learning with millions of labels

geico auto insurance geico car insurance geico insurance www geico com care geicos geico com need cheap auto insurance wisconsin cheap car insurance quotes cheap auto insurance florida all state car insurance coupon code

3

Predict the set of monetizable Bing queries that might lead to a click on this ad

MLRF: Multi-label Random Forests [Agrawal, Gupta, Prabhu, Varma WWW 2013]

slide-4
SLIDE 4

Research Problems

  • Defining millions of labels
  • Obtaining good quality training data
  • Training using limited resources
  • Log time and log space prediction
  • Obtaining discriminative features at scale
  • Performance evaluation
  • Dealing with tail labels and label correlations
  • Dealing with missing and noisy labels
  • Statistical guarantees
  • Applications

4

slide-5
SLIDE 5

Extreme Multi-label Learning - People

  • Which people are present in this selfie?

5

slide-6
SLIDE 6

Extreme Multi-label Learning – Wikipedia

6

Labels: Living people, American computer scientists, Formal methods people, Carnegie Mellon University faculty, Massachusetts Institute of Technology alumni, Academic journal editors, Women in technology, Women computer scientists.

slide-7
SLIDE 7

Reformulating ML Problems

  • Ranking or recommending millions of items

7

… Label 1 Label 2 Label 3 Label 4 Label 5 … Label 1M

Items Labels

slide-8
SLIDE 8

FastXML

Yashoteja Prabhu (IIT Delhi) Manik Varma (Microsoft Research)

A Fast, Accurate & Stable Tree-classifier for eXtreme Multi-label Learning

8

slide-9
SLIDE 9

FastXML

  • Logarithmic time prediction in milliseconds
  • Ensemble of balanced tree classifiers
  • Accuracy gains upto 25% over competing methods
  • Nodes partitioned using nDCG
  • Upto 1000x faster training over the state-of-the-art
  • Alternating minimization based optimization
  • Proof of convergence to a stationary point

9

slide-10
SLIDE 10

Extreme Multi-Label Learning

  • Problem formulation

X: Users Y: Items

f : X → 2Y

10

slide-11
SLIDE 11

Extreme Multi-Label Learning

  • Problem formulation

f ( )

11

slide-12
SLIDE 12

Tree Based Extreme Classification

  • Prediction in logarithmic time
slide-13
SLIDE 13

Tree Based Extreme Classification

  • Prediction in logarithmic time
slide-14
SLIDE 14

FastXML Architecture

2 2

14

slide-15
SLIDE 15

FastXML

  • Logarithmic time prediction in milliseconds
  • Ensemble of balanced tree classifiers
  • Accuracy gains upto 25% over competing methods
  • Nodes partitioned using nDCG
  • Upto 1000x faster training over the state-of-the-art
  • Alternating minimization based optimization
  • Proof of convergence to a stationary point

15

slide-16
SLIDE 16

FastXML Architecture

2 2

16

slide-17
SLIDE 17

Learning to Partition a Node

Training data

2 4

slide-18
SLIDE 18

Learning to Partition a Node

18

X: Space of Users

Min𝐱 𝐱 1 − 𝐷

𝑗∈Users

nDCG 𝐲𝒋, 𝐳𝒋, 𝐱

𝐱

𝐱𝑢𝐲 < 0 𝐱𝑢𝐲 > 0

𝐱𝑢𝐲 < 0 𝐱𝑢𝐲 > 0

slide-19
SLIDE 19

FastXML

  • Logarithmic time prediction in milliseconds
  • Ensemble of balanced tree classifiers
  • Accuracy gains upto 25% over competing methods
  • Nodes partitioned using nDCG
  • Upto 1000x faster training over the state-of-the-art
  • Alternating minimization based optimization
  • Proof of convergence to a stationary point

19

slide-20
SLIDE 20

Optimizing nDCG

  • nDCG is hard to optimize
  • nDCG is non-convex and non-smooth
  • Large input variations → No change in nDCG
  • Small input variations → Large changes in nDCG

20

nDCG ∝ like(𝑗, 𝐬1) +

𝑚=2 𝑀

like(𝑗, 𝐬𝑚) log(𝑚 + 1)

1 If user i likes the item with rank 𝐬𝑚 0 otherwise

like 𝑗, 𝐬𝑚 ={

slide-21
SLIDE 21

Optimizing nDCG

21

Min𝐱 𝐱 1 − 𝐷

𝑗∈Users

nDCG 𝐲𝒋, 𝐳𝒋, 𝐱

slide-22
SLIDE 22

Optimizing nDCG – Reformulation

Min𝐱,𝛆,𝐬± 𝐱 1 +

𝑗

𝐷𝜀 𝜀𝑗 log 1 + 𝑓−𝜀𝑗𝐱𝑢𝐲𝑗 − 𝐷𝑠

𝑗

nDCG 𝐬𝜺𝒋 𝑢𝑂𝐳𝑗 𝐳𝑗

22

slide-23
SLIDE 23

Optimizing nDCG – Initialization

𝜀𝑗 ~ Bernoulli(0.5), ∀𝑗

Min𝐱,𝛆,𝐬± 𝐱 1 +

𝑗

𝐷𝜀 𝜀𝑗 log 1 + 𝑓−𝜀𝑗𝐱𝑢𝐲𝑗 − 𝐷𝑠

𝑗

nDCG 𝐬𝜺𝒋 𝑢𝑂𝐳𝑗 𝐳𝑗

slide-24
SLIDE 24

Optimizing nDCG – Initialization

𝜀𝑗 ~ Bernoulli(0.5), ∀𝑗

Min𝐱,𝛆,𝐬± 𝐱 1 +

𝑗

𝐷𝜀 𝜀𝑗 log 1 + 𝑓−𝜀𝑗𝐱𝑢𝐲𝑗 − 𝐷𝑠

𝑗

nDCG 𝐬𝜺𝒋 𝑢𝑂𝐳𝑗 𝐳𝑗

slide-25
SLIDE 25

Optimizing nDCG – Initialization

𝐬±∗ = rank

𝑗: 𝜀𝑗=±1

𝑂𝐳𝑗𝐳𝑗

Min𝐱,𝛆,𝐬± 𝐱 1 +

𝑗

𝐷𝜀 𝜀𝑗 log 1 + 𝑓−𝜀𝑗𝐱𝑢𝐲𝑗 − 𝐷𝑠

𝑗

nDCG 𝐬𝜺𝒋 𝑢𝑂𝐳𝑗 𝐳𝑗

slide-26
SLIDE 26

Optimizing nDCG – Repartitioning Users

𝜀𝑗

∗ = sign 𝑤𝑗 − − 𝑤𝑗 +

𝑤𝑗

± = 𝐷𝜀 ±1 log 1 + 𝑓∓𝐱𝑢𝐲𝑗 − 𝐷𝑠nDCG 𝐬± 𝑢𝑂𝐳𝑗𝐳𝑗

Min𝐱,𝛆,𝐬± 𝐱 1 +

𝑗

𝐷𝜀 𝜀𝑗 log 1 + 𝑓−𝜀𝑗𝐱𝑢𝐲𝑗 − 𝐷𝑠

𝑗

nDCG 𝐬𝜺𝒋 𝑢𝑂𝐳𝑗 𝐳𝑗

slide-27
SLIDE 27

Optimizing nDCG – Repartitioning Users

𝜀𝑗

∗ = sign 𝑤𝑗 − − 𝑤𝑗 +

𝑤𝑗

± = 𝐷𝜀 ±1 log 1 + 𝑓∓𝐱𝑢𝐲𝑗 − 𝐷𝑠nDCG 𝐬± 𝑢𝑂𝐳𝑗𝐳𝑗

Min𝐱,𝛆,𝐬± 𝐱 1 +

𝑗

𝐷𝜀 𝜀𝑗 log 1 + 𝑓−𝜀𝑗𝐱𝑢𝐲𝑗 − 𝐷𝑠

𝑗

nDCG 𝐬𝜺𝒋 𝑢𝑂𝐳𝑗 𝐳𝑗

slide-28
SLIDE 28

Optimizing nDCG – Reranking Items

𝐬±∗ = rank

𝑗: 𝜀𝑗=±1

𝑂𝐳𝑗𝐳𝑗

Min𝐱,𝛆,𝐬± 𝐱 1 +

𝑗

𝐷𝜀 𝜀𝑗 log 1 + 𝑓−𝜀𝑗𝐱𝑢𝐲𝑗 − 𝐷𝑠

𝑗

nDCG 𝐬𝜺𝒋 𝑢𝑂𝐳𝑗 𝐳𝑗

slide-29
SLIDE 29

Optimizing nDCG

Min𝐱,𝛆,𝐬± 𝐱 1 +

𝑗

𝐷𝜀 𝜀𝑗 log 1 + 𝑓−𝜀𝑗𝐱𝑢𝐲𝑗 − 𝐷𝑠

𝑗

nDCG 𝐬𝜺𝒋 𝑢𝑂𝐳𝑗 𝐳𝑗

slide-30
SLIDE 30

Optimizing nDCG – Hyperplane Separator

Min𝐱,𝛆,𝐬± 𝐱 1 +

𝑗

𝐷𝜀 𝜀𝑗 log 1 + 𝑓−𝜀𝑗𝐱𝑢𝐲𝑗 − 𝐷𝑠

𝑗

nDCG 𝐬𝜺𝒋 𝑢𝑂𝐳𝑗 𝐳𝑗

slide-31
SLIDE 31

Data Set Statistics

Data Set # of Training Points # of Test Points # of Dimensions # of Labels Delicious 12,920 3,185 500 983 MediaMill 30,993 12,914 120 101 RCV1-X 781,265 23,149 47,236 2,456 BibTeX 4,880 2,515 1,836 159 Data Set # of Training Points (M) # of Test Points (M) # of Dimensions (M) # of Labels (M) WikiLSHTC 1.89 0.47 1.62 0.33 Ads-430K 1.12 0.50 0.088 0.43 Ads-1M 3.92 1.56 0.16 1.08 Ads-9M 70.46 22.63 2.08 8.84

Large data sets Small data sets

31

slide-32
SLIDE 32

Results on Small Data Sets

45 55 65 75 85 95 P1 P3 P5 FastXML MLRF LPSR 1-vs-All LEML CS 30 40 50 60 70 80 90 P1 P3 P5 FastXML MLRF LPSR 1-vs-All LEML CS 40 45 50 55 60 65 70 P1 P3 P5 FastXML MLRF LPSR 1-vs-All LEML CS 10 20 30 40 50 60 70 P1 P3 P5 FastXML MLRF LPSR 1-vs-All LEML CS Delicious MediaMill RCV1-X BibTeX

32

slide-33
SLIDE 33

Large Data Sets - WikiLSHTC

Dataset Statistics Training Points 1,892,600 Features 1,617,899 (sparse) Labels 325,056 Test Points 472,835 Precision at K Training Time (hr)

10 20 30 40 50 60 P1 P3 P5 FastXML LPSR-NB LEML 10 20 30 FastXML LPSR-NB LEML

Test Time (millisec)

0.33 9.00 243.00 FastXML LPSR-NB LEML

33

slide-34
SLIDE 34

Precision at K

  • 4

1 6 11 16 P1 P3 P5 FastXML

Large Data Sets - Ads

34

Ads-430K Ads-1M Ads-9M

Precision at K

5 10 15 20 25 30 P1 P3 P5 FastXML LPSR-NB LEML

Precision at K

5 10 15 20 25 P1 P3 P5 FastXML LPSR-NB

Test Time (millisec)

1 2 3 4 FastXML LPSR-NB LEML

Test Time (millisec)

0.5 1 1.5 FastXML

Test Time (millisec)

0.2 0.4 0.6 0.8 1 FastXML LPSR-NB

slide-35
SLIDE 35

Training Times in Hours Versus Cores

1 2 3 4 5 6 WikiLSHTC 1 2 4 8 16 0.2 0.4 0.6 0.8 1 Ads-430K 1 2 4 8 16 1 2 3 4 Ads-1M 1 2 4 8 16

35

slide-36
SLIDE 36

Conclusions

36

  • Extreme classification
  • Tackle applications with millions of labels
  • A new paradigm for recommendation
  • FastXML
  • Significantly higher prediction accuracy
  • Can train on a single desktop
  • Publications and code
  • WWW13, KDD14, NIPS15 paperps
  • Code and data available from my website
slide-37
SLIDE 37

Unbiased Performance Evaluation

Himanshu Jain (IIT Delhi) Yashoteja Prabhu (IIT Delhi) Manik Varma (Microsoft Research)

37

slide-38
SLIDE 38

Traditional Loss/Gain Functions

  • Hamming loss
  • Subset 0/1 loss
  • Precision
  • Recall
  • F-score
  • Jaccard distance

38

slide-39
SLIDE 39

39

Washington Lincoln Kennedy Jefferson Roosevelt

history history history history history politics politics politics politics politics people people people people people usa usa usa usa usa america america america america america history usa leader people us citizen politics america writer usa 19th century born war politician american thinker us history

  • philosopher
  • usa

usa president president usa first president president

cuban missile crisis founding fathers of the us

president

founding fathers of the us emancipation proclamation

project apollo

declaration of independence attack on pearl harbour american revolutionary war

assassinated assassinated

acquisition of louisiana great depression

whiskey rebellion abolition of slavery

  • american

revolutionary war

  • 1

2 3

slide-40
SLIDE 40

Average # of Positive Labels per Point

1 2 4 8 16 32 64 128

  • +ve labels are more important than –ve ones
slide-41
SLIDE 41

Missing Labels

41

Labels: Living people, American computer scientists, Formal methods people, Carnegie Mellon University faculty, Massachusetts Institute of Technology alumni, Academic journal editors, Women in technology, Women computer scientists.

slide-42
SLIDE 42

Tail Labels

42

  • # of relevant labels > # of prediction slots
  • Not all positive labels are equally important
slide-43
SLIDE 43

Extreme Loss/Gain Functions

  • Accuracy – handle biased ground truth
  • Rareness / Novelty
  • Diversity
  • Explainability

43

slide-44
SLIDE 44

Open Research Questions

44

  • Applications
  • Obtaining good quality training data
  • Log time and space training and prediction
  • Obtaining discriminative features at scale
  • Extreme loss functions
  • Performance evaluation
  • Dealing with tail labels and label correlations
  • Dealing with missing and noisy labels
  • Explore/exploit for tail labels
  • Statistical guarantees
  • Fine-grained classification
slide-45
SLIDE 45

Acknowledgements

Rahul Agrawal Kush Bhatia Shilpa G. Archit Gupta Himanshu Jain Prateek Jain Abhishek Kadian Purushottam Kar Abhirup Nath Ambuj Tewari

  • C. Yeshwanth

45

slide-46
SLIDE 46

Multiple Iterations - Ads-430K

2 4 6 8 10 1 2 3 4 5 15 25 35 45 x 10000 1 2 3 4 5 10 15 20 25 30 P1 P3 P5 1 2 3 4 5 10 15 20 25 30 35 40 1 2 3 4 5

w update Iterations

  • Obj. Value at Root Node

Training Time (hr) Precision at K

46

slide-47
SLIDE 47

Tree Imbalance

1 2 4 8 FastXML MLRF LPSR 1 2 4 8 16 32 64 128 FastXML LPSR

Small Data Sets Large Data Sets

47

slide-48
SLIDE 48

Variants of FastXML - Small Data Sets

50 55 60 65 70 P1 P3 P5

FastXML MLRF-nDCG FastXML- nDCG5 FastXML-P5

50 55 60 65 70 75 80 85 90 P1 P3 P5

FastXML MLRF-nDCG FastXML- nDCG5 FastXML-P5

40 50 60 70 80 90 100 P1 P3 P5

FastXML MLRF-nDCG FastXML- nDCG5 FastXML-P5

10 20 30 40 50 60 70 P1 P3 P5

FastXML MLRF-nDCG FastXML- nDCG5 FastXML-P5

Delicious MediaMill RCV1-X BibTeX

48

slide-49
SLIDE 49

Variants of FastXML - Large Data Sets

10 20 30 40 50 P1 P3 P5

FastXML FastXML- nDCG5 FastXML-P5

5 10 15 20 25 30 P1 P3 P5

FastXML FastXML- nDCG5 FastXML-P5

5 10 15 20 25 P1 P3 P5

FastXML FastXML- nDCG5 FastXML-P5

WikiLSHTC Ads-430K Ads-1M

49

slide-50
SLIDE 50

Random Tree Selection

10 20 30 40 50 0.1 0.2 0.3 0.4 0.5 0.6 0.7

  • No. of Trees

P5

BibTeX Delicious MediaMill RCV1-X Ads-430K WikiLSHTC

50