Extreme Classification
Manik Varma Microsoft Research
A New Paradigm for Ranking & Recommendation
1
Extreme Classification A New Paradigm for Ranking & - - PowerPoint PPT Presentation
Extreme Classification A New Paradigm for Ranking & Recommendation Manik Varma Microsoft Research 1 Classification Pick one Pick one Pick all that apply Label 1 Label 1 Label 1 Label 2 Label 2 Label 2 Label 3 Label 3
Manik Varma Microsoft Research
A New Paradigm for Ranking & Recommendation
1
Classification
2
Label 1 Label 2 Label 1 Label 2 Label 3 … Label L Label 1 Label 2 Label 3 … Label L
Binary Multi-label Multi-class
Pick one Pick all that apply Pick one
Extreme Multi-label Learning
geico auto insurance geico car insurance geico insurance www geico com care geicos geico com need cheap auto insurance wisconsin cheap car insurance quotes cheap auto insurance florida all state car insurance coupon code
3
Predict the set of monetizable Bing queries that might lead to a click on this ad
MLRF: Multi-label Random Forests [Agrawal, Gupta, Prabhu, Varma WWW 2013]
Research Problems
4
Extreme Multi-label Learning - People
5
Extreme Multi-label Learning – Wikipedia
6
Labels: Living people, American computer scientists, Formal methods people, Carnegie Mellon University faculty, Massachusetts Institute of Technology alumni, Academic journal editors, Women in technology, Women computer scientists.
Reformulating ML Problems
7
… Label 1 Label 2 Label 3 Label 4 Label 5 … Label 1M
Items Labels
Yashoteja Prabhu (IIT Delhi) Manik Varma (Microsoft Research)
A Fast, Accurate & Stable Tree-classifier for eXtreme Multi-label Learning
8
FastXML
9
Extreme Multi-Label Learning
X: Users Y: Items
f : X → 2Y
10
Extreme Multi-Label Learning
11
Tree Based Extreme Classification
Tree Based Extreme Classification
FastXML Architecture
2 2
14
FastXML
15
FastXML Architecture
2 2
16
Learning to Partition a Node
Training data
2 4
Learning to Partition a Node
18
X: Space of Users
Min𝐱 𝐱 1 − 𝐷
𝑗∈Users
nDCG 𝐲𝒋, 𝐳𝒋, 𝐱
𝐱
𝐱𝑢𝐲 < 0 𝐱𝑢𝐲 > 0
𝐱𝑢𝐲 < 0 𝐱𝑢𝐲 > 0
FastXML
19
Optimizing nDCG
20
nDCG ∝ like(𝑗, 𝐬1) +
𝑚=2 𝑀
like(𝑗, 𝐬𝑚) log(𝑚 + 1)
1 If user i likes the item with rank 𝐬𝑚 0 otherwise
like 𝑗, 𝐬𝑚 ={
Optimizing nDCG
21
Min𝐱 𝐱 1 − 𝐷
𝑗∈Users
nDCG 𝐲𝒋, 𝐳𝒋, 𝐱
Optimizing nDCG – Reformulation
Min𝐱,𝛆,𝐬± 𝐱 1 +
𝑗
𝐷𝜀 𝜀𝑗 log 1 + 𝑓−𝜀𝑗𝐱𝑢𝐲𝑗 − 𝐷𝑠
𝑗
nDCG 𝐬𝜺𝒋 𝑢𝑂𝐳𝑗 𝐳𝑗
22
Optimizing nDCG – Initialization
𝜀𝑗 ~ Bernoulli(0.5), ∀𝑗
Min𝐱,𝛆,𝐬± 𝐱 1 +
𝑗
𝐷𝜀 𝜀𝑗 log 1 + 𝑓−𝜀𝑗𝐱𝑢𝐲𝑗 − 𝐷𝑠
𝑗
nDCG 𝐬𝜺𝒋 𝑢𝑂𝐳𝑗 𝐳𝑗
Optimizing nDCG – Initialization
𝜀𝑗 ~ Bernoulli(0.5), ∀𝑗
Min𝐱,𝛆,𝐬± 𝐱 1 +
𝑗
𝐷𝜀 𝜀𝑗 log 1 + 𝑓−𝜀𝑗𝐱𝑢𝐲𝑗 − 𝐷𝑠
𝑗
nDCG 𝐬𝜺𝒋 𝑢𝑂𝐳𝑗 𝐳𝑗
Optimizing nDCG – Initialization
𝐬±∗ = rank
𝑗: 𝜀𝑗=±1
𝑂𝐳𝑗𝐳𝑗
Min𝐱,𝛆,𝐬± 𝐱 1 +
𝑗
𝐷𝜀 𝜀𝑗 log 1 + 𝑓−𝜀𝑗𝐱𝑢𝐲𝑗 − 𝐷𝑠
𝑗
nDCG 𝐬𝜺𝒋 𝑢𝑂𝐳𝑗 𝐳𝑗
Optimizing nDCG – Repartitioning Users
𝜀𝑗
∗ = sign 𝑤𝑗 − − 𝑤𝑗 +
𝑤𝑗
± = 𝐷𝜀 ±1 log 1 + 𝑓∓𝐱𝑢𝐲𝑗 − 𝐷𝑠nDCG 𝐬± 𝑢𝑂𝐳𝑗𝐳𝑗
Min𝐱,𝛆,𝐬± 𝐱 1 +
𝑗
𝐷𝜀 𝜀𝑗 log 1 + 𝑓−𝜀𝑗𝐱𝑢𝐲𝑗 − 𝐷𝑠
𝑗
nDCG 𝐬𝜺𝒋 𝑢𝑂𝐳𝑗 𝐳𝑗
Optimizing nDCG – Repartitioning Users
𝜀𝑗
∗ = sign 𝑤𝑗 − − 𝑤𝑗 +
𝑤𝑗
± = 𝐷𝜀 ±1 log 1 + 𝑓∓𝐱𝑢𝐲𝑗 − 𝐷𝑠nDCG 𝐬± 𝑢𝑂𝐳𝑗𝐳𝑗
Min𝐱,𝛆,𝐬± 𝐱 1 +
𝑗
𝐷𝜀 𝜀𝑗 log 1 + 𝑓−𝜀𝑗𝐱𝑢𝐲𝑗 − 𝐷𝑠
𝑗
nDCG 𝐬𝜺𝒋 𝑢𝑂𝐳𝑗 𝐳𝑗
Optimizing nDCG – Reranking Items
𝐬±∗ = rank
𝑗: 𝜀𝑗=±1
𝑂𝐳𝑗𝐳𝑗
Min𝐱,𝛆,𝐬± 𝐱 1 +
𝑗
𝐷𝜀 𝜀𝑗 log 1 + 𝑓−𝜀𝑗𝐱𝑢𝐲𝑗 − 𝐷𝑠
𝑗
nDCG 𝐬𝜺𝒋 𝑢𝑂𝐳𝑗 𝐳𝑗
Optimizing nDCG
Min𝐱,𝛆,𝐬± 𝐱 1 +
𝑗
𝐷𝜀 𝜀𝑗 log 1 + 𝑓−𝜀𝑗𝐱𝑢𝐲𝑗 − 𝐷𝑠
𝑗
nDCG 𝐬𝜺𝒋 𝑢𝑂𝐳𝑗 𝐳𝑗
Optimizing nDCG – Hyperplane Separator
Min𝐱,𝛆,𝐬± 𝐱 1 +
𝑗
𝐷𝜀 𝜀𝑗 log 1 + 𝑓−𝜀𝑗𝐱𝑢𝐲𝑗 − 𝐷𝑠
𝑗
nDCG 𝐬𝜺𝒋 𝑢𝑂𝐳𝑗 𝐳𝑗
Data Set Statistics
Data Set # of Training Points # of Test Points # of Dimensions # of Labels Delicious 12,920 3,185 500 983 MediaMill 30,993 12,914 120 101 RCV1-X 781,265 23,149 47,236 2,456 BibTeX 4,880 2,515 1,836 159 Data Set # of Training Points (M) # of Test Points (M) # of Dimensions (M) # of Labels (M) WikiLSHTC 1.89 0.47 1.62 0.33 Ads-430K 1.12 0.50 0.088 0.43 Ads-1M 3.92 1.56 0.16 1.08 Ads-9M 70.46 22.63 2.08 8.84
Large data sets Small data sets
31
Results on Small Data Sets
45 55 65 75 85 95 P1 P3 P5 FastXML MLRF LPSR 1-vs-All LEML CS 30 40 50 60 70 80 90 P1 P3 P5 FastXML MLRF LPSR 1-vs-All LEML CS 40 45 50 55 60 65 70 P1 P3 P5 FastXML MLRF LPSR 1-vs-All LEML CS 10 20 30 40 50 60 70 P1 P3 P5 FastXML MLRF LPSR 1-vs-All LEML CS Delicious MediaMill RCV1-X BibTeX
32
Large Data Sets - WikiLSHTC
Dataset Statistics Training Points 1,892,600 Features 1,617,899 (sparse) Labels 325,056 Test Points 472,835 Precision at K Training Time (hr)
10 20 30 40 50 60 P1 P3 P5 FastXML LPSR-NB LEML 10 20 30 FastXML LPSR-NB LEML
Test Time (millisec)
0.33 9.00 243.00 FastXML LPSR-NB LEML
33
Precision at K
1 6 11 16 P1 P3 P5 FastXML
Large Data Sets - Ads
34
Ads-430K Ads-1M Ads-9M
Precision at K
5 10 15 20 25 30 P1 P3 P5 FastXML LPSR-NB LEML
Precision at K
5 10 15 20 25 P1 P3 P5 FastXML LPSR-NB
Test Time (millisec)
1 2 3 4 FastXML LPSR-NB LEML
Test Time (millisec)
0.5 1 1.5 FastXML
Test Time (millisec)
0.2 0.4 0.6 0.8 1 FastXML LPSR-NB
Training Times in Hours Versus Cores
1 2 3 4 5 6 WikiLSHTC 1 2 4 8 16 0.2 0.4 0.6 0.8 1 Ads-430K 1 2 4 8 16 1 2 3 4 Ads-1M 1 2 4 8 16
35
36
Himanshu Jain (IIT Delhi) Yashoteja Prabhu (IIT Delhi) Manik Varma (Microsoft Research)
37
Traditional Loss/Gain Functions
38
39
Washington Lincoln Kennedy Jefferson Roosevelt
history history history history history politics politics politics politics politics people people people people people usa usa usa usa usa america america america america america history usa leader people us citizen politics america writer usa 19th century born war politician american thinker us history
usa president president usa first president president
cuban missile crisis founding fathers of the us
president
founding fathers of the us emancipation proclamation
project apollo
declaration of independence attack on pearl harbour american revolutionary war
assassinated assassinated
acquisition of louisiana great depression
whiskey rebellion abolition of slavery
revolutionary war
Average # of Positive Labels per Point
1 2 4 8 16 32 64 128
Missing Labels
41
Labels: Living people, American computer scientists, Formal methods people, Carnegie Mellon University faculty, Massachusetts Institute of Technology alumni, Academic journal editors, Women in technology, Women computer scientists.
Tail Labels
42
Extreme Loss/Gain Functions
43
44
Rahul Agrawal Kush Bhatia Shilpa G. Archit Gupta Himanshu Jain Prateek Jain Abhishek Kadian Purushottam Kar Abhirup Nath Ambuj Tewari
45
Multiple Iterations - Ads-430K
2 4 6 8 10 1 2 3 4 5 15 25 35 45 x 10000 1 2 3 4 5 10 15 20 25 30 P1 P3 P5 1 2 3 4 5 10 15 20 25 30 35 40 1 2 3 4 5
w update Iterations
Training Time (hr) Precision at K
46
Tree Imbalance
1 2 4 8 FastXML MLRF LPSR 1 2 4 8 16 32 64 128 FastXML LPSR
Small Data Sets Large Data Sets
47
Variants of FastXML - Small Data Sets
50 55 60 65 70 P1 P3 P5
FastXML MLRF-nDCG FastXML- nDCG5 FastXML-P5
50 55 60 65 70 75 80 85 90 P1 P3 P5
FastXML MLRF-nDCG FastXML- nDCG5 FastXML-P5
40 50 60 70 80 90 100 P1 P3 P5
FastXML MLRF-nDCG FastXML- nDCG5 FastXML-P5
10 20 30 40 50 60 70 P1 P3 P5
FastXML MLRF-nDCG FastXML- nDCG5 FastXML-P5
Delicious MediaMill RCV1-X BibTeX
48
Variants of FastXML - Large Data Sets
10 20 30 40 50 P1 P3 P5
FastXML FastXML- nDCG5 FastXML-P5
5 10 15 20 25 30 P1 P3 P5
FastXML FastXML- nDCG5 FastXML-P5
5 10 15 20 25 P1 P3 P5
FastXML FastXML- nDCG5 FastXML-P5
WikiLSHTC Ads-430K Ads-1M
49
Random Tree Selection
10 20 30 40 50 0.1 0.2 0.3 0.4 0.5 0.6 0.7
P5
BibTeX Delicious MediaMill RCV1-X Ads-430K WikiLSHTC
50