Extreme multilabel learning
Charles Elkan Amazon Fellow December 12, 2015
1/32
Extreme multilabel learning Charles Elkan Amazon Fellow December - - PowerPoint PPT Presentation
Extreme multilabel learning Charles Elkan Amazon Fellow December 12, 2015 1/32 Massive multilabel classification In the world of big data, it is common to have many training examples (10 6 instances) high-dimensional data (10 6 features) many
Charles Elkan Amazon Fellow December 12, 2015
1/32
In the world of big data, it is common to have many training examples (106 instances) high-dimensional data (106 features) many labels to predict (104.5 labels) Numbers are for predicting medical subject headings (MeSH) for documents in PubMed. Amazon datasets are far larger.
2/32
3/32
4/32
NIH can only afford to assign one human indexer per document. How can we measure how accurate the humans are? Method: Look for articles that were inadvertently indexed twice. Finding: About 0.1% of PubMed articles are duplicates, usually not exact. Causes: Primarily plagiarism and joint issues of journals.
5/32
0.005 0.010 0.020 0.050 0.100 0.200 0.500 1.000 0.0 0.2 0.4 0.6 0.8 1.0 Graph 2: Consistency of Individual Descriptors Base Rate - Logarithmic Scale Consistency
1 2 3 4 5 6 7 8 9 10 11 12
6/32
Consistency for concrete terms is better than for abstract terms. Descriptor Name Consistency (%) 95% ± Base Rate (%) Humans 92.80 0.62 79.58 Female 70.74 1.69 29.81 Male 68.14 1.78 27.61 Animals 76.89 1.93 20.21 ... Time Factors 19.13 3.29 4.08
7/32
At first sight, the training dataset contains 1012 values. Can it fit in memory? (Yes–easy given sparsity.) What if the dataset is stored in distributed fashion? But: Storing dense linear classifiers for 104.5 labels with 106 features would need 200 gigabytes.
8/32
Training is feasible on a single CPU core if we have Sparse features (102.5 nonzero features per instance) Sparse labels (101.5 positive labels per instance) Sparse models (103 features are enough for each label). Note: Class imbalance is a non-problem for logistic regression and related methods. Here, a typical class has only 101.5/104.5 = 0.1% positives.
9/32
We use F1 measure F =
2 1/P+1/R = 2tp 2tp+fp+fn.
Why does F1 not depend on the number tn of true negatives? Intuition: For any label, most instances are negative, so give no credit for correct predictions that are easy.
10/32
+
+
+
+ + +
Labels Instances
6 7 2 5 2 3 6 6 6 8 4 5 2 5 6 7 Instance F1
Average
Average F1 per document reflects the experience of a user who examines the positive predictions for some specific documents.
11/32
ECML 2014 paper with Z. Lipton and B. Narayanaswamy Optimal Thresholding of Classifiers to Maximize F1 Measure. Theorem: The probability threshold that maximizes F1 is
We can apply the theorem separately for any variant of F1.
12/32
1
Correlations between labels are not highly predictive.
2
Optimizing the right measure of success is important.
3
Keeping rare features is important for predicting rare labels.
Need the word “platypus" to predict label “monotreme."
4
Standard bag-of-words preprocessing is hard to beat.
Use log(tf + 1) · idf and L2 length normalization.
13/32
Two ideas to achieve tractability in training:
1
Use a loss function that promotes sparsity of weights.
2
Design the training algorithm to never lose sparsity. On PubMed data, only 0.3% of weights are ever non-zero during training.
14/32
Features with the largest squared weights.
Earthquakes earthquake A 1.37 earthquake T 0.99 fukushima A 0.34 earthquakes A 0.30 disaster A 0.29 Disasters J 0.18 haiti A 0.18 wenchuan T 0.18 disasters A 0.17 wenchuan A 0.16 (remaining mass) 0.14
A = word in abstract, T = word in title, J = exact journal name. This model has perfect training and test accuracy.
15/32
To solve massive multilabel learning tasks:
1
Linear or logistic regression
2
Training the models for all labels simultaneously
3
Combined L1 and L2 regularization (elastic net)
4
Stochastic gradient descent (SGD)
5
Proximal updates delayed and applied only when needed.
6
Sparse data structures
16/32
𝑌: Data matrix, sparse 𝑋: Weight matrix, sparse Y: Label matrix, sparse
𝑛 𝑛 𝑜 𝑜 𝑀 𝑀
Use L1 regularization to find sparse W to minimize discrepancy between f(XW) and labels Y.
17/32
It is wasteful to learn dense weights when only a few non-zero weights are needed for good accuracy. Elastic net regularizer sums L1 and squared L2 penalties: R(W) =
L
λ1||wl||1 + λ2 2 ||wl||2
2.
Like pure L1, eliminates non-predictive features; like pure L2
2,
spreads weights over correlated predictive features.
18/32
We want to minimize the regularized loss L(w) + R(w), where R is analytically tractable, such as L1 + L2
2.
Define proxQ( ˜ w) = arg minw∈Rd 1
2||w − ˜
w||2
2 + Q(w).
Then wt+1 = proxQ(wt − ηgt) where Q = ηR. The proximal operator balances two objectives:
1
staying close to the SG-updated weight vector wt − ηgt
2
moving toward a weight vector with lower value of R.
19/32
2
Minimum of the sum of the proximity and regularizer functions:
𝑥𝑢+1 = prox𝜃𝑢𝑆 𝑥𝑢 − 𝜃𝑢𝑢 = arg min 𝑁𝑢 𝑥 + 𝑆 𝑥
𝑥𝑢+1
20/32
1,125,160 training instances—all articles since 2011 969,389 vocabulary words 25,380 labels, so 2.5 × 1010 potential weights. TF-IDF and L2 bag-of-words preprocessing. AdaGrad multiplier α fixed to 1. L1 and L2 regularization strengths λ1 = 3 × 10−6 and λ2 = 10−6 chosen on small subset of training data.
21/32
+
+
+
+ + +
Labels Instances
6 7 2 5 2 3 6 6 6 8 4 5 2 5 6 7 Instance F1
Average
Reflects the experience of a user who looks at the positive predictions for some specific documents.
22/32
Fraction of labels Per-instance F1 all 0.52 30% 0.54 10% 0.56 3% 0.59 1% 0.61 Example-based F1 computed with various subsets of the 25,380 labels, from all to the 1% of most frequent labels. Not surprising: More common labels are easier to predict.
23/32
10000000 20000000 30000000 40000000 50000000 60000000 70000000 80000000 90000000 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000 110000 120000 130000 140000 150000 160000 170000 180000 190000 200000 210000 220000 230000 240000 250000 260000 270000 280000 290000
nnz(W)
nnz(W)
Of about 25 billion potential weights, during training at most 80 million are non-zero; at convergence 50 million (0.2%).
24/32
To solve massive multilabel learning tasks:
1
Linear or logistic regression
2
Training the models for all labels simultaneously
3
Combined L1 and L2 regularization (elastic net)
4
Stochastic gradient descent (SGD)
5
Proximal updates delayed and applied only when needed
6
Sparse data structures
25/32
26/32
If xij = 0 then the prediction f(xi · w) does not depend on wj, and the unregularized derivative wrt wj is zero. Algorithm 1 Using delayed updates for t ∈ 1, ..., T do Sample xi randomly from the training set for j s.t. xij = 0 do wj ← DelayedUpdate(wj, t, ψj) ψj ← t end for w ← w − ∇Fi(w) end for
27/32
Theorem: To bring weight wj current to time k from time ψj in constant time, the FoBoS update, with L1 + L2
2 regularization
and learning rate η(t), is w(k)
j
= sgn(w
(ψj) j
) ·
(ψj) j
| Φ(k − 1) Φ(ψj − 1) − Φ(k − 1) · λ1
where Φ(t) = Φ(t − 1) ·
1 1+ηtλ2 with base case Φ(−1) = 1
and β(t) = β(t − 1) +
η(t) Φ(t−1) with base case β(−1) = 0.
28/32
Datasets Dataset Examples Features Labels rcv1 30, 000 47, 236 101 bookmarks 87,856 2150 208 Speed in Julia on one core Dataset Delayed (xps) Standard (xps) Speedup rcv1 555 1.13 489.7 bookmarks 516.8 25 20.7 Speed in examples per second (xps).
29/32
On the rcv1 dataset, 101 models train in minutes on one core.
30/32
To learn one linear model: With n examples, d dimensions, and e epochs, standard SGD-based methods use O(nde) time and O(d) space. With d′ average nonzero features per example and v nonzero weights per model, we use O(nd′e) time and O(v) space. Let n′ be the average number of positive examples per label. Future work: Use only O(n′d′e) time.
31/32
Questions? Discussion?
32/32