On-line Hierarchical Multi-label Text Classification Jesse Read - - PowerPoint PPT Presentation
On-line Hierarchical Multi-label Text Classification Jesse Read - - PowerPoint PPT Presentation
On-line Hierarchical Multi-label Text Classification Jesse Read Supervised by Bernhard (and Eibe and Geoff) On-line Hierarchical Multi-label Text Classification 1 Multi-label Classification Multi-class (Single-label) Classification e.g.
Multi-label Classification
Multi-class (“Single-label”) Classification e.g. Class set C = {Sports, Environment, Science, Politics} For a text document d, select a class c ∈ C Multi-label Classification e.g. Label set L = {Sports, Environment, Science, Politics}. For a text document d select a label subset S ⊆ L e.g.: Doc. Labels (S ⊆ L) 1 {Sports,Politics} 2 {Science,Politics} 3 {Sports} 4 {Environment,Science} ...how to do multi-label classification?
On-line Hierarchical Multi-label Text Classification 2
Problem Transformation Methods (PT)
Transforming a multi-label problem into a multi-class problem without losing information:
- 1. (LC) Label Combination Method
- 2. (BC) Binary Classifiers Method
- 3. (RT) Ranking Threshold Method
Our toy multi-label problem: Label Set L = {Sports, Environment, Science, Politics} Doc. Labels (S ⊆ L) 1 {Sports,Politics} 2 {Science,Politics} 3 {Sports} 4 {Environment,Science}
On-line Hierarchical Multi-label Text Classification 3
- 1. Label Combination Method (LC)
Train Doc. Class 1 Sports+Politics 2 Science+Politics 3 Sports 4 Science+Environment Test Doc. Class X ?
- May generate many classes for few documents
- Possibly inflexible for time-ordered data
On-line Hierarchical Multi-label Text Classification 4
- 2. Binary Classifiers Method (BC)
Train BSports Doc. Class 1 1 2 3 1 4 BEnvironment Doc. Class 1 2 3 4 1 BScience Doc. Class 1 2 1 3 4 1 BP olitics Doc. Class 1 1 2 1 3 4 Test Doc. BSports BEnvironment BScience BP olitics X ? ? ? ?
- Slow, need |L| classifiers.
- Assumes that all labels are independent
On-line Hierarchical Multi-label Text Classification 5
- 3. Ranking Threshold Method (RT)
Train Doc. Class 1 Sports 1 Politics 2 Science 2 Politics 3 Sports 4 Science 4 Environment Test Doc. Certainty Distribution X (Yw,Yx,Yy,Yz) = (?,?,?,?)
- Difficulty in selecting a threshold
- Assumes that all labels are independent
On-line Hierarchical Multi-label Text Classification 6
Algorithm Adaption Methods
We have seen the 3 main “Problem Transformation” methods. There are also Algorithm Adaption methods, for example:
- Modifying the entropy of J48
- Multiple actions for Association Rules
- AdaBoost.MH, AdaBoost.MR
- Modifications to SMO, kNN, . . .
Although most algorithm adaption methods just use a problem transformation method internally, e.g. Association Rules—LC, AdaBoost.MH—“AdaBoost Transformation”(AT), AdaBoost.MR—RT ...what about hierarchy?
On-line Hierarchical Multi-label Text Classification 7
Hierarchical Classification
Includes some method to recognise relationships between labels. For text data, we recognise a tree structured topic hierarchy, known as a taxonomy. There are two approaches to hierarchical classification:
- Global Hierarchical (a.k.a. the “big bang” approach)
- Local Hierarchical (a.k.a the “top down” approach)
On-line Hierarchical Multi-label Text Classification 8
Global Hierarchical
Americas. US Americas Canada MidEast. Iraq MidEast. Iran Sports. Soccer Sports. Rugby Sci/Tech root
+ Improvements in accuracy − Difficult to maintain; can get very computationally complex E.g.
- Stacking (e.g. on BC)
- EM (e.g. on LC)
- Boosting (e.g. with AT)
- Association Rules
- Predictive Clustering Trees (multi-label tree learners)
On-line Hierarchical Multi-label Text Classification 9
Local Hierarchical
US Canada Iraq Iran Soccer Rugby Sci/Tech Americas Mid.East Sports root
+ Divides up the problem: easy to maintain; intuitive − Error propagation; accuracy similar to flat PT E.g.
- Pachinko Machine, e.g. Fuzzy Relational Thesauri (FRT)
- Probabilistic
- Hybrid: ECOC, Error Recovery, Can return to higher nodes
On-line Hierarchical Multi-label Text Classification 10
Multi-label Datasets
Key |D| |L| UC(D, L) LC(D, L) Hier. Seq. Text YEAST 2,417 14 198 4.24 N N N MEDC 978 45 94 1.25 N N Y 20NG 19,300 20 55 1.03 Y Y Y ENRN 1,702 53 753 3.38 Y Y Y MARX 3,617 101 208 1.13 Y Y Y REUT 6,000 103 811 1.46 Y N Y |D| = Number of documents |L| = Number of possible labels UC(D, L) = |S|S ⊆ L, ∃d ∈ D : L(d) = S| LC(D, L) = for i = 1 · · · |D|, Si ⊆ L, where (di, Si):
1 |D|
|D|
i=1 |Si|
- Hier. = Hierarchical structure defined within dataset
- Seq. = Time-ordered data
Text = Text Dataset
On-line Hierarchical Multi-label Text Classification 11
Multi-label Evaluation
- Percentage of correctly classified instances? – Too harsh
- Percentage of correctly classified labels? – Too easy
Let C be a multi-label classifier, Si ⊆ L and Yi = C(xi) be label predictions by C for document xi: Accuracy(C, D) = 1 |D|
|D|
- i=1
|Si ∩ Yi| |Si ∪ Yi| (1) Hierarchical Evaluation:
- Should we give partial credit?
- If so, how?
On-line Hierarchical Multi-label Text Classification 12
Algorithms
Multi-class algorithms commonly used in prior multi-label work: Key Type Description NB Bayes Na¨ ıve Bayes BAG. Meta Bagging (with J48) SMO Function Support Vector Machines J48 Tree J48 IBk kNN k Nearest Neighbor NN Neural Neural Networks Pilot experiments showed that:
- Default NN too slow
- IBk does not perform well with sparse data
On-line Hierarchical Multi-label Text Classification 13
Experiments — Tables
Flat vs Global Hierarchical vs Local Hierarchical
1. Problem Transformation LC BC RT NB BAG SMO J48 NB BAG SMO J48 NB BAG SMO J48 MEDI 68.05 71.77* 71.10* 72.13* 55.82 75.58* 73.59* 65.83 67.81 64.20 65.72 60.22 20NG 57.47* 57.58* 57.35* 52.74 32.33
- 47.67
41.09 56.05* 47.19 54.61* 50.55 ENRN 32.72* 25.42
- 22.96
21.82 31.35* 30.56* 26.26 15.16 30.25* 24.09 27.82 MARX 48.15* 48.93* 43.26 44.79 32.6 31.69 38.64 33.95 48.44* 36.07 40.46 38.71 REUT 43.76 51.47
- 41.68
18.21 44.09 56.23* 43.83 37.13 45.9 58.65* 45.31 2. Global Hierarchical LC-EM BC-Stack(RT-NB) AT NB BAG SMO J48 NB BAG SMO J48 BAG J48 MEDI 67.45 74.71* 70.75 72.31 56.09 70.76 73.65* 65.85 67.06 67.82 20NG 57.48* 57.58* 57.45* 53.39 29.8
- 49.06
40.88
- ENRN
34.6* 25.46
- 23.31
20.66 31.79 27.01 25.35
- MARX
48.18 50.64* 43.29 44.82 39.09 32.08 38.87 34.25
- REUT
43.77 51.49*
- 41.69
19.78 43.83 57.32* 43.68
- 3.
Local Hierarchical LC BC RT NB BAG SMO J48 NB BAG SMO J48 NB BAG SMO J48 20NG 56.49 58.31* 58.83* 53.48 43.68
- 52.44
42.03 54.87 40.58 53.37 49.26 ENRN 25.96 29.38 27.73 25.23 15.3 34.99*
- 26.26
4.67 25.51 23.59 27.63 MARX 48.49 54.57* 42.4 46.84 41.69 38.67 40.34 38.65 46.44 33.59 38.32 41.23
On-line Hierarchical Multi-label Text Classification 14
Experiments — 20NG — Accuracy
20 40 60 80 100 10 100 1000 10000 100000 % Labeled(Training) Examples LH.LC-SMO LH.BC-SMO LH.RT-NB GH.LC_EM-NB GH.BC_STACK-SMO GH.AT_J48 On-line Hierarchical Multi-label Text Classification 15
Experiments — 20NG — Build Time
2000 4000 6000 8000 10000 12000 10 100 1000 10000 100000 % Labeled(Training) Examples LH.LC-SMO LH.BC-SMO LH.RT-NB GH.LC_EM-NB GH.BC_STACK-SMO GH.AT_J48 On-line Hierarchical Multi-label Text Classification 16
Experiments — ENRN — Accuracy
20 40 60 80 100 10 100 1000 10000 % Labeled(Training) Examples LH.LC-SMO LH.BC-SMO LH.RT-NB GH.LC_EM-NB GH.BC_STACK-SMO GH.AT_J48 On-line Hierarchical Multi-label Text Classification 17
Experiments — ENRN — Build Time
500 1000 1500 2000 2500 3000 3500 4000 4500 10 100 1000 10000 % Labeled(Training) Examples LH.LC-SMO LH.BC-SMO LH.RT-NB GH.LC_EM-NB GH.BC_STACK-SMO GH.AT_J48 On-line Hierarchical Multi-label Text Classification 18
Experiments — MARX — Accuracy
20 40 60 80 100 10 100 1000 10000 % Labeled(Training) Examples LH.LC-SMO LH.BC-SMO LH.RT-NB GH.LC_EM-NB GH.BC_STACK-SMO GH.AT_J48 On-line Hierarchical Multi-label Text Classification 19
Experiments — MARX — Build Time
200 400 600 800 1000 1200 1400 10 100 1000 10000 % Labeled(Training) Examples LH.LC-SMO LH.BC-SMO LH.RT-NB GH.LC_EM-NB GH.BC_STACK-SMO GH.AT_J48 On-line Hierarchical Multi-label Text Classification 20
Conclusions
Problem Transformation methods:
- No problem transformation method is best on all datasets
- BC and RT might do better with a better selected |S|
- Complexity determined by D, L, LC(D, L) and UC(D, L)
Multi-class algorithms:
- J48 not that great
- BC doesn’t go well with Na¨
ıve Bayes, RT does, and LC works equally with either Hierarchical:
- Global PT-extensions improve on flat
- In practice there is overhead involved with building local
hierarchical classifiers but also in theory more flexible
On-line Hierarchical Multi-label Text Classification 21
Applications
- Bookmarks (Web browser, del.ic.ous, Google Bookmarks)
- Folksonomies (Wikipedia, CiteULike)
- Websites
- Other (Medical Text Classification, etc.)
These all tend to be (or could be):
- Multi-label
- Hierarchical
- Supervised
- On-line
- Time-evolving
On-line Hierarchical Multi-label Text Classification 22
Future Work
Only just the beginning. Other things to look into:
- Modelling multi-label hierarchical on-line data
- Topic/“burst” detection (N.B. - not topic creation)
- Minimising human interaction
- Active learning?
- Incremental algorithms
- Adaptive learning