SLIDE 1
On-line Hierarchical Multi-label Text Classification
Jesse Read September 7, 2007
On-line Hierarchical Multi-label Text Classification 1
SLIDE 2 The Problem
Learning to automatically classify text documents. Eg:
- Emails
- News Articles, Current Events (websites, RSS feeds)
- “Folksonomies” (Wikipedia, CiteULike)
- Bookmarks (Web browser, del.ic.ous, Google Bookmarks)
- Other (e.g. File System, Medical Text Classification)
Each of these examples is (or could be):
- Text
- Multi-label
- Organised in a Hierarchy
- On-line / Streamed (not Batch Learning)
- Affected by Human Interaction
On-line Hierarchical Multi-label Text Classification 2
SLIDE 3
Multi-label Classification
Given a label set L = {Sports, Environment, Science, Politics}; “Single-label” (Multi-class) Classification For a text document d, the task is to select a label l ∈ L Multi-label Classification For a text document d select a label subset S ⊆ L E.g.: Example Labels (S ⊆ L) Document 1 {Sports,Politics} Document 2 {Science,Politics} Document 3 {Sports} Document 4 {Environment,Science}
On-line Hierarchical Multi-label Text Classification 3
SLIDE 4 Multi-label Classification
Done by transforming a multi-label problem into a single-label problem, i.e. with a Problem Transformation method:
- 1. (LC) Label Combination Method
- 2. (BC) Binary Classifiers Method
- 3. (RT) Ranking Threshold Method
Then employ a standard single-label algorithm on the resulting data. E.g. : Naive Bayes, C4.5, Bagging with C4.5, Support Vector Machines, k Nearest Neighbour, Neural Networks, AdaBoostM1. Then transform result back to multi-label representation.
On-line Hierarchical Multi-label Text Classification 4
SLIDE 5
- 1. Label Combination Method (LC)
Each combination of labels becomes a single label. A single-label classifier C learns to classify from the resulting combinations. One decision per label. E.g.: (C) Document X either belongs to Sports+Politics
- r Science+Politics
- r Sports
- r Science+Environment
- May generate many unique combinations for few documents
- What if a document about Sports and Science turns up?
- Can run very slow if no. of unique combinations grows large
On-line Hierarchical Multi-label Text Classification 5
SLIDE 6
- 2. Binary Classifiers Method (BC)
Single-label [binary] classifiers are created for each possible label. Multiple decisions per document. E.g. Four classifiers C1 · · · C4, one for each label. Document X (C1) belongs to Sports? YES/NO... (C2) belongs to Environment? YES/NO... (C3) belongs to Science? YES/NO... (C4) belongs to Politics? YES/NO...
- Slow, need as many classifiers as labels.
- Assumes that all labels are independent
- Often way too many labels are selected
On-line Hierarchical Multi-label Text Classification 6
SLIDE 7
- 3. Ranking Threshold Method (RT)
A single-label classifier C outputs a ranking of its confidence for each label. E.g.: Document X (C) is 95.5% likely to belong to Science (C) is 81.2% likely to belong to Environment (C) is 60.9% likely to belong to Sports (C) is 21.3% likely to belong to Politics e.g. Threshold = 80.0%
- Not all single-label classifiers can output their “confidence”
- Assumes that all labels are independent
- Difficulty in selecting a good threshold
- Often the threshold encloses way too many labels
On-line Hierarchical Multi-label Text Classification 7
SLIDE 8
Hierarchical Classification (Option 1 - Global)
Uses 1 Problem Transformation method and single-label classifier. Information about the hierarchy is incorporated into the process.
Americas. US Americas Canada MidEast. Iraq MidEast. Iran Sports. Soccer Sports. Rugby Sci/Tech root
+ Higher accuracy − Can run very slow and use up a lot of memory − Difficult to maintain; inflexible
On-line Hierarchical Multi-label Text Classification 8
SLIDE 9
Hierarchical Classification (Option 2 - Local)
Each internal node with its own Problem Transformation Method.
US Canada Iraq Iran Soccer Rugby Sci/Tech Americas Mid.East Sports root
+ Divides up the problem: easy to maintain; efficient; intuitive − Error propagation; accuracy unimpressive − Overhead involved in setting up the hierarchical structure
On-line Hierarchical Multi-label Text Classification 9
SLIDE 10
Experiments — 20Newsgroups — Accuracy
20 40 60 80 100 10 100 1000 10000 100000 % Labeled(Training) Examples LH.LC-SMO LH.BC-SMO LH.RT-NB GH.LC_EM-NB GH.BC_STACK-SMO GH.AT_J48 On-line Hierarchical Multi-label Text Classification 10
SLIDE 11
Experiments — 20Newsgroups — Build Time
2000 4000 6000 8000 10000 12000 10 100 1000 10000 100000 % Labeled(Training) Examples LH.LC-SMO LH.BC-SMO LH.RT-NB GH.LC_EM-NB GH.BC_STACK-SMO GH.AT_J48 On-line Hierarchical Multi-label Text Classification 11
SLIDE 12
Experiments — Enron — Accuracy
20 40 60 80 100 10 100 1000 10000 % Labeled(Training) Examples LH.LC-SMO LH.BC-SMO LH.RT-NB GH.LC_EM-NB GH.BC_STACK-SMO GH.AT_J48 On-line Hierarchical Multi-label Text Classification 12
SLIDE 13
Experiments — Enron — Build Time
500 1000 1500 2000 2500 3000 3500 4000 4500 10 100 1000 10000 % Labeled(Training) Examples LH.LC-SMO LH.BC-SMO LH.RT-NB GH.LC_EM-NB GH.BC_STACK-SMO GH.AT_J48 On-line Hierarchical Multi-label Text Classification 13
SLIDE 14
Experiments — NewsArticles — Accuracy
20 40 60 80 100 10 100 1000 10000 % Labeled(Training) Examples LH.LC-SMO LH.BC-SMO LH.RT-NB GH.LC_EM-NB GH.BC_STACK-SMO GH.AT_J48 On-line Hierarchical Multi-label Text Classification 14
SLIDE 15
Experiments — NewsArticles — Build Time
200 400 600 800 1000 1200 1400 10 100 1000 10000 % Labeled(Training) Examples LH.LC-SMO LH.BC-SMO LH.RT-NB GH.LC_EM-NB GH.BC_STACK-SMO GH.AT_J48 On-line Hierarchical Multi-label Text Classification 15
SLIDE 16 Initial Conclusions
Performance is poor.
- All Problem Transformation methods have significant
disadvantages
- Multi-label data is more complex than single-label data
- Multi-label text datasets can be very different, no method best
for all
- On-line data is invariably susceptible to “Concept Drift”
- . . . but it is very costly to build / rebuild classifiers
On-line Hierarchical Multi-label Text Classification 16
SLIDE 17 Current Work
- Analysis and modelling of on-line hierarchical multi-label text
data
- Analysing the performance/flaws of Problem Transformation
methods
- Investigating adaptive and incremental learning methods
On-line Hierarchical Multi-label Text Classification 17
SLIDE 18 “Multi-label-ness”: Documents per Label
- 80/20 rule. Typically most labels used not used very often.
On-line Hierarchical Multi-label Text Classification 18
SLIDE 19 “Multi-label-ness”: Labels per Documents
- Most documents have only a few labels.
On-line Hierarchical Multi-label Text Classification 19
SLIDE 20 On-line data: Creation of Labels Over Time
- Most labels are used for the first time (created) very early on.
On-line Hierarchical Multi-label Text Classification 20
SLIDE 21 On-line data: Label Combinations Over Time
- New label combinations continue to appear for some time.
On-line Hierarchical Multi-label Text Classification 21
SLIDE 22 On-line data∗: Label Activity Over Time
- Labels occur and reoccur in “bursts”
- → Topic/“burst” detection∗
On-line Hierarchical Multi-label Text Classification 22
SLIDE 23 On-line data∗: Label Activity Over Time
- Label often co-occur in bursts.
- Labels may be unused for periods of time
On-line Hierarchical Multi-label Text Classification 23
SLIDE 24 Other Things I found
- Some labels are particularly troublesome
- Some label combinations are particularly troublesome
- Some Problem Transformation methods do better or worse
depending on variations of: – The length and type of text documents – The no. of training examples seen – The no. of possible labels it can choose from – The no. of unique combinations of those labels – Etc.
On-line Hierarchical Multi-label Text Classification 24
SLIDE 25 Future Work
- Continue analysis
- Improve Problem Transformation methods
- Design a novel hierarchical multi-label classification framework,
for on-line text data streams, able to adapt to and learn from human interference (manual labelling).
On-line Hierarchical Multi-label Text Classification 25
SLIDE 26
. . . Questions? . . . Comments?
On-line Hierarchical Multi-label Text Classification 26