on line hierarchical multi label text classification
play

On-line Hierarchical Multi-label Text Classification Jesse Read - PowerPoint PPT Presentation

On-line Hierarchical Multi-label Text Classification Jesse Read September 7, 2007 On-line Hierarchical Multi-label Text Classification 1 The Problem Learning to automatically classify text documents . Eg: Emails News Articles, Current


  1. On-line Hierarchical Multi-label Text Classification Jesse Read September 7, 2007 On-line Hierarchical Multi-label Text Classification 1

  2. The Problem Learning to automatically classify text documents . Eg: • Emails • News Articles, Current Events (websites, RSS feeds) • “Folksonomies” (Wikipedia, CiteULike) • Bookmarks (Web browser, del.ic.ous, Google Bookmarks) • Other (e.g. File System, Medical Text Classification) Each of these examples is (or could be): • Text • Multi-label • Organised in a Hierarchy • On-line / Streamed (not Batch Learning) • Affected by Human Interaction On-line Hierarchical Multi-label Text Classification 2

  3. Multi-label Classification Given a label set L = { Sports, Environment, Science, Politics } ; “Single-label” (Multi-class) Classification For a text document d , the task is to select a label l ∈ L Multi-label Classification For a text document d select a label subset S ⊆ L Example Labels ( S ⊆ L ) Document 1 { Sports,Politics } E.g.: Document 2 { Science,Politics } Document 3 { Sports } Document 4 { Environment,Science } On-line Hierarchical Multi-label Text Classification 3

  4. Multi-label Classification Done by transforming a multi-label problem into a single-label problem, i.e. with a Problem Transformation method : 1. (LC) Label Combination Method 2. (BC) Binary Classifiers Method 3. (RT) Ranking Threshold Method Then employ a standard single-label algorithm on the resulting data. E.g. : Naive Bayes, C4.5, Bagging with C4.5, Support Vector Machines, k Nearest Neighbour, Neural Networks, AdaBoostM1 . Then transform result back to multi-label representation. On-line Hierarchical Multi-label Text Classification 4

  5. 1. Label Combination Method (LC) Each combination of labels becomes a single label. A single-label classifier C learns to classify from the resulting combinations. One decision per label. E.g.: ( C ) Document X either belongs to Sports+Politics or Science+Politics or Sports or Science+Environment • May generate many unique combinations for few documents • What if a document about Sports and Science turns up? • Can run very slow if no. of unique combinations grows large On-line Hierarchical Multi-label Text Classification 5

  6. 2. Binary Classifiers Method (BC) Single-label [binary] classifiers are created for each possible label. Multiple decisions per document. E.g. Four classifiers C 1 · · · C 4 , one for each label. Document X ( C 1 ) belongs to Sports ? YES/NO... ( C 2 ) belongs to Environment ? YES/NO... ( C 3 ) belongs to Science ? YES/NO... ( C 4 ) belongs to Politics ? YES/NO... • Slow, need as many classifiers as labels. • Assumes that all labels are independent • Often way too many labels are selected On-line Hierarchical Multi-label Text Classification 6

  7. 3. Ranking Threshold Method (RT) A single-label classifier C outputs a ranking of its confidence for each label. E.g.: Document X ( C ) is 95.5% likely to belong to Science ( C ) is 81.2% likely to belong to Environment ( C ) is 60.9% likely to belong to Sports ( C ) is 21.3% likely to belong to Politics e.g. Threshold = 80.0% • Not all single-label classifiers can output their “confidence” • Assumes that all labels are independent • Difficulty in selecting a good threshold • Often the threshold encloses way too many labels On-line Hierarchical Multi-label Text Classification 7

  8. Hierarchical Classification (Option 1 - Global) Uses 1 Problem Transformation method and single-label classifier. Information about the hierarchy is incorporated into the process. root Americas. Americas MidEast. MidEast. Sports. Sports. Sci/Tech US Canada Iraq Iran Soccer Rugby + Higher accuracy − Can run very slow and use up a lot of memory − Difficult to maintain; inflexible On-line Hierarchical Multi-label Text Classification 8

  9. Hierarchical Classification (Option 2 - Local) Each internal node with its own Problem Transformation Method. root Americas Mid.East Sci/Tech Sports US Canada Iraq Iran Soccer Rugby + Divides up the problem: easy to maintain; efficient; intuitive − Error propagation; accuracy unimpressive − Overhead involved in setting up the hierarchical structure On-line Hierarchical Multi-label Text Classification 9

  10. Experiments — 20Newsgroups — Accuracy 100 LH.LC-SMO LH.BC-SMO LH.RT-NB GH.LC_EM-NB GH.BC_STACK-SMO 80 GH.AT_J48 60 40 20 0 10 100 1000 10000 100000 % Labeled(Training) Examples On-line Hierarchical Multi-label Text Classification 10

  11. Experiments — 20Newsgroups — Build Time 12000 LH.LC-SMO LH.BC-SMO LH.RT-NB GH.LC_EM-NB 10000 GH.BC_STACK-SMO GH.AT_J48 8000 6000 4000 2000 0 10 100 1000 10000 100000 % Labeled(Training) Examples On-line Hierarchical Multi-label Text Classification 11

  12. Experiments — Enron — Accuracy 100 LH.LC-SMO LH.BC-SMO LH.RT-NB GH.LC_EM-NB GH.BC_STACK-SMO 80 GH.AT_J48 60 40 20 0 10 100 1000 10000 % Labeled(Training) Examples On-line Hierarchical Multi-label Text Classification 12

  13. Experiments — Enron — Build Time 4500 LH.LC-SMO LH.BC-SMO 4000 LH.RT-NB GH.LC_EM-NB GH.BC_STACK-SMO 3500 GH.AT_J48 3000 2500 2000 1500 1000 500 0 10 100 1000 10000 % Labeled(Training) Examples On-line Hierarchical Multi-label Text Classification 13

  14. Experiments — NewsArticles — Accuracy 100 LH.LC-SMO LH.BC-SMO LH.RT-NB GH.LC_EM-NB GH.BC_STACK-SMO 80 GH.AT_J48 60 40 20 0 10 100 1000 10000 % Labeled(Training) Examples On-line Hierarchical Multi-label Text Classification 14

  15. Experiments — NewsArticles — Build Time 1400 LH.LC-SMO LH.BC-SMO LH.RT-NB 1200 GH.LC_EM-NB GH.BC_STACK-SMO GH.AT_J48 1000 800 600 400 200 0 10 100 1000 10000 % Labeled(Training) Examples On-line Hierarchical Multi-label Text Classification 15

  16. Initial Conclusions Performance is poor. • All Problem Transformation methods have significant disadvantages • Multi-label data is more complex than single-label data • Multi-label text datasets can be very different, no method best for all • On-line data is invariably susceptible to “Concept Drift” • . . . but it is very costly to build / rebuild classifiers On-line Hierarchical Multi-label Text Classification 16

  17. Current Work • Analysis and modelling of on-line hierarchical multi-label text data • Analysing the performance/flaws of Problem Transformation methods • Investigating adaptive and incremental learning methods On-line Hierarchical Multi-label Text Classification 17

  18. “Multi-label-ness”: Documents per Label • 80/20 rule. Typically most labels used not used very often. On-line Hierarchical Multi-label Text Classification 18

  19. “Multi-label-ness”: Labels per Documents • Most documents have only a few labels. On-line Hierarchical Multi-label Text Classification 19

  20. On-line data: Creation of Labels Over Time • Most labels are used for the first time (created) very early on. On-line Hierarchical Multi-label Text Classification 20

  21. On-line data: Label Combinations Over Time • New label combinations continue to appear for some time. On-line Hierarchical Multi-label Text Classification 21

  22. On-line data ∗ : Label Activity Over Time • Labels occur and reoccur in “bursts” • → Topic/“burst” detection ∗ On-line Hierarchical Multi-label Text Classification 22

  23. On-line data ∗ : Label Activity Over Time • Label often co-occur in bursts. • Labels may be unused for periods of time On-line Hierarchical Multi-label Text Classification 23

  24. Other Things I found • Some labels are particularly troublesome • Some label combinations are particularly troublesome • Some Problem Transformation methods do better or worse depending on variations of: – The length and type of text documents – The no. of training examples seen – The no. of possible labels it can choose from – The no. of unique combinations of those labels – Etc. On-line Hierarchical Multi-label Text Classification 24

  25. Future Work • Continue analysis • Improve Problem Transformation methods • Design a novel hierarchical multi-label classification framework, for on-line text data streams, able to adapt to and learn from human interference (manual labelling). On-line Hierarchical Multi-label Text Classification 25

  26. . . . Questions? . . . Comments? On-line Hierarchical Multi-label Text Classification 26

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend