flawfinder
play

FlawFinder A Modular System for Predicting Quality Flaws in - PowerPoint PPT Presentation

FlawFinder A Modular System for Predicting Quality Flaws in Wikipedia Oliver Ferschke, Iryna Gurevych and Marc Rittberger CLEF 2012 Labs and Workshop, Notebook Papers, September 2012. Rome, Italy., September 17 20, 2012 1 Introduction


  1. FlawFinder A Modular System for Predicting Quality Flaws in Wikipedia Oliver Ferschke, Iryna Gurevych and Marc Rittberger CLEF 2012 Labs and Workshop, Notebook Papers, September 2012. Rome, Italy., September 17 – 20, 2012 1

  2. Introduction Oliver Ferschke Iryna Gurevych Marc Rittberger 19.09.2012 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Oliver Ferschke | 2

  3. FlawFinder Page IDs JWPL a a a a a b b b b b c c c c c Reader Linguistic Preprocessing Feature Extraction Training / Classification Writer Task-based system with Datastore / Results Multiple processing pipelines. 19.09.2012 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Oliver Ferschke | 3

  4. Data Import  Document retrieval via Java Wikipedia Library and Wikipedia Revision Toolkit  article text  revision history  revision meta data (authors, edit comment, timestamps)  links (in/out, internal/external)  JWPL database based on Wikipedia data dump from January 4th, 2012. http://jwpl.googlecode.com 19.09.2012 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Oliver Ferschke | 4

  5. Preprocessing  UIMA-based NLP components for preprocessing from the Darmstadt Knowledge Processing Repository Linguistic Preprocessing Named Sentence Stopword Wikitext Tokenizer Entity Splitter Filter Parser Recognizer v http://dkpro-core.googlecode.com 19.09.2012 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Oliver Ferschke | 5

  6. Features  NGram features • 32 feature types in 7 categories  Structural features • ClearTK framework  Reference features • „ plug and play “ feature extractors • independent from utilized ML toolkit  Network features •  Named entity features Information Gain approach for  Revision-based features feature selection  Other features • Unsupervised discretization of numeric features v http://cleartk.googlecode.com 19.09.2012 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Oliver Ferschke | 6

  7. Classification Approach  Binary classification  Naive Bayes  AdaBoost with depth-limited C4.5 decision trees as weak classifiers  Negative instances  Random sample of untagged articles  Evaluation  10-fold cross validation on 1000 documents  Stable sampling of negative instances in one evaluation run v 19.09.2012 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Oliver Ferschke | 7

  8. Parameter Optimization • The overall system is a „ pipeline of pipelines “. DKPro • Individual pipelines can be parameterized Lab Parameter optimization: • Find best parameter setting across all pipelines • Report on performance for pipeline configurations DKPro Lab: • Task based processing • Parameter injection • Global configuration • Report probes gather statistics for global report Reports http://dkpro-lab.googlecode.com 19.09.2012 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Oliver Ferschke | 8

  9. Error Analysis and Evaluation Common error sources • Outdated labels (positive instances) • Missing labels (negative instances) • Unclear label definitions  esp. reference flaws are often confused • Section-scope and article-scope flaws mixed 19.09.2012 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Oliver Ferschke | 9

  10. Conclusions & Outlook  Use article revision in which tag was first inserted  Solves outdated label problem  Use revision history for identifying negative instances  Solves missing label problem  Separate treatment of section- and article-scope templates  Real world application: multi-flaw classification  problems with overlaps in flaw definitions 19.09.2012 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Oliver Ferschke | 10

  11. Thank you for your attention! Ubiquitous Knowledge Processing Lab http://www.ukp.tu-darmstadt.de 19.09.2012 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Oliver Ferschke | 11

  12. 19.09.2012 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Oliver Ferschke | 12

  13. Features  NGram features • Token-unigrams, bigrams, trigrams  Structural features • Extracted from article  Reference features text w/o markup  Network features • Min. frequency (5)  Named entity features  Revision-based features • Stopword filtered  Other features v 19.09.2012 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Oliver Ferschke | 13

  14. Features  NGram features • Empty sections  Structural features • Number of sections  Reference features • Mean section length  Network features • Markup to text ratio  Named entity features  Revision-based features  Other features v 19.09.2012 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Oliver Ferschke | 14

  15. Features  NGram features • Number of references  Structural features • Reference lists  Reference features • Reference to text ratio  Network features • References per  Named entity features sentence  Revision-based features  Other features v 19.09.2012 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Oliver Ferschke | 15

  16. Features  NGram features • External links  Structural features • Inlinks  Reference features • Outlinks  Network features  Named entity features  Revision-based features  Other features v 19.09.2012 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Oliver Ferschke | 16

  17. Features  NGram features • NER types • Organization  Structural features • Person • Location  Reference features  Network features • Absolute numbers and NER to text ratio  Named entity features  Revision-based features  Other features v 19.09.2012 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Oliver Ferschke | 17

  18. Features  NGram features • Number of revisions  Structural features • Number of unique contributors  Reference features  Network features • Number of registered contributors  Named entity features  Revision-based features • Article age  Other features v 19.09.2012 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Oliver Ferschke | 18

  19. Features  NGram features • Number of discussions on Talk page  Structural features • Number of sentences,  Reference features tokens and characters  Network features  Named entity features  Revision-based features  Other features v 19.09.2012 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Oliver Ferschke | 19

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend