FlawFinder A Modular System for Predicting Quality Flaws in - - PowerPoint PPT Presentation

flawfinder
SMART_READER_LITE
LIVE PREVIEW

FlawFinder A Modular System for Predicting Quality Flaws in - - PowerPoint PPT Presentation

FlawFinder A Modular System for Predicting Quality Flaws in Wikipedia Oliver Ferschke, Iryna Gurevych and Marc Rittberger CLEF 2012 Labs and Workshop, Notebook Papers, September 2012. Rome, Italy., September 17 20, 2012 1 Introduction


slide-1
SLIDE 1

1

FlawFinder

A Modular System for Predicting Quality Flaws in Wikipedia

Oliver Ferschke, Iryna Gurevych and Marc Rittberger CLEF 2012 Labs and Workshop, Notebook Papers, September 2012. Rome, Italy., September 17–20, 2012

slide-2
SLIDE 2

2 19.09.2012 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Oliver Ferschke |

Introduction

Oliver Ferschke Iryna Gurevych Marc Rittberger

slide-3
SLIDE 3

3 19.09.2012 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Oliver Ferschke |

Writer Datastore / Results Training / Classification Feature Extraction Linguistic Preprocessing Reader

a b c a b c a b c a b c a b c

JWPL

Page IDs

Task-based system with Multiple processing pipelines.

FlawFinder

slide-4
SLIDE 4

4 19.09.2012 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Oliver Ferschke |

Data Import

  • Document retrieval via Java Wikipedia Library and

Wikipedia Revision Toolkit

  • article text
  • revision history
  • revision meta data (authors, edit comment, timestamps)
  • links (in/out, internal/external)
  • JWPL database based on Wikipedia data dump

from January 4th, 2012.

http://jwpl.googlecode.com

slide-5
SLIDE 5

5 19.09.2012 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Oliver Ferschke |

Preprocessing

  • UIMA-based NLP components for preprocessing from the Darmstadt

Knowledge Processing Repository

v

http://dkpro-core.googlecode.com

Linguistic Preprocessing

Sentence Splitter Tokenizer Stopword Filter Named Entity Recognizer Wikitext Parser

slide-6
SLIDE 6

6 19.09.2012 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Oliver Ferschke |

Features

  • NGram features
  • Structural features
  • Reference features
  • Network features
  • Named entity features
  • Revision-based features
  • Other features

v

http://cleartk.googlecode.com

  • 32 feature types in 7 categories
  • ClearTK framework
  • „plug and play“ feature extractors
  • independent from utilized ML toolkit
  • Information Gain

approach for feature selection

  • Unsupervised

discretization of numeric features

slide-7
SLIDE 7

7 19.09.2012 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Oliver Ferschke |

Classification Approach

  • Binary classification
  • Naive Bayes
  • AdaBoost with depth-limited C4.5 decision trees as weak classifiers
  • Negative instances
  • Random sample of untagged articles
  • Evaluation
  • 10-fold cross validation on 1000 documents
  • Stable sampling of negative instances in
  • ne evaluation run

v

slide-8
SLIDE 8

8 19.09.2012 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Oliver Ferschke |

Parameter Optimization

http://dkpro-lab.googlecode.com

  • The overall system is a „pipeline of pipelines“.
  • Individual pipelines can be parameterized

Parameter optimization:

  • Find best parameter setting across all pipelines
  • Report on performance for pipeline configurations

DKPro Lab:

  • Task based processing
  • Parameter injection
  • Global configuration
  • Report probes gather statistics for global report

DKPro Lab

Reports

slide-9
SLIDE 9

9 19.09.2012 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Oliver Ferschke |

Error Analysis and Evaluation

Common error sources

  • Outdated labels (positive instances)
  • Missing labels (negative instances)
  • Unclear label definitions

 esp. reference flaws are often confused

  • Section-scope and article-scope flaws mixed
slide-10
SLIDE 10

10 19.09.2012 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Oliver Ferschke |

Conclusions & Outlook

  • Use article revision in which tag was first inserted
  • Solves outdated label problem
  • Use revision history for identifying negative instances
  • Solves missing label problem
  • Separate treatment of section- and article-scope templates
  • Real world application: multi-flaw classification
  • problems with overlaps in flaw definitions
slide-11
SLIDE 11

11 19.09.2012 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Oliver Ferschke |

Thank you for your attention!

http://www.ukp.tu-darmstadt.de Ubiquitous Knowledge Processing Lab

slide-12
SLIDE 12

12 19.09.2012 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Oliver Ferschke |

slide-13
SLIDE 13

13 19.09.2012 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Oliver Ferschke |

Features

  • NGram features
  • Structural features
  • Reference features
  • Network features
  • Named entity features
  • Revision-based features
  • Other features

v

  • Token-unigrams,

bigrams, trigrams

  • Extracted from article

text w/o markup

  • Min. frequency (5)
  • Stopword filtered
slide-14
SLIDE 14

14 19.09.2012 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Oliver Ferschke |

Features

  • NGram features
  • Structural features
  • Reference features
  • Network features
  • Named entity features
  • Revision-based features
  • Other features

v

  • Empty sections
  • Number of sections
  • Mean section length
  • Markup to text ratio
slide-15
SLIDE 15

15 19.09.2012 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Oliver Ferschke |

Features

  • NGram features
  • Structural features
  • Reference features
  • Network features
  • Named entity features
  • Revision-based features
  • Other features

v

  • Number of references
  • Reference lists
  • Reference to text ratio
  • References per

sentence

slide-16
SLIDE 16

16 19.09.2012 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Oliver Ferschke |

Features

  • NGram features
  • Structural features
  • Reference features
  • Network features
  • Named entity features
  • Revision-based features
  • Other features

v

  • External links
  • Inlinks
  • Outlinks
slide-17
SLIDE 17

17 19.09.2012 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Oliver Ferschke |

Features

  • NGram features
  • Structural features
  • Reference features
  • Network features
  • Named entity features
  • Revision-based features
  • Other features

v

  • NER types
  • Organization
  • Person
  • Location
  • Absolute numbers and

NER to text ratio

slide-18
SLIDE 18

18 19.09.2012 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Oliver Ferschke |

Features

  • NGram features
  • Structural features
  • Reference features
  • Network features
  • Named entity features
  • Revision-based features
  • Other features

v

  • Number of revisions
  • Number of unique

contributors

  • Number of registered

contributors

  • Article age
slide-19
SLIDE 19

19 19.09.2012 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Oliver Ferschke |

Features

  • NGram features
  • Structural features
  • Reference features
  • Network features
  • Named entity features
  • Revision-based features
  • Other features

v

  • Number of discussions
  • n Talk page
  • Number of sentences,

tokens and characters