Featured Article Identification in Wikipedia - Thesis Defense - - PowerPoint PPT Presentation

Featured Article Identification in Wikipedia - Thesis Defense - Christian Fricke christian.fricke@uni-weimar.de Faculty of Media / Media Systems Bauhaus-Universität Weimar, Germany October 18, 2012

Why is Wikipedia relevant? Millions of people use Wikipedia, including authors, readers, researchers, and data analysts. 180 160 140 120 Number of Papers 100 Journals Conferences 80 60 40 20 0 2002 2003 2004 2005 2006 2007 2008 2009 2010 Year Source: http://en.wikipedia.org/w/index.php?title=File:Growth_of_Academic_Interest_in_Wikipedia.svg Featured Article Identification in Wikipedia Motivation 2 / 21

Wikipedia Statistics The quality assessment of articles is manually unmanageable for the ever-growing encyclopedia. � FA-Class: 4 213 (0.01%) � A-Class: 991 (0.03%) � GA-Class: 16 508 (0.43%) � B-Class: 82 787 (2.15%) � C-Class: 129 483 (3.36%) English Wikipedia � Start-Class: 881 813 (22.9%) � Stub-Class: 2 169 051 (56.4%) � FL-Class: 1 781 (0.01%) � List-Class: 100 812 (2.62%) � Unassessed: 461 818 (12.0%) � Total: 3 849 257 (100%) Featured Article Identification in Wikipedia Motivation 3 / 21

Automated Solution ◮ Quality judgement of articles as indicator for improvement ◮ Most common method: binary classification of featured and non-featured articles represented as vectors of feature values Featured: FA-Class all other articles Non-featured: Featured Article Identification in Wikipedia Motivation 4 / 21

Outline 1. Motivation 2. Quality Assessment Models 3. Feature Implementation 4. Article Classification 5. Conclusion Featured Article Identification in Wikipedia Quality Assessment Models 5 / 21

Binary Classification Approaches (1) Blumenstock [WWW 2008] (2) Dalip et al. [JDIQ 2011] (3) Lipka and Stein [WWW 2010] (4) Stvilia et al. [IQ 2005] Problem: Extenuation of results through customized data sets Featured Article Identification in Wikipedia Quality Assessment Models 6 / 21

(1) Blumenstock [WWW 2008] Features A single metric, the length (word count) of an article as its sole representation Dataset Unbalanced, random Featured : 1 554 Non-featured : 9 513 Classifier Multi-Layer Perceptron Featured Article Identification in Wikipedia Quality Assessment Models 7 / 21

(2) Dalip et al. [JDIQ 2011] Features 54 features ranging from simple counts to complex graph-based metrics Dataset Unbalanced, random Featured : 549 Non-featured : 2 745 Classifier Support Vector Machine Featured Article Identification in Wikipedia Quality Assessment Models 8 / 21

(3) Lipka and Stein [WWW 2010] Features Character trigram vector—mapping from substrings of three tokens to their respective frequencies Dataset Balanced, domain-specific Featured : 380 Non-featured : 380 Classifier Support Vector Machine Featured Article Identification in Wikipedia Quality Assessment Models 9 / 21

(4) Stvilia et al. [IQ 2005] Features Seven distinct metrics based on variable groupings that contain 19 features Dataset Unbalanced, random Featured : 236 Non-featured : 834 Classifier C4.5 Decision Tree Featured Article Identification in Wikipedia Quality Assessment Models 10 / 21

Outline 1. Motivation 2. Quality Assessment Models 3. Feature Implementation 4. Article Classification 5. Conclusion Featured Article Identification in Wikipedia Feature Implementation 11 / 21

Data Preparation The January 2012 snapshot of the English Wikipedia constitutes 8TB of text data and is processed in less than two hours using the optimized Webis Hadoop cluster. Import SQL tables Extract metadata Database dump wikitext Parse XML files Extract wikitext Extract plaintext Dump Preprocessing Featured Article Identification in Wikipedia Feature Implementation 12 / 21

Feature Categories Features are organized in four categories: Content Length and part of speech rates, readability indices, trigrams . . . Structure Lead rate, section distribution, counts for categories, files, images, lists, tables, and templates . . . Network Link counts and PageRank . . . History Age, currency, counts for edits, editors, and reverts . . . Featured Article Identification in Wikipedia Feature Implementation 13 / 21

Feature Computation The runtime for the computation of each feature for all articles depends on its source and complexity. Category Features Runtime Source Content 35 < 1h plaintext Structure 23 < 1h wikitext Network 8 < 12h metadata History 9 < 12h all Total: 75 ∼ 1d Featured Article Identification in Wikipedia Feature Implementation 14 / 21

Experiment Reconstruction ◮ Implemented most features to accurately replicate results in an easy to use framework incorporating data extraction, feature computation, dataset construction, and model definitions ◮ Employed WEKA to train and evaluate the classifiers ◮ Biased dataset selections made exact reproduction difficult Featured Article Identification in Wikipedia Feature Implementation 15 / 21

Outline 1. Motivation 2. Quality Assessment Models 3. Feature Implementation 4. Article Classification 5. Conclusion Featured Article Identification in Wikipedia Article Classification 16 / 21

Evaluation Measures P recision (Proportion of correctly identified negatives) R ecall (Proportion of correctly identified positives) F -Measure (Harmonic mean of P recision and R ecall ) Featured Article Identification in Wikipedia Article Classification 17 / 21

Reconstruction Results Model Featured Non-featured Average P recision / R ecall / F -Measure P recision / R ecall / F -Measure F -Measure 0.871 / 0.936 / 0.902 0.989 / 0.977 / 0.983 0.970 (1) 0.781 / 0.877 / 0.826 0.980 / 0.960 / 0.970 0.949 ⊥ ⊥ ⊥ (2) 0.903 / 0.900 / 0.901 0.980 / 0.981 / 0.980 0.967 0.966 / 0.961 / 0.964 ⊥ ⊥ (3) 0.949 / 0.939 / 0.944 0.940 / 0.950 / 0.945 0.944 0.900 / 0.920 / 0.910 0.980 / 0.970 / 0.975 0.957 (4) 0.859 / 0.907 / 0.882 0.973 / 0.958 / 0.965 0.947 (1) Blumenstock (2) Dalip et al. (3) Lipka and Stein (4) Stvilia et al. Featured Article Identification in Wikipedia Article Classification 18 / 21

Uniform Dataset We define four datasets to fairly compare the performance of each proposed model and propose an additional model that combines every implemented feature. Dataset Balanced, random, corresponding to minimum word counts of 0, 800, 1600, and 2400 Featured : 3 000 Non-featured : 3 000 (5) Fricke and Anderka: Features All 75 features from every category Classifier Support Vector Machine Featured Article Identification in Wikipedia Article Classification 19 / 21

Uniform Evaluation 1.00 Average F -Measure � (1) Blumenstock � (2) Dalip et al. 0.95 � (3) Lipka and Stein � (4) Stvilia et al. � (5) Fricke and Anderka 0.90 0 800 1600 2400 minimum word count Featured Article Identification in Wikipedia Article Classification 20 / 21

Conclusion and Outlook ◮ A framework for convenient and consistent evaluation ◮ A new model utilizing every implemented quality indicator ◮ The most comprehensive collection of article features to date Featured Article Identification in Wikipedia Conclusion 21 / 21

Featured Article Identification in Wikipedia - Thesis Defense - - PowerPoint PPT Presentation

Featured Article Identification in Wikipedia - Thesis Defense - Christian Fricke christian.fricke@uni-weimar.de Faculty of Media / Media Systems Bauhaus-Universitt Weimar, Germany October 18, 2012 Why is Wikipedia relevant? Millions of

AATO CONSTITUTION 1 Article of the Constitution Article 6 The Council Article 1

Semantic Wikipedia [[enhances::Wikipedia]] Wikipedia today A free online encyclopdia

WIKIPEDIA ARTICLE GROUP 9 Contents Article Overview 1. Dimensionality Reduction 2.

Saturday, 29 January 2011 OVERVIEW What is Wikipedia/Wikimedia? (Mike) What makes a

Introduction to Wikipedia editing Mike Peel 12 November 2014 Questions Who has used

IN1146 full featured Mobile Projector IN1146 full featured Mobile Projector 38% Lighter and 42%

Article 1-To accept reports Article 2-To set salaries for school officials Article 3-To

Genealogy Wikis & Wikipedia Dave Barton Agenda What is a Wiki Genealogy Wikis

Wikipedia: n ++ made easy Matt Might University of Utah / NGLY1.org matt.might.net What

Wikipedia Sociographics Jimmy Wales President, Wikimedia Foundation Wikipedia Founder Todays

Computers Session 1 INST 346 Agenda The Computer The Course Source: Wikipedia

Physical Infrastructure Week 1 INFM 603 Agenda The Computer The Internet The Web

Tor and Wikipedia Roger Dingledine The Free Haven Project 1 Motivation China blocks

Article 6 Kelley Kizzier UNFCCC Co-Chair Article 6 Context and Overview The last Article to

Paris Agreements Article 6 Update Stefano De Clara Director for International Policy, IETA

RISK IDENTIFICATION Everything your competitor knows about Risk Identification on Software

Introduction to Wikipedia editing Mike Peel 13 September 2014 Questions Who has edited

Can we rely on Wikitext to get the links on a Wikipedia page? '''Niue''' ({{lang-niu|Niue}}) is

Exploring scattering amplitudes in gauge theories using soap films Gregory Korchemsky IPhT,

Neural Architecture Optimization CONTENTS 1.AutoML 2.NAS

Improving Neural Language Models with Weight Norm Initialization and Regularization Christian

Scalaris: Scalable Web Applications with a Transactional Key-Value Store Nico Kruber Michael

CICM 2016, OpenMath workshop Implicit Content Dictionaries in the NIST Digital Repository of

Clean Code 1 / 24 Clean Code What is clean code? Elegant and efficient. Bjarne

Featured Article Identification in Wikipedia - Thesis Defense - - PowerPoint PPT Presentation

Featured Article Identification in Wikipedia - Thesis Defense - Christian Fricke christian.fricke@uni-weimar.de Faculty of Media / Media Systems Bauhaus-Universitt Weimar, Germany October 18, 2012 Why is Wikipedia relevant? Millions of

AATO CONSTITUTION 1 Article of the Constitution Article 6 The Council Article 1

Semantic Wikipedia [[enhances::Wikipedia]] Wikipedia today A free online encyclopdia

WIKIPEDIA ARTICLE GROUP 9 Contents Article Overview 1. Dimensionality Reduction 2.

Saturday, 29 January 2011 OVERVIEW What is Wikipedia/Wikimedia? (Mike) What makes a

Introduction to Wikipedia editing Mike Peel 12 November 2014 Questions Who has used

IN1146 full featured Mobile Projector IN1146 full featured Mobile Projector 38% Lighter and 42%

Article 1-To accept reports Article 2-To set salaries for school officials Article 3-To

Genealogy Wikis &amp; Wikipedia Dave Barton Agenda What is a Wiki Genealogy Wikis

Wikipedia: n ++ made easy Matt Might University of Utah / NGLY1.org matt.might.net What

Wikipedia Sociographics Jimmy Wales President, Wikimedia Foundation Wikipedia Founder Todays

Computers Session 1 INST 346 Agenda The Computer The Course Source: Wikipedia

Physical Infrastructure Week 1 INFM 603 Agenda The Computer The Internet The Web

Tor and Wikipedia Roger Dingledine The Free Haven Project 1 Motivation China blocks

Article 6 Kelley Kizzier UNFCCC Co-Chair Article 6 Context and Overview The last Article to

Paris Agreements Article 6 Update Stefano De Clara Director for International Policy, IETA

RISK IDENTIFICATION Everything your competitor knows about Risk Identification on Software

Introduction to Wikipedia editing Mike Peel 13 September 2014 Questions Who has edited

Can we rely on Wikitext to get the links on a Wikipedia page? '''Niue''' ({{lang-niu|Niue}}) is

Exploring scattering amplitudes in gauge theories using soap films Gregory Korchemsky IPhT,

Neural Architecture Optimization CONTENTS 1.AutoML 2.NAS

Improving Neural Language Models with Weight Norm Initialization and Regularization Christian

Scalaris: Scalable Web Applications with a Transactional Key-Value Store Nico Kruber Michael

CICM 2016, OpenMath workshop Implicit Content Dictionaries in the NIST Digital Repository of

Clean Code 1 / 24 Clean Code What is clean code? Elegant and efficient. Bjarne

Genealogy Wikis & Wikipedia Dave Barton Agenda What is a Wiki Genealogy Wikis