Wiki Vandalysis- Wikipedia Vandalism Analysis Manoj Harpalani, - PowerPoint PPT Presentation

Wiki Vandalysis- Wikipedia Vandalism Analysis Manoj Harpalani, Thanadit Phumprao, Megha Bassi, Michael Hart, and Rob Johnson Stony Brook University

Text Features o Edit Distance o Text Changes o Spelling Errors o Obscene Words o Repeated Patterns o Sum of metrics  Spelling errors, obscene words, repeated patterns o Sentences inserted, deleted and changed o Word count o Ratio of suspicious features to the article word count.

Advanced Text Analysis Features • Grammar o Link grammar checker o Discover number of grammatical errors. • Sentiment Analysis o Logistic regression over character-level n-grams o Trained on film summaries and reviews o Measure both polarity and subjectivity  Across edit type (insert,delete,modify)  Across sentences  Over all words

Meta-Features • Article o Number of times article was vandalized previously o Number times article was reverted previously • Editor o Time since author registered in Wikipedia o Number of previous vandalisms o Total contributions to Wikipedia o Total contributions to a given article o Number of contributions in a sampling of edits

Classification approaches • Baseline o Used Bag of Words approach o Added RankBoost to improve baseline • Classifiers built on features o Naive Bayes o C4.5 Decision Tree o NBTree

Classifiers evaluated Evaluation Results on Training Set: Metric NB+BoW NB+BoW+RankBoost NB C4.5 NBTree Precision 27.8% 34.1% 15.8% 53.2% 64.3% Recall 32.6% 26.6% 93.2% 36.9% 36.4% Accuracy 87.5% 89.7% 69.2% 94.1% 94.8% F-measure 30.1% 29.9% 27.1% 43.6% 46.5% AUC 69% 62% 88.5% 80.5% 91% Evalutation Results on Test Set: Metric NB C4.5 NBTree Precision 19.0% 51.0% 61.5% Recall 92.0% 26.7% 25.2% Accuracy 72.0% 91.6% 92.3% F-measure 35.5% 35.1% 35.8% AUC 86.6% 76.9% 88.7%

Performance for Selected users Type of user FP rate Recall Precision Registered users < 0.1% 22.0% 68.4% Registered users that edited this article < 0.01% 0.0% 0.0% 10 times or more Unregistered users 3.9% 40.8% 67.2% IP addresses that edited this article 10 1.7% 33.3% 50.0% times or more

Top Performing Features Feature Information Gain Total number of author contributions 0.074 How long the author has been registered 0.067 If the author is a registered user 0.06 How frequently the author contributed in the training sex 0.04 How often the article has been vandalized 0.035 How often the article has been reverted 0.034 The number of previous contributions on the article 0.019 Change in sentiment score 0.019 Number of misspelled words 0.019 Sum of metrics 0.018 Meta feature Text feature Advanced text feature

Features Employed by the NBTree

Sentiment and Vandalism • Change in polarity and vandalism o Vandalism skewed negatively o Regular edits skewed positively • 0:03 with a standard deviation of 1:1

Timely suggestions for Wikipedia • Certain IPs contribute heavily to Wikipedia o IPs belong to universities, Redmond, etc. o Recruit! • Incorporate simple features into current vandalism tools o Editor meta-information o Article meta-information o Even if not used directly to classify vandalism  Use to rank suspicious edits for Wiki Admins

Vandalism of Registered Users is hard • Our classifier strengths o Unregistered users o IPs that contribute frequently o Registered users with minimal site usage • But poor classification of active registered users o Not many instances of vandalism by these users o Our features provide little discriminatory information o Vandalism not as clear-cut • Suggestions o Ignore? Apply the Law of Diminishing returns  o Use techniques from imbalanced training set

Conclusions • NBTree worked well by partitioning edits o Train a tailored stochastic model o Suggests a one-size fits all approach is difficult o Until someone creates a better model describing vandalism • Author and article meta information incredibly useful o Expectation of the quality of the edit • Main limitation o Could not verify relevance/factuality of content o Ideas?  Expertise of editor  Language model based on similar articles  Value-added assessment

Grazie! Domande?

Wiki Vandalysis- Wikipedia Vandalism Analysis Manoj Harpalani, - PowerPoint PPT Presentation

Wiki Vandalysis- Wikipedia Vandalism Analysis Manoj Harpalani, Thanadit Phumprao, Megha Bassi, Michael Hart, and Rob Johnson Stony Brook University Text Features o Edit Distance o Text Changes o Spelling Errors o Obscene Words o Repeated

Wiki Wiki |wik| Etymology Coined by programmer Ward Cunningham (1949- ), from Hawaiian

Vandalism Detection on Wikipedia The class imbalance problem & new approaches Paul Gtze

Genealogy Wikis & Wikipedia Dave Barton Agenda What is a Wiki Genealogy Wikis

http://ar.wikipedia.org/wiki / http :// www . masraheon . com / . htm 3 .

Vandalism Detection in Wikidata Stefan Heindorf 1 , Martin Potthast 2 , Benno Stein 2 , Gregor

CAMPUS WIKI ANURAG MISRA DURGESH DEEP Whats a Wiki? A wiki is a type of website

Detecting Wikipedia vandalism using WikiTrust Bo Adler Luca de Alfaro Ian Pye Fujitsu Labs of

Detecting Wikipedia Vandalism via Spatio- Temporal Analysis of Revision Metadata Andrew G. West

Wiki Bot Ayushi Aggarwal, Wenxi Lu Motivation Hands-off Wikipedia Search based on Wiki topics

Identifying Deceptive Product Reviews Wikipedia Vandalism The Gender of Authors via

2261 2211 2211 2261 2261 2261 2262 2266 2261 1 4991 2 http://ar.wikipedia.org/wiki/

Number Systems MA1S1 Tristan McLoughlin November 27, 2013 http://en.wikipedia.org/wiki/Binary

Mark Shtern DDoS Attacks http://en.wikipedia.org/wiki/Operation_Pa yback

Word Provenance (Weird origins of simple words) - Vaastav Anand Source : Wikipedia 1.

Session 6 JavaScript Part 1 Reading Reading Wikipedia en.wikipedia.org/wiki/Javascript

Electronic Violence and Vandalism Reporting System 2015-2016 West Long Branch School District

Announcements Exam 2: 03/11, PA2,HW3. Today: Characterizing hash functions

The deployment of Wireless Networks in The deployment of Wireless Networks in High Voltage

A Software Defined Multi-Path Traffic Offloading System for Heterogeneous LTE-WiFi Networks

Feasibility Study of Mobile Phone WiFi Detection in Aerial Search and Rescue Operations Wei

Achieving Healthcare Information Interoperability: A Wiki-like Approach for Cutting the Gordian

On how your brain is conspiring against you making good software Jenna Zeigen

FRBR: access as the relationships between the entities provide links to navigate through the

Initial evaluation of an internet- based wiki platform for cancer management guidelines ! Ian

Sambuz

Useful Links

Newsletter

Mail Us

Wiki Vandalysis- Wikipedia Vandalism Analysis Manoj Harpalani, - PowerPoint PPT Presentation

Wiki Vandalysis- Wikipedia Vandalism Analysis Manoj Harpalani, Thanadit Phumprao, Megha Bassi, Michael Hart, and Rob Johnson Stony Brook University Text Features o Edit Distance o Text Changes o Spelling Errors o Obscene Words o Repeated

Wiki Wiki |wik| Etymology Coined by programmer Ward Cunningham (1949- ), from Hawaiian

Vandalism Detection on Wikipedia The class imbalance problem &amp; new approaches Paul Gtze

Genealogy Wikis &amp; Wikipedia Dave Barton Agenda What is a Wiki Genealogy Wikis

http://ar.wikipedia.org/wiki / http :// www . masraheon . com / . htm 3 .

Vandalism Detection in Wikidata Stefan Heindorf 1 , Martin Potthast 2 , Benno Stein 2 , Gregor

CAMPUS WIKI ANURAG MISRA DURGESH DEEP Whats a Wiki? A wiki is a type of website

Detecting Wikipedia vandalism using WikiTrust Bo Adler Luca de Alfaro Ian Pye Fujitsu Labs of

Detecting Wikipedia Vandalism via Spatio- Temporal Analysis of Revision Metadata Andrew G. West

Wiki Bot Ayushi Aggarwal, Wenxi Lu Motivation Hands-off Wikipedia Search based on Wiki topics

Identifying Deceptive Product Reviews Wikipedia Vandalism The Gender of Authors via

2261 2211 2211 2261 2261 2261 2262 2266 2261 1 4991 2 http://ar.wikipedia.org/wiki/

Number Systems MA1S1 Tristan McLoughlin November 27, 2013 http://en.wikipedia.org/wiki/Binary

Mark Shtern DDoS Attacks http://en.wikipedia.org/wiki/Operation_Pa yback

Word Provenance (Weird origins of simple words) - Vaastav Anand Source : Wikipedia 1.

Session 6 JavaScript Part 1 Reading Reading Wikipedia en.wikipedia.org/wiki/Javascript

Electronic Violence and Vandalism Reporting System 2015-2016 West Long Branch School District

Announcements Exam 2: 03/11, PA2______,HW3______. Today: Characterizing hash functions

The deployment of Wireless Networks in The deployment of Wireless Networks in High Voltage

A Software Defined Multi-Path Traffic Offloading System for Heterogeneous LTE-WiFi Networks

Feasibility Study of Mobile Phone WiFi Detection in Aerial Search and Rescue Operations Wei

Achieving Healthcare Information Interoperability: A Wiki-like Approach for Cutting the Gordian

On how your brain is conspiring against you making good software Jenna Zeigen

FRBR: access as the relationships between the entities provide links to navigate through the

Initial evaluation of an internet- based wiki platform for cancer management guidelines ! Ian

Sambuz

Useful Links

Newsletter

Mail Us

Vandalism Detection on Wikipedia The class imbalance problem & new approaches Paul Gtze

Genealogy Wikis & Wikipedia Dave Barton Agenda What is a Wiki Genealogy Wikis

Announcements Exam 2: 03/11, PA2,HW3. Today: Characterizing hash functions