Detecting Wikipedia Vandalism via Spatio- Temporal Analysis of - PowerPoint PPT Presentation

Detecting Wikipedia Vandalism via Spatio- Temporal Analysis of Revision Metadata Andrew G. West June 10, 2010 ONR-MURI Presentation

Where we left off…. FROM THE LAST MURI REVIEW 2 6/10/2010 ONR-MURI Review

Spatio-Temporal Reputation • Single-entity reputation User-Space values are the status quo Region Entity Locality • Issue: Sybil attacks ( e.g. , spam botnets) • Spatial reputation: Entity Local Regional Behavior Behavior Behavior • No entity-specific data? History History History Use broader groupings Rep. Rep. Rep. • Exploit homophily Combination • Clarity in borderline Reputation Value classification cases 3 6/10/2010 ONR-MURI Review

Hierarchical Groupings = TDG = QTM IANA • Spatial groupings for spam detection leverage the IP assignment hierarchy RIR RIR • Entities are IP addresses • {AS, Subnet, IP} groups used AS AS AS • TDGs are hierarchies, thus spatio-(temporal) techniques Subnet Subnet Subnet may fulfill the reputation component of QTM/QuanTM IP IP IP 4 6/10/2010 ONR-MURI Review

PreSTA for Spam Detection BL Source DBs PreSTA: Preventative Incoming Emails BL Source Spatio-Temporal BL Source Aggregation Spamhaus Cache Miss Subscription Spatial Decision Analysis Classifier Blacklist Temporal PreSTA Client DB Analysis Cache DB Reputation Engine Cache Hit SMTP Server PreSTA Server 5 6/10/2010 ONR-MURI Review

New Contributions… APPLYING SPATIO- TEMPORAL PROPERTIES TO WIKIPEDIA 6 6/10/2010 ONR-MURI Review

Vandalism • Serious problem. One source [3] estimates hundreds of millions of `damaged page views’ • NLP effective for blatant instances. Subtle ones ( e.g. , insertion of ‘not’, name replacement) – much harder to find VANDALISM: Informally, an edit that is: • Our method: Alternative • Non-value adding means of detection, • Offensive complementing NLP • Destructive in content removal 7 6/10/2010 ONR-MURI Review

Big Idea • Wikipedia revision metadata (not the article or diff text) can be used to detect instances of vandalism – As effective as language-processing [2] efforts – Machine-learning over spatio-temporal props: • Simple features: Straightforward metadata analysis • Aggregate features: Reputation values for single entities (editors, articles) and spatial groupings thereof (geographical location, topical categories) 8 6/10/2010 ONR-MURI Review

Outline • Labeling revisions ( rollback ) • Simple features – Motivation: SNARE [1] spam-blocking – Edit time-of-day, day-of- week, comment length… • Aggregate features – Motivation: PreSTA [5] reputation algorithm – Article rep., editor rep., spatial reputations… • Classifier performance • STiki [4] (a real-time implementation) 9 6/10/2010 ONR-MURI Review

Metadata Wikipedia provides metadata via DB-dumps: # METADATA ITEM NOTES (1) Timestamp of edit In GMT locale Able to deduce (2) Article being edited namespace from title May be user-name (if (3) Editor making edit registered editor), or IP address* (if anonymous) Text field where editor (4) Revision comment can summarize changes 10 6/10/2010 ONR-MURI Review

Labeling Vandalism “ Reversion ” ( i.e. , undo) Test- set contains ≈50 million edits: • Any user can execute: • (1) only NS0 edits (71% of all edits) • (2) only edits within last year (2008/11+) • (1) Press button • (2) Enter edit summary • (3) Confirm reversion “ Rollback ” (expedited revert) • Privileged : ≈4,700 users • (1) Press button. Done. • Auto-summarization: “Reverted edits by x to last revision by y ” Prevalence/Source of Rollbacks 11 6/10/2010 ONR-MURI Review

Rollback-based Labels • Use rollback-based labeling: – (1) Find special comment format – (2) Verify permissions of editor – (3) Backtrack to find offending-edit (OE) – All edits not in set {OE} are {Unlabeled} • Alternatives: Manual labeling, page-hashing • Advantages of using rollback: – (1) Automated (just parsing) – (2) High-confidence (privileged users are trusted ) – (3) Per-case (vandalism need not be defined) 12 6/10/2010 ONR-MURI Review

Simple Features SIMPLE FEATURES * Discussion abbreviated to concentrate on aggregate ones 13 6/10/2010 ONR-MURI Review

Spatio-Temporal Basics • Temporal props: A function of when events occur • Spatial props: Appropriate wherever a size, distance, or membership function can be defined Motivating work: SNARE [1] • Spatio-temporal props. effective in spam-mitigation • Physical distance mail traveled, time-of-day, mail sent, message size (in bytes), AS- membership of sender… (13 in total) • Advantages of approach: • NLP-filters easy to evade … More difficult for spatio-temporal props. • Computationally simpler than NLP 14 6/10/2010 ONR-MURI Review

Edit Time, Day-of-Week • Use IP-geo-location data to determine origin time-zone, adjust UTC timestamp Unlabeled • Vandalism most prevalent during Local time-of-day when edits made working hours/week: UnLbl Kids are in school(?) • Fun fact: Vandalism almost twice as prevalent on a Tuesday versus a Sunday Local day-of-week when edits made 15 6/10/2010 ONR-MURI Review

Time- since (TS) … TS Article Edited OE UnLbl • High-edit pages All edits (median, hrs.) 1.03 9.67 most often TS Editor Registration OE UnLbl vandalized Regd., median (days) 0.07 765 • ≈2% of pages Anon., median (days) 0.01 1.97 have 5+ OEs, yet these pages have • Long-time participants 52% of all edits vandalize very little • Other work [3] • “Registration”: time-stamp of has shown these first edit made by user are also articles • Sybil-attack to abuse benefits? most visited 16 6/10/2010 ONR-MURI Review

Misc. Simple Features FEATURE OE UnLbl Revision comment (average length in characters) 17.73 41.56 Revision comment (average length in characters) 17.73 41.56 Anonymous editors (percentage) 85.38% 28.97% Bot editors (percentage) 00.46% 09.15% Privileged editors (percentage) 00.78% 23.92% • Revision comment length – Vandals leave shorter comments (Iazy-ness? or just minimizing bandwidth?) • Privileged editors (and bots) – Huge contributors, but rarely vandalize 17 6/10/2010 ONR-MURI Review

Aggregate Features AGGREGATE FEATURES 18 6/10/2010 ONR-MURI Review

PreSTA Algorithm PreSTA [5]: Model for ST-rep: CORE IDEA: No entity specific data? Examine Rep( group ) = spatially-adjacent Σ time_decay (TS vandalism ) entities (homophily) size(group) Timestamps (TS) of vandalism incidents by group members A • Grouping functions (spatial) Alice French Europeans define memberships • rep(A) rep(FRA) rep(EUR) Observations of misbehavior form feedback – and observ- Higher-Order Reputation ations are decayed (temporal) 19 6/10/2010 ONR-MURI Review

Example Reputation Time Behavior Rep. Rep. Calculation TS 1 Calculate User No history? TS 2 Vandalizes TS 3 Calculate Reputation = 0.0 TS 4 Completely Innocent! User TS 5 Vandalizes TS 6 Calculate 20 6/10/2010 ONR-MURI Review

Example Reputation Time Behavior Rep. Rep. Calculation TS 1 Calculate User TS 2 Vandalizes TS 3 Calculate TS 4 User TS 5 Vandalizes TS 6 Calculate 21 6/10/2010 ONR-MURI Review

Example Reputation Time Behavior Rep. Rep. Calculation TS 1 Calculate One incident User TS 2 in history Vandalizes TS 3 Calculate Reputation: decay (TS 3 - TS 2 ) = TS 4 0.95 User TS 5 Vandalizes decay() returns TS 6 Calculate values on [0,1] 22 6/10/2010 ONR-MURI Review

Example Reputation Time Behavior Rep. Rep. Calculation TS 1 Calculate User TS 2 Vandalizes TS 3 Calculate TS 4 User TS 5 Vandalizes TS 6 Calculate 23 6/10/2010 ONR-MURI Review

Example Reputation Time Behavior Rep. Rep. Calculation TS 1 Calculate Two incidents User in history TS 2 Vandalizes TS 3 Calculate Reputation: decay (TS 6 - TS 2 ) + TS 4 decay (TS 6 - TS 5 ) = User 0.50 + 0.95 = 1.45 TS 5 Vandalizes Values are relative TS 6 Calculate 24 6/10/2010 ONR-MURI Review

Rollback as Feedback CDF of time between OE and flagging Use rollbacks (OEs) as neg. feedbacks for entities • Key notion: A bad edit is not part of reputation until (TS flag > TS vandalism ). Thus, vandalism must be flagged quickly so reputations are not latent. – Fortunately, median time-to- rollback: ≈80 seconds 25 6/10/2010 ONR-MURI Review

Article Reputation • Intuitively some topics are contro- UnLbl versial and likely targets for vandalism CDF of Article Reputation (or temporally so). ARTICLE #OEs • Trivial spatial George W. Bush 6546 grouping (size=1) Wikipedia 5589 • 85% of OEs have Adolph Hitler 2612 United States 2161 non-zero rep (just World War II 1886 45% of random) Articles w/most OEs 26 6/10/2010 ONR-MURI Review

Detecting Wikipedia Vandalism via Spatio- Temporal Analysis of - PowerPoint PPT Presentation

Detecting Wikipedia Vandalism via Spatio- Temporal Analysis of Revision Metadata Andrew G. West June 10, 2010 ONR-MURI Presentation Where we left off. FROM THE LAST MURI REVIEW 2 6/10/2010 ONR-MURI Review Spatio-Temporal Reputation

Spatio-Temporal Statistics with R Chapter Two: Exploring Spatio-Temporal Data Spatio-Temporal

Vandalism Detection on Wikipedia The class imbalance problem & new approaches Paul Gtze

Temporal, Spatial, and Spatio-temporal Granularities Gabriele Pozzani Department of Computer

Lecture 1 Spatio-temporal data & Linear Models Colin Rundel 1/18/2017 1 Spatio-temporal

Estimating parameters in spatio- temporal Quermass- in spatio-temporal interaction process

Detecting Wikipedia vandalism using WikiTrust Bo Adler Luca de Alfaro Ian Pye Fujitsu Labs of

Realistic Image Synthesis - Spatio-temporal Sampling and Reconstruction. Exploiting Temporal

Vandalism Detection in Wikidata Stefan Heindorf 1 , Martin Potthast 2 , Benno Stein 2 , Gregor

Building a Visual Analytics System for Spatio-temporal Analysis Alan Tan , Yue Lin, Ralf Gommers 5

Spaten : a Spatio-Temporal and Textual Big Data Generator Thaleia Dimitra Doudali* Ioannis

Overview Optical flow Video classification Bag of spatio-temporal features Action

Overview Video classification Bag of spatio-temporal features Action localization

Identifying Deceptive Product Reviews Wikipedia Vandalism The Gender of Authors via

Detecting Spammers and Content Detecting Spammers and Content Detecting Spammers and Content

12/6/2013 Detecting Fakes Image Forensics: Detecting Forged Photos 1.Detecting photorealistic

ADROIT: Detecting Spatio-Temporal Correlated Attack-Stages in IoT Networks NUS-Singtel Cyber

RLSRunner and KRunner: Linking Rascal with K for Program Analysis and Execution Mark Hills, Paul

RLSRunner: Linking Rascal with K for Program Analysis Mark Hills, Paul Klint, & Jurgen J.

Time Related Range Types Revisited Real World use cases from the KOF and SwissPUG daily business

GRADUATE FELLOW FAST FORWARD Bill Dally, Chief Scientist and SVP Research, NVIDIA Thursday, March

Everyone wants (someone else) to do it Writing documentation for open source software Welcome

The Role of Public Art in Downtown Revitalization Presented by Karin Eaton Definition : From

Greening the Gateway Cities Human-Environment Regional Observatory (HERO) July 12 th , 2018 Laura

Phoenix, AZ 2000Today Phoenix Population 1990: 992,511 2000: 1,327,000 +33.7% 2010:

Detecting Wikipedia Vandalism via Spatio- Temporal Analysis of - PowerPoint PPT Presentation

Detecting Wikipedia Vandalism via Spatio- Temporal Analysis of Revision Metadata Andrew G. West June 10, 2010 ONR-MURI Presentation Where we left off. FROM THE LAST MURI REVIEW 2 6/10/2010 ONR-MURI Review Spatio-Temporal Reputation

Spatio-Temporal Statistics with R Chapter Two: Exploring Spatio-Temporal Data Spatio-Temporal

Vandalism Detection on Wikipedia The class imbalance problem &amp; new approaches Paul Gtze

Temporal, Spatial, and Spatio-temporal Granularities Gabriele Pozzani Department of Computer

Lecture 1 Spatio-temporal data &amp; Linear Models Colin Rundel 1/18/2017 1 Spatio-temporal

Estimating parameters in spatio- temporal Quermass- in spatio-temporal interaction process

Detecting Wikipedia vandalism using WikiTrust Bo Adler Luca de Alfaro Ian Pye Fujitsu Labs of

Realistic Image Synthesis - Spatio-temporal Sampling and Reconstruction. Exploiting Temporal

Vandalism Detection in Wikidata Stefan Heindorf 1 , Martin Potthast 2 , Benno Stein 2 , Gregor

Building a Visual Analytics System for Spatio-temporal Analysis Alan Tan , Yue Lin, Ralf Gommers 5

Spaten : a Spatio-Temporal and Textual Big Data Generator Thaleia Dimitra Doudali* Ioannis

Overview Optical flow Video classification Bag of spatio-temporal features Action

Overview Video classification Bag of spatio-temporal features Action localization

Identifying Deceptive Product Reviews Wikipedia Vandalism The Gender of Authors via

Detecting Spammers and Content Detecting Spammers and Content Detecting Spammers and Content

12/6/2013 Detecting Fakes Image Forensics: Detecting Forged Photos 1.Detecting photorealistic

ADROIT: Detecting Spatio-Temporal Correlated Attack-Stages in IoT Networks NUS-Singtel Cyber

RLSRunner and KRunner: Linking Rascal with K for Program Analysis and Execution Mark Hills, Paul

RLSRunner: Linking Rascal with K for Program Analysis Mark Hills, Paul Klint, &amp; Jurgen J.

Time Related Range Types Revisited Real World use cases from the KOF and SwissPUG daily business

GRADUATE FELLOW FAST FORWARD Bill Dally, Chief Scientist and SVP Research, NVIDIA Thursday, March

Everyone wants (someone else) to do it Writing documentation for open source software Welcome

The Role of Public Art in Downtown Revitalization Presented by Karin Eaton Definition : From

Greening the Gateway Cities Human-Environment Regional Observatory (HERO) July 12 th , 2018 Laura

Phoenix, AZ 2000Today Phoenix Population 1990: 992,511 2000: 1,327,000 +33.7% 2010:

Vandalism Detection on Wikipedia The class imbalance problem & new approaches Paul Gtze

Lecture 1 Spatio-temporal data & Linear Models Colin Rundel 1/18/2017 1 Spatio-temporal

RLSRunner: Linking Rascal with K for Program Analysis Mark Hills, Paul Klint, & Jurgen J.