vandalism detection in wikidata
play

Vandalism Detection in Wikidata Stefan Heindorf 1 , Martin Potthast 2 - PowerPoint PPT Presentation

Vandalism Detection in Wikidata Stefan Heindorf 1 , Martin Potthast 2 , Benno Stein 2 , Gregor Engels 1 CIKM 2016 October 25, 2016 1 2 Motivation Vandalism Detection in Wikidata Stefan Heindorf 2 Motivation Vandalism Detection in Wikidata


  1. Vandalism Detection in Wikidata Stefan Heindorf 1 , Martin Potthast 2 , Benno Stein 2 , Gregor Engels 1 CIKM 2016 October 25, 2016 1 2

  2. Motivation Vandalism Detection in Wikidata Stefan Heindorf 2

  3. Motivation Vandalism Detection in Wikidata Stefan Heindorf 2

  4. Motivation Vandalism Detection in Wikidata Stefan Heindorf 2

  5. Motivation Vandalism Detection in Wikidata Stefan Heindorf 2

  6. Motivation Vandalism Detection in Wikidata Stefan Heindorf 2

  7. Motivation Vandalism Detection in Wikidata Stefan Heindorf 2

  8. Motivation Vandalism Detection in Wikidata Stefan Heindorf 2

  9. Motivation Vandalism Detection in Wikidata Stefan Heindorf 2

  10. Vandalism Detection in Wikidata Stefan Heindorf 3

  11. Item head Vandalism Detection in Wikidata Stefan Heindorf 3

  12. Item head Item body Vandalism Detection in Wikidata Stefan Heindorf 3

  13. Revisions Item head Item body Vandalism Detection in Wikidata Stefan Heindorf 3

  14. Revisions Item head (Feb 22, 2013) (May 13, 2013) Item body (May 30, 2013) Vandalism Detection in Wikidata Stefan Heindorf 3

  15. Revisions Item head (Feb 22, 2013) (May 13, 2013) Item body (May 30, 2013) Vandalism Detection in Wikidata Stefan Heindorf 3

  16. Revisions Item head (Feb 22, 2013) (May 13, 2013) Item body (May 30, 2013) Vandalism Detection in Wikidata Stefan Heindorf 3

  17. Revisions Item head Item body Vandalism Detection in Wikidata Stefan Heindorf 3

  18. Why is it a problem? Patrolling Reverting Warning Protecting Blocking • Over 2 Mio manual edits per month • A lot of tedious work • Vandalism is not detected in time Vandalism Detection in Wikidata Stefan Heindorf 4

  19. Research Question How to detect damaging changes to crowdsourced knowledge bases? Vandalism Detection in Wikidata Stefan Heindorf 5

  20. Our Approach Vandalism Detection in Wikidata Stefan Heindorf 6

  21. Our Approach  Vandalism Corpus [SIGIR’15] 1. Label Dataset Vandalism Detection in Wikidata Stefan Heindorf 6

  22. Our Approach  Vandalism Corpus [SIGIR’15] 1. Label Dataset 2. Study Vandalism Characteristics  47 Features Vandalism Detection in Wikidata Stefan Heindorf 6

  23. Our Approach  Vandalism Corpus [SIGIR’15] 1. Label Dataset 2. Study Vandalism Characteristics  47 Features  Multiple-Instance Learning 3. Experiment with ML Vandalism Detection in Wikidata Stefan Heindorf 6

  24. Our Approach  Vandalism Corpus [SIGIR’15] 1. Label Dataset 2. Study Vandalism Characteristics  47 Features  Multiple-Instance Learning 3. Experiment with ML  2 Baselines 4. Compare with state of the art Vandalism Detection in Wikidata Stefan Heindorf 6

  25. Corpus [SIGIR ’15] Revisions over time 7

  26. Corpus [SIGIR ’15] Revisions over time Month 7

  27. Corpus [SIGIR ’15] Revisions over time Month 7

  28. Corpus [SIGIR ’15] Revisions over time 103,000 vandalism revisions Month 7

  29. Corpus [SIGIR ’15] Revisions over time 103,000 vandalism revisions 24 million manual revisions Month 7

  30. Corpus [SIGIR ’15] Revisions over time 103,000 vandalism revisions 24 million manual revisions  0.4% vandalism Month 7

  31. Corpus [SIGIR ’15] Revisions over time 103,000 vandalism revisions 24 million manual revisions  0.4% vandalism Item head (1.3% vandalism) Month 7

  32. Corpus [SIGIR ’15] Revisions over time 103,000 vandalism revisions 24 million manual revisions  0.4% vandalism Item body (0.2% vandalism) Item head (1.3% vandalism) Month 7

  33. Corpus [SIGIR ’15] Revisions over time 103,000 vandalism revisions 24 million manual revisions  0.4% vandalism Item body Training (0.2% vandalism) Item head (1.3% vandalism) Month 7

  34. Corpus [SIGIR ’15] Revisions over time Validation 103,000 vandalism revisions 24 million manual revisions  0.4% vandalism Item body Training (0.2% vandalism) Item head (1.3% vandalism) Month 7

  35. Corpus [SIGIR ’15] Revisions over time Validation Test 103,000 vandalism revisions 24 million manual revisions  0.4% vandalism Item body Training (0.2% vandalism) Item head (1.3% vandalism) Month 7

  36. Features (47 in total) Content Features 11 Character features (e.g., lowerCaseRatio, digitRatio ) 9 Word features (e.g., badWordRatio ) 4 Sentence features (e.g., commentSitelinkSimilarity ) 3 Statement features (e.g., propertyFrequency ) Context Features 10 User features (e.g., userCountry ) 2 Item features (e.g., logItemFrequency ) 8 Revision features (e.g., revisionTag , revisionLanguage ) Vandalism Detection in Wikidata Stefan Heindorf 8

  37. Features (47 in total) revisionTag Vandalism Detection in Wikidata Stefan Heindorf 8

  38. Features (47 in total) revisionTag Vand. Total Prob. Rev. with tags 52 T 8,619 T 0.60% By abuse filter 49 T 122 T 39.90% By editing tools 3 T 8,496 T 0.03% Rev. w/o tags 52 T 15,386 T 0.34% revisionTag Vandalism Detection in Wikidata Stefan Heindorf 8

  39. Features (47 in total) revisionTag Vand. Total Prob. Rev. with tags 52 T 8,619 T 0.60% By abuse filter 49 T 122 T 39.90% By editing tools 3 T 8,496 T 0.03% Rev. w/o tags 52 T 15,386 T 0.34% revisionTag Vandalism Detection in Wikidata Stefan Heindorf 8

  40. Features (47 in total) revisionTag Vand. Total Prob. Rev. with tags 52 T 8,619 T 0.60% By abuse filter 49 T 122 T 39.90% By editing tools 3 T 8,496 T 0.03% Rev. w/o tags 52 T 15,386 T 0.34% revisionTag Vandalism Detection in Wikidata Stefan Heindorf 8

  41. Features (47 in total) revisionTag Vand. Total Prob. Rev. with tags 52 T 8,619 T 0.60% By abuse filter 49 T 122 T 39.90% By editing tools 3 T 8,496 T 0.03% Rev. w/o tags 52 T 15,386 T 0.34% revisionTag Vandalism Detection in Wikidata Stefan Heindorf 8

  42. Multiple-Instance Learning Vandalism Detection in Wikidata Stefan Heindorf 9

  43. Multiple-Instance Learning • Observation: Vandalism seldom occurs in isolation Vandalism Detection in Wikidata Stefan Heindorf 9

  44. Multiple-Instance Learning • Observation: Vandalism seldom occurs in isolation 22:35, 11 September 2013 184.19.64.111 (talk) . . (Changed English label: Barack Obama Aloha) 22:35, 11 September 2013 184.19.64.111 (talk) . . (Added English alias: Lulu:):):):):):):)) 12:05, 11 September 2013 MatmaBot (talk | contribs) . . (Changed Polish description: imported Vandalism Detection in Wikidata Stefan Heindorf 9

  45. Multiple-Instance Learning • Observation: Vandalism seldom occurs in isolation 22:35, 11 September 2013 184.19.64.111 (talk) . . (Changed English label: Barack Obama Aloha) 22:35, 11 September 2013 184.19.64.111 (talk) . . (Added English alias: Lulu:):):):):):):)) 12:05, 11 September 2013 MatmaBot (talk | contribs) . . (Changed Polish description: imported Vandalism Detection in Wikidata Stefan Heindorf 9

  46. Multiple-Instance Learning • Observation: Vandalism seldom occurs in isolation 22:35, 11 September 2013 184.19.64.111 (talk) . . (Changed English label: Barack Obama Aloha) Session 1 22:35, 11 September 2013 184.19.64.111 (talk) . . (Added English alias: Lulu:):):):):):):)) Session 2 12:05, 11 September 2013 MatmaBot (talk | contribs) . . (Changed Polish description: imported Vandalism Detection in Wikidata Stefan Heindorf 9

  47. Multiple-Instance Learning • Observation: Vandalism seldom occurs in isolation 22:35, 11 September 2013 184.19.64.111 (talk) . . (Changed English label: Barack Obama Aloha) Session 1 22:35, 11 September 2013 184.19.64.111 (talk) . . (Added English alias: Lulu:):):):):):):)) Session 2 12:05, 11 September 2013 MatmaBot (talk | contribs) . . (Changed Polish description: imported • Idea: Apply Multiple-Instance Learning Vandalism Detection in Wikidata Stefan Heindorf 9

  48. Multiple-Instance Learning • Observation: Vandalism seldom occurs in isolation 22:35, 11 September 2013 184.19.64.111 (talk) . . (Changed English label: Barack Obama Aloha) Session 1 22:35, 11 September 2013 184.19.64.111 (talk) . . (Added English alias: Lulu:):):):):):):)) Session 2 12:05, 11 September 2013 MatmaBot (talk | contribs) . . (Changed Polish description: imported • Idea: Apply Multiple-Instance Learning Vandalism Detection in Wikidata Stefan Heindorf 9

  49. Multiple-Instance Learning • Observation: Vandalism seldom occurs in isolation 22:35, 11 September 2013 184.19.64.111 (talk) . . (Changed English label: Barack Obama Aloha) Session 1 22:35, 11 September 2013 184.19.64.111 (talk) . . (Added English alias: Lulu:):):):):):):)) Session 2 12:05, 11 September 2013 MatmaBot (talk | contribs) . . (Changed Polish description: imported • Idea: Apply Multiple-Instance Learning Vandalism Detection in Wikidata Stefan Heindorf 9

  50. WDVD vs. Baselines • WDVD (our approach) W iki d ata V andalism D etector Vandalism Detection in Wikidata 10 Stefan Heindorf

  51. WDVD vs. Baselines • WDVD (our approach) W iki d ata V andalism D etector • FILTER (baseline) Wikidata Abuse Filter Vandalism Detection in Wikidata 10 Stefan Heindorf

  52. WDVD vs. Baselines • WDVD (our approach) W iki d ata V andalism D etector • FILTER (baseline) Wikidata Abuse Filter • ORES (baseline) O bjective R evision E valuation S ervice Vandalism Detection in Wikidata 10 Stefan Heindorf

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend