Using Language Modeling for Spam Detec7on in Social - PowerPoint PPT Presentation

Using ¡Language ¡Modeling ¡for ¡ Spam ¡Detec7on ¡in ¡Social ¡ Reference ¡Manager ¡Websites ¡ Toine ¡Bogers ¡and ¡Antal ¡van ¡den ¡Bosch ¡ DIR ¡2009 ¡ February ¡3, ¡2009 ¡

Outline ¡ • Introduc@on ¡ • Methodology ¡ • Our ¡approach ¡ • Results ¡ • Discussion ¡

Social ¡reference ¡managers ¡ • Social ¡bookmarking ¡for ¡scien@fic ¡papers ¡(and ¡Web ¡pages) ¡ – Intended ¡to ¡support ¡researchers ¡in ¡sharing ¡references ¡and ¡bibliographies ¡ – Several ¡features ¡ • Ar@cle ¡metadata ¡ • BibTex, ¡RIS, ¡ ¡ ¡EndNote, ¡etc. ¡ • Tagging ¡ – Examples ¡ • CiteULike ¡ • BibSonomy ¡ • Connotea ¡

Spam ¡ • In ¡a ¡social ¡bookmarking ¡context: ¡ – Users ¡pos@ng ¡content ¡and ¡tags ¡designed ¡to ¡mislead ¡others ¡ • Open ¡ques@ons ¡ – How ¡big ¡of ¡a ¡problem ¡is ¡it? ¡ – How ¡harmful ¡to ¡which ¡task? ¡ – How ¡can ¡we ¡deal ¡with ¡it? ¡ – LiYle ¡research ¡done ¡

Task ¡ • Task ¡defini@on ¡take ¡from ¡the ¡2008 ¡Discovery ¡Challenge ¡ – Annually ¡organized ¡data ¡mining ¡compe@@ons ¡ – Two ¡tasks ¡in ¡2008 ¡ • Tag ¡recommenda@on ¡ • Spam ¡detec@on ¡ • Spam ¡detec@on ¡task ¡ – Learn ¡a ¡model ¡that ¡predicts ¡spam ¡at ¡the ¡user ¡level ¡ – Equal ¡to ¡detec@ng ¡spam ¡users ¡ – Organizers ¡provided ¡a ¡pre-‑labeled ¡data ¡set ¡ – All ¡of ¡a ¡spam ¡user’s ¡posts ¡are ¡labeled ¡as ¡spam ¡

Data ¡sets ¡ • BibSonomy ¡ – Provided ¡by ¡Discovery ¡Challenge ¡organizers ¡ – Dump ¡of ¡BibSonomy ¡ranging ¡from ¡beginning ¡2006 ¡to ¡March ¡31, ¡2008 ¡ – Approx. ¡39,000 ¡users ¡and ¡> ¡2 ¡million ¡posts ¡ – Divided ¡in ¡training ¡and ¡test ¡set ¡ – Percentage ¡of ¡spam ¡users ¡is ¡93.2% ¡ • CiteULike ¡ – Used ¡a ¡public ¡November ¡2007 ¡dump ¡as ¡star@ng ¡point ¡ – Randomly ¡selected ¡~20% ¡subset ¡(5,200 ¡users) ¡to ¡annotate ¡ – Straighforward ¡interface ¡showed ¡5 ¡random ¡posts ¡to ¡annotators ¡ – Percentage ¡of ¡spam ¡users ¡is ¡28.1% ¡ – Many ¡spam ¡posts ¡in ¡data ¡dump ¡are ¡filtered ¡from ¡CiteULike ¡website ¡ • So ¡metadata ¡for ¡spam ¡posts ¡not ¡consistently ¡available! ¡

Data ¡representa7on ¡ • BibSonomy ¡ – Treated ¡bookmarks ¡and ¡BibTeX ¡the ¡same ¡ – Divide ¡the ¡metadata ¡into ¡4 ¡different ¡fields: ¡ TITLE , ¡ DESCRIPTION , ¡ TAGS , ¡and ¡ URL ¡ – Normalized ¡the ¡URL ¡(tokeniza@on, ¡removal ¡of ¡common ¡prefixes/suffixes) ¡ • CiteULike ¡ – Clean ¡posts ¡had ¡metadata, ¡but ¡most ¡spam ¡posts ¡did ¡not ¡ – Used ¡only ¡ TAGS ¡ metadata ¡for ¡a ¡fair ¡comparison ¡ ¡

Example ¡of ¡a ¡clean ¡post ¡ <DOC> <DOC> <DOCNO> <DOCNO> 694792 694792 </DOCNO> </DOCNO> <TITLE> <TITLE> When Can We Call a System Self-Organizing When Can We Call a System Self-Organizing author ¡ </TITLE> </TITLE> book@tle ¡ <DESCRIPTION> <DESCRIPTION> ECAL Carlos ECAL Carlos Gershenson Gershenson and Francis and Francis Heylighen Heylighen </DESCRIPTION> </DESCRIPTION> <TAGS> <TAGS> search agents search agents ir ir todo todo </TAGS> </TAGS> <URL> <URL> springerlink springerlink metapress metapress openurl openurl asp genre article asp genre article issn issn 0302 9743 volume 2801 0302 9743 volume 2801 spage spage 606 606 </URL> </URL> </DOC> ¡ </DOC>

Experimental ¡setup ¡& ¡evalu7on ¡ • Experimental ¡setup ¡ – BibSonomy: ¡pre-‑defined ¡split ¡in ¡training ¡and ¡test ¡material ¡ • Official ¡training ¡material ¡divided ¡in ¡80-‑20 ¡split ¡on ¡users ¡(38,920 ¡users) ¡ • 80% ¡training ¡set ¡ ¡ ¡ ¡ ¡(25,372 ¡users) ¡ ¡ • 20% ¡valida@on ¡set ¡for ¡parameter ¡op@miza@on ¡ ¡ ¡ ¡(6,343 ¡users) ¡ • Official ¡test ¡set ¡ ¡ ¡ ¡ ¡ ¡ ¡(7,205 ¡users) ¡ – CiteULike ¡ • 60% ¡training ¡set ¡ ¡ ¡ ¡ ¡ ¡ ¡(4,160 ¡users) ¡ ¡ • 20% ¡valida@on ¡set ¡for ¡parameter ¡op@miza@on ¡ ¡ ¡ ¡ ¡ ¡ ¡(520 ¡users) ¡ • 20% ¡test ¡set ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡(520 ¡users) ¡ • Evalua@on ¡metric ¡ – AUC ¡(Area ¡Under ¡the ¡ROC ¡Curve) ¡

Our ¡approach ¡ • Inspired ¡by ¡Mishne ¡et ¡al. ¡(2005) ¡for ¡blog ¡spam ¡ • Approach ¡based ¡on ¡similar ¡language ¡use ¡of ¡similar ¡users ¡ – We ¡compare ¡language ¡models ¡of ¡spam ¡and ¡‘genuine’ ¡content ¡ • Two-‑stage ¡approach ¡ – Determining ¡most ¡similar ¡matching ¡content ¡using ¡language ¡models ¡ – Let ¡the ¡most ¡similar ¡matches ¡determine ¡the ¡spam ¡label ¡

Matching ¡language ¡models ¡ • At ¡what ¡level ¡should ¡we ¡compare ¡our ¡language ¡models? ¡ SPAM ¡ CLEAN ¡

Matching ¡language ¡models ¡ new user 1. ¡ ¡ ¡ ¡ ¡ ¡ ¡ 2. ¡ ¡ ¡ ¡ ¡ ¡ ¡ user-level 3. ¡ ¡ ¡ ¡ ¡ ¡ ¡ matching 4. ¡ ¡ ¡ ¡ ¡ ¡ ¡ collec@on ¡of ¡ 5. ¡ ¡ ¡ ¡ ¡ ¡ ¡ user ¡profiles ¡ 1. ¡ 1. ¡ new user new posts 1. ¡ 2. ¡ 1. ¡ 2. ¡ 2. ¡ 2. ¡ 3. ¡ 3. ¡ 3. ¡ 3. ¡ 4. ¡ 4. ¡ 4. ¡ 5. ¡ 4. ¡ post-level 5. ¡ 5. ¡ 5. ¡ 6. ¡ 6. ¡ matching 6. ¡ 6. ¡ 7. ¡ 7. ¡ 7. ¡ 8. ¡ 7. ¡ collec@on ¡of ¡ 8. ¡ 8. ¡ 8. ¡ 9. ¡ posts ¡ 9. ¡ 9. ¡ 9. ¡ ¡ ¡ ¡ ¡

Matching ¡language ¡models ¡ • (Dis)similarity ¡between ¡LMs ¡calculated ¡using ¡KL-‑divergence ¡ – Used ¡Indri ¡Toolkit ¡for ¡experiments ¡ • Experimented ¡with ¡all ¡fields ¡combined ¡and ¡all ¡4 ¡fields ¡separately ¡ – 9 ¡different ¡matchings ¡ TITLE TITLE DESCRIPTION DESCRIPTION TAGS TAGS URL URL new collection (training set) users/posts

Spam ¡classifica7on ¡ • Aoer ¡the ¡matching ¡phase ¡we ¡get ¡a ¡normalized ¡ranking ¡ – Each ¡user/post ¡has ¡a ¡score ¡between ¡0 ¡and ¡1 ¡and ¡a ¡binary ¡spam ¡label ¡ • Ques@ons ¡ – How ¡many ¡of ¡the ¡top ¡ k ¡matches ¡help ¡determine ¡the ¡final ¡label? ¡ • Op@mized ¡on ¡AUC, ¡from ¡k ¡= ¡1 ¡to ¡k ¡= ¡1000 ¡ – How ¡do ¡the ¡top ¡ k ¡matches ¡contribute ¡towards ¡the ¡final ¡label? ¡ 1. ¡ • Simplest: ¡take ¡top ¡label ¡ 2. ¡ • A ¡bit ¡more ¡sophis@cated: ¡take ¡average ¡label ¡among ¡top ¡ k ¡ 3. ¡ 4. ¡ • What ¡we ¡did: ¡take ¡average ¡label, ¡weighted ¡by ¡normalized ¡score ¡ 5. ¡ � k SPAM CLEAN r = 1 , r ⌅ = i sim ( u i , u r ) · label ( u r ) 6. ¡ score ( u i ) = k 7. ¡ – At ¡the ¡post ¡level ¡we ¡get ¡per-‑post ¡weighted ¡average ¡scores ¡ where for the top k matching users u from ranks 1 to 8. ¡ • Simple ¡average ¡of ¡per-‑post ¡scores ¡is ¡then ¡calculated ¡for ¡each ¡test ¡user ¡ 9. ¡ 10. ¡

Results ¡ User level Post level Collection Fields Validation Test k Validation Test k BibSonomy all fields 0.9682 0.9661 235 0.9571 0.9536 50 (matching title 0.9290 0.9450 150 0.9055 0.9287 45 fields) description 0.9055 0.9452 100 0.8802 0.9371 100 tags 0.9724 0.9073 110 0.9614 0.9088 60 URL 0.8785 0.8523 35 0.8489 0.8301 8 BibSonomy all fields 0.9682 0.9661 235 0.9571 0.9536 50 (single title 0.9300 0.9531 140 0.9147 0.9296 50 fields in description 0.9113 0.9497 90 0.8874 0.9430 75 evaluation sets) tags 0.9690 0.9381 65 0.9686 0.9251 95 URL 0.8830 0.8628 15 0.8727 0.8369 15 CiteULike tags 0.9329 0.9240 5 0.9262 0.9079 5

Results ¡ ROC curve (best runs) 1 0.8 0.6 TP Rate 0.4 0.2 BibSonomy, post level BibSonomy, user level CiteULike, post level CiteULike, user level 0 0 0.2 0.4 0.6 0.8 1 FP Rate

Using Language Modeling for Spam Detec7on in Social - PowerPoint PPT Presentation

Using Language Modeling for Spam Detec7on in Social Reference Manager Websites Toine Bogers and Antal van den Bosch DIR 2009 February 3, 2009

Spam, Spam, Spam Why is spam interesting? Everyone can observe spam. Spam / Anti-spam is a

Opinion Spam and Analysis NITIN JINDAL & BING LIU, WSDM 08 UIUC Opinion/Review Spam All

Link Spam Alliances Zoltn Gyngyi Hector Garcia-Molina Class List Spam 101 Intro to

Spam Fighting at CERN 28 April 2004 Emmanuel Ormancey 1 What is Spam ? What is Spam ? Spam

Spam Filtering with Naive Bayes Classifier Yuriy Arabskyy June 6, 2017 Table of contents What

Web Spam Dr. Marc Spaniol Dr. Marc Spaniol Saarbrcken, June 24, 2010 Databases and

Spam Is Bad John R. Levine Chair, IRTF ASRG Chair@asrg.sp.am http://asrg.sp.am Why is spam

Web Spam Marc Spaniol Marc Spaniol Saarbrcken, July 23, 2009 Databases and Information

Detecting Product Review Spammers using Rating Behaviors Itay Dressler What is Spam? Why

The CAN-SPAM Act of 2003 D E C E M B E R 2 0 0 3 THE CAN-SPAM ACT OF 2003 Status of Legislation

Spam Prevention using Spam Prevention using Access Code (AC) Access Code (AC) Akhtar H Khalil,

Web Spam Know Your Neighbors: Web Spam Detection using the Web Topology Presenter: Sadia Masood

2015 TRECVID Workshop Mul7media Event Detec7on Task Jonathan

Part-based R-CNNs for Fine-grained Category Detec7on

Language Modeling CSE354 - Spring 2020 Task Language Modeling Probabilistic Modeling

Spam-URL Detection via Redirects Heeyoung Kwon Mirza Basim Baig Leman Akoglu Era of Spam Era

WEB COMMUNITY Google Analytic Dashboard Updates, Site Manager Role Updates, Blog Enhancements and

Mining Sentiment Mining Sentiment Classification from Classification from Political Web Logs

A Probabilistic Approach to Spatiotemporal Theme Pattern Mining on Weblogs Qiaozhu Mei , Chao

Just Getting Started Workshop David Sroka Alyse Chiariello Point of Reference NICE | inContact

Constituency-based Hyponymy Extraction COMP 762 Chianyu Liu, 260576898 Hyponym and Hypernym

Deep learning in computer vision and natural language processing Yifeng Tao School of Computer

Contributing to Open Source Part 1: Your Expectations, Project Selection, and Protocol OSS

(Even More) Language Modeling: Multi-Task Learning, and Building Blocks of Transformers CMSC

Using Language Modeling for Spam Detec7on in Social - PowerPoint PPT Presentation

Using Language Modeling for Spam Detec7on in Social Reference Manager Websites Toine Bogers and Antal van den Bosch DIR 2009 February 3, 2009

Spam, Spam, Spam Why is spam interesting? Everyone can observe spam. Spam / Anti-spam is a

Opinion Spam and Analysis NITIN JINDAL &amp; BING LIU, WSDM 08 UIUC Opinion/Review Spam All

Link Spam Alliances Zoltn Gyngyi Hector Garcia-Molina Class List Spam 101 Intro to

Spam Fighting at CERN 28 April 2004 Emmanuel Ormancey 1 What is Spam ? What is Spam ? Spam

Spam Filtering with Naive Bayes Classifier Yuriy Arabskyy June 6, 2017 Table of contents What

Web Spam Dr. Marc Spaniol Dr. Marc Spaniol Saarbrcken, June 24, 2010 Databases and

Spam Is Bad John R. Levine Chair, IRTF ASRG Chair@asrg.sp.am http://asrg.sp.am Why is spam

Web Spam Marc Spaniol Marc Spaniol Saarbrcken, July 23, 2009 Databases and Information

Detecting Product Review Spammers using Rating Behaviors Itay Dressler What is Spam? Why

The CAN-SPAM Act of 2003 D E C E M B E R 2 0 0 3 THE CAN-SPAM ACT OF 2003 Status of Legislation

Spam Prevention using Spam Prevention using Access Code (AC) Access Code (AC) Akhtar H Khalil,

Web Spam Know Your Neighbors: Web Spam Detection using the Web Topology Presenter: Sadia Masood

2015 TRECVID Workshop Mul7media Event Detec7on Task Jonathan

Part-based R-CNNs for Fine-grained Category Detec7on

Language Modeling CSE354 - Spring 2020 Task Language Modeling Probabilistic Modeling

Spam-URL Detection via Redirects Heeyoung Kwon Mirza Basim Baig Leman Akoglu Era of Spam Era

WEB COMMUNITY Google Analytic Dashboard Updates, Site Manager Role Updates, Blog Enhancements and

Mining Sentiment Mining Sentiment Classification from Classification from Political Web Logs

A Probabilistic Approach to Spatiotemporal Theme Pattern Mining on Weblogs Qiaozhu Mei , Chao

Just Getting Started Workshop David Sroka Alyse Chiariello Point of Reference NICE | inContact

Constituency-based Hyponymy Extraction COMP 762 Chianyu Liu, 260576898 Hyponym and Hypernym

Deep learning in computer vision and natural language processing Yifeng Tao School of Computer

Contributing to Open Source Part 1: Your Expectations, Project Selection, and Protocol OSS

(Even More) Language Modeling: Multi-Task Learning, and Building Blocks of Transformers CMSC

Opinion Spam and Analysis NITIN JINDAL & BING LIU, WSDM 08 UIUC Opinion/Review Spam All