Web Spam Challenges Carlos Castillo Yahoo! Research - PowerPoint PPT Presentation

Web Spam Challenges Carlos Castillo Yahoo! Research chato@yahoo-inc.com

WEBSPAM\-UK200[67] ● Got a crawl from Boldi/Vigna/Santini in 2006Q3 ● Wrote to 20-30 people to ask for volunteers – Most said yes, and most of them didn't defect ● Created an interface for labeling – 3-4 days of work to get it right ● Labeled a few thousand elements together ● Then, did basically the same again in 2007Q3 Carlos Castillo: “Web Spam Challenges”, AIRWeb 2009.

Why is it good to do collaborative labeling? ● The labeling reflects some degree of consensus – After-the-fact methodological discussions can be very distracting, and actually there was none of it ● Webmasters do not harass you – Responsibility is shared and furthermore you tell search engines not to use these labels ● Labellers get insights about the problem Carlos Castillo: “Web Spam Challenges”, AIRWeb 2009.

Why is it bad to do collaborative labeling? ● In this particular problem, it is very expensive – You get more labels for less money if you just pay for the labels to MTs Carlos Castillo: “Web Spam Challenges”, AIRWeb 2009.

Lessons (learned?) ● Would do WEBSPAM-UK2006 waiting for the 2 nd or 3 rd crawl instead of using the 1 st one ● Would try to raise money for WEBSPAM-UK2007 and do it with MTs – If the money were enough, try to go for a larger collection Carlos Castillo: “Web Spam Challenges”, AIRWeb 2009.

Web Spam Challenges ● The good – Saved a lot of processing to participants, thus ... – Got several submissions with diverse approaches – Baseline was strong but not too much – Side-effect: good dataset for learning on graphs ● The bad – Train/test splits at host level (I, fixed in III) – Snowball sampling (II, fixed in III) – Carlos Castillo: “Web Spam Challenges”, AIRWeb 2009.

Lessons (learned?) ● Would do mostly the same ● Avoid the mistakes ● Promote much more the competition – Try to appeal to a wider audience – Get sponsorship for a prize – Carlos Castillo: “Web Spam Challenges”, AIRWeb 2009.

What is the point of all this? ● Remove a roadblock for researchers working on a topic ● Encourage multiple approaches to a certain problem – in parallel ● Keep web data flowing into universities ● Allow repeatability of the results Carlos Castillo: “Web Spam Challenges”, AIRWeb 2009.

So, if a new dataset+challenge appears ● It has to be a good problem: novel, difficult and with a range of potential applications ● Why? Because if we are going to encourage many information-retrieval researchers to work on this problem, there has to be a large treasure chest to split at the end Carlos Castillo: “Web Spam Challenges”, AIRWeb 2009.

Good signals to look for ● “The dataset for X removed a roadblock towards a complex information-retrieval problem for which no other dataset existed” ● “Research about X was only done inside companies before” ● “Problem X was increasingly threatening Web- based work/search/collaboration/etc.” ● Carlos Castillo: “Web Spam Challenges”, AIRWeb 2009.

Ideas (1/4) ● Disruptive or non-cooperative behaviour in – peer-production sites ● Examples: review/opinion/tag/tag-as-vote/vote spam ● Adversary: wants to promote his agenda/business – social networks ● Examples: find fake users ● Adversary: wants to be seen as multiple independent people ● Examples: find users that are too aggressive on promoting their own stuff? most social networks have norms against it (wikipedia/kuro5hin/digg/etc.) ● Carlos Castillo: “Web Spam Challenges”, AIRWeb 2009.

Ideas (2/4) ● Plagiarism or missing attribution – Web-scale automatic identification of sources for the statements on a document – Adversary: wants to make his posting look original ● Carlos Castillo: “Web Spam Challenges”, AIRWeb 2009.

Ideas (3/4) ● Checking/joining facts on the Web ● “The capital of Spain is Toledo (Wikipedia: Madrid)” ● “The oil spill from the tanker has killed 500 seals (BBC: 541 seals FOX: 2 anchovies)” – Adversary: wants you to believe something wrong – Related problem: revealing networks of mutually- reinforcing sites pushing a certain agenda ● Aspect of credibility on the Web (there is already a workshop on that) ● Carlos Castillo: “Web Spam Challenges”, AIRWeb 2009.

Ideas (4/4) ● Simpler problem: validating citations – This citation to page P validates the claim it is cited about? where specifically in P? – Adversary: wants to convince you of something that is not supported by the pages he is linking – ● E.g.: someone wants to convince you that Einstein believed in a personal God by quoting him selectively -- but you have access to all his books/letters/etc. Carlos Castillo: “Web Spam Challenges”, AIRWeb 2009.

Summary of proposals ● Non-cooperative behaviour in peer-production networks ● Disruptive usage of social networking sites ● Distortions or falsehoods on the web ● Citations: missing attribution (plagiarism) ● Citations: distorted attribution (invalid citation) Carlos Castillo: “Web Spam Challenges”, AIRWeb 2009.

Web Spam Challenges Carlos Castillo Yahoo! Research - PowerPoint PPT Presentation

Web Spam Challenges Carlos Castillo Yahoo! Research chato@yahoo-inc.com WEBSPAM\-UK200[67] Got a crawl from Boldi/Vigna/Santini in 2006Q3 Wrote to 20-30 people to ask for volunteers Most said yes, and most of them didn't defect

Spam, Spam, Spam Why is spam interesting? Everyone can observe spam. Spam / Anti-spam is a

Opinion Spam and Analysis NITIN JINDAL & BING LIU, WSDM 08 UIUC Opinion/Review Spam All

Link Spam Alliances Zoltn Gyngyi Hector Garcia-Molina Class List Spam 101 Intro to

Spam Fighting at CERN 28 April 2004 Emmanuel Ormancey 1 What is Spam ? What is Spam ? Spam

Web Spam Dr. Marc Spaniol Dr. Marc Spaniol Saarbrcken, June 24, 2010 Databases and

Web Spam Marc Spaniol Marc Spaniol Saarbrcken, July 23, 2009 Databases and Information

Web Spam Know Your Neighbors: Web Spam Detection using the Web Topology Presenter: Sadia Masood

Spam Filtering with Naive Bayes Classifier Yuriy Arabskyy June 6, 2017 Table of contents What

Spam Is Bad John R. Levine Chair, IRTF ASRG Chair@asrg.sp.am http://asrg.sp.am Why is spam

Identifying Web Spam Identifying Web Spam With User Behavior Analysis With User Behavior

Detecting Product Review Spammers using Rating Behaviors Itay Dressler What is Spam? Why

The CAN-SPAM Act of 2003 D E C E M B E R 2 0 0 3 THE CAN-SPAM ACT OF 2003 Status of Legislation

NPFL103: Information Retrieval (12) Web search, Crawling, Spam detection Pavel Pecina Institute

Web Services Web Services Towards Web Services Towards Web Services Towards Web Services A

Machine Learning in Spam Filtering A Crash Course in ML Konstantin Tretyakov kt@ut.ee

Spam-URL Detection via Redirects Heeyoung Kwon Mirza Basim Baig Leman Akoglu Era of Spam Era

Severe Weather-Related Risk and Em ergency Com m unication in Coastal Com m unities D O N N A K

analysis of data streams (with hand-waving) Jay Emerson Department of Statistics, Yale

Threads, SMP, and Microkernels Chapter 4 Chapter 4 1 Process Process Resource ownership -

Symmetric Multiprocessing Simultaneous Multithreading Paralelismo ao nvel dos dados Lu s

Hurricane Sandy: Short-Term Citizen-Based Communication and Aid Mobilization Robert M. Schwartz,

NORC at the University of Chicago PRESENTED BY: 4350 East West Highway 8th Floor, Bethesda, MD

The Emerging Role of Data Scientists on Software Development Teams - Shruthi Nagaraj Carleton

Emergence of communities in social networks Jukka-Pekka Onnela Department of Physics & Sad

Sambuz

Useful Links

Newsletter

Mail Us

Web Spam Challenges Carlos Castillo Yahoo! Research - PowerPoint PPT Presentation

Web Spam Challenges Carlos Castillo Yahoo! Research chato@yahoo-inc.com WEBSPAM\-UK200[67] Got a crawl from Boldi/Vigna/Santini in 2006Q3 Wrote to 20-30 people to ask for volunteers Most said yes, and most of them didn't defect

Spam, Spam, Spam Why is spam interesting? Everyone can observe spam. Spam / Anti-spam is a

Opinion Spam and Analysis NITIN JINDAL &amp; BING LIU, WSDM 08 UIUC Opinion/Review Spam All

Link Spam Alliances Zoltn Gyngyi Hector Garcia-Molina Class List Spam 101 Intro to

Spam Fighting at CERN 28 April 2004 Emmanuel Ormancey 1 What is Spam ? What is Spam ? Spam

Web Spam Dr. Marc Spaniol Dr. Marc Spaniol Saarbrcken, June 24, 2010 Databases and

Web Spam Marc Spaniol Marc Spaniol Saarbrcken, July 23, 2009 Databases and Information

Web Spam Know Your Neighbors: Web Spam Detection using the Web Topology Presenter: Sadia Masood

Spam Filtering with Naive Bayes Classifier Yuriy Arabskyy June 6, 2017 Table of contents What

Spam Is Bad John R. Levine Chair, IRTF ASRG Chair@asrg.sp.am http://asrg.sp.am Why is spam

Identifying Web Spam Identifying Web Spam With User Behavior Analysis With User Behavior

Detecting Product Review Spammers using Rating Behaviors Itay Dressler What is Spam? Why

The CAN-SPAM Act of 2003 D E C E M B E R 2 0 0 3 THE CAN-SPAM ACT OF 2003 Status of Legislation

NPFL103: Information Retrieval (12) Web search, Crawling, Spam detection Pavel Pecina Institute

Web Services Web Services Towards Web Services Towards Web Services Towards Web Services A

Machine Learning in Spam Filtering A Crash Course in ML Konstantin Tretyakov kt@ut.ee

Spam-URL Detection via Redirects Heeyoung Kwon Mirza Basim Baig Leman Akoglu Era of Spam Era

Severe Weather-Related Risk and Em ergency Com m unication in Coastal Com m unities D O N N A K

analysis of data streams (with hand-waving) Jay Emerson Department of Statistics, Yale

Threads, SMP, and Microkernels Chapter 4 Chapter 4 1 Process Process Resource ownership -

Symmetric Multiprocessing Simultaneous Multithreading Paralelismo ao nvel dos dados Lu s

Hurricane Sandy: Short-Term Citizen-Based Communication and Aid Mobilization Robert M. Schwartz,

NORC at the University of Chicago PRESENTED BY: 4350 East West Highway 8th Floor, Bethesda, MD

The Emerging Role of Data Scientists on Software Development Teams - Shruthi Nagaraj Carleton

Emergence of communities in social networks Jukka-Pekka Onnela Department of Physics &amp; Sad

Sambuz

Useful Links

Newsletter

Mail Us

Opinion Spam and Analysis NITIN JINDAL & BING LIU, WSDM 08 UIUC Opinion/Review Spam All

Emergence of communities in social networks Jukka-Pekka Onnela Department of Physics & Sad