Identifying Web Spam Identifying Web Spam With User Behavior - PowerPoint PPT Presentation

Identifying Web Spam Identifying Web Spam With User Behavior Analysis With User Behavior Analysis Yiqun Liu, Rongwei Cen, Min Zhang, Shaoping Ma, Liyun Ru Yiqun Liu, Rongwei Cen, Min Zhang, Shaoping Ma, Liyun Ru State Key Lab of Intelligent Tech. & Sys. State Key Lab of Intelligent Tech. & Sys. Tsinghua University Tsinghua University 2008/04/23 2008/04/23

Introduction – – simple math simple math Introduction • How many spam pages are there on the Web? – Over 10% ( Fetterly et al. 2004, Gyöngyi et al. 2004) – Web has 152 billion pages (How Much Info project 2003) • How many can a search engine index? – Tens of billions (Google: 8 billion@2004, Yahoo: 20 billion@2005) • #(spam) is equal to/more than search engines’ index sizes • Search index will be filled with useless pages without spam detection. • We have developed lots of spam detection methods However … … However

Introduction Introduction • Search “N95 battery time” with a certain Chinese search engine on 08/04/17 Result #1: a cloaking spam Result #3: the page cannot be connected (cache shows a content spam) Result #4: search result from another engine with ads (also a content spam)

Introduction Introduction • Problem: spam detection has been an ever-lasting process – Good news for anti-spam engineers! – Bad news for Web users / search engines • Are detection methods not effective? – No! Lots of works report over 90% detection accuracy ( Ntoulas et al. 2006, Saito et al. 2007, Lin et al. 2007, …) • Are detection methods not timely? – Yes! When one kind of spam appears, it takes a long time for anti-spam engineers to realize the appearance.

Time For a certain kind of Web spam technique • How does spam make a profit? Introduction Introduction T2 UV / Profit T1

Introduction Introduction • Important: find new kind of spam as soon as possible Detect a new kind of Web spam technique timely Reduce the spam profit When profit < cost, spam stops

User- -behavior Features behavior Features User • Users will at first realize the existence of a new spam page – How to use the wisdom of crowds to detect spam? • Social annotation? (possible noises) • Web access log analysis. – Web access logs • Collected by a commercial search engine • July 1st, 2007 to August 26th, 2007 • 2.74 billion user clicks in 800 million Web pages

User- -behavior Features behavior Features User • The behavior features we propose – How many user visits are oriented from search engine? – How many users will follow links on the page? – How many users will not visit the site in the future? – How many user visits are oriented by hot keyword searches? – How many pages does a certain user visit in the site? – How many users visit the site? – …

User- -behavior Features behavior Features User • Search engine oriented visiting rate ( SEOV rate) – Web spam are designed to get “an unjustifiably favorable relevance or importance score” from search engines. ( Gyongyi et. al. 2005) – Assumption: Most user visits to Web spam are from search engine result lists – Definition: # ( ) Search engine oriented visits of p = ( ) SEOV p # ( ) Visits of p

User- -behavior Features behavior Features User • SEOV rate distribution 0.9 Ordinary Page Spam Page 0.8 0.7 0.6 Percentage 0.5 0.4 0.3 0.2 0.1 0 0-0.1 0.1-0.2 0.2-0.3 0.3-0.4 0.4-0.5 0.5-0.6 0.6-0.7 0.7-0.8 0.8-0.9 0.9-1 SEOV Some spam don’t receive many UV from search engines, either. Most ordinary pages’ user visits are not from search engines

User- -behavior Features behavior Features User • Source page rate ( SP rate) – Spam pages are usually designed to show users ads/low-quality information at their first look. – Users don’t trust hyperlinks on spam pages – Assumption: Most Web users will not follow hyperlinks on spam pages – Definition: # ( ) p appears as the source page = ( ) SP p # ( ) p appears in the Web access logs

User- -behavior Features behavior Features User • SP rate distribution 0.6 Ordinary Page Spam Page 0.5 0.4 Percentage 0.3 0.2 0.1 0 <0.05 0.05-0.1 0.1-0.15 0.15-0.2 0.2-0.25 0.25-0.3 0.3-0.35 0.35-0.4 0.4-0.45 >0.45 SP User clicks hyperlink on some spam page, too. (users may be cheated by anchor texts) Half of spam pages have very small SP values

User- -behavior Features behavior Features User • Short-time Navigation Rate ( SN rate) – Users cannot be cheated again and again during a small time period – Assumption: Most Web users will not visit a spam site many times in a same user session – Definition: # ( ) Sessions in which users visit less than N pages in s = ( ) SN s # ( ) Sessions in which users visit s N : parameter

User- -behavior Features behavior Features User • SN rate distribution ( N = 3) 0.6 Ordinary Page Spam Page 0.5 0.4 Percentage 0.3 0.2 0.1 0 0-0.1 0.1-0.2 0.2-0.3 0.3-0.4 0.4-0.5 0.5-0.6 0.6-0.7 0.7-0.8 0.8-0.9 0.9-1 SN A number of ordinary pages also receive few UVs in a session. (redirection sites, low-quality sites, …) Few spam pages are visited over 2 times in a session

User- -behavior Features behavior Features User • Correlation values between these features – Different assumption – Different information sources – Relatively low correlation – Possible to use Bayes learning methods SEOV SP SN 1.0000 0.1981 0.1780 SEOV 0.1981 1.0000 0.0460 SP 0.1780 0.0460 1.0000 SN

Detection algorithm Detection algorithm • Problem: – Uniform sampling of negative examples (pages which are not spam) is difficult • Solution: – Learning from positive examples (Web spam) and unlabelled data (Web corpus) – Calculate the possibility of a page p being Web spam using user behavior features

Detection algorithm Detection algorithm • For a single feature A : ∈ ( | ) P p Spam p has feature A ∩ ∈ # ( ) # ( ) p has feature A p Spam sample set p has feature A ∝ # ( ) # ( ) Spam sample set CORPUS • For three features SEOV , SP and SN : – Features are approximately independent as well as conditionally independent given the target value ∈ ( | , ,..., ) P p Spam p has feature A A A 1 2 n ∩ ∈ n # ( ) # ( ) p has feature A p Spam sample set p has feature A ∏ ∝ i i ( ) # ( ) # ( ) Spam sample set CORPUS = i 1

Detection algorithm Detection algorithm • Algorithm Description

Experimental Results Experimental Results • Experiment setup – Training set: • 802 spam sites • Collected from the hottest search queries’ result lists – Test set: • 1564 Web sites annotated with whether it is spam or not • 345 spam, 1060 non-spam, 159 cannot tell • Percentage of spam is higher than the estimation given by Fetterly et. al. and Gyöngyi et. al. . (we only retain the sites which are visited at least 10 times)

Experimental Results Experimental Results • How to evaluate the performance – Focus: find the recently-appeared spam types (not to detect all possible spam types) 1: Whether the spam candidates identified by this algorithm are really Web spam. (effectiveness) 2: Whether this algorithm detect spam types more timely than current search engines. (timeliness) 3: Which feature is more effective?

Experimental Results Experimental Results • Detection performance (effectiveness) – Whether the top-ranked candidates are Web spam – 300 Pages with the highest P(Spam) values • Only 6% are not Web spam (low-quality page, SEO page) • Many spam types can be identified. (wisdom of crowds) Page Type Percentage Non-spam pages 6.00% Web spam pages (Content spamming) 21.67% Web spam pages (Link spamming) 23.33% Web spam pages (Other spamming) 10.67% Pages that cannot be accessed 38.33%

Experimental Results Experimental Results • Detection performance (timeliness) – Experiments with one of the most frequently-used Chinese search engines (use X to represent it) – Recent data: Access logs from 08/02/04 to 08/03/02 – Top-ranked spam candidate sites • 723/1000 are spam sites (some failed to be connected) • X indexed 34 million pages from these 723 sites in early Mar. • 59 million pages were indexed by X at the end of Mar. These spam are not detected by X , X spent lots of resources on these useless pages

Experimental Results Experimental Results • Detection performance (algorithm & features) AUC value of the detection algorithm is about 80%

Experimental Results Experimental Results • Detection performance (algorithm & features) Learning algorithm gains better performance than any single feature

Experimental Results Experimental Results • Detection performance (algorithm & features) SN performs the worst: Examples: Q&A portal, Audio/Video sharing sites.

Conclusions Conclusions • The amount of Web spam is perhaps over search engine index size • Timeliness is as important as effectiveness in spam detection • User behavior features can be used to find recently- appeared spam types timely and effectively

Identifying Web Spam Identifying Web Spam With User Behavior - PowerPoint PPT Presentation

Identifying Web Spam Identifying Web Spam With User Behavior Analysis With User Behavior Analysis Yiqun Liu, Rongwei Cen, Min Zhang, Shaoping Ma, Liyun Ru Yiqun Liu, Rongwei Cen, Min Zhang, Shaoping Ma, Liyun Ru State Key Lab of Intelligent

Spam, Spam, Spam Why is spam interesting? Everyone can observe spam. Spam / Anti-spam is a

Opinion Spam and Analysis NITIN JINDAL & BING LIU, WSDM 08 UIUC Opinion/Review Spam All

Link Spam Alliances Zoltn Gyngyi Hector Garcia-Molina Class List Spam 101 Intro to

Spam Fighting at CERN 28 April 2004 Emmanuel Ormancey 1 What is Spam ? What is Spam ? Spam

Web Spam Dr. Marc Spaniol Dr. Marc Spaniol Saarbrcken, June 24, 2010 Databases and

Web Spam Marc Spaniol Marc Spaniol Saarbrcken, July 23, 2009 Databases and Information

Spam Filtering with Naive Bayes Classifier Yuriy Arabskyy June 6, 2017 Table of contents What

Spam Is Bad John R. Levine Chair, IRTF ASRG Chair@asrg.sp.am http://asrg.sp.am Why is spam

Web Spam Know Your Neighbors: Web Spam Detection using the Web Topology Presenter: Sadia Masood

Detecting Product Review Spammers using Rating Behaviors Itay Dressler What is Spam? Why

The CAN-SPAM Act of 2003 D E C E M B E R 2 0 0 3 THE CAN-SPAM ACT OF 2003 Status of Legislation

Drexel University College of Medicine Proofpoint Anti-SPAM User-Managed Features User Managed

RUN groupadd -r user && useradd -r -g user user USER user $ docker run --read-only debian

NPFL103: Information Retrieval (12) Web search, Crawling, Spam detection Pavel Pecina Institute

Machine Learning in Spam Filtering A Crash Course in ML Konstantin Tretyakov kt@ut.ee

Spam-URL Detection via Redirects Heeyoung Kwon Mirza Basim Baig Leman Akoglu Era of Spam Era

Information Retrieval CS6200 Jesse Anderton College of Computer and Information Science

Developing a Concept-Oriented Search Engine for Isabelle Based on Natural Language: Technical

CS490W Without search engines the web wouldnt scale The acceptance of search interaction makes

Searching with Context Reiner Kraft Farzin Maghoul Chi Chao Chang Ravi Kumar Yahoo!, Inc.,

Web site deployment and promotion Now, you are done coding your web site. What do you do next?

The Impact of Solid State Drive on Search Engine Cache Management Jiancong Tong Ph.D. candidate

Search Engines and Time Series Databases Corso di Sistemi e Architetture per Big Data A.A.

Team Members Ali Khodaei Kaveh Shahabi Search Engine Sangeetha U Santharam for

Sambuz

Useful Links

Newsletter

Mail Us

Identifying Web Spam Identifying Web Spam With User Behavior - PowerPoint PPT Presentation

Identifying Web Spam Identifying Web Spam With User Behavior Analysis With User Behavior Analysis Yiqun Liu, Rongwei Cen, Min Zhang, Shaoping Ma, Liyun Ru Yiqun Liu, Rongwei Cen, Min Zhang, Shaoping Ma, Liyun Ru State Key Lab of Intelligent

Spam, Spam, Spam Why is spam interesting? Everyone can observe spam. Spam / Anti-spam is a

Opinion Spam and Analysis NITIN JINDAL &amp; BING LIU, WSDM 08 UIUC Opinion/Review Spam All

Link Spam Alliances Zoltn Gyngyi Hector Garcia-Molina Class List Spam 101 Intro to

Spam Fighting at CERN 28 April 2004 Emmanuel Ormancey 1 What is Spam ? What is Spam ? Spam

Web Spam Dr. Marc Spaniol Dr. Marc Spaniol Saarbrcken, June 24, 2010 Databases and

Web Spam Marc Spaniol Marc Spaniol Saarbrcken, July 23, 2009 Databases and Information

Spam Filtering with Naive Bayes Classifier Yuriy Arabskyy June 6, 2017 Table of contents What

Spam Is Bad John R. Levine Chair, IRTF ASRG Chair@asrg.sp.am http://asrg.sp.am Why is spam

Web Spam Know Your Neighbors: Web Spam Detection using the Web Topology Presenter: Sadia Masood

Detecting Product Review Spammers using Rating Behaviors Itay Dressler What is Spam? Why

The CAN-SPAM Act of 2003 D E C E M B E R 2 0 0 3 THE CAN-SPAM ACT OF 2003 Status of Legislation

Drexel University College of Medicine Proofpoint Anti-SPAM User-Managed Features User Managed

RUN groupadd -r user &amp;&amp; useradd -r -g user user USER user $ docker run --read-only debian

NPFL103: Information Retrieval (12) Web search, Crawling, Spam detection Pavel Pecina Institute

Machine Learning in Spam Filtering A Crash Course in ML Konstantin Tretyakov kt@ut.ee

Spam-URL Detection via Redirects Heeyoung Kwon Mirza Basim Baig Leman Akoglu Era of Spam Era

Information Retrieval CS6200 Jesse Anderton College of Computer and Information Science

Developing a Concept-Oriented Search Engine for Isabelle Based on Natural Language: Technical

CS490W Without search engines the web wouldnt scale The acceptance of search interaction makes

Searching with Context Reiner Kraft Farzin Maghoul Chi Chao Chang Ravi Kumar Yahoo!, Inc.,

Web site deployment and promotion Now, you are done coding your web site. What do you do next?

The Impact of Solid State Drive on Search Engine Cache Management Jiancong Tong Ph.D. candidate

Search Engines and Time Series Databases Corso di Sistemi e Architetture per Big Data A.A.

Team Members Ali Khodaei Kaveh Shahabi Search Engine Sangeetha U Santharam for

Sambuz

Useful Links

Newsletter

Mail Us

Opinion Spam and Analysis NITIN JINDAL & BING LIU, WSDM 08 UIUC Opinion/Review Spam All

RUN groupadd -r user && useradd -r -g user user USER user $ docker run --read-only debian