Web Spam Marc Spaniol Marc Spaniol Saarbrcken, July 23, 2009 - - PowerPoint PPT Presentation

web spam
SMART_READER_LITE
LIVE PREVIEW

Web Spam Marc Spaniol Marc Spaniol Saarbrcken, July 23, 2009 - - PowerPoint PPT Presentation

Web Dynamics Web Spam Web Spam Marc Spaniol Marc Spaniol Saarbrcken, July 23, 2009 Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0709-1/49 Agenda Web spam - Why and what? Web Spam - Spam taxonomy Overview


slide-1
SLIDE 1

Databases and Information Systems

  • Prof. Dr. G. Weikum

Marc Spaniol MPII-Sp-0709-1/49 Web Spam

Web Spam

Marc Spaniol

Saarbrücken, July 23, 2009

Web Dynamics

slide-2
SLIDE 2

Databases and Information Systems

  • Prof. Dr. G. Weikum

Marc Spaniol MPII-Sp-0709-2/49 Web Spam

Agenda

  • Web spam
  • Why and what?
  • Spam taxonomy
  • Overview
  • Strategies in detail
  • Link spam
  • Link farms
  • Examples
  • Countermeasures
  • Spam detection
  • Labeling and assessment
  • Combating spam
  • Web spam challenge
  • Conclusion
slide-3
SLIDE 3

Databases and Information Systems

  • Prof. Dr. G. Weikum

Marc Spaniol MPII-Sp-0709-3/49 Web Spam

Web Spam Why?

Time spent looking at hit position Time elapsed to reach hit position

“Spam industry had a revenue potential of $4.5 billion in year 2004 if they had been able to completely fool all search engines on all commercially viable queries” [Amitay 2004] [Granka, Joachims, Gay 2004]

slide-4
SLIDE 4

Databases and Information Systems

  • Prof. Dr. G. Weikum

Marc Spaniol MPII-Sp-0709-4/49 Web Spam

Web Spam What’s the Problem?

Reputable 70.0% Spam 16.5% Weborg 0.8% Ad 3.7% Non-existent 7.9% Empty 0.4% Alias 0.3% Unknown 0.4% 2004 .de crawl Courtesy: T. Suel

slide-5
SLIDE 5

Databases and Information Systems

  • Prof. Dr. G. Weikum

Marc Spaniol MPII-Sp-0709-5/49 Web Spam

Web Spam

  • Target of spammers
  • Not end users (directly)
  • High revenue from customers for search engine “optimization” (especially Google)
  • Indirect revenue
  • Affiliate programs, Google AdSense
  • Ad display, traffic funneling
  • Spam taxonomy
  • Content spam
  • Keywords
  • Popular expressions
  • Mis-spellings
  • Link spam “farms”
  • Densely connected sites
  • Redirects
  • Cloaking and hiding
  • Spam in social media

[Benczúr et al. 2008]

slide-6
SLIDE 6

Databases and Information Systems

  • Prof. Dr. G. Weikum

Marc Spaniol MPII-Sp-0709-6/49 Web Spam

Overview

Spamming Content Hiding Links Term Hiding Boosting Cloaking Redirection

slide-7
SLIDE 7

Databases and Information Systems

  • Prof. Dr. G. Weikum

Marc Spaniol MPII-Sp-0709-7/49 Web Spam

Spammed Ranking Elements

  • Term frequency (tf in the tf.idf, Okapi BM25 etc. ranking schemes)
  • Term frequency weighted by HTML elements
  • Title
  • Headers
  • Font size
  • Face
  • Heaviest weight in ranking
  • URL, domain name part
  • Anchor text: <a href=“…”>Best Saarbruecken nightlife</a>
  • Structural information
  • URL length
  • Depth from server root
  • Indegree
  • PageRank
  • Link based centrality

All Web information retrieval ranking elements spammed

slide-8
SLIDE 8

Databases and Information Systems

  • Prof. Dr. G. Weikum

Marc Spaniol MPII-Sp-0709-8/49 Web Spam

Content Spam

  • Domain name

adjustableloanmortgagemastersonline.compay.dahannusaprima.co.uk buy-canon-rebel-20d-lens-case.camerasx.com

  • Anchor text (title, H1, etc)

<a href=“target.html”>free, great deals, cheap, inexpensive, cheap, free</a>

  • Meta keywords

<meta name=“keywords” content=“UK Swingers, UK, swingers, swinging, genuine, adult contacts, connect4fun, sex, …”> [Gyöngyi, Garcia-Molina, 2005]

slide-9
SLIDE 9

Databases and Information Systems

  • Prof. Dr. G. Weikum

Marc Spaniol MPII-Sp-0709-9/49 Web Spam

Parking Domain

<div style="position:absolute; top:20px; width:600px; height:90px; overflow:hidden;"><font size=-1>atangledweb.co.uk currently offline<br>atangledweb.co.uk back soon<br></font><br><br><a href="http://www.atangledweb.co.uk"><font size=-1>atangledweb.co.uk</font></a><br><br><br> Soundbridge HomeMusic WiFi Media Play<a class=l href="http://www.atangledweb.co.uk/index01.html">-</a>... SanDisk Sansa e250 - 2GB MP3 Player -<a class=l href="http://www.atangledweb.co.uk/index02.html">-</a>... AIGO F820+ 1GB Beach inspired MP3 Pla<a class=l href="http://www.atangledweb.co.uk/index03.html">-</a>... Targus I-Pod Mini Sound Enhancer<a class=l href="http://www.atangledweb.co.uk/index04.html">-</a>... Sony NWA806FP.CE7 4GB video WALKMAN <a class=l href="http://www.atangledweb.co.uk/index05.html">-</a>... Ministry of Sound 512MB MP3 player<a class=l href="http://www.mp3roze.co.uk/cat7000.html">-</a>... Nokia 6125 - Fold Design - 1.3 Megapi<a class=l href="http://www.mp3roze.co.uk/cat7001.html">-</a>...

slide-10
SLIDE 10

Databases and Information Systems

  • Prof. Dr. G. Weikum

Marc Spaniol MPII-Sp-0709-10/49 Web Spam

Keyword Stuffing & Generated Copies

slide-11
SLIDE 11

Databases and Information Systems

  • Prof. Dr. G. Weikum

Marc Spaniol MPII-Sp-0709-11/49 Web Spam

Google ads

slide-12
SLIDE 12

Databases and Information Systems

  • Prof. Dr. G. Weikum

Marc Spaniol MPII-Sp-0709-12/49 Web Spam

Link Spam

“Hyperlink structure contains an enormous amount of latent human annotation that can be extremely valuable for automatically inferring notions of authority.” (Chakrabarti et. al. ’99)

  • Hyperlinks: Good, Bad, Ugly

Honest link, human annotation No value of recommendation, e.g. “affiliate programs”, navigation, ads … Deliberate manipulation, link spam

slide-13
SLIDE 13

Databases and Information Systems

  • Prof. Dr. G. Weikum

Marc Spaniol MPII-Sp-0709-13/49 Web Spam

PageRank

  • One page is important if it is pointed to by many other pages
  • Based on the link structure

The algorithm of PageRank is vulnerable to link spamming

cT’p (1 – c) p = N + 1N

PageRank of pi pointing to p0 Outgoing links from pi random jump

PageRank of page p0:

damping factor

Generalized (vector):

Transition matrix

p0 = cΣipi/|F(i)| + (1-c)

Score vector “1” vector

slide-14
SLIDE 14

Databases and Information Systems

  • Prof. Dr. G. Weikum

Marc Spaniol MPII-Sp-0709-14/49 Web Spam

Link Farms

  • Entry point from the honest web
  • Honey pots: Copies of quality content
  • Dead links to parking domain
  • Blog or guestbook comment spam

W W W

Hijacked Farm

slide-15
SLIDE 15

Databases and Information Systems

  • Prof. Dr. G. Weikum

Marc Spaniol MPII-Sp-0709-15/49 Web Spam

Spam Farm: Pages

?

λ1 λ2 λk λ0 pk p2 p1 p0

Target page

  • Each farm has only one
  • The target of the spammer is to increase

this page’s ranking

Boosting pages

  • Controlled by the spammer
  • Pointing to the target page in order

to increase its PageRank

slide-16
SLIDE 16

Databases and Information Systems

  • Prof. Dr. G. Weikum

Marc Spaniol MPII-Sp-0709-16/49 Web Spam

Spam Farm: External Links

λ0

?

λ1 λ2 λk pk p2 p1 p0

Leakage

  • Fractions of PageRank
  • Link to the pages are added from pages outside

the Farm (forum, blog, …)

  • The spammer has no or limited control on them
  • λ = λ0 + … + λk
slide-17
SLIDE 17

Databases and Information Systems

  • Prof. Dr. G. Weikum

Marc Spaniol MPII-Sp-0709-17/49 Web Spam

Simple Farm Model

  • PageRank of target page: p0
  • Number of all pages: N
  • Damping factor: c
  • Leakage contributed by accessible pages: λ
  • PageRank of each farm page: (1-c)/N

p0 = λ+ k*c*[(1-c)/N] + (1-c)/N = λ+ [(1-c)(ck+1)]/N By making k large, we can make p0 as large as we want No multiplier effect for “acquired” page rank p0 1 2 k

W W W

slide-18
SLIDE 18

Databases and Information Systems

  • Prof. Dr. G. Weikum

Marc Spaniol MPII-Sp-0709-18/49 Web Spam

Optimal Farm Model

  • Optimal PageRank of target page: q0
  • Number of all pages: N
  • Damping factor: c
  • Leakage contributed by accessible pages: λ
  • PageRank of each farm page: cq0/k + (1-c)/N

q0 = λ+ ck[cq0/k + (1-c)/N] + (1-c)/N = λ+ c2q0 + c(1-c)k/N + (1-c)/N … = λ/(1-c2) + [(1-c)(ck+1)]/N(1-c2) = p0/(1-c2) By making k large, we can make q0 as large as we want For c = 0.85 “performance” gain: 1/(1-c2) = 3.6 Multiplier effect for “acquired” page rank q0 1 2 k

W W W

slide-19
SLIDE 19

Databases and Information Systems

  • Prof. Dr. G. Weikum

Marc Spaniol MPII-Sp-0709-19/49 Web Spam

Optimal: Short reinforcement loop(s) Fewer links

Simple vs. Optimal Farm

pk p2 p1 p0 λ qk q2 q1 q0 λ

Optimal: The target points to all boosting pages There are no links among boosting pages

q0 = p0 / (1 – c2)

Simple: Each boosting page only points to the target page

rk r2 r1 r0 λ (1 – c)(ck + 1) p0 = cλ N + r0 = p0 / (1 – c2)

slide-20
SLIDE 20

Databases and Information Systems

  • Prof. Dr. G. Weikum

Marc Spaniol MPII-Sp-0709-20/49 Web Spam

Optimality without Leakage

For mathematical simplification only Idea: Interpret leakage as additional boosting pages

qk q2 q1 qk+m qk+1 q0 m = λN / (1-c) (1 – c)(ck + 1) = cλ + (1 – c2)N (1 – c)[c(k + m) + 1] (1 – c2)N ! (1 – c2)

slide-21
SLIDE 21

Databases and Information Systems

  • Prof. Dr. G. Weikum

Marc Spaniol MPII-Sp-0709-21/49 Web Spam

c(k + m)/2 + 1 p0 = q0 = (1 + c)N

Redistribution of PageRank Convenient for the smaller Farm

p0 = cΣi=1,...,kpi/2 + cΣj=1,...,mqj/2 + (1-c)/N q0 = cΣi=1,...,kpi/2 + cΣj=1,...,mqj/2 + (1-c)/N pi = c(p0+q0)/(k+m) + (1-c)/N, i= 1,...,k qj = c(p0+q0)/(k+m) + (1-c)/N, j= 1,...,m

Alliance of two Farms

Intuitive: Each boosting page points to both targets 2(k + m) new links

pk p2 p1 p0 qm q2 q1 q0

slide-22
SLIDE 22

Databases and Information Systems

  • Prof. Dr. G. Weikum

Marc Spaniol MPII-Sp-0709-22/49 Web Spam

Alliance of two Farms

Better: Only the target pages are interconnected with each other Redistribution of PageRank Convenient for the smaller Farm

  • nly 2 new links

pk p2 p1 p0 qm q2 q1 q0 c(k + m)/2 + 1 p0 = q0 = (1 + c)N

slide-23
SLIDE 23

Databases and Information Systems

  • Prof. Dr. G. Weikum

Marc Spaniol MPII-Sp-0709-23/49 Web Spam

ck + c2m p0 = (1 + c)N 1 N + p0 = c(Σi=1,...,kpi + q0) + (1-c)/N q0 = c(Σj=1,...,mqj + p0) + (1-c)/N pi = (1-c)/N, i= 1,...,k qj = (1-c)/N, j=1,...,m

Alliance between two Farms

Optimal: Each target points to the other target The targets have no links to the boosting pages

qm q2 q1 q0 pk p2 p1 p0

slide-24
SLIDE 24

Databases and Information Systems

  • Prof. Dr. G. Weikum

Marc Spaniol MPII-Sp-0709-24/49 Web Spam

Multi-Farm Alliance

qm q2 q1 q0 pk p2 p1 p0 rn r2 r1 r0

core

Two fundamental structures: Web ring Complete core

slide-25
SLIDE 25

Databases and Information Systems

  • Prof. Dr. G. Weikum

Marc Spaniol MPII-Sp-0709-25/49 Web Spam

Web Ring

Simple and intuitive connection model

qm q2 q1 q0 pk p2 p1 p0 rn r2 r1 r0 ck + c2m + c3n p0 = (1 + c + c2)N 1 N +

The distance influences the contribution of each farm to the PageRank of the others

slide-26
SLIDE 26

Databases and Information Systems

  • Prof. Dr. G. Weikum

Marc Spaniol MPII-Sp-0709-26/49 Web Spam

Complete Core

The core is a completely connected sub-graph

qm q2 q1 q0 pk p2 p1 p0 rn r2 r1 r0 2ck – c2k + c2m + c2n p0 = (2 + c)N 1 N +

The contribution of each farm to the PageRank of the other ones is uniform

slide-27
SLIDE 27

Databases and Information Systems

  • Prof. Dr. G. Weikum

Marc Spaniol MPII-Sp-0709-27/49 Web Spam

Evaluation

1000 2000 3000 4000 5000 6000 1 2 3 4 5 6 7 8 9 10 Farm Number

Scaled Target Page Rank

Single Farm Web Ring Complete Core

Non-connected Farm: The PageRank of the target page is linear to the dimension

  • f

the farm (numer of boosting pages) Complete core: All PageRanks increase, especially those of the target of farms with low dimensions Web ring: The PageRank of the target

  • f

farm 10 decreases compared to the non- connected farm case

Target scores for ring/complete cores: 10 farms of sizes 1.000, 2.000, … , 10.000

slide-28
SLIDE 28

Databases and Information Systems

  • Prof. Dr. G. Weikum

Marc Spaniol MPII-Sp-0709-28/49 Web Spam

Evaluation

20 40 60 80 100 120 140 160 180 200 1 2 3 4 5 6 7 8 9 10 Farm Number Page Rank Contribution Complete Core Web Ring

Complete core: Preserves the impact of PageRank with respect to the other targets and gives an identical contribution to the other targets that is much lower than to itself Web ring: The values of contribution are closer to each other and decrease by the distance from the farm Contribution of farm 1 to the other targets

slide-29
SLIDE 29

Databases and Information Systems

  • Prof. Dr. G. Weikum

Marc Spaniol MPII-Sp-0709-29/49 Web Spam

Lessons Learned

  • Single farm
  • Short loop(s) increase target PageRank
  • Increase of PageRank is linear with the amount of boosting pages
  • Leakage should only point to the target page
  • Leakage can be interpreted as an additional number of boosting pages
  • Two farms:
  • Target pages should only link to other targets
  • In an alliance of two, both participants win
  • Larger alliances
  • Need to be stable to keep all participants happy
  • Complete core topology:

Contribution to the PageRank of others at a relatively “low level”

  • Web ring topology

Contribution to the PageRank of others “slowly” decreasing by distance

slide-30
SLIDE 30

Databases and Information Systems

  • Prof. Dr. G. Weikum

Marc Spaniol MPII-Sp-0709-30/49 Web Spam

Link Farms – Example

  • Multi-domain
  • Multi-IP

411fashion.com 411 sites A-Z list Honey pot: quality content copy 411amusement.com 411 sites A-Z list 411zoos.com 411 sites A-Z list

target

slide-31
SLIDE 31

Databases and Information Systems

  • Prof. Dr. G. Weikum

Marc Spaniol MPII-Sp-0709-31/49 Web Spam

Cloaking and Hiding

  • Formatting tricks: Trapping crawlers with simple HTML processing only
  • One-pixel image
  • White over white
  • Color, position from stylesheet
  • Redirection
  • Script
  • Meta-tag with refresh time 0
slide-32
SLIDE 32

Databases and Information Systems

  • Prof. Dr. G. Weikum

Marc Spaniol MPII-Sp-0709-32/49 Web Spam

Obfuscated JavaScript

<SCRIPT language=javascript> var1=100; var3=200; var2=var1 + var3; var4=var1; var5=var4 + var3; if(var2==var5) document.location="http://umlander.info/mega/free_software_ downloads.html"; </SCRIPT>

  • More sophisticated tricks
  • Redirection through window.location
  • Spam content (text, link) from random looking static data via eval calls
  • Content generation by document.write
slide-33
SLIDE 33

Databases and Information Systems

  • Prof. Dr. G. Weikum

Marc Spaniol MPII-Sp-0709-33/49 Web Spam

HTTP Level Cloaking

  • User agent, client host filtering
  • Different for users and for GoogleBot
  • “Collaboration service” of spammers for
  • Crawler IPs
  • Agents
  • Behavior
slide-34
SLIDE 34

Databases and Information Systems

  • Prof. Dr. G. Weikum

Marc Spaniol MPII-Sp-0709-34/49 Web Spam

Spam in Social Media

Guest books

slide-35
SLIDE 35

Databases and Information Systems

  • Prof. Dr. G. Weikum

Marc Spaniol MPII-Sp-0709-35/49 Web Spam

Fake Blogs

slide-36
SLIDE 36

Databases and Information Systems

  • Prof. Dr. G. Weikum

Marc Spaniol MPII-Sp-0709-36/49 Web Spam

Spam Detection

  • Crawl-time vs. post-processing
  • Simple filters in crawler
  • Cannot handle unseen sites
  • Require large bootstrap crawl
  • Need to run rendering and script execution
  • Crawl time feature generation and classification
  • Needs interface in crawler to access content
  • Needs model from bootstrap or external crawl (may be smaller)
  • Sounds expensive but needs to be done only once per site
  • The hard work is done post-processing both cases
slide-37
SLIDE 37

Databases and Information Systems

  • Prof. Dr. G. Weikum

Marc Spaniol MPII-Sp-0709-37/49 Web Spam

Assessment Interface and Collaboration Infrastructure

Docs (WARC) Feature generation

(crawl-time)

access

Classifier

  • Build model
  • Apply model

(crawl-time)

feature feed text files

“Interaction” Active learning Local storages May share features, extracts across institutions

slide-38
SLIDE 38

Databases and Information Systems

  • Prof. Dr. G. Weikum

Marc Spaniol MPII-Sp-0709-38/49 Web Spam

Spam Labeling

  • Manual labels (black AND white lists) primarily determine quality
  • Can blacklist only a tiny fraction
  • Recall 10% of sites are spam
  • Needs machine learning
  • Central to the service
  • Aid manual assessment
  • Aid information and label sharing
  • Catch spam farms that span different top-level domains

No free lunch: No fully automatic filtering

slide-39
SLIDE 39

Databases and Information Systems

  • Prof. Dr. G. Weikum

Marc Spaniol MPII-Sp-0709-39/49 Web Spam

Web Spam Interface

slide-40
SLIDE 40

Databases and Information Systems

  • Prof. Dr. G. Weikum

Marc Spaniol MPII-Sp-0709-40/49 Web Spam

Combating Spam

  • Refresh detection
  • Conceal crawling by
  • Headers: Browser vs. crawler
  • Access: “Random” vs. BFS
  • TrustRank method
  • Supervised learning features
  • Number of words in the pages
  • Average word length
  • Number of words in the page title
  • Fraction of visible content
  • Amount of anchor text
  • Compressibility
  • Partition the Web into different blocks

Never stop! On-going process

slide-41
SLIDE 41

Databases and Information Systems

  • Prof. Dr. G. Weikum

Marc Spaniol MPII-Sp-0709-41/49 Web Spam

Google AdWords Competition

10k 10th wedding anniversary 128mb, 1950s, … abc, abercrombie, … b2b, baby, bad credit, … digital camera earn big money, easy, … f1, family, flower, fantasy gameboy, gates, girl, … hair, harry potter, … ibiza, import car, … james bond, janet jackson karate, konica, kostenlose ladies, lesbian, lingerie, … …

Query Marketability

slide-42
SLIDE 42

Databases and Information Systems

  • Prof. Dr. G. Weikum

Marc Spaniol MPII-Sp-0709-42/49 Web Spam

Generative Content Models

Honest topic 4 Honest topic 10 club (0.035) music (0.022) team (0.012) band (0.012) league (0.009) film (0.011) win (0.009) festival (0.009) Spam topic 7 loan (0.080) unsecured (0.026) credit (0.024) home (0.022) Excerpt: 20 spam and 50 honest topic models

[Bíró, Szabó, Benczúr 2008]

slide-43
SLIDE 43

Databases and Information Systems

  • Prof. Dr. G. Weikum

Marc Spaniol MPII-Sp-0709-43/49 Web Spam

TrustRank

  • Basic idea: Approximate isolation
  • Honest pages rarely point to spam
  • Spam cites many, many spam
  • Sample a set of “seed pages” from the web
  • “Oracle” (human) identifies good and spam pages in the seed set
  • The subset of seed pages that are identified as “good” are called

“trusted pages” Expensive! Make seed set as small as possible

slide-44
SLIDE 44

Databases and Information Systems

  • Prof. Dr. G. Weikum

Marc Spaniol MPII-Sp-0709-44/49 Web Spam

Trust Propagation

  • Set trust of each trusted page to 1
  • Propagate trust through links
  • Each page gets a trust value between 0 and 1
  • Use a threshold value and mark all pages below the trust threshold as spam
  • Trust attenuation
  • The degree of trust conferred by a trusted page decreases with distance
  • Trust splitting
  • The larger the number of outlinks from a page, the less scrutiny the page author

gives each outlink

  • Trust is “split” across outlinks
slide-45
SLIDE 45

Databases and Information Systems

  • Prof. Dr. G. Weikum

Marc Spaniol MPII-Sp-0709-45/49 Web Spam

Trust Computation

  • 1. Predicted spamicity

p(v) for all pages

  • 2. Target page u,

new feature f(u) by neighbor p(v) aggregation

  • 3. Reclassification by

adding the new feature

?

u v1 v2 v7

slide-46
SLIDE 46

Databases and Information Systems

  • Prof. Dr. G. Weikum

Marc Spaniol MPII-Sp-0709-46/49 Web Spam

ρ=0.97 ρ=0.61

PageRank Supporter Distribution

Honest: fhh.hamburg.de Spam: radiopr.bildflirt.de

(part of www.popdata.de farm)

low high PageRank low high PageRank

[Benczúr, Csalogány, Sarlós, Uher 2005]

slide-47
SLIDE 47

Databases and Information Systems

  • Prof. Dr. G. Weikum

Marc Spaniol MPII-Sp-0709-47/49 Web Spam

Web Spam Challenge

  • WEBSPAM-UK2006
  • 77M pages
  • 11,402 hosts
  • 7,373 labeled
  • 26% spam
  • WEBSPAM-UK2007
  • 100M pages
  • 114,529 hosts
  • 6,479 labeled
  • 6% spam

[WebSpam08]

slide-48
SLIDE 48

Databases and Information Systems

  • Prof. Dr. G. Weikum

Marc Spaniol MPII-Sp-0709-48/49 Web Spam

Summary

  • Web Spam
  • Aims at earch engine optimization
  • “Attacks” indexing crawlers
  • Slows down archiving crawlers
  • “Pollutes” the archive
  • Social Web is a “threat”
  • Spamming techniques
  • Boosting
  • Hiding
  • Combinations of various techniques
  • Countermeasures
  • Manual assessment
  • Machine learning techniques
  • Hybrid approaches

All Web information retrieval ranking elements spammed

slide-49
SLIDE 49

Databases and Information Systems

  • Prof. Dr. G. Weikum

Marc Spaniol MPII-Sp-0709-49/49 Web Spam

References

[Benc08]

  • A. Benczur: “Web spam survey for the Archivist”. 8th International Web Archiving

Workshop (IWAW 2008), Århus, Denkmark, Sept. 18, 2008. http://liwa-project.eu/index.php/video/33/ [last access: July 22, 2009] [BSS*08]

  • A. Benczur, D. Siklosi, J. Szabo et al.: “Web spam survey for the Archivist”.

Proceedings of the 8th International Web Archiving Workshop (IWAW 2008), Århus, Denkmark, Sept. 18, 2008. http://iwaw.net/08/IWAW2008-Benczur.pdf [last access: July 22, 2009] [GyGa05]

  • Z. Gyöngyi and H. Garcia-Molina: “Link Spam Alliances”. Technical Report,

Stanford University, March 2, 2005. http://www-db.stanford.edu/~zoltan/publications/gyongyi2005link.pdf [last access: July 22, 2009] [WebSpam08] Web Spam Challenge: “Home page”. 2008. http://webspam.lip6.fr/wiki/pmwiki.php [last access: July 22, 2009]