Link Spam Detection Based on Mass Estimation Zoltn Gyngyi , Pavel - - PowerPoint PPT Presentation

link spam detection based on mass estimation
SMART_READER_LITE
LIVE PREVIEW

Link Spam Detection Based on Mass Estimation Zoltn Gyngyi , Pavel - - PowerPoint PPT Presentation

Link Spam Detection Based on Mass Estimation Zoltn Gyngyi , Pavel Berkhin, Hector Garcia-Molina, Jan Pedersen Roadmap Search engine spamming Link spamming PageRank contribution Spam mass Definition Estimation


slide-1
SLIDE 1

Link Spam Detection Based on Mass Estimation

Zoltán Gyöngyi, Pavel Berkhin, Hector Garcia-Molina, Jan Pedersen

slide-2
SLIDE 2

Very Large Data Bases ● Seoul, September 13, 2006 2

Roadmap

Search engine spamming Link spamming PageRank contribution Spam mass

  • Definition
  • Estimation
  • Algorithm

Experiments

slide-3
SLIDE 3

Very Large Data Bases ● Seoul, September 13, 2006 3

Spamming: Example

#1 search result for the query “austria ski”

slide-4
SLIDE 4

Very Large Data Bases ● Seoul, September 13, 2006 4

Spamming: Example

#1 search result for the query “austria ski”

asiandiveholidays.com asianmp3.com mp3thailand.com thailandhealthcaretimes.com thailandpropertytimes.com

slide-5
SLIDE 5

Very Large Data Bases ● Seoul, September 13, 2006 5

Spamming: Example

slide-6
SLIDE 6

Very Large Data Bases ● Seoul, September 13, 2006 6

Spamming: Introduction

Spamming = misleading search engines to

  • btain higher-than-deserved ranking
slide-7
SLIDE 7

Very Large Data Bases ● Seoul, September 13, 2006 7

Spamming: Introduction

Spamming = misleading search engines to

  • btain higher-than-deserved ranking
slide-8
SLIDE 8

Very Large Data Bases ● Seoul, September 13, 2006 8

Spamming: Introduction

Spamming = misleading search engines to

  • btain higher-than-deserved ranking

Link spamming = building link structures that boost PageRank score

slide-9
SLIDE 9

Very Large Data Bases ● Seoul, September 13, 2006 9

Spamming: Our Target

Detect pages that achieve high PageRank through link spamming

s0 s1 s2 sk-1 sk g1 gm

k >> m m

slide-10
SLIDE 10

Very Large Data Bases ● Seoul, September 13, 2006 10

PageRank Contribution

slide-11
SLIDE 11

Very Large Data Bases ● Seoul, September 13, 2006 11

PageRank Contribution

p0

slide-12
SLIDE 12

Very Large Data Bases ● Seoul, September 13, 2006 12

PageRank Contribution

p0

slide-13
SLIDE 13

Very Large Data Bases ● Seoul, September 13, 2006 13

PageRank Contribution

p0

slide-14
SLIDE 14

Very Large Data Bases ● Seoul, September 13, 2006 14

PageRank Contribution

p0

slide-15
SLIDE 15

Very Large Data Bases ● Seoul, September 13, 2006 15

PageRank Contribution

p0

+ = 2 c2 (1 – c) / n + 2 c (1 – c) / n

p0

– = 6 c2 (1 – c) / n + c (1 – c) / n

p0

slide-16
SLIDE 16

Very Large Data Bases ● Seoul, September 13, 2006 16

Spam Mass: Definition

Absolute mass

  • Amount (part) of

PageRank coming from spam Relative mass

  • Fraction of PageRank

coming from spam

  • More useful in practice

a.m. = p0

– = 5

p0

p0 5 7 r.m. = = 5 2 p0

slide-17
SLIDE 17

Very Large Data Bases ● Seoul, September 13, 2006 17

Spam Mass: Estimation

Ideally…

p0

slide-18
SLIDE 18

Very Large Data Bases ● Seoul, September 13, 2006 18

Spam Mass: Estimation

p0

+

In practice… Approximate the set of good nodes by a subset called good core

slide-19
SLIDE 19

Very Large Data Bases ● Seoul, September 13, 2006 19

Spam Mass: Estimation

p0

+

In practice… Approximate the set of good nodes by a subset called good core

p0

– = p0 – p0 +

slide-20
SLIDE 20

Very Large Data Bases ● Seoul, September 13, 2006 20

Spam Mass: Algorithm

  • 1. Create good core
  • 2. Compute PageRank scores pi and pi

+

  • 3. Compute estimated relative mass mi as

(pi – pi

+) / pi

  • 4. For all pages i with large PageRank

Mark page as spam if mi > threshold

slide-21
SLIDE 21

Very Large Data Bases ● Seoul, September 13, 2006 21

Experiments: Data

Yahoo! web index host graph

  • 73.3M nodes
  • 979M links

Good core

  • High-quality web directory: 16,780
  • Governmental hosts: 55,320
  • Educational hosts: 434,000
slide-22
SLIDE 22

Very Large Data Bases ● Seoul, September 13, 2006 22

Experiments: Data

Sample

  • 0.1% of nodes with PageRank > 10x minimum
  • 892 nodes
  • Manually labeled good, spam

Relative mass groups (approx. same size)

  • Group 1: 44 samples with smallest rel. mass

  • Group 20: 40 samples with largest rel. mass
slide-23
SLIDE 23

Very Large Data Bases ● Seoul, September 13, 2006 23

Experiments: Relative Mass

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 20 40 60 80 100

5% 7% 5% 8%

17 % 10 % 11 % 12 %

9 %

29 % 16 % 33 % 40 % 50 % 58 % 59 % 62 % 74 % 80 % 95 %

100 %

93 % 95 % 92 % 83 % 90 % 89 % 88 % 91 % 71 % 84 % 67 % 60 % 45 % 38 % 35 % 26 % 10 %

good anomalous spam

Sample group number Sample group composition

slide-24
SLIDE 24

Very Large Data Bases ● Seoul, September 13, 2006 24

Experiments: Relative Mass

Anomalies

  • *.alibaba.com
  • *.blogger.com.br
  • Polish hosts only 12 .pl in good core
slide-25
SLIDE 25

Very Large Data Bases ● Seoul, September 13, 2006 25

Experiments: Relative Mass

slide-26
SLIDE 26

Very Large Data Bases ● Seoul, September 13, 2006 26

Experiments: Core Size

Estimated precision Relative mass threshold

0.4 0.5 0.6 0.7 0.8 0.98 0.91 0.84 0.76 0.66 0.56 0.45 0.34 0.23 0.1 0.98 0.91 0.84 0.76 0.66 0.56 0.45 0.34 0.23 0.1

100% core 10% core 1% core 0.1% core .it core

slide-27
SLIDE 27

Very Large Data Bases ● Seoul, September 13, 2006 27

Related Work

PageRank analyses

  • [Bianchini+2005], [Langville+2004]

Link spam analyses

  • [Baeza+2005], [Gyöngyi+2005]

Link spam detection

  • Statistics: [Fetterly+2004], [Benczúr+2005]
  • Collusion detection: [Zhang+2004], [Wu+2005]

TrustRank

  • [Gyöngyi+2004], [Wu+2006]
slide-28
SLIDE 28

Very Large Data Bases ● Seoul, September 13, 2006 28

Conclusions

Search engine spamming

  • Manipulation of search engine ranking
  • Focus on link spamming

Spam mass

  • ~ PageRank contribution of spam
  • Useful in link spam detection

Strong experimental results

  • Virtually 100% of top 47K nodes spam
  • 94% of top 105K nodes spam
slide-29
SLIDE 29

Very Large Data Bases ● Seoul, September 13, 2006 29

Link Spamming: Model

Spam farm

slide-30
SLIDE 30

Very Large Data Bases ● Seoul, September 13, 2006 30

Link Spamming: Model

Spam farm

1.Target node

s0

slide-31
SLIDE 31

Very Large Data Bases ● Seoul, September 13, 2006 31

Link Spamming: Model

Spam farm

1.Target node 2.Boosting nodes

s0 s1 s2 s3 s4

Great cheap ski Switzerland Italy travel best rates winter sports hotels Ski Austria travel…

slide-32
SLIDE 32

Very Large Data Bases ● Seoul, September 13, 2006 32

Link Spamming: Model

Spam farm

1.Target node 2.Boosting nodes 3.Hijacked links from good nodes

g1 g2 s0 s1 s2 s3 s4

Comments

Great pictures! See my Austria ski vacation. (by as7869) Joe’s Blog

slide-33
SLIDE 33

Very Large Data Bases ● Seoul, September 13, 2006 33

Link Spamming: Model

Spam farm alliances

slide-34
SLIDE 34

Very Large Data Bases ● Seoul, September 13, 2006 34

PageRank

Probabilistic model: p = c U T p + (1 – c) v

  • U = U(T, v) stochastic transition matrix
  • |v| = 1

Linear model: (I – c T T) p = (1 – c) v

  • No adjustment for nodes without outlinks

(transition matrix T has all-zero rows)

  • Advantages

– For p = PR(v) and v = v1 + v2, p = p1 + p2 where p1 = PR(v1) and p2 = PR(v2) – Faster to compute

slide-35
SLIDE 35

Very Large Data Bases ● Seoul, September 13, 2006 35

PageRank Contribution

Walk W from x to y: x = x0, x1, …, xk = y

  • Weight π(W) = out(x0) –1 · · · out(xk – 1) –1

Contribution of x to y over W: ck π(W) (1 – c) / n PageRank contribution py

x of x to y—over

all walks

  • Possibly infinite # of walks if there are cycles
  • pyx = PR(random jump to x only)

See also [Jeh+2003]