link spam detection based on mass estimation
play

Link Spam Detection Based on Mass Estimation Zoltn Gyngyi , Pavel - PowerPoint PPT Presentation

Link Spam Detection Based on Mass Estimation Zoltn Gyngyi , Pavel Berkhin, Hector Garcia-Molina, Jan Pedersen Roadmap Search engine spamming Link spamming PageRank contribution Spam mass Definition Estimation


  1. Link Spam Detection Based on Mass Estimation Zoltán Gyöngyi , Pavel Berkhin, Hector Garcia-Molina, Jan Pedersen

  2. Roadmap � Search engine spamming � Link spamming � PageRank contribution � Spam mass • Definition • Estimation • Algorithm � Experiments Very Large Data Bases ● Seoul, September 13, 2006 2

  3. Spamming: Example #1 search result for the query “austria ski” Very Large Data Bases ● Seoul, September 13, 2006 3

  4. Spamming: Example #1 search result for the query “austria ski” asiandiveholidays.com asianmp3.com mp3thailand.com thailandhealthcaretimes.com thailandpropertytimes.com Very Large Data Bases ● Seoul, September 13, 2006 4

  5. Spamming: Example Very Large Data Bases ● Seoul, September 13, 2006 5

  6. Spamming: Introduction Spamming = misleading search engines to obtain higher-than-deserved ranking Very Large Data Bases ● Seoul, September 13, 2006 6

  7. Spamming: Introduction Spamming = misleading search engines to obtain higher-than-deserved ranking Very Large Data Bases ● Seoul, September 13, 2006 7

  8. Spamming: Introduction Spamming = misleading search engines to obtain higher-than-deserved ranking Link spamming = building link structures that boost PageRank score Very Large Data Bases ● Seoul, September 13, 2006 8

  9. Spamming: Our Target Detect pages that achieve high PageRank through link spamming s 1 g 1 s 2 s 0 k >> m m s k-1 g m s k Very Large Data Bases ● Seoul, September 13, 2006 9

  10. PageRank Contribution Very Large Data Bases ● Seoul, September 13, 2006 10

  11. PageRank Contribution p 0 Very Large Data Bases ● Seoul, September 13, 2006 11

  12. PageRank Contribution p 0 Very Large Data Bases ● Seoul, September 13, 2006 12

  13. PageRank Contribution p 0 Very Large Data Bases ● Seoul, September 13, 2006 13

  14. PageRank Contribution p 0 Very Large Data Bases ● Seoul, September 13, 2006 14

  15. PageRank Contribution p 0 + = 2 c 2 (1 – c) / n + 2 c (1 – c) / n p 0 – = 6 c 2 (1 – c) / n + c (1 – c) / n p 0 Very Large Data Bases ● Seoul, September 13, 2006 15

  16. Spam Mass: Definition � Absolute mass • Amount (part) of – = 5 a.m. = p 0 PageRank coming from spam 5 p 0 2 � Relative mass • Fraction of PageRank p 0 – 5 r.m. = = p 0 coming from spam 7 • More useful in practice Very Large Data Bases ● Seoul, September 13, 2006 16

  17. Spam Mass: Estimation Ideally… p 0 Very Large Data Bases ● Seoul, September 13, 2006 17

  18. Spam Mass: Estimation In practice… p 0 + � Approximate the set of good nodes by a subset called good core Very Large Data Bases ● Seoul, September 13, 2006 18

  19. Spam Mass: Estimation In practice… – = p 0 – p 0 p 0 + p 0 + � Approximate the set of good nodes by a subset called good core Very Large Data Bases ● Seoul, September 13, 2006 19

  20. Spam Mass: Algorithm 1. Create good core 2. Compute PageRank scores p i and p i + 3. Compute estimated relative mass m i as (p i – p i + ) / p i 4. For all pages i with large PageRank Mark page as spam if m i > threshold Very Large Data Bases ● Seoul, September 13, 2006 20

  21. Experiments: Data � Yahoo! web index � host graph • 73.3M nodes • 979M links � Good core • High-quality web directory: 16,780 • Governmental hosts: 55,320 • Educational hosts: 434,000 Very Large Data Bases ● Seoul, September 13, 2006 21

  22. Experiments: Data � Sample • 0.1% of nodes with PageRank > 10x minimum • 892 nodes • Manually labeled good, spam � Relative mass groups (approx. same size) • Group 1: 44 samples with smallest rel. mass … • Group 20: 40 samples with largest rel. mass Very Large Data Bases ● Seoul, September 13, 2006 22

  23. Experiments: Relative Mass good anomalous 100 10 % 26 % 35 38 80 % 45 % Sample group composition spam % 60 67 % 71 % % 60 83 84 88 90 89 % 91 92 % 93 95 95 % % % % % % 100 % % % 80 40 74 % % 62 59 58 % % 50 % % 40 20 33 % 29 % % 17 16 12 11 10 % 9 % % 8% 7% % % % 5% 5% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Sample group number Very Large Data Bases ● Seoul, September 13, 2006 23

  24. Experiments: Relative Mass � Anomalies • *.alibaba.com • *.blogger.com.br • Polish hosts � only 12 .pl in good core Very Large Data Bases ● Seoul, September 13, 2006 24

  25. Experiments: Relative Mass Very Large Data Bases ● Seoul, September 13, 2006 25

  26. Experiments: Core Size 0.98 0.91 0.84 0.76 0.66 0.56 0.45 0.34 0.23 0.1 0 100% core 0.8 10% core 1% core 0.7 Estimated precision 0.1% core .it core 0.6 0.5 0.4 0.98 0.91 0.84 0.76 0.66 0.56 0.45 0.34 0.23 0.1 0 Relative mass threshold Very Large Data Bases ● Seoul, September 13, 2006 26

  27. Related Work � PageRank analyses • [Bianchini+2005], [Langville+2004] � Link spam analyses • [Baeza+2005], [Gyöngyi+2005] � Link spam detection • Statistics: [Fetterly+2004], [Benczúr+2005] • Collusion detection: [Zhang+2004], [Wu+2005] � TrustRank • [Gyöngyi+2004], [Wu+2006] Very Large Data Bases ● Seoul, September 13, 2006 27

  28. Conclusions � Search engine spamming • Manipulation of search engine ranking • Focus on link spamming � Spam mass • ~ PageRank contribution of spam • Useful in link spam detection � Strong experimental results • Virtually 100% of top 47K nodes spam • 94% of top 105K nodes spam Very Large Data Bases ● Seoul, September 13, 2006 28

  29. Link Spamming: Model � Spam farm Very Large Data Bases ● Seoul, September 13, 2006 29

  30. Link Spamming: Model � Spam farm 1.Target node s 0 Very Large Data Bases ● Seoul, September 13, 2006 30

  31. Link Spamming: Model � Spam farm 1.Target node s 1 2.Boosting nodes s 2 s 0 Ski Austria travel… s 3 Great cheap ski Switzerland Italy travel s 4 best rates winter sports hotels Very Large Data Bases ● Seoul, September 13, 2006 31

  32. Link Spamming: Model � Spam farm 1.Target node s 1 2.Boosting nodes g 1 3.Hijacked links from s 2 s 0 good nodes Joe’s Blog s 3 Comments g 2 s 4 Great pictures! See my Austria ski vacation. (by as7869) Very Large Data Bases ● Seoul, September 13, 2006 32

  33. Link Spamming: Model � Spam farm alliances Very Large Data Bases ● Seoul, September 13, 2006 33

  34. PageRank � Probabilistic model: p = c U T p + (1 – c) v • U = U ( T , v ) stochastic transition matrix • |v | = 1 � Linear model: ( I – c T T ) p = (1 – c) v • No adjustment for nodes without outlinks (transition matrix T has all-zero rows) • Advantages – For p = PR( v ) and v = v 1 + v 2 , p = p 1 + p 2 where p 1 = PR( v 1 ) and p 2 = PR( v 2 ) – Faster to compute Very Large Data Bases ● Seoul, September 13, 2006 34

  35. PageRank Contribution � Walk W from x to y: x = x 0 , x 1 , …, x k = y • Weight π (W) = out(x 0 ) –1 · · · out(x k – 1 ) –1 � Contribution of x to y over W: c k π (W) (1 – c) / n x of x to y—over � PageRank contribution p y all walks • Possibly infinite # of walks if there are cycles • p yx = PR(random jump to x only) � See also [Jeh+2003] Very Large Data Bases ● Seoul, September 13, 2006 35

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend