link spam alliances
play

Link Spam Alliances Zoltn Gyngyi Hector Garcia-Molina Class List - PowerPoint PPT Presentation

Link Spam Alliances Zoltn Gyngyi Hector Garcia-Molina Class List Spam 101 Intro to web spam Spam 221 Spamming PageRank Spam farm model Optimal farm structure Alliances of two farms Larger alliances Spam


  1. Link Spam Alliances Zoltán Gyöngyi Hector Garcia-Molina

  2. Class List � Spam 101 — Intro to web spam � Spam 221 — Spamming PageRank • Spam farm model • Optimal farm structure • Alliances of two farms • Larger alliances � Spam 321 — Link spam detection seminar Very Large Data Bases ● Trondheim, September 1, 2005 2

  3. Spam 101 kaiser pharmacy online Very Large Data Bases ● Trondheim, September 1, 2005 3

  4. Spam 101 Save today on Viagra, Lipitor, Zoloft, … Phentermine 90 Pills/$119 Very Large Data Bases ● Trondheim, September 1, 2005 4

  5. Spam 101 Pet shops commonly carry fish for home aquariums, small birds such as parakeets, small mammals such as fancy rats and hamsters… Very Large Data Bases ● Trondheim, September 1, 2005 5

  6. Spam 101 Pharmacy is the profession of compounding and dispensing medication. More recently, the term has come to include other services… Lawyers Loans Mortgage Ringtones Viagra Very Large Data Bases ● Trondheim, September 1, 2005 6

  7. Spam 101 Spamming = misleading search engines to obtain higher-than-deserved ranking Link spamming = building link structures that boost PageRank score Very Large Data Bases ● Trondheim, September 1, 2005 7

  8. Spam 221: PageRank A page is important if many important pages point to it p 0 = c ∑ i p i / out(i) + (1 – c) PageRank PageRank of page p i of page p 0 that points to page p o Very Large Data Bases ● Trondheim, September 1, 2005 8

  9. Spam 221: PageRank Random jump Damping probability ≈ 0.15 factor ≈ 0.85 (uniform static score) p 0 = c ∑ i p i / out(i) + (1 – c) Outdegree of page p i Very Large Data Bases ● Trondheim, September 1, 2005 9

  10. Spam 221: Spam Farm Model 1 1 2 0 2 ? k 0 k Very Large Data Bases ● Trondheim, September 1, 2005 10

  11. Spam 221: Spam Farm Model � Single target page p 0 • Increase exposure • In particular, increase PageRank Very Large Data Bases ● Trondheim, September 1, 2005 11

  12. Spam 221: Spam Farm Model Canada Rx Cheap Canadian drugs here import pharmacy online best prescriptions discount savings � Boosting pages p 1 , …, p k • Owned/controlled by spammer Very Large Data Bases ● Trondheim, September 1, 2005 12

  13. Spam 221: Spam Farm Model � Leakages λ 0 , …, λ k • Fractions of PageRank • Through hijacked links – Spammer has limited access to source page • λ = λ 0 + ··· + λ k Joe’s Blog Posted on 04/28/05 … Comments Great thoughts! I also wrote about this issue in my blog . (by as7869 ) Very Large Data Bases ● Trondheim, September 1, 2005 13

  14. Spam 221: Optimal Farm � Optimal � Simple p 0 = λ + (1 – c)(c k + 1) q 0 = p 0 / (1 – c 2 ) • Every link points • Links to boosting pages to p 0 • 3.6x increase in target PageRank For c = 0.85 p 1 q 1 λ λ p 2 q 2 p 0 q 0 p k q k Very Large Data Bases ● Trondheim, September 1, 2005 14

  15. Spam 221: Optimal Farm � Optimal � Optimal #2 q 0 = p 0 / (1 – c 2 ) r 0 = p 0 / (1 – c 2 ) • Links to boosting pages • Same PageRank • 3.6x increase in target • Fewer links PageRank r 2 q 1 λ λ r 3 q 2 q 0 r 0 r 1 r k q k Very Large Data Bases ● Trondheim, September 1, 2005 15

  16. Spam 221: Optimal Farm � Optimal � Optimal #2 q 0 = p 0 / (1 – c 2 ) r 0 = p 0 / (1 – c 2 ) • Links to boosting pages • Same PageRank • 3.6x increase in target • Fewer links PageRank Lesson #1 : r 2 q 1 λ Short loop(s) increase target PageRank λ r 3 q 2 q 0 r 0 r 1 r k q k Very Large Data Bases ● Trondheim, September 1, 2005 16

  17. Spam 221: Two Farms � Alliances = interconnected farms • Single spammer, several target pages/farms • Multiple spammers What happens if you and I team up? Very Large Data Bases ● Trondheim, September 1, 2005 17

  18. Spam 221: Two Farms � We can do this… � … but it won’t help: d = c / (1 + c) target scores balance out p 0 = q 0 = d (k + m) / 2 Very Large Data Bases ● Trondheim, September 1, 2005 18

  19. Spam 221: Two Farms � However, we can also do this… • Remove the links to boosting pages p 1 q 1 p 2 p 0 q 0 q 2 p k q m � … and both target scores increase • For k = m, we have a 6.7x increase p 0 = d k + c d m + 1 q 0 = d m + c d k + 1 Very Large Data Bases ● Trondheim, September 1, 2005 19

  20. Spam 221: Two Farms � However, we can also do this… • Remove the links to boosting pages Lesson #2 : p 1 q 1 Target pages should only link to other targets p 2 p 0 q 0 q 2 p k q m � … and both target scores increase Lesson #3 : In an alliance of two, both participants win • For k = m, we have a 6.7x increase p 0 = d k + c d m + 1 q 0 = d m + c d k + 1 Very Large Data Bases ● Trondheim, September 1, 2005 20

  21. Spam 221: Larger Alliances � “Extremes” • Ring core • Completely connected core Very Large Data Bases ● Trondheim, September 1, 2005 21

  22. Spam 221: Larger Alliances � Target scores for ring/complete cores • 10 farms of sizes 1000, 2000, …, 10000 6000 5000 Complete k n a 4000 R e g Ring a 3000 P t e Problem: farm 10 g r 2000 a “loses” in a ring T 1000 Optimal Single 0 1 2 3 4 5 6 7 8 9 10 Farm Number Very Large Data Bases ● Trondheim, September 1, 2005 22

  23. Spam 221: Larger Alliances � Target scores for ring/complete cores • 10 farms of sizes 1000, 2000, …, 10000 6000 5000 Complete k n a 4000 R Lesson #4 : e g Ring a 3000 Larger alliances need to be stable to keep P t e Problem: farm 10 g all participants happy r 2000 a “loses” in a ring T 1000 Optimal Single 0 1 2 3 4 5 6 7 8 9 10 Farm Number Very Large Data Bases ● Trondheim, September 1, 2005 23

  24. Spam 221: Larger Alliances � Stable alliance = no farm has incentive to split off • Alliances of two are always stable • Larger alliances are not necessarily stable � Dynamics see paper • Should a new farm be added? • What about adding more boosting pages? • When/with whom should a farm split off? • Should a “loser” be compensated? Very Large Data Bases ● Trondheim, September 1, 2005 24

  25. Spam 321: Spam Detection � Identifying regular structures • Inlink/outlink/PageRank distribution “unnatural” • Fetterly et al. , 2004 • Benczúr et al. , 2005 p 1 λ p 2 p 1 = p 2 = ··· = p k p 0 p k Very Large Data Bases ● Trondheim, September 1, 2005 25

  26. Spam 321: Spam Detection � Detecting collusion • Alliance cores preserve (capture) PageRank • Zhang et al. , 2004 p 1 q 1 p 2 p 0 q 0 q 2 p k q m (p 0 + q 0 ) / ( ∑ i p i + ∑ j q j ) ≈ c / (1 – c) Very Large Data Bases ● Trondheim, September 1, 2005 26

  27. Spam 321: Spam Detection � Estimating spam mass • Target PageRank depends on boosting • Work in progress 0 λ 0 p' 0 0 (p 0 – p' 0 ) / p 0 large Very Large Data Bases ● Trondheim, September 1, 2005 27

  28. Review Session � Link spammers target PageRank � Spam farm model • Single target page • Boosting pages + leakage � Alliances of two • Always better than alone � Larger alliances • Different core structures • Not necessarily stable – Conditions on joining and leaving Very Large Data Bases ● Trondheim, September 1, 2005 28

  29. Review Session � Related work • Bianchini et al. , 2005. Inside PageRank • Langville and Meyer, 2004. Deeper Inside PageRank • Baeza-Yates et al. , 2005. PageRank Increase under Different Collusion Topologies � Future work • Spam detection • Cost model extension Very Large Data Bases ● Trondheim, September 1, 2005 29

  30. Spam 221: Larger Alliances � Various core structures • 4 farms of size 50 • One target probed (others symmetrical) 160 k n 130 a ring R e g 100 a P t e 70 g r a s T h 100 p 40 80 a r 60 G 40 # f o 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Score Group Very Large Data Bases ● Trondheim, September 1, 2005 30

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend