Link Spam Alliances Zoltn Gyngyi Hector Garcia-Molina Class List - - PowerPoint PPT Presentation

link spam alliances
SMART_READER_LITE
LIVE PREVIEW

Link Spam Alliances Zoltn Gyngyi Hector Garcia-Molina Class List - - PowerPoint PPT Presentation

Link Spam Alliances Zoltn Gyngyi Hector Garcia-Molina Class List Spam 101 Intro to web spam Spam 221 Spamming PageRank Spam farm model Optimal farm structure Alliances of two farms Larger alliances Spam


slide-1
SLIDE 1

Link Spam Alliances

Zoltán Gyöngyi Hector Garcia-Molina

slide-2
SLIDE 2

Very Large Data Bases ● Trondheim, September 1, 2005 2

Class List

Spam 101 — Intro to web spam Spam 221 — Spamming PageRank

  • Spam farm model
  • Optimal farm structure
  • Alliances of two farms
  • Larger alliances

Spam 321 — Link spam detection seminar

slide-3
SLIDE 3

Very Large Data Bases ● Trondheim, September 1, 2005 3

Spam 101

kaiser pharmacy online

slide-4
SLIDE 4

Very Large Data Bases ● Trondheim, September 1, 2005 4

Spam 101

Save today on Viagra, Lipitor, Zoloft, … Phentermine 90 Pills/$119

slide-5
SLIDE 5

Very Large Data Bases ● Trondheim, September 1, 2005 5

Spam 101

Pet shops commonly carry fish for home aquariums, small birds such as parakeets, small mammals such as fancy rats and hamsters…

slide-6
SLIDE 6

Very Large Data Bases ● Trondheim, September 1, 2005 6

Spam 101

Pharmacy is the profession of compounding and dispensing

  • medication. More recently, the

term has come to include other services… Lawyers Loans Mortgage Ringtones Viagra

slide-7
SLIDE 7

Very Large Data Bases ● Trondheim, September 1, 2005 7

Spam 101

Spamming = misleading search engines to obtain higher-than-deserved ranking Link spamming = building link structures that boost PageRank score

slide-8
SLIDE 8

Very Large Data Bases ● Trondheim, September 1, 2005 8

PageRank

  • f page p0

PageRank of page pi that points to page po

Spam 221: PageRank

p0 = c ∑i pi / out(i) + (1 – c)

A page is important if many important pages point to it

slide-9
SLIDE 9

Very Large Data Bases ● Trondheim, September 1, 2005 9

Spam 221: PageRank

p0 = c ∑i pi / out(i) + (1 – c)

Damping factor ≈ 0.85 Outdegree

  • f page pi

Random jump probability ≈ 0.15 (uniform static score)

slide-10
SLIDE 10

Very Large Data Bases ● Trondheim, September 1, 2005 10

Spam 221: Spam Farm Model

1 2 k

?

1 2 k

slide-11
SLIDE 11

Very Large Data Bases ● Trondheim, September 1, 2005 11

Single target page p0

  • Increase exposure
  • In particular, increase

PageRank

Spam 221: Spam Farm Model

slide-12
SLIDE 12

Very Large Data Bases ● Trondheim, September 1, 2005 12

Boosting pages p1, …, pk

  • Owned/controlled by

spammer

Cheap Canadian drugs here import pharmacy

  • nline best prescriptions

discount savings Canada Rx

Spam 221: Spam Farm Model

slide-13
SLIDE 13

Very Large Data Bases ● Trondheim, September 1, 2005 13

Leakages λ0, …, λk

  • Fractions of PageRank
  • Through hijacked links

– Spammer has limited access to source page

  • λ = λ0+ ··· + λk

Posted on 04/28/05 … Comments Great thoughts! I also wrote about this issue in my blog. (by as7869) Joe’s Blog

Spam 221: Spam Farm Model

slide-14
SLIDE 14

Very Large Data Bases ● Trondheim, September 1, 2005 14

Simple

p0 = λ + (1 – c)(c k + 1)

  • Every link points

to p0

Spam 221: Optimal Farm

For c = 0.85

p1 p2 pk p0

λ

Optimal

q0 = p0 / (1 – c2)

  • Links to boosting pages
  • 3.6x increase in target

PageRank

q1 q2 qk q0

λ

slide-15
SLIDE 15

Very Large Data Bases ● Trondheim, September 1, 2005 15

Spam 221: Optimal Farm

Optimal

q0 = p0 / (1 – c2)

  • Links to boosting pages
  • 3.6x increase in target

PageRank

q1 q2 qk q0

λ

Optimal #2

r0 = p0 / (1 – c2)

  • Same PageRank
  • Fewer links

r2 r3 rk r0

λ

r1

slide-16
SLIDE 16

Very Large Data Bases ● Trondheim, September 1, 2005 16

Spam 221: Optimal Farm

Optimal

q0 = p0 / (1 – c2)

  • Links to boosting pages
  • 3.6x increase in target

PageRank

q1 q2 qk q0

λ

Optimal #2

r0 = p0 / (1 – c2)

  • Same PageRank
  • Fewer links

r2 r3 rk r0

λ

r1

Lesson #1: Short loop(s) increase target PageRank

slide-17
SLIDE 17

Very Large Data Bases ● Trondheim, September 1, 2005 17

Spam 221: Two Farms

Alliances = interconnected farms

  • Single spammer, several target pages/farms
  • Multiple spammers

What happens if you and I team up?

slide-18
SLIDE 18

Very Large Data Bases ● Trondheim, September 1, 2005 18

We can do this…

Spam 221: Two Farms

p0 = q0 = d (k + m) / 2 … but it won’t help: target scores balance out

d = c / (1 + c)

slide-19
SLIDE 19

Very Large Data Bases ● Trondheim, September 1, 2005 19

However, we can also do this…

  • Remove the links to boosting pages

Spam 221: Two Farms

p0 = d k + c d m + 1 q0 = d m + c d k + 1

… and both target scores increase

  • For k = m, we have a 6.7x increase

p1 p2 pk p0 q1 q2 qm q0

slide-20
SLIDE 20

Very Large Data Bases ● Trondheim, September 1, 2005 20

However, we can also do this…

  • Remove the links to boosting pages

Spam 221: Two Farms

p0 = d k + c d m + 1 q0 = d m + c d k + 1

… and both target scores increase

  • For k = m, we have a 6.7x increase

p1 p2 pk p0 q1 q2 qm q0

Lesson #2: Target pages should only link to other targets Lesson #3: In an alliance of two, both participants win

slide-21
SLIDE 21

Very Large Data Bases ● Trondheim, September 1, 2005 21

“Extremes”

  • Ring core

Spam 221: Larger Alliances

  • Completely connected

core

slide-22
SLIDE 22

Very Large Data Bases ● Trondheim, September 1, 2005 22

1 2 3 4 5 6 7 8 9 10 Farm Number 1000 2000 3000 4000 5000 6000

t e g r a T k n a R e g a P

Optimal Single Ring Complete

Target scores for ring/complete cores

  • 10 farms of sizes 1000, 2000, …, 10000

Spam 221: Larger Alliances

Problem: farm 10 “loses” in a ring

slide-23
SLIDE 23

Very Large Data Bases ● Trondheim, September 1, 2005 23

1 2 3 4 5 6 7 8 9 10 Farm Number 1000 2000 3000 4000 5000 6000

t e g r a T k n a R e g a P

Optimal Single Ring Complete

Target scores for ring/complete cores

  • 10 farms of sizes 1000, 2000, …, 10000

Spam 221: Larger Alliances

Problem: farm 10 “loses” in a ring

Lesson #4: Larger alliances need to be stable to keep all participants happy

slide-24
SLIDE 24

Very Large Data Bases ● Trondheim, September 1, 2005 24

Stable alliance = no farm has incentive to split off

  • Alliances of two are always stable
  • Larger alliances are not necessarily stable

Dynamics see paper

  • Should a new farm be added?
  • What about adding more boosting pages?
  • When/with whom should a farm split off?
  • Should a “loser” be compensated?

Spam 221: Larger Alliances

slide-25
SLIDE 25

Very Large Data Bases ● Trondheim, September 1, 2005 25

Identifying regular structures

  • Inlink/outlink/PageRank distribution

“unnatural”

  • Fetterly et al., 2004
  • Benczúr et al., 2005

Spam 321: Spam Detection

p1 p2 pk p0

λ

p1 = p2 = ··· = pk

slide-26
SLIDE 26

Very Large Data Bases ● Trondheim, September 1, 2005 26

p1 p2 pk p0 q1 q2 qm q0

Detecting collusion

  • Alliance cores preserve (capture) PageRank
  • Zhang et al., 2004

(p0 + q0) / (∑i pi + ∑j qj) ≈ c / (1 – c)

Spam 321: Spam Detection

slide-27
SLIDE 27

Very Large Data Bases ● Trondheim, September 1, 2005 27

Estimating spam mass

  • Target PageRank depends on boosting
  • Work in progress

(p0 – p'0) / p0 large

Spam 321: Spam Detection

p'0

λ

slide-28
SLIDE 28

Very Large Data Bases ● Trondheim, September 1, 2005 28

Review Session

Link spammers target PageRank Spam farm model

  • Single target page
  • Boosting pages + leakage

Alliances of two

  • Always better than alone

Larger alliances

  • Different core structures
  • Not necessarily stable

– Conditions on joining and leaving

slide-29
SLIDE 29

Very Large Data Bases ● Trondheim, September 1, 2005 29

Review Session

Related work

  • Bianchini et al., 2005. Inside PageRank
  • Langville and Meyer, 2004. Deeper Inside

PageRank

  • Baeza-Yates et al., 2005. PageRank Increase

under Different Collusion Topologies

Future work

  • Spam detection
  • Cost model extension
slide-30
SLIDE 30

Very Large Data Bases ● Trondheim, September 1, 2005 30

Various core structures

  • 4 farms of size 50
  • One target probed (others symmetrical)

Spam 221: Larger Alliances

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Score Group 20 40 60 80 100

# f

  • s

h p a r G

40 70 100 130 160

t e g r a T k n a R e g a P

ring