Link Spam Detection Based on Mass Estimation Zoltn Gyngyi , Pavel - - PowerPoint PPT Presentation
Link Spam Detection Based on Mass Estimation Zoltn Gyngyi , Pavel - - PowerPoint PPT Presentation
Link Spam Detection Based on Mass Estimation Zoltn Gyngyi , Pavel Berkhin, Hector Garcia-Molina, Jan Pedersen Roadmap Search engine spamming Link spamming PageRank contribution Spam mass Definition Estimation
Very Large Data Bases ● Seoul, September 13, 2006 2
Roadmap
Search engine spamming Link spamming PageRank contribution Spam mass
- Definition
- Estimation
- Algorithm
Experiments
Very Large Data Bases ● Seoul, September 13, 2006 3
Spamming: Example
#1 search result for the query “austria ski”
Very Large Data Bases ● Seoul, September 13, 2006 4
Spamming: Example
#1 search result for the query “austria ski”
asiandiveholidays.com asianmp3.com mp3thailand.com thailandhealthcaretimes.com thailandpropertytimes.com
Very Large Data Bases ● Seoul, September 13, 2006 5
Spamming: Example
Very Large Data Bases ● Seoul, September 13, 2006 6
Spamming: Introduction
Spamming = misleading search engines to
- btain higher-than-deserved ranking
Very Large Data Bases ● Seoul, September 13, 2006 7
Spamming: Introduction
Spamming = misleading search engines to
- btain higher-than-deserved ranking
Very Large Data Bases ● Seoul, September 13, 2006 8
Spamming: Introduction
Spamming = misleading search engines to
- btain higher-than-deserved ranking
Link spamming = building link structures that boost PageRank score
Very Large Data Bases ● Seoul, September 13, 2006 9
Spamming: Our Target
Detect pages that achieve high PageRank through link spamming
s0 s1 s2 sk-1 sk g1 gm
k >> m m
Very Large Data Bases ● Seoul, September 13, 2006 10
PageRank Contribution
Very Large Data Bases ● Seoul, September 13, 2006 11
PageRank Contribution
p0
Very Large Data Bases ● Seoul, September 13, 2006 12
PageRank Contribution
p0
Very Large Data Bases ● Seoul, September 13, 2006 13
PageRank Contribution
p0
Very Large Data Bases ● Seoul, September 13, 2006 14
PageRank Contribution
p0
Very Large Data Bases ● Seoul, September 13, 2006 15
PageRank Contribution
p0
+ = 2 c2 (1 – c) / n + 2 c (1 – c) / n
p0
– = 6 c2 (1 – c) / n + c (1 – c) / n
p0
Very Large Data Bases ● Seoul, September 13, 2006 16
Spam Mass: Definition
Absolute mass
- Amount (part) of
PageRank coming from spam Relative mass
- Fraction of PageRank
coming from spam
- More useful in practice
a.m. = p0
– = 5
p0
–
p0 5 7 r.m. = = 5 2 p0
Very Large Data Bases ● Seoul, September 13, 2006 17
Spam Mass: Estimation
Ideally…
p0
Very Large Data Bases ● Seoul, September 13, 2006 18
Spam Mass: Estimation
p0
+
In practice… Approximate the set of good nodes by a subset called good core
Very Large Data Bases ● Seoul, September 13, 2006 19
Spam Mass: Estimation
p0
+
In practice… Approximate the set of good nodes by a subset called good core
p0
– = p0 – p0 +
Very Large Data Bases ● Seoul, September 13, 2006 20
Spam Mass: Algorithm
- 1. Create good core
- 2. Compute PageRank scores pi and pi
+
- 3. Compute estimated relative mass mi as
(pi – pi
+) / pi
- 4. For all pages i with large PageRank
Mark page as spam if mi > threshold
Very Large Data Bases ● Seoul, September 13, 2006 21
Experiments: Data
Yahoo! web index host graph
- 73.3M nodes
- 979M links
Good core
- High-quality web directory: 16,780
- Governmental hosts: 55,320
- Educational hosts: 434,000
Very Large Data Bases ● Seoul, September 13, 2006 22
Experiments: Data
Sample
- 0.1% of nodes with PageRank > 10x minimum
- 892 nodes
- Manually labeled good, spam
Relative mass groups (approx. same size)
- Group 1: 44 samples with smallest rel. mass
…
- Group 20: 40 samples with largest rel. mass
Very Large Data Bases ● Seoul, September 13, 2006 23
Experiments: Relative Mass
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 20 40 60 80 100
5% 7% 5% 8%
17 % 10 % 11 % 12 %
9 %
29 % 16 % 33 % 40 % 50 % 58 % 59 % 62 % 74 % 80 % 95 %
100 %
93 % 95 % 92 % 83 % 90 % 89 % 88 % 91 % 71 % 84 % 67 % 60 % 45 % 38 % 35 % 26 % 10 %
good anomalous spam
Sample group number Sample group composition
Very Large Data Bases ● Seoul, September 13, 2006 24
Experiments: Relative Mass
Anomalies
- *.alibaba.com
- *.blogger.com.br
- Polish hosts only 12 .pl in good core
Very Large Data Bases ● Seoul, September 13, 2006 25
Experiments: Relative Mass
Very Large Data Bases ● Seoul, September 13, 2006 26
Experiments: Core Size
Estimated precision Relative mass threshold
0.4 0.5 0.6 0.7 0.8 0.98 0.91 0.84 0.76 0.66 0.56 0.45 0.34 0.23 0.1 0.98 0.91 0.84 0.76 0.66 0.56 0.45 0.34 0.23 0.1
100% core 10% core 1% core 0.1% core .it core
Very Large Data Bases ● Seoul, September 13, 2006 27
Related Work
PageRank analyses
- [Bianchini+2005], [Langville+2004]
Link spam analyses
- [Baeza+2005], [Gyöngyi+2005]
Link spam detection
- Statistics: [Fetterly+2004], [Benczúr+2005]
- Collusion detection: [Zhang+2004], [Wu+2005]
TrustRank
- [Gyöngyi+2004], [Wu+2006]
Very Large Data Bases ● Seoul, September 13, 2006 28
Conclusions
Search engine spamming
- Manipulation of search engine ranking
- Focus on link spamming
Spam mass
- ~ PageRank contribution of spam
- Useful in link spam detection
Strong experimental results
- Virtually 100% of top 47K nodes spam
- 94% of top 105K nodes spam
Very Large Data Bases ● Seoul, September 13, 2006 29
Link Spamming: Model
Spam farm
Very Large Data Bases ● Seoul, September 13, 2006 30
Link Spamming: Model
Spam farm
1.Target node
s0
Very Large Data Bases ● Seoul, September 13, 2006 31
Link Spamming: Model
Spam farm
1.Target node 2.Boosting nodes
s0 s1 s2 s3 s4
Great cheap ski Switzerland Italy travel best rates winter sports hotels Ski Austria travel…
Very Large Data Bases ● Seoul, September 13, 2006 32
Link Spamming: Model
Spam farm
1.Target node 2.Boosting nodes 3.Hijacked links from good nodes
g1 g2 s0 s1 s2 s3 s4
Comments
Great pictures! See my Austria ski vacation. (by as7869) Joe’s Blog
Very Large Data Bases ● Seoul, September 13, 2006 33
Link Spamming: Model
Spam farm alliances
Very Large Data Bases ● Seoul, September 13, 2006 34
PageRank
Probabilistic model: p = c U T p + (1 – c) v
- U = U(T, v) stochastic transition matrix
- |v| = 1
Linear model: (I – c T T) p = (1 – c) v
- No adjustment for nodes without outlinks
(transition matrix T has all-zero rows)
- Advantages
– For p = PR(v) and v = v1 + v2, p = p1 + p2 where p1 = PR(v1) and p2 = PR(v2) – Faster to compute
Very Large Data Bases ● Seoul, September 13, 2006 35
PageRank Contribution
Walk W from x to y: x = x0, x1, …, xk = y
- Weight π(W) = out(x0) –1 · · · out(xk – 1) –1
Contribution of x to y over W: ck π(W) (1 – c) / n PageRank contribution py
x of x to y—over
all walks
- Possibly infinite # of walks if there are cycles
- pyx = PR(random jump to x only)