Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
Stanford University
http://www.mmds.org
http://www.mmds.org #1: C4.5 Decision Tree - Classification (61 - - PowerPoint PPT Presentation
Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University http://www.mmds.org #1: C4.5 Decision Tree - Classification (61 votes) #2: K-Means - Clustering (60 votes) #3: SVM
Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
Stanford University
http://www.mmds.org
Data Mining: Concepts and Techniques
2
#1: C4.5 Decision Tree - Classification (61 votes) #2: K-Means - Clustering (60 votes) #3: SVM – Classification (58 votes) #4: Apriori - Frequent Itemsets (52 votes) #5: EM – Clustering (48 votes) #6: PageRank – Link mining (46 votes) #7: AdaBoost – Boosting (45 votes) #7: kNN – Classification (45 votes) #7: Naive Bayes – Classification (45 votes) #10: CART – Classification (34 votes)
How to organize the Web?
First try: Human curated
Web directories
Second try: Web Search
5
6
Link based ranking
algorithms
7
Data Mining: Concepts and Techniques
All web pages are not equally “important”
www.joe-schmoe.com vs. www.stanford.edu
There is large diversity
in the web-graph node connectivity. Let’s rank the pages by the link structure!
8
Idea: Links as votes
Think of in-links as votes:
Are all in-links are equal?
9
B 38.4 C 34.3 E 8.1 F 3.9 D 3.9 A 3.3 1.6 1.6 1.6 1.6 1.6
10
Each link’s vote is proportional to the
importance of its source page
If page j with importance rj has n out-links,
each link gets rj / n votes
Page j’s own importance is the sum of the
votes on its in-links
11
j
k i rj/3 rj/3 rj/3
rj = ri/3+rk/4
ri/3 rk/4
A “vote” from an important
page is worth more
A page is important if it is
pointed to by other important pages
Define a “rank” rj for page j
12
j i i j
i
y m a a/2 y/2 a/2 m y/2
The web in 1839 “Flow” equations:
ry = ry /2 + ra /2 ra = ry /2 + rm rm = ra /2
𝒆𝒋 … out-degree of node 𝒋
3 equations, 3 unknowns,
no constants
Additional constraint forces uniqueness:
𝟑 𝟔 , 𝒔𝒃 = 𝟑 𝟔 , 𝒔𝒏 = 𝟐 𝟔
Gaussian elimination method works for
small examples, but we need a better method for large web-size graphs
We need a new formulation!
13
ry = ry /2 + ra /2 ra = ry /2 + rm rm = ra /2
Flow equations:
Stochastic adjacency matrix 𝑵
1 𝑒𝑗
else 𝑁𝑘𝑗 = 0
Rank vector 𝒔: vector with an entry per page
𝑗
= 1
The flow equations can be written
14
j i i j
r r
i
d
Remember the flow equation: Flow equation in the matrix form
𝑵 ⋅ 𝒔 = 𝒔
15
j i
rj 1/3
j i i j
r r
i
d
ri
r = M∙r
y ½ ½ 0 y a = ½ 0 1 a m 0 ½ 0 m
16
y a m y a m y ½ ½ a ½ 1 m ½ ry = ry /2 + ra /2 ra = ry /2 + rm rm = ra /2
The flow equations can be written
So the rank vector r is an eigenvector of the
stochastic web matrix M
with corresponding eigenvalue 1
column stochastic (with non-negative entries)
We can now efficiently solve for r!
The method is called Power iteration
17
NOTE: x is an eigenvector with the corresponding eigenvalue λ if:
𝑩𝒚 = 𝝁𝒚
Given a web graph with n nodes, where the
nodes are pages and edges are hyperlinks
Power iteration: a simple iterative scheme
18
j i t i t j
r r
i ) ( ) 1 (
d
di …. out-degree of node i
|x|1 = 1≤i≤N|xi| is the L1 norm Can use any other vector norm, e.g., Euclidean
Example:
ry 1/3 1/3 5/12 9/24 6/15 ra = 1/3 3/6 1/3 11/24 … 6/15 rm 1/3 1/6 3/12 1/6 3/15
y a m
y a m y ½ ½ a ½ 1 m ½
19
Iteration 0, 1, 2, …
Example:
ra 1 1 rb 1 1
20
=
b a
Iteration 0, 1, 2, …
j i t i t j
i ) ( ) 1 (
Example:
ra 1 rb 1
21
=
b a
Iteration 0, 1, 2, …
j i t i t j
i ) ( ) 1 (
Imagine a random web surfer:
Let:
𝒒(𝒖) … vector whose 𝒋th coordinate is the
22
j i i j
r r (i) dout
j i1 i2 i3
Where is the surfer at time t+1?
𝒒 𝒖 + 𝟐 = 𝑵 ⋅ 𝒒(𝒖)
Suppose the random walk reaches a state
𝒒 𝒖 + 𝟐 = 𝑵 ⋅ 𝒒(𝒖) = 𝒒(𝒖)
then 𝒒(𝒖) is stationary distribution of a random walk
Our original rank vector 𝒔 satisfies 𝒔 = 𝑵 ⋅ 𝒔
the random walk
) ( M ) 1 ( t p t p
j i1 i2 i3
23
Does this converge? Does it converge to what we want?
j i t i t j
i ) ( ) 1 (
equivalently
24
A central result from the theory of random
walks (a.k.a. Markov processes): For graphs that satisfy certain conditions (strong connected, no dead ends) the stationary distribution is unique and eventually will be reached no matter what the initial probability distribution at time t = 0
25
26
2 problems:
(1) Some pages are
dead ends (have no out-links)
(2) Spider traps:
(all out-links are within the group)
27
Dead end
Example:
ry 1/3 2/6 3/12 5/24 ra = 1/3 1/6 2/12 3/24 … rm 1/3 1/6 1/12 2/24
28
Iteration 0, 1, 2, …
y a m
y a m y ½ ½ a ½ m ½
Here the PageRank “leaks” out since the matrix is not stochastic.
Teleports: Follow random teleport links from
dead-ends
29
y a m
y a m y ½ ½ ⅓ a ½ ⅓ m ½ ⅓ y a m y ½ ½ a ½ m ½
y a m
Example:
ry 1/3 2/6 3/12 5/24 ra = 1/3 1/6 2/12 3/24 … rm 1/3 3/6 7/12 16/24 1
30
Iteration 0, 1, 2, …
y a m
y a m y ½ ½ a ½ m ½ 1 m is a spider trap
All the PageRank score gets “trapped” in node m.
The Google solution for spider traps: At each
time step, the random surfer has two options
Surfer will teleport out of spider trap
within a few time steps
31
y a m y a m
PageRank equation [Brin-Page, ‘98]
𝑠
𝑘 = 𝑗→𝑘
𝛾 𝑠𝑗 𝑒𝑗 +
𝑗
(1 − 𝛾) 1 𝑂
The Google Matrix A:
𝐵 = 𝛾 𝑁 + 1 − 𝛾 1 𝑂 𝑂×𝑂
We have a recursive problem: And the Power method still works! What is ?
34
[1/N]NxN…N by N matrix where all entries are 1/N
y a = m 1/3 1/3 1/3 0.33 0.20 0.46 0.24 0.20 0.52 0.26 0.18 0.56 7/33 5/33 21/33 . . .
35
y a m
13/15 7/15
1/2 1/2 0 1/2 0 0 0 1/2 1 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 y 7/15 7/15 1/15 a 7/15 1/15 1/15 m 1/15 7/15 13/15 0.8 + 0.2 M [1/N]NxN A
Input: Graph 𝑯 and parameter 𝜸
Output: PageRank vector 𝒔𝒐𝒇𝒙
𝑘 𝑝𝑚𝑒 = 1 𝑂
𝑘 𝑜𝑓𝑥 − 𝑠 𝑘 𝑝𝑚𝑒 > 𝜁
𝒐𝒇𝒙 = 𝒋→𝒌 𝜸 𝒔𝒋
𝒑𝒎𝒆
𝒆𝒋
𝒔′𝒌
𝒐𝒇𝒙 = 𝟏 if in-degree of 𝒌 is 0
∀𝒌: 𝒔𝒌
𝒐𝒇𝒙 = 𝒔′𝒌 𝒐𝒇𝒙 + 𝟐−𝑻 𝑶
36
where: 𝑇 = 𝑘 𝑠′𝑘
𝑜𝑓𝑥
If the graph has no dead-ends then the amount of leaked PageRank is 1-β. But since we have dead-ends the amount of leaked PageRank may be larger. We have to explicitly account for it by computing S.
Measures generic popularity of a page
Uses a single measure of importance
Susceptible to Link spam
boost page rank
39