CS246: Mining Massive Datasets Jure Leskovec, Stanford University
http://cs246.stanford.edu
http://cs246.stanford.edu Web pages are not equally important - - PowerPoint PPT Presentation
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu Web pages are not equally important www.joe-schmoe.com vs. www.stanford.edu We already know: Since there is large diversity in the
CS246: Mining Massive Datasets Jure Leskovec, Stanford University
http://cs246.stanford.edu
Web pages are not equally “important”
www.joe-schmoe.com vs. www.stanford.edu
We already know:
Since there is large diversity in the connectivity of the webgraph we can rank the pages by the link structure
2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 2
vs.
We will cover the following Link Analysis
approaches to computing importances of nodes in a graph:
2/7/2012 3 Jure Leskovec, Stanford C246: Mining Massive Datasets
Idea: Links as votes
Think of in-links as votes:
Are all in-links are equal?
2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 4
Each link’s vote is proportional to the
importance of its source page
If page p with importance x has n out-links,
each link gets x/n votes
Page p’s own importance is the sum of the
votes on its in-links
2/7/2012 5 Jure Leskovec, Stanford C246: Mining Massive Datasets
p
A “vote” from an important
page is worth more
A page is important if it is
pointed to by other important pages
Define a “rank” rj for node j
2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 6
→
j i i j
y m a a/2 y/2 a/2 m y/2
The web in 1839 Flow equations:
ry = ry /2 + ra /2 ra = ry /2 + rm rm = ra /2
3 equations, 3 unknowns,
no constants
Additional constraint forces uniqueness
Gaussian elimination method works for small
examples, but we need a better method for large web-size graphs
2/7/2012 7 Jure Leskovec, Stanford C246: Mining Massive Datasets
ry = ry /2 + ra /2 ra = ry /2 + rm rm = ra /2
Flow equations:
Stochastic adjacency matrix M
Rank vector r: vector with an entry per page
The flow equations can be written
2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 8
Suppose page j links to 3 pages, including i
2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 9
i j
M r r = i
1/3
The flow equations can be written
So the rank vector is an eigenvector of the
stochastic web matrix
corresponding eigenvalue 1
2/7/2012 10 Jure Leskovec, Stanford C246: Mining Massive Datasets
r = Mr
y ½ ½ 0 y a = ½ 0 1 a m 0 ½ 0 m
2/7/2012 11 Jure Leskovec, Stanford C246: Mining Massive Datasets
y a m y a m y ½ ½ a ½ 1 m ½ ry = ry /2 + ra /2 ra = ry /2 + rm rm = ra /2
Given a web graph with n nodes, where the
nodes are pages and edges are hyperlinks
Power iteration: a simple iterative scheme
2/7/2012 12 Jure Leskovec, Stanford C246: Mining Massive Datasets
→ + = j i t i t j
r r
i ) ( ) 1 (
d
di …. out-degree of node i
Power Iteration:
𝑘 = 1/N
𝑘 = ∑ 𝑠𝑗 𝑒𝑗 𝑗→𝑘
Example:
ry
1/3 1/3 5/12 9/24 6/15 ra = 1/3 3/6 1/3 11/24 … 6/15 rm 1/3 1/6 3/12 1/6 3/15
2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets
y a m
y a m y ½ ½ a ½ 1 m ½
13
Iteration 0, 1, 2, …
ry = ry /2 + ra /2 ra = ry /2 + rm rm = ra /2
Imagine a random web surfer:
Let:
p(t) … vector whose ith coordinate is the
2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 14
→
=
j i i j
r r (i) dout
j i1 i2 i3
Where is the surfer at time t+1?
p(t+1) = M · p(t)
Suppose the random walk reaches a state
p(t+1) = M · p(t) = p(t)
then p(t) is stationary distribution of a random walk
Our rank vector r satisfies r = M · r
the random walk
2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 15
) ( ) 1 ( t p t p ⋅ = + M
j i1 i2 i3
Does this converge? Does it converge to what we want? Are results reasonable?
2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 16
→ + = j i t i t j
i ) ( ) 1 (
equivalently
Example:
ra 1 1 rb 1 1
2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 17
=
b a
Iteration 0, 1, 2, …
→ + = j i t i t j
i ) ( ) 1 (
Example:
ra 1 rb 1
2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 18
=
b a
Iteration 0, 1, 2, …
→ + = j i t i t j
i ) ( ) 1 (
2 problems:
Some pages are “dead ends”
(have no out-links)
importance to “leak out”
Spider traps (all out links are
within the group)
2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 19
Power Iteration:
𝑘 = 1
𝑘 = ∑ 𝑠𝑗 𝑒𝑗 𝑗→𝑘
Example:
ry
1/3 2/6 3/12 5/24 ra = 1/3 1/6 2/12 3/24 … rm 1/3 3/6 7/12 16/24 1
2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 20
Iteration 0, 1, 2, …
y a m
y a m y ½ ½ a ½ m ½ 1
ry = ry /2 + ra /2 ra = ry /2 rm = ra /2 + rm
The Google solution for spider traps: At each
time step, the random surfer has two options:
at random
Surfer will teleport out of spider trap within a
few time steps
2/7/2012 21 Jure Leskovec, Stanford C246: Mining Massive Datasets
y a m y a m
Power Iteration:
𝑘 = 1
𝑘 = ∑ 𝑠𝑗 𝑒𝑗 𝑗→𝑘
Example:
ry
1/3 2/6 3/12 5/24 ra = 1/3 1/6 2/12 3/24 … rm 1/3 1/6 1/12 2/24
2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 22
Iteration 0, 1, 2, …
y a m
y a m y ½ ½ a ½ m ½
ry = ry /2 + ra /2 ra = ry /2 rm = ra /2
Teleports: Follow random teleport links with
probability 1.0 from dead-ends
2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 23
y a m
y a m y ½ ½ ⅓ a ½ ⅓ m ½ ⅓ y a m y ½ ½ a ½ m ½
y a m
Markov Chains
Set of states X Transition matrix P where Pij = P(Xt=i | Xt-1=j) π specifying the probability of being at each
state x ∈ X
Goal is to find π such that π = P π
2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 24
Theory of Markov chains Fact: For any start vector, the power method
applied to a Markov transition matrix P will converge to a unique positive stationary vector as long as P is stochastic, irreducible and aperiodic.
2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 25
Stochastic: Every column sums to 1 A possible solution: Add green links
2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 26
y a m
y a m y ½ ½ 1/3 a ½ 1/3 m ½ 1/3
ry = ry /2 + ra /2 + rm /3 ra = ry /2+ rm /3 rm = ra /2 + rm /3
A chain is periodic if there exists k > 1 such
that the interval between two visits to some state s is always a multiple of k.
A possible solution: Add green links
2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 27
y a m
From any state, there is a non-zero
probability of going from any one state to any another
A possible solution: Add green links
2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 28
y a m
Google’s solution that does it all:
At each step, random surfer has two options:
PageRank equation [Brin-Page, 98]
𝑘 = (1 − 𝛾) 𝑠 𝑗
𝑗→𝑘
2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 29
di … out-degree
Assuming we follow random teleport links with probability 1.0 from dead-ends
PageRank equation [Brin-Page, 98]
𝑠
𝑘 = (1 − 𝛾) 𝑠 𝑗
𝑒𝑗
𝑗→𝑘
+ 𝛾 1 𝑜
The Google Matrix A:
𝐵 = 1 − 𝛾 𝑇 + 𝛾 1 𝑜 𝟐 ⋅ 𝟐𝑈
G is stochastic, aperiodic and irreducible, so
𝑠(𝑢+1) = 𝐵 ⋅ 𝑠(𝑢)
What is β ?
2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 30
y a = m 1/3 1/3 1/3 0.33 0.20 0.46 0.24 0.20 0.52 0.26 0.18 0.56 7/33 5/33 21/33 . . .
2/7/2012 31 Jure Leskovec, Stanford C246: Mining Massive Datasets
y a m
0.8+0.2·⅓ 0.8·½+0.2·⅓
1/2 1/2 0 1/2 0 0 0 1/2 1 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 y 7/15 7/15 1/15 a 7/15 1/15 1/15 m 1/15 7/15 13/15 0.8 + 0.2 S 1/n·1·1T A
Suppose there are N pages Consider a page j, with set of out-links O(j) We have Mij = 1/|O(j)| when j→i and Mij = 0
The random teleport is equivalent to
with probability (1-β)/N
from 1/|O(j)| to β/|O(j)|
score and redistribute evenly
2/7/2012 32 Jure Leskovec, Stanford C246: Mining Massive Datasets
Construct the N x N matrix A as follows
Verify that A is a stochastic matrix The page rank vector r is the principal
eigenvector of this matrix A
Equivalently, r is the stationary distribution of
the random walk with teleports
2/7/2012 33 Jure Leskovec, Stanford C246: Mining Massive Datasets
Key step is matrix-vector multiplication
Easy if we have enough main memory to hold
A, rold, rnew
Say N = 1 billion pages
each entry (say)
vectors, approx 8GB
2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 34
½ ½ 0 ½ 0 0 0 ½ 1 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3
7/15 7/15 1/15 7/15 1/15 1/15 1/15 7/15 13/15 0.8 +0.2 A = β∙M + (1-β) [1/N]NxN
= A =
𝑠 = 𝐵 ⋅ 𝑠, where 𝐵𝑗𝑗 = 𝛾 𝑁𝑗𝑗 +
1−𝛾 𝑂
𝑠𝑗 = ∑
𝐵𝑗𝑗 ⋅ 𝑠𝑗
𝑂 𝑗=1
𝑠𝑗 = ∑
𝛾 𝑁𝑗𝑗 +
1−𝛾 𝑂
⋅ 𝑠𝑗
𝑂 𝑗=1
= ∑ 𝛾 𝑁𝑗𝑗 ⋅ 𝑠𝑗 +
1−𝛾 𝑂 𝑂 𝑗=1
∑ 𝑠𝑗
𝑂 𝑗=1
= ∑ 𝛾 𝑁𝑗𝑗 ⋅ 𝑠𝑗 +
1−𝛾 𝑂 𝑂 𝑗=1
, since ∑𝑠𝑗 = 1
So, 𝑠 = 𝛾 𝑁 ⋅ 𝑠 +
1−𝛾 𝑂 𝑂
2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 35
[x]N … a vector of length N with all entries x
We can rearrange the PageRank equation
𝟐−𝜸 𝑶 𝑶
M is a sparse matrix!
So in each iteration, we need to:
2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 36
Encode sparse matrix using only nonzero
entries
2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 37
3 1, 5, 7 1 5 17, 64, 113, 117, 245 2 2 13, 23
source node degree destination nodes
Assume enough RAM to fit rnew into memory
Then 1 step of power-iteration is:
2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 38
3 1, 5, 6 1 4 17, 64, 113, 117 2 2 13, 23
src degree destination 1 2 3 4 5 6 1 2 3 4 5 6 rnew rold
Initialize all entries of rnew to (1-β)/N For each page p (of out-degree n): Read into memory: p, n, dest1,…,destn, rold(p) for j = 1…n: rnew(destj) += β rold(p) / n
Assume enough RAM to fit rnew into memory
In each iteration, we have to:
Question:
2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 39
2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 40
4 0, 1, 3, 5 1 2 0, 5 2 2 3, 4
src degree destination 1 2 3 4 5 1 2 3 4 5 rnew rold
Similar to nested-loop join in databases
k scans of M and rold
Can we do better?
we must avoid reading it k times per iteration
2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 41
2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 42
4 0, 1 1 3 2 2 1
src degree destination 1 2 3 4 5 1 2 3 4 5 rnew rold
4 5 1 3 5 2 2 4 4 3 2 2 3
Break M into stripes
corresponding block of rnew
Some additional overhead per stripe
Cost per iteration
2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 43
Measures generic popularity of a page
Uses a single measure of importance
Susceptible to Link spam
boost page rank
2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 44