CS246: Mining Massive Datasets Jure Leskovec, Stanford University
http://cs246.stanford.edu
http://cs246.stanford.edu High dim. Graph Infinite Machine Apps - - PowerPoint PPT Presentation
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu High dim. Graph Infinite Machine Apps data data data learning Locality Filtering PageRank, Recommen sensitive data SVM SimRank der systems
CS246: Mining Massive Datasets Jure Leskovec, Stanford University
http://cs246.stanford.edu
High dim. data
Locality sensitive hashing Clustering Dimensional ity reduction
Graph data
PageRank, SimRank Community Detection Spam Detection
Infinite data
Filtering data streams Web advertising Queries on streams
Machine learning
SVM Decision Trees Perceptron, kNN
Apps
Recommen der systems Association Rules Duplicate document detection
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 2
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 3
Facebook social graph
4-degrees of separation [Backstrom-Boldi-Rosa-Ugander-Vigna, 2011]
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 4
Connections between political blogs
Polarization of the network [Adamic-Glance, 2005]
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 5
Citation networks and Maps of science
[Börner et al., 2012]
domain2 domain1 domain3 router
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 6
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 7
Seven Bridges of Königsberg
[Euler, 1735]
Return to the starting point by traveling each link of the graph once and only once.
Web as a directed graph:
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 8
I teach a class on Networks. CS224W: Classes are in the Gates building Computer Science Department at Stanford Stanford University
Web as a directed graph:
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 9
I teach a class on Networks. CS224W: Classes are in the Gates building Computer Science Department at Stanford Stanford University
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 10
How to organize the Web?
First try: Human curated
Web directories
Second try: Web Search
Find relevant docs in a small and trusted set
random things, web spam, etc.
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 11
2 challenges of web search:
(1) Web contains many sources of information
Who to “trust”?
(2) What is the “best” answer to query
“newspaper”?
might all be pointing to many newspapers
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 12
All web pages are not equally “important”
www.joe-schmoe.com vs. www.stanford.edu
There is large diversity
in the web-graph node connectivity. Let’s rank the pages by the link structure!
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 13
We will cover the following Link Analysis
approaches for computing importances
2/5/2013 14 Jure Leskovec, Stanford C246: Mining Massive Datasets
Idea: Links as votes
Think of in-links as votes:
Are all in-links are equal?
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 15
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 16
Each link’s vote is proportional to the
importance of its source page
If page j with importance rj has n out-links,
each link gets rj / n votes
Page j’s own importance is the sum of the
votes on its in-links
2/5/2013 17 Jure Leskovec, Stanford C246: Mining Massive Datasets
j
k i rj/3 rj/3 rj/3
rj = ri/3+rk/4
ri/3 rk/4
A “vote” from an important
page is worth more
A page is important if it is
pointed to by other important pages
Define a “rank” rj for page j
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 18
j i i j
i
y m a a/2 y/2 a/2 m y/2
The web in 1839 “Flow” equations:
ry = ry /2 + ra /2 ra = ry /2 + rm rm = ra /2
𝒆𝒋 … out-degree of node 𝒋
3 equations, 3 unknowns,
no constants
Additional constraint forces uniqueness:
𝟑 𝟔 , 𝒔𝒃 = 𝟑 𝟔 , 𝒔𝒏 = 𝟐 𝟔
Gaussian elimination method works for
small examples, but we need a better method for large web-size graphs
We need a new formulation!
2/5/2013 19 Jure Leskovec, Stanford C246: Mining Massive Datasets
ry = ry /2 + ra /2 ra = ry /2 + rm rm = ra /2
Flow equations:
Stochastic adjacency matrix 𝑵
1 𝑒𝑗
else 𝑁𝑘𝑗 = 0
Rank vector 𝒔: vector with an entry per page
𝑗
The flow equations can be written
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 20
j i i j
i
Remember the flow equation: Flow equation in the matrix form
𝑵 ⋅ 𝒔 = 𝒔
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 21
j i
rj 1/3
j i i j
r r
i
d
ri
The flow equations can be written
So the rank vector r is an eigenvector of the
stochastic web matrix M
with corresponding eigenvalue 1
column stochastic
sums to one, so 𝑵𝒔 ≤ 𝟐
We can now efficiently solve for r!
The method is called Power iteration
2/5/2013 22 Jure Leskovec, Stanford C246: Mining Massive Datasets
NOTE: x is an eigenvector with the corresponding eigenvalue λ if:
𝑩𝒚 = 𝝁𝒚
r = M∙r
y ½ ½ 0 y a = ½ 0 1 a m 0 ½ 0 m
2/5/2013 23 Jure Leskovec, Stanford C246: Mining Massive Datasets
y a m y a m y ½ ½ a ½ 1 m ½ ry = ry /2 + ra /2 ra = ry /2 + rm rm = ra /2
Given a web graph with n nodes, where the
nodes are pages and edges are hyperlinks
Power iteration: a simple iterative scheme
2/5/2013 24 Jure Leskovec, Stanford C246: Mining Massive Datasets
j i t i t j
r r
i ) ( ) 1 (
d
di …. out-degree of node i
Power Iteration:
𝑘 = 1/N
𝑠𝑗 𝑒𝑗 𝑗→𝑘
Example:
ry
1/3 1/3 5/12 9/24 6/15 ra = 1/3 3/6 1/3 11/24 … 6/15 rm 1/3 1/6 3/12 1/6 3/15
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets
y a m
y a m y ½ ½ a ½ 1 m ½
25
Iteration 0, 1, 2, …
ry = ry /2 + ra /2 ra = ry /2 + rm rm = ra /2
Power Iteration:
𝑘 = 1/N
𝑠𝑗 𝑒𝑗 𝑗→𝑘
Example:
ry
1/3 1/3 5/12 9/24 6/15 ra = 1/3 3/6 1/3 11/24 … 6/15 rm 1/3 1/6 3/12 1/6 3/15
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets
y a m
y a m y ½ ½ a ½ 1 m ½
26
Iteration 0, 1, 2, …
ry = ry /2 + ra /2 ra = ry /2 + rm rm = ra /2
Power iteration:
A method for finding dominant eigenvector (the vector corresponding to the largest eigenvalue)
= 𝑵𝟑 ⋅ 𝒔 𝟏
= 𝑵𝟒 ⋅ 𝒔 𝟏
Claim:
Sequence 𝑵 ⋅ 𝒔 𝟏 , 𝑵𝟑 ⋅ 𝒔 𝟏 , … 𝑵𝒍 ⋅ 𝒔 𝟏 , … approaches the dominant eigenvector of 𝑵
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 27
Claim: Sequence 𝑵 ⋅ 𝒔 𝟏 , 𝑵𝟑 ⋅ 𝒔 𝟏 , … 𝑵𝒍 ⋅ 𝒔 𝟏 , …
approaches the dominant eigenvector of 𝑵
Proof:
𝑦1, 𝑦2, … , 𝑦𝑜 with corresponding eigenvalues 𝜇1, 𝜇2, … , 𝜇𝑜, where 𝜇1 > 𝜇2 > ⋯ > 𝜇𝑜
𝑠(0) = 𝑑1 𝑦1 + 𝑑2 𝑦2 + ⋯ + 𝑑𝑜 𝑦𝑜
= 𝑑1(𝑁𝑦1) + 𝑑2(𝑁𝑦2) + ⋯ + 𝑑𝑜(𝑁𝑦𝑜) = 𝑑1(𝜇1𝑦1) + 𝑑2(𝜇2𝑦2) + ⋯ + 𝑑𝑜(𝜇𝑜𝑦𝑜)
𝑁𝑙𝑠(0) = 𝑑1(𝜇1
𝑙𝑦1) + 𝑑2(𝜇2 𝑙𝑦2) + ⋯ + 𝑑𝑜(𝜇𝑜 𝑙𝑦𝑜)
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 28
Claim: Sequence 𝑵 ⋅ 𝒔 𝟏 , 𝑵𝟑 ⋅ 𝒔 𝟏 , … 𝑵𝒍 ⋅ 𝒔 𝟏 , …
approaches the dominant eigenvector of 𝑵
Proof (continued):
𝑁𝑙𝑠(0) = 𝑑1(𝜇1
𝑙𝑦1) + 𝑑2(𝜇2 𝑙𝑦2) + ⋯ + 𝑑𝑜(𝜇𝑜 𝑙𝑦𝑜)
𝑙 𝑑1𝑦1 + 𝑑2 𝜇2 𝜇1 𝑙
𝑦2 + ⋯ + 𝑑𝑜
𝜇2 𝜇1 𝑙
𝑦𝑜
𝜇2 𝜇1 , 𝜇3 𝜇1 … < 1
and so
𝜇𝑗 𝜇1 𝑙
= 0 as 𝑙 → ∞ (for all 𝑗 = 2 … 𝑜).
𝒍𝒚𝟐
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 29
Imagine a random web surfer:
Let:
𝒒(𝒖) … vector whose 𝑗th coordinate is the
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 30
j i i j
r r (i) dout
j i1 i2 i3
Where is the surfer at time t+1?
𝑞 𝑢 + 1 = 𝑁 ⋅ 𝑞(𝑢)
Suppose the random walk reaches a state
𝑞 𝑢 + 1 = 𝑁 ⋅ 𝑞(𝑢) = 𝑞(𝑢)
then 𝒒(𝑢) is stationary distribution of a random walk
Our original rank vector 𝒔 satisfies 𝒔 = 𝑵 ⋅ 𝒔
the random walk
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 31
) ( M ) 1 ( t p t p
j i1 i2 i3
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 32
j i t i t j
i ) ( ) 1 (
equivalently
Announcement: We graded HW0 and HW1!
Example:
ra 1 1 rb 1 1
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 33
=
b a
Iteration 0, 1, 2, …
j i t i t j
i ) ( ) 1 (
Example:
ra 1 rb 1
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 34
=
b a
Iteration 0, 1, 2, …
j i t i t j
i ) ( ) 1 (
2 problems:
(1) Some pages are
dead ends (have no out-links)
importance to “leak out”
(2) Spider traps
(all out-links are within the group)
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 35
Power Iteration:
𝑘 = 1
𝑘 = 𝑠𝑗 𝑒𝑗 𝑗→𝑘
Example:
ry
1/3 2/6 3/12 5/24 ra = 1/3 1/6 2/12 3/24 … rm 1/3 3/6 7/12 16/24 1
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 36
Iteration 0, 1, 2, …
y a m
y a m y ½ ½ a ½ m ½ 1
ry = ry /2 + ra /2 ra = ry /2 rm = ra /2 + rm
The Google solution for spider traps: At each
time step, the random surfer has two options
Surfer will teleport out of spider trap within a
few time steps
2/5/2013 37 Jure Leskovec, Stanford C246: Mining Massive Datasets
y a m y a m
Power Iteration:
𝑘 = 1
𝑘 = 𝑠𝑗 𝑒𝑗 𝑗→𝑘
Example:
ry
1/3 2/6 3/12 5/24 ra = 1/3 1/6 2/12 3/24 … rm 1/3 1/6 1/12 2/24
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 38
Iteration 0, 1, 2, …
y a m
y a m y ½ ½ a ½ m ½
ry = ry /2 + ra /2 ra = ry /2 rm = ra /2
Teleports: Follow random teleport links with
probability 1.0 from dead-ends
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 39
y a m
y a m y ½ ½ ⅓ a ½ ⅓ m ½ ⅓ y a m y ½ ½ a ½ m ½
y a m
Markov chains
Set of states X Transition matrix P where Pij = P(Xt=i | Xt-1=j) π specifying the stationary probability of
being at each state x X
Goal is to find π such that π = P π
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 40
Theory of Markov chains Fact: For any start vector, the power method
applied to a Markov transition matrix P will converge to a unique positive stationary vector as long as P is stochastic, irreducible and aperiodic.
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 41
Stochastic: Every column sums to 1 A possible solution: Add green links
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 42
y a m
y a m y ½ ½ 1/3 a ½ 1/3 m ½ 1/3
ry = ry /2 + ra /2 + rm /3 ra = ry /2+ rm /3 rm = ra /2 + rm /3
A chain is periodic if there exists k > 1 such
that the interval between two visits to some state s is always a multiple of k.
A possible solution: Add green links
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 43
y a m
From any state, there is a non-zero
probability of going from any one state to any another
A possible solution: Add green links
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 44
y a m
Google’s solution that does it all:
At each step, random surfer has two options:
PageRank equation [Brin-Page, 98]
𝑘 = 𝛾 𝑠 𝑗
𝑗→𝑘
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 45
di … out-degree
This formulation assumes that 𝑵 has no dead ends. We can either preprocess matrix 𝑵 to remove all dead ends or explicitly follow random teleport links with probability 1.0 from dead-ends.
PageRank equation [Brin-Page, 98]
𝑠
𝑘 = 𝛾 𝑠 𝑗
𝑒𝑗
𝑗→𝑘
+ (1 − 𝛾) 1 𝑜
The Google Matrix A:
𝐵 = 𝛾 𝑁 + (1 − 𝛾) 1 𝑜 𝒇 ⋅ 𝒇𝑈
A is stochastic, aperiodic and irreducible, so
What is ?
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 46
e…vector of all 1s
y a = m 1/3 1/3 1/3 0.33 0.20 0.46 0.24 0.20 0.52 0.26 0.18 0.56 7/33 5/33 21/33 . . .
2/5/2013 47 Jure Leskovec, Stanford C246: Mining Massive Datasets
y a m
0.8+0.2·⅓ 0.8·½+0.2·⅓
1/2 1/2 0 1/2 0 0 0 1/2 1 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 y 7/15 7/15 1/15 a 7/15 1/15 1/15 m 1/15 7/15 13/15 0.8 + 0.2 M 1/n·1·1T A
Key step is matrix-vector multiplication
Easy if we have enough main memory to
hold A, rold, rnew
Say N = 1 billion pages
each entry (say)
vectors, approx 8GB
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 49
½ ½ 0 ½ 0 0 0 ½ 1 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3
7/15 7/15 1/15 7/15 1/15 1/15 1/15 7/15 13/15 0.8 +0.2 A = ∙M + (1-) [1/N]NxN
= A =
Suppose there are N pages Consider page j, with dj out-links We have Mij = 1/|dj| when j→i
and Mij = 0 otherwise
The random teleport is equivalent to:
and setting transition probability to (1-)/N
score and redistribute evenly
2/5/2013 50 Jure Leskovec, Stanford C246: Mining Massive Datasets
𝒔 = 𝑩 ⋅ 𝒔, where 𝑩𝒋𝒌 = 𝜸 𝑵𝒋𝒌 +
𝟐−𝜸 𝑶
𝑠
𝑗 =
𝐵𝑗𝑘 ⋅ 𝑠
𝑘 𝑂 𝑘=1
𝑠
𝑗 =
𝛾 𝑁𝑗𝑘 +
1−𝛾 𝑂
⋅ 𝑠
𝑘 𝑂 𝑘=1
= 𝛾 𝑁𝑗𝑘 ⋅ 𝑠
𝑘 + 1−𝛾 𝑂 𝑂 𝑘=1
𝑠
𝑘 𝑂 𝑘=1
= 𝛾 𝑁𝑗𝑘 ⋅ 𝑠
𝑘 + 1−𝛾 𝑂 𝑂 𝑘=1
since 𝑠
𝑘 = 1
So we get: 𝒔 = 𝜸 𝑵 ⋅ 𝒔 +
𝟐−𝜸 𝑶 𝑶
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 52
[x]N … a vector of length N with all entries x
Note: Here we assumed M has no dead-ends.
We just rearranged the PageRank equation
𝒔 = 𝜸𝑵 ⋅ 𝒔 + 𝟐 − 𝜸 𝑶
𝑶
M is a sparse matrix! (with no dead-ends)
So in each iteration, we need to:
𝒐𝒇𝒙 𝒋
< 𝟐 and we also have to renormalize rnew so that it sums to 1
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 53
Input: Graph 𝑯 and parameter 𝜸
Output: PageRank vector 𝒔
𝑘 0 = 1 𝑂 , 𝑢 = 1
(𝒖) =
𝜸
𝒔𝒋
(𝒖−𝟐)
𝒆𝒋 𝒋→𝒌
𝒔′𝒌
(𝒖) = 𝟏 if in-deg. of 𝒌 is 0
∀𝒌: 𝒔𝒌
𝒖 = 𝒔′ 𝒌 𝒖 + 𝟐−𝑻 𝑶
𝑠
𝑘 (𝑢) − 𝑠 𝑘 (𝑢−1) > 𝜁 𝑘
54
where: 𝑇 = 𝑠′𝑘
(𝑢) 𝑘
Encode sparse matrix using only nonzero
entries
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 55
3 1, 5, 7 1 5 17, 64, 113, 117, 245 2 2 13, 23
source node degree destination nodes
Assume enough RAM to fit rnew into memory
Then 1 step of power-iteration is:
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 56
3 1, 5, 6 1 4 17, 64, 113, 117 2 2 13, 23
src degree destination 1 2 3 4 5 6 1 2 3 4 5 6 rnew rold
Initialize all entries of rnew to (1-)/N For each page p (of out-degree n): Read into memory: p, n, dest1,…,destn, rold(p) for j = 1…n: rnew(destj) += rold(p) / n
Assume enough RAM to fit rnew into memory
In each iteration, we have to:
Question:
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 57
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 58
4 0, 1, 3, 5 1 2 0, 5 2 2 3, 4
src degree destination 1 2 3 4 5 1 2 3 4 5 rnew rold
Similar to nested-loop join in databases
k scans of M and rold
Can we do better?
we must avoid reading it k times per iteration
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 59
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 60
4 0, 1 1 3 2 2 1
src degree destination 1 2 3 4 5 1 2 3 4 5 rnew rold
4 5 1 3 5 2 2 4 4 3 2 2 3
Break M into stripes
corresponding block of rnew
Some additional overhead per stripe
Cost per iteration
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 61
Measures generic popularity of a page
Uses a single measure of importance
Susceptible to Link spam
boost page rank
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 62