http://cs246.stanford.edu High dim. Graph Infinite Machine Apps - - PowerPoint PPT Presentation

http cs246 stanford edu high dim graph infinite machine
SMART_READER_LITE
LIVE PREVIEW

http://cs246.stanford.edu High dim. Graph Infinite Machine Apps - - PowerPoint PPT Presentation

CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu High dim. Graph Infinite Machine Apps data data data learning Locality Filtering PageRank, Recommen sensitive data SVM SimRank der systems


slide-1
SLIDE 1

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

http://cs246.stanford.edu

slide-2
SLIDE 2

High dim. data

Locality sensitive hashing Clustering Dimensional ity reduction

Graph data

PageRank, SimRank Community Detection Spam Detection

Infinite data

Filtering data streams Web advertising Queries on streams

Machine learning

SVM Decision Trees Perceptron, kNN

Apps

Recommen der systems Association Rules Duplicate document detection

2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 2

slide-3
SLIDE 3

2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 3

Facebook social graph

4-degrees of separation [Backstrom-Boldi-Rosa-Ugander-Vigna, 2011]

slide-4
SLIDE 4

2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 4

Connections between political blogs

Polarization of the network [Adamic-Glance, 2005]

slide-5
SLIDE 5

2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 5

Citation networks and Maps of science

[Börner et al., 2012]

slide-6
SLIDE 6

domain2 domain1 domain3 router

Internet

2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 6

slide-7
SLIDE 7

2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 7

Seven Bridges of Königsberg

[Euler, 1735]

Return to the starting point by traveling each link of the graph once and only once.

slide-8
SLIDE 8

 Web as a directed graph:

  • Nodes: Webpages
  • Edges: Hyperlinks

2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 8

I teach a class on Networks. CS224W: Classes are in the Gates building Computer Science Department at Stanford Stanford University

slide-9
SLIDE 9

 Web as a directed graph:

  • Nodes: Webpages
  • Edges: Hyperlinks

2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 9

I teach a class on Networks. CS224W: Classes are in the Gates building Computer Science Department at Stanford Stanford University

slide-10
SLIDE 10

2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 10

slide-11
SLIDE 11

 How to organize the Web?

 First try: Human curated

Web directories

  • Yahoo, DMOZ, LookSmart

 Second try: Web Search

  • Information Retrieval investigates:

Find relevant docs in a small and trusted set

  • Newspaper articles, Patents, etc.
  • But: Web is huge, full of untrusted documents,

random things, web spam, etc.

2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 11

slide-12
SLIDE 12

2 challenges of web search:

 (1) Web contains many sources of information

Who to “trust”?

  • Trick: Trustworthy pages may point to each other!

 (2) What is the “best” answer to query

“newspaper”?

  • No single right answer
  • Trick: Pages that actually know about newspapers

might all be pointing to many newspapers

2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 12

slide-13
SLIDE 13

 All web pages are not equally “important”

www.joe-schmoe.com vs. www.stanford.edu

 There is large diversity

in the web-graph node connectivity. Let’s rank the pages by the link structure!

2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 13

slide-14
SLIDE 14

 We will cover the following Link Analysis

approaches for computing importances

  • f nodes in a graph:
  • Page Rank
  • Hubs and Authorities (HITS)
  • Topic-Specific (Personalized) Page Rank
  • Web Spam Detection Algorithms

2/5/2013 14 Jure Leskovec, Stanford C246: Mining Massive Datasets

slide-15
SLIDE 15

 Idea: Links as votes

  • Page is more important if it has more links
  • In-coming links? Out-going links?

 Think of in-links as votes:

  • www.stanford.edu has 23,400 in-links
  • www.joe-schmoe.com has 1 in-link

 Are all in-links are equal?

  • Links from important pages count more
  • Recursive question!

2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 15

slide-16
SLIDE 16

2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 16

slide-17
SLIDE 17

 Each link’s vote is proportional to the

importance of its source page

 If page j with importance rj has n out-links,

each link gets rj / n votes

 Page j’s own importance is the sum of the

votes on its in-links

2/5/2013 17 Jure Leskovec, Stanford C246: Mining Massive Datasets

j

k i rj/3 rj/3 rj/3

rj = ri/3+rk/4

ri/3 rk/4

slide-18
SLIDE 18

 A “vote” from an important

page is worth more

 A page is important if it is

pointed to by other important pages

 Define a “rank” rj for page j

2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 18

j i i j

r r

i

d

y m a a/2 y/2 a/2 m y/2

The web in 1839 “Flow” equations:

ry = ry /2 + ra /2 ra = ry /2 + rm rm = ra /2

𝒆𝒋 … out-degree of node 𝒋

slide-19
SLIDE 19

 3 equations, 3 unknowns,

no constants

  • No unique solution
  • All solutions equivalent modulo the scale factor

 Additional constraint forces uniqueness:

  • 𝒔𝒛 + 𝒔𝒃 + 𝒔𝒏 = 𝟐
  • Solution: 𝒔𝒛 =

𝟑 𝟔 , 𝒔𝒃 = 𝟑 𝟔 , 𝒔𝒏 = 𝟐 𝟔

 Gaussian elimination method works for

small examples, but we need a better method for large web-size graphs

 We need a new formulation!

2/5/2013 19 Jure Leskovec, Stanford C246: Mining Massive Datasets

ry = ry /2 + ra /2 ra = ry /2 + rm rm = ra /2

Flow equations:

slide-20
SLIDE 20

 Stochastic adjacency matrix 𝑵

  • Let page 𝑗 has 𝑒𝑗 out-links
  • If 𝑗 → 𝑘, then 𝑁𝑘𝑗 =

1 𝑒𝑗

else 𝑁𝑘𝑗 = 0

  • 𝑵 is a column stochastic matrix
  • Columns sum to 1

 Rank vector 𝒔: vector with an entry per page

  • 𝑠𝑗 is the importance score of page 𝑗
  • 𝑠𝑗 = 1

𝑗

 The flow equations can be written

𝒔 = 𝑵 ⋅ 𝒔

2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 20

j i i j

r r

i

d

slide-21
SLIDE 21

 Remember the flow equation:  Flow equation in the matrix form

𝑵 ⋅ 𝒔 = 𝒔

  • Suppose page i links to 3 pages, including j

2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 21

j i

M r r =

rj 1/3

j i i j

r r

i

d

ri

. . =

slide-22
SLIDE 22

 The flow equations can be written

𝒔 = 𝑵 ∙ 𝒔

 So the rank vector r is an eigenvector of the

stochastic web matrix M

  • In fact, its first or principal eigenvector,

with corresponding eigenvalue 1

  • Largest eigenvalue of M is 1 since M is

column stochastic

  • We know r is unit length and each column of M

sums to one, so 𝑵𝒔 ≤ 𝟐

 We can now efficiently solve for r!

The method is called Power iteration

2/5/2013 22 Jure Leskovec, Stanford C246: Mining Massive Datasets

NOTE: x is an eigenvector with the corresponding eigenvalue λ if:

𝑩𝒚 = 𝝁𝒚

slide-23
SLIDE 23

r = M∙r

y ½ ½ 0 y a = ½ 0 1 a m 0 ½ 0 m

2/5/2013 23 Jure Leskovec, Stanford C246: Mining Massive Datasets

y a m y a m y ½ ½ a ½ 1 m ½ ry = ry /2 + ra /2 ra = ry /2 + rm rm = ra /2

slide-24
SLIDE 24

 Given a web graph with n nodes, where the

nodes are pages and edges are hyperlinks

 Power iteration: a simple iterative scheme

  • Suppose there are N web pages
  • Initialize: r(0) = [1/N,….,1/N]T
  • Iterate: r(t+1) = M ∙ r(t)
  • Stop when |r(t+1) – r(t)|1 < 
  • |x|1 = 1≤i≤N|xi| is the L1 norm

2/5/2013 24 Jure Leskovec, Stanford C246: Mining Massive Datasets

   j i t i t j

r r

i ) ( ) 1 (

d

di …. out-degree of node i

slide-25
SLIDE 25

 Power Iteration:

  • Set 𝑠

𝑘 = 1/N

  • 1: 𝑠′𝑘 =

𝑠𝑗 𝑒𝑗 𝑗→𝑘

  • 2: 𝑠 = 𝑠′
  • Goto 1

 Example:

ry

1/3 1/3 5/12 9/24 6/15 ra = 1/3 3/6 1/3 11/24 … 6/15 rm 1/3 1/6 3/12 1/6 3/15

2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets

y a m

y a m y ½ ½ a ½ 1 m ½

25

Iteration 0, 1, 2, …

ry = ry /2 + ra /2 ra = ry /2 + rm rm = ra /2

slide-26
SLIDE 26

 Power Iteration:

  • Set 𝑠

𝑘 = 1/N

  • 1: 𝑠′𝑘 =

𝑠𝑗 𝑒𝑗 𝑗→𝑘

  • 2: 𝑠 = 𝑠′
  • Goto 1

 Example:

ry

1/3 1/3 5/12 9/24 6/15 ra = 1/3 3/6 1/3 11/24 … 6/15 rm 1/3 1/6 3/12 1/6 3/15

2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets

y a m

y a m y ½ ½ a ½ 1 m ½

26

Iteration 0, 1, 2, …

ry = ry /2 + ra /2 ra = ry /2 + rm rm = ra /2

slide-27
SLIDE 27

 Power iteration:

A method for finding dominant eigenvector (the vector corresponding to the largest eigenvalue)

  • 𝒔(𝟐) = 𝑵 ⋅ 𝒔(𝟏)
  • 𝒔(𝟑) = 𝑵 ⋅ 𝒔 𝟐 = 𝑵 𝑵𝒔 𝟐

= 𝑵𝟑 ⋅ 𝒔 𝟏

  • 𝒔(𝟒) = 𝑵 ⋅ 𝒔 𝟑 = 𝑵 𝑵𝟑𝒔 𝟏

= 𝑵𝟒 ⋅ 𝒔 𝟏

 Claim:

Sequence 𝑵 ⋅ 𝒔 𝟏 , 𝑵𝟑 ⋅ 𝒔 𝟏 , … 𝑵𝒍 ⋅ 𝒔 𝟏 , … approaches the dominant eigenvector of 𝑵

2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 27

slide-28
SLIDE 28

 Claim: Sequence 𝑵 ⋅ 𝒔 𝟏 , 𝑵𝟑 ⋅ 𝒔 𝟏 , … 𝑵𝒍 ⋅ 𝒔 𝟏 , …

approaches the dominant eigenvector of 𝑵

 Proof:

  • Assume M has n linearly independent eigenvectors,

𝑦1, 𝑦2, … , 𝑦𝑜 with corresponding eigenvalues 𝜇1, 𝜇2, … , 𝜇𝑜, where 𝜇1 > 𝜇2 > ⋯ > 𝜇𝑜

  • Vectors 𝑦1, 𝑦2, … , 𝑦𝑜 form a basis and thus we can write:

𝑠(0) = 𝑑1 𝑦1 + 𝑑2 𝑦2 + ⋯ + 𝑑𝑜 𝑦𝑜

  • 𝑵𝒔(𝟏) = 𝑵 𝒅𝟐 𝒚𝟐 + 𝒅𝟑 𝒚𝟑 + ⋯ + 𝒅𝒐 𝒚𝒐

= 𝑑1(𝑁𝑦1) + 𝑑2(𝑁𝑦2) + ⋯ + 𝑑𝑜(𝑁𝑦𝑜) = 𝑑1(𝜇1𝑦1) + 𝑑2(𝜇2𝑦2) + ⋯ + 𝑑𝑜(𝜇𝑜𝑦𝑜)

  • Repeated multiplication on both sides produces

𝑁𝑙𝑠(0) = 𝑑1(𝜇1

𝑙𝑦1) + 𝑑2(𝜇2 𝑙𝑦2) + ⋯ + 𝑑𝑜(𝜇𝑜 𝑙𝑦𝑜)

2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 28

slide-29
SLIDE 29

 Claim: Sequence 𝑵 ⋅ 𝒔 𝟏 , 𝑵𝟑 ⋅ 𝒔 𝟏 , … 𝑵𝒍 ⋅ 𝒔 𝟏 , …

approaches the dominant eigenvector of 𝑵

 Proof (continued):

  • Repeated multiplication on both sides produces

𝑁𝑙𝑠(0) = 𝑑1(𝜇1

𝑙𝑦1) + 𝑑2(𝜇2 𝑙𝑦2) + ⋯ + 𝑑𝑜(𝜇𝑜 𝑙𝑦𝑜)

  • 𝑁𝑙𝑠(0) = 𝜇1

𝑙 𝑑1𝑦1 + 𝑑2 𝜇2 𝜇1 𝑙

𝑦2 + ⋯ + 𝑑𝑜

𝜇2 𝜇1 𝑙

𝑦𝑜

  • Since 𝜇1 > 𝜇2 then fractions

𝜇2 𝜇1 , 𝜇3 𝜇1 … < 1

and so

𝜇𝑗 𝜇1 𝑙

= 0 as 𝑙 → ∞ (for all 𝑗 = 2 … 𝑜).

  • Thus: 𝑵𝒍𝒔(𝟏) ≈ 𝒅𝟐 𝝁𝟐

𝒍𝒚𝟐

  • Note if 𝑑1 = 0 then the method won’t converge

2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 29

slide-30
SLIDE 30

 Imagine a random web surfer:

  • At any time 𝑢, surfer is on some page 𝑗
  • At time 𝑢 + 1, the surfer follows an
  • ut-link from 𝑗 uniformly at random
  • Ends up on some page 𝑘 linked from 𝑗
  • Process repeats indefinitely

 Let:

 𝒒(𝒖) … vector whose 𝑗th coordinate is the

  • prob. that the surfer is at page 𝑗 at time 𝑢
  • So, 𝒒(𝒖) is a probability distribution over pages

2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 30

j i i j

r r (i) dout

j i1 i2 i3

slide-31
SLIDE 31

 Where is the surfer at time t+1?

  • Follows a link uniformly at random

𝑞 𝑢 + 1 = 𝑁 ⋅ 𝑞(𝑢)

 Suppose the random walk reaches a state

𝑞 𝑢 + 1 = 𝑁 ⋅ 𝑞(𝑢) = 𝑞(𝑢)

then 𝒒(𝑢) is stationary distribution of a random walk

 Our original rank vector 𝒔 satisfies 𝒔 = 𝑵 ⋅ 𝒔

  • So, 𝒔 is a stationary distribution for

the random walk

2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 31

) ( M ) 1 ( t p t p   

j i1 i2 i3

slide-32
SLIDE 32

2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 32

Does this converge? Does it converge to what we want? Are results reasonable?

   j i t i t j

r r

i ) ( ) 1 (

d

Mr r 

  • r

equivalently

Announcement: We graded HW0 and HW1!

  • Stanford students: Pick them up from the submission box in Gates
  • SCPD students: SCPD will send you the HW
slide-33
SLIDE 33

 Example:

ra 1 1 rb 1 1

2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 33

=

b a

Iteration 0, 1, 2, …

   j i t i t j

r r

i ) ( ) 1 (

d

slide-34
SLIDE 34

 Example:

ra 1 rb 1

2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 34

=

b a

Iteration 0, 1, 2, …

   j i t i t j

r r

i ) ( ) 1 (

d

slide-35
SLIDE 35

2 problems:

 (1) Some pages are

dead ends (have no out-links)

  • Such pages cause

importance to “leak out”

 (2) Spider traps

(all out-links are within the group)

  • Eventually spider traps absorb all importance

2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 35

slide-36
SLIDE 36

 Power Iteration:

  • Set 𝑠

𝑘 = 1

  • 𝑠

𝑘 = 𝑠𝑗 𝑒𝑗 𝑗→𝑘

  • And iterate

 Example:

ry

1/3 2/6 3/12 5/24 ra = 1/3 1/6 2/12 3/24 … rm 1/3 3/6 7/12 16/24 1

2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 36

Iteration 0, 1, 2, …

y a m

y a m y ½ ½ a ½ m ½ 1

ry = ry /2 + ra /2 ra = ry /2 rm = ra /2 + rm

slide-37
SLIDE 37

 The Google solution for spider traps: At each

time step, the random surfer has two options

  • With prob. , follow a link at random
  • With prob. 1-, jump to some random page
  • Common values for  are in the range 0.8 to 0.9

 Surfer will teleport out of spider trap within a

few time steps

2/5/2013 37 Jure Leskovec, Stanford C246: Mining Massive Datasets

y a m y a m

slide-38
SLIDE 38

 Power Iteration:

  • Set 𝑠

𝑘 = 1

  • 𝑠

𝑘 = 𝑠𝑗 𝑒𝑗 𝑗→𝑘

  • And iterate

 Example:

ry

1/3 2/6 3/12 5/24 ra = 1/3 1/6 2/12 3/24 … rm 1/3 1/6 1/12 2/24

2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 38

Iteration 0, 1, 2, …

y a m

y a m y ½ ½ a ½ m ½

ry = ry /2 + ra /2 ra = ry /2 rm = ra /2

slide-39
SLIDE 39

 Teleports: Follow random teleport links with

probability 1.0 from dead-ends

  • Adjust matrix accordingly

2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 39

y a m

y a m y ½ ½ ⅓ a ½ ⅓ m ½ ⅓ y a m y ½ ½ a ½ m ½

y a m

slide-40
SLIDE 40

Markov chains

 Set of states X  Transition matrix P where Pij = P(Xt=i | Xt-1=j)  π specifying the stationary probability of

being at each state x  X

 Goal is to find π such that π = P π

2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 40

) ( ) 1 ( t t

Mr r 

slide-41
SLIDE 41

 Theory of Markov chains  Fact: For any start vector, the power method

applied to a Markov transition matrix P will converge to a unique positive stationary vector as long as P is stochastic, irreducible and aperiodic.

2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 41

slide-42
SLIDE 42

 Stochastic: Every column sums to 1  A possible solution: Add green links

2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 42

y a m

y a m y ½ ½ 1/3 a ½ 1/3 m ½ 1/3

ry = ry /2 + ra /2 + rm /3 ra = ry /2+ rm /3 rm = ra /2 + rm /3

) 1 ( e n a M A

T

 

  • ai…=1 if node i has
  • ut deg 0, =0 else
  • e…vector of all 1s
slide-43
SLIDE 43

 A chain is periodic if there exists k > 1 such

that the interval between two visits to some state s is always a multiple of k.

 A possible solution: Add green links

2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 43

y a m

slide-44
SLIDE 44

 From any state, there is a non-zero

probability of going from any one state to any another

 A possible solution: Add green links

2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 44

y a m

slide-45
SLIDE 45

 Google’s solution that does it all:

  • Makes M stochastic, aperiodic, irreducible

 At each step, random surfer has two options:

  • With probability , follow a link at random
  • With probability 1-, jump to some random page

 PageRank equation [Brin-Page, 98]

𝑠

𝑘 = 𝛾 𝑠 𝑗

𝑒𝑗

𝑗→𝑘

+ (1 − 𝛾) 1 𝑜

2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 45

di … out-degree

  • f node i

This formulation assumes that 𝑵 has no dead ends. We can either preprocess matrix 𝑵 to remove all dead ends or explicitly follow random teleport links with probability 1.0 from dead-ends.

slide-46
SLIDE 46

 PageRank equation [Brin-Page, 98]

𝑠

𝑘 = 𝛾 𝑠 𝑗

𝑒𝑗

𝑗→𝑘

+ (1 − 𝛾) 1 𝑜

 The Google Matrix A:

𝐵 = 𝛾 𝑁 + (1 − 𝛾) 1 𝑜 𝒇 ⋅ 𝒇𝑈

 A is stochastic, aperiodic and irreducible, so

𝒔(𝒖+𝟐) = 𝑩 ⋅ 𝒔(𝒖)

 What is  ?

  • In practice  =0.8,0.9 (make 5 steps and jump)

2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 46

e…vector of all 1s

slide-47
SLIDE 47

y a = m 1/3 1/3 1/3 0.33 0.20 0.46 0.24 0.20 0.52 0.26 0.18 0.56 7/33 5/33 21/33 . . .

2/5/2013 47 Jure Leskovec, Stanford C246: Mining Massive Datasets

y a m

0.8+0.2·⅓ 0.8·½+0.2·⅓

1/2 1/2 0 1/2 0 0 0 1/2 1 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 y 7/15 7/15 1/15 a 7/15 1/15 1/15 m 1/15 7/15 13/15 0.8 + 0.2 M 1/n·1·1T A

slide-48
SLIDE 48
slide-49
SLIDE 49

 Key step is matrix-vector multiplication

  • rnew = A ∙ rold

 Easy if we have enough main memory to

hold A, rold, rnew

 Say N = 1 billion pages

  • We need 4 bytes for

each entry (say)

  • 2 billion entries for

vectors, approx 8GB

  • Matrix A has N2 entries
  • 1018 is a large number!

2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 49

½ ½ 0 ½ 0 0 0 ½ 1 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3

7/15 7/15 1/15 7/15 1/15 1/15 1/15 7/15 13/15 0.8 +0.2 A = ∙M + (1-) [1/N]NxN

= A =

slide-50
SLIDE 50

 Suppose there are N pages  Consider page j, with dj out-links  We have Mij = 1/|dj| when j→i

and Mij = 0 otherwise

 The random teleport is equivalent to:

  • Adding a teleport link from j to every other page

and setting transition probability to (1-)/N

  • Reducing the probability of following each
  • ut-link from 1/|dj| to /|dj|
  • Equivalent: Tax each page a fraction (1-) of its

score and redistribute evenly

2/5/2013 50 Jure Leskovec, Stanford C246: Mining Massive Datasets

slide-51
SLIDE 51

 𝒔 = 𝑩 ⋅ 𝒔, where 𝑩𝒋𝒌 = 𝜸 𝑵𝒋𝒌 +

𝟐−𝜸 𝑶

 𝑠

𝑗 =

𝐵𝑗𝑘 ⋅ 𝑠

𝑘 𝑂 𝑘=1

 𝑠

𝑗 =

𝛾 𝑁𝑗𝑘 +

1−𝛾 𝑂

⋅ 𝑠

𝑘 𝑂 𝑘=1

= 𝛾 𝑁𝑗𝑘 ⋅ 𝑠

𝑘 + 1−𝛾 𝑂 𝑂 𝑘=1

𝑠

𝑘 𝑂 𝑘=1

= 𝛾 𝑁𝑗𝑘 ⋅ 𝑠

𝑘 + 1−𝛾 𝑂 𝑂 𝑘=1

since 𝑠

𝑘 = 1

 So we get: 𝒔 = 𝜸 𝑵 ⋅ 𝒔 +

𝟐−𝜸 𝑶 𝑶

2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 52

[x]N … a vector of length N with all entries x

Note: Here we assumed M has no dead-ends.

slide-52
SLIDE 52

 We just rearranged the PageRank equation

𝒔 = 𝜸𝑵 ⋅ 𝒔 + 𝟐 − 𝜸 𝑶

𝑶

  • where [(1-)/N]N is a vector with all N entries (1-)/N

 M is a sparse matrix! (with no dead-ends)

  • 10 links per node, approx 10N entries

 So in each iteration, we need to:

  • Compute rnew =  M ∙ rold
  • Add a constant value (1-)/N to each entry in rnew
  • Note if M contains dead-ends then 𝒔𝒋

𝒐𝒇𝒙 𝒋

< 𝟐 and we also have to renormalize rnew so that it sums to 1

2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 53

slide-53
SLIDE 53

 Input: Graph 𝑯 and parameter 𝜸

  • Directed graph 𝑯 with spider traps and dead ends
  • Parameter 𝛾

 Output: PageRank vector 𝒔

  • Set: 𝑠

𝑘 0 = 1 𝑂 , 𝑢 = 1

  • do:
  • ∀𝑘: 𝒔′𝒌

(𝒖) =

𝜸

𝒔𝒋

(𝒖−𝟐)

𝒆𝒋 𝒋→𝒌

𝒔′𝒌

(𝒖) = 𝟏 if in-deg. of 𝒌 is 0

  • Now re-insert the leaked PageRank:

∀𝒌: 𝒔𝒌

𝒖 = 𝒔′ 𝒌 𝒖 + 𝟐−𝑻 𝑶

  • 𝒖 = 𝒖 + 𝟐
  • while

𝑠

𝑘 (𝑢) − 𝑠 𝑘 (𝑢−1) > 𝜁 𝑘

54

where: 𝑇 = 𝑠′𝑘

(𝑢) 𝑘

slide-54
SLIDE 54

 Encode sparse matrix using only nonzero

entries

  • Space proportional roughly to number of links
  • Say 10N, or 4*10*1 billion = 40GB
  • Still won’t fit in memory, but will fit on disk

2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 55

3 1, 5, 7 1 5 17, 64, 113, 117, 245 2 2 13, 23

source node degree destination nodes

slide-55
SLIDE 55

 Assume enough RAM to fit rnew into memory

  • Store rold and matrix M on disk

 Then 1 step of power-iteration is:

2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 56

3 1, 5, 6 1 4 17, 64, 113, 117 2 2 13, 23

src degree destination 1 2 3 4 5 6 1 2 3 4 5 6 rnew rold

Initialize all entries of rnew to (1-)/N For each page p (of out-degree n): Read into memory: p, n, dest1,…,destn, rold(p) for j = 1…n: rnew(destj) +=  rold(p) / n

slide-56
SLIDE 56

 Assume enough RAM to fit rnew into memory

  • Store rold and matrix M on disk

 In each iteration, we have to:

  • Read rold and M
  • Write rnew back to disk
  • IO cost = 2|r| + |M|

 Question:

  • What if we could not even fit rnew in memory?

2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 57

slide-57
SLIDE 57

2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 58

4 0, 1, 3, 5 1 2 0, 5 2 2 3, 4

src degree destination 1 2 3 4 5 1 2 3 4 5 rnew rold

slide-58
SLIDE 58

 Similar to nested-loop join in databases

  • Break rnew into k blocks that fit in memory
  • Scan M and rold once for each block

 k scans of M and rold

  • k(|M| + |r|) + |r| = k|M| + (k+1)|r|

 Can we do better?

  • Hint: M is much bigger than r (approx 10-20x), so

we must avoid reading it k times per iteration

2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 59

slide-59
SLIDE 59

2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 60

4 0, 1 1 3 2 2 1

src degree destination 1 2 3 4 5 1 2 3 4 5 rnew rold

4 5 1 3 5 2 2 4 4 3 2 2 3

slide-60
SLIDE 60

 Break M into stripes

  • Each stripe contains only destination nodes in the

corresponding block of rnew

 Some additional overhead per stripe

  • But it is usually worth it

 Cost per iteration

  • |M|(1+) + (k+1)|r|

2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 61

slide-61
SLIDE 61

 Measures generic popularity of a page

  • Biased against topic-specific authorities
  • Solution: Topic-Specific PageRank (next)

 Uses a single measure of importance

  • Other models e.g., hubs-and-authorities
  • Solution: Hubs-and-Authorities (next)

 Susceptible to Link spam

  • Artificial link topographies created in order to

boost page rank

  • Solution: TrustRank (next)

2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 62