http://cs246.stanford.edu Web pages are not equally important - - PowerPoint PPT Presentation

http cs246 stanford edu web pages are not equally
SMART_READER_LITE
LIVE PREVIEW

http://cs246.stanford.edu Web pages are not equally important - - PowerPoint PPT Presentation

CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu Web pages are not equally important www.joe-schmoe.com vs. www.stanford.edu We already know: Since there is large diversity in the


slide-1
SLIDE 1

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

http://cs246.stanford.edu

slide-2
SLIDE 2

 Web pages are not equally “important”

www.joe-schmoe.com vs. www.stanford.edu

 We already know:

Since there is large diversity in the connectivity of the webgraph we can rank the pages by the link structure

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 2

vs.

slide-3
SLIDE 3

 We will cover the following Link Analysis

approaches to computing importances of nodes in a graph:

  • Page Rank
  • Hubs and Authorities (HITS)
  • Topic-Specific (Personalized) Page Rank
  • Web Spam Detection Algorithms

2/7/2012 3 Jure Leskovec, Stanford C246: Mining Massive Datasets

slide-4
SLIDE 4

 Idea: Links as votes

  • Page is more important if it has more links
  • In-coming links? Out-going links?

 Think of in-links as votes:

  • www.stanford.edu has 23,400 inlinks
  • www.joe-schmoe.com has 1 inlink

 Are all in-links are equal?

  • Links from important pages count more
  • Recursive question!

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 4

slide-5
SLIDE 5

 Each link’s vote is proportional to the

importance of its source page

 If page p with importance x has n out-links,

each link gets x/n votes

 Page p’s own importance is the sum of the

votes on its in-links

2/7/2012 5 Jure Leskovec, Stanford C246: Mining Massive Datasets

p

slide-6
SLIDE 6

 A “vote” from an important

page is worth more

 A page is important if it is

pointed to by other important pages

 Define a “rank” rj for node j

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 6

=

j i i j

r r (i) dout

y m a a/2 y/2 a/2 m y/2

The web in 1839 Flow equations:

ry = ry /2 + ra /2 ra = ry /2 + rm rm = ra /2

slide-7
SLIDE 7

 3 equations, 3 unknowns,

no constants

  • No unique solution
  • All solutions equivalent modulo scale factor

 Additional constraint forces uniqueness

  • y + a + m = 1
  • y = 2/5, a = 2/5, m = 1/5

 Gaussian elimination method works for small

examples, but we need a better method for large web-size graphs

2/7/2012 7 Jure Leskovec, Stanford C246: Mining Massive Datasets

ry = ry /2 + ra /2 ra = ry /2 + rm rm = ra /2

Flow equations:

slide-8
SLIDE 8

 Stochastic adjacency matrix M

  • Let page j has dj out-links
  • If j → i, then Mij = 1/dj else Mij = 0
  • M is a column stochastic matrix
  • Columns sum to 1

 Rank vector r: vector with an entry per page

  • ri is the importance score of page i
  • ∑i ri = 1

 The flow equations can be written

r = M r

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 8

slide-9
SLIDE 9

 Suppose page j links to 3 pages, including i

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 9

i j

M r r = i

1/3

slide-10
SLIDE 10

 The flow equations can be written

r = M ∙ r

 So the rank vector is an eigenvector of the

stochastic web matrix

  • In fact, its first or principal eigenvector, with

corresponding eigenvalue 1

2/7/2012 10 Jure Leskovec, Stanford C246: Mining Massive Datasets

slide-11
SLIDE 11

r = Mr

y ½ ½ 0 y a = ½ 0 1 a m 0 ½ 0 m

2/7/2012 11 Jure Leskovec, Stanford C246: Mining Massive Datasets

y a m y a m y ½ ½ a ½ 1 m ½ ry = ry /2 + ra /2 ra = ry /2 + rm rm = ra /2

slide-12
SLIDE 12

 Given a web graph with n nodes, where the

nodes are pages and edges are hyperlinks

 Power iteration: a simple iterative scheme

  • Suppose there are N web pages
  • Initialize: r(0) = [1/N,….,1/N]T
  • Iterate: r(t+1) = M ∙ r(t)
  • Stop when |r(t+1) – r(t)|1 < ε
  • |x|1 = ∑1≤i≤N|xi| is the L1 norm
  • Can use any other vector norm e.g., Euclidean

2/7/2012 12 Jure Leskovec, Stanford C246: Mining Massive Datasets

→ + = j i t i t j

r r

i ) ( ) 1 (

d

di …. out-degree of node i

slide-13
SLIDE 13

 Power Iteration:

  • Set 𝑠

𝑘 = 1/N

  • 𝑠

𝑘 = ∑ 𝑠𝑗 𝑒𝑗 𝑗→𝑘

  • And iterate
  • ri=∑j Mij∙rj

 Example:

ry

1/3 1/3 5/12 9/24 6/15 ra = 1/3 3/6 1/3 11/24 … 6/15 rm 1/3 1/6 3/12 1/6 3/15

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets

y a m

y a m y ½ ½ a ½ 1 m ½

13

Iteration 0, 1, 2, …

ry = ry /2 + ra /2 ra = ry /2 + rm rm = ra /2

slide-14
SLIDE 14

 Imagine a random web surfer:

  • At any time t, surfer is on some page u
  • At time t+1, the surfer follows an
  • ut-link from u uniformly at random
  • Ends up on some page v linked from u
  • Process repeats indefinitely

 Let:

 p(t) … vector whose ith coordinate is the

  • prob. that the surfer is at page i at time t
  • p(t) is a probability distribution over pages

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 14

=

j i i j

r r (i) dout

j i1 i2 i3

slide-15
SLIDE 15

 Where is the surfer at time t+1?

  • Follows a link uniformly at random

p(t+1) = M · p(t)

 Suppose the random walk reaches a state

p(t+1) = M · p(t) = p(t)

then p(t) is stationary distribution of a random walk

 Our rank vector r satisfies r = M · r

  • So, it is a stationary distribution for

the random walk

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 15

) ( ) 1 ( t p t p ⋅ = + M

j i1 i2 i3

slide-16
SLIDE 16

 Does this converge?  Does it converge to what we want?  Are results reasonable?

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 16

→ + = j i t i t j

r r

i ) ( ) 1 (

d

Mr r =

  • r

equivalently

slide-17
SLIDE 17

 Example:

ra 1 1 rb 1 1

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 17

=

b a

Iteration 0, 1, 2, …

→ + = j i t i t j

r r

i ) ( ) 1 (

d

slide-18
SLIDE 18

 Example:

ra 1 rb 1

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 18

=

b a

Iteration 0, 1, 2, …

→ + = j i t i t j

r r

i ) ( ) 1 (

d

slide-19
SLIDE 19

2 problems:

 Some pages are “dead ends”

(have no out-links)

  • Such pages cause

importance to “leak out”

 Spider traps (all out links are

within the group)

  • Eventually spider traps absorb all importance

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 19

slide-20
SLIDE 20

 Power Iteration:

  • Set 𝑠

𝑘 = 1

  • 𝑠

𝑘 = ∑ 𝑠𝑗 𝑒𝑗 𝑗→𝑘

  • And iterate

 Example:

ry

1/3 2/6 3/12 5/24 ra = 1/3 1/6 2/12 3/24 … rm 1/3 3/6 7/12 16/24 1

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 20

Iteration 0, 1, 2, …

y a m

y a m y ½ ½ a ½ m ½ 1

ry = ry /2 + ra /2 ra = ry /2 rm = ra /2 + rm

slide-21
SLIDE 21

 The Google solution for spider traps: At each

time step, the random surfer has two options:

  • With probability β, follow a link at random
  • With probability 1-β, jump to some page uniformly

at random

  • Common values for β are in the range 0.8 to 0.9

 Surfer will teleport out of spider trap within a

few time steps

2/7/2012 21 Jure Leskovec, Stanford C246: Mining Massive Datasets

y a m y a m

slide-22
SLIDE 22

 Power Iteration:

  • Set 𝑠

𝑘 = 1

  • 𝑠

𝑘 = ∑ 𝑠𝑗 𝑒𝑗 𝑗→𝑘

  • And iterate

 Example:

ry

1/3 2/6 3/12 5/24 ra = 1/3 1/6 2/12 3/24 … rm 1/3 1/6 1/12 2/24

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 22

Iteration 0, 1, 2, …

y a m

y a m y ½ ½ a ½ m ½

ry = ry /2 + ra /2 ra = ry /2 rm = ra /2

slide-23
SLIDE 23

 Teleports: Follow random teleport links with

probability 1.0 from dead-ends

  • Adjust matrix accordingly

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 23

y a m

y a m y ½ ½ ⅓ a ½ ⅓ m ½ ⅓ y a m y ½ ½ a ½ m ½

y a m

slide-24
SLIDE 24

Markov Chains

 Set of states X  Transition matrix P where Pij = P(Xt=i | Xt-1=j)  π specifying the probability of being at each

state x ∈ X

 Goal is to find π such that π = P π

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 24

) ( ) 1 ( t t

Mr r =

+

slide-25
SLIDE 25

 Theory of Markov chains  Fact: For any start vector, the power method

applied to a Markov transition matrix P will converge to a unique positive stationary vector as long as P is stochastic, irreducible and aperiodic.

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 25

slide-26
SLIDE 26

 Stochastic: Every column sums to 1  A possible solution: Add green links

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 26

y a m

y a m y ½ ½ 1/3 a ½ 1/3 m ½ 1/3

ry = ry /2 + ra /2 + rm /3 ra = ry /2+ rm /3 rm = ra /2 + rm /3

) 1 ( 1 n a M S

T

+ =

  • ai…=1 if node i has
  • ut deg 0, =0 else
  • 1…vector of all 1s
slide-27
SLIDE 27

 A chain is periodic if there exists k > 1 such

that the interval between two visits to some state s is always a multiple of k.

 A possible solution: Add green links

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 27

y a m

slide-28
SLIDE 28

 From any state, there is a non-zero

probability of going from any one state to any another

 A possible solution: Add green links

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 28

y a m

slide-29
SLIDE 29

 Google’s solution that does it all:

  • Makes M stochastic, aperiodic, irreducible

 At each step, random surfer has two options:

  • With probability 1-β, follow a link at random
  • With probability β, jump to some random page

 PageRank equation [Brin-Page, 98]

𝑠

𝑘 = (1 − 𝛾) 𝑠 𝑗

𝑒𝑗

𝑗→𝑘

+ 𝛾 1 𝑜

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 29

di … out-degree

  • f node i

Assuming we follow random teleport links with probability 1.0 from dead-ends

slide-30
SLIDE 30

 PageRank equation [Brin-Page, 98]

𝑠

𝑘 = (1 − 𝛾) 𝑠 𝑗

𝑒𝑗

𝑗→𝑘

+ 𝛾 1 𝑜

 The Google Matrix A:

𝐵 = 1 − 𝛾 𝑇 + 𝛾 1 𝑜 𝟐 ⋅ 𝟐𝑈

 G is stochastic, aperiodic and irreducible, so

𝑠(𝑢+1) = 𝐵 ⋅ 𝑠(𝑢)

 What is β ?

  • In practice β =0.15 (make 5 steps and jump)

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 30

slide-31
SLIDE 31

y a = m 1/3 1/3 1/3 0.33 0.20 0.46 0.24 0.20 0.52 0.26 0.18 0.56 7/33 5/33 21/33 . . .

2/7/2012 31 Jure Leskovec, Stanford C246: Mining Massive Datasets

y a m

0.8+0.2·⅓ 0.8·½+0.2·⅓

1/2 1/2 0 1/2 0 0 0 1/2 1 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 y 7/15 7/15 1/15 a 7/15 1/15 1/15 m 1/15 7/15 13/15 0.8 + 0.2 S 1/n·1·1T A

slide-32
SLIDE 32

 Suppose there are N pages  Consider a page j, with set of out-links O(j)  We have Mij = 1/|O(j)| when j→i and Mij = 0

  • therwise

 The random teleport is equivalent to

  • Adding a teleport link from j to every other page

with probability (1-β)/N

  • Reducing the probability of following each out-link

from 1/|O(j)| to β/|O(j)|

  • Equivalent: Tax each page a fraction (1-β) of its

score and redistribute evenly

2/7/2012 32 Jure Leskovec, Stanford C246: Mining Massive Datasets

slide-33
SLIDE 33

 Construct the N x N matrix A as follows

  • Aij = β ∙ Mij + (1-β)/N

 Verify that A is a stochastic matrix  The page rank vector r is the principal

eigenvector of this matrix A

  • satisfying r = A ∙ r

 Equivalently, r is the stationary distribution of

the random walk with teleports

2/7/2012 33 Jure Leskovec, Stanford C246: Mining Massive Datasets

slide-34
SLIDE 34

 Key step is matrix-vector multiplication

  • rnew = A ∙ rold

 Easy if we have enough main memory to hold

A, rold, rnew

 Say N = 1 billion pages

  • We need 4 bytes for

each entry (say)

  • 2 billion entries for

vectors, approx 8GB

  • Matrix A has N2 entries
  • 1018 is a large number!

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 34

½ ½ 0 ½ 0 0 0 ½ 1 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3

7/15 7/15 1/15 7/15 1/15 1/15 1/15 7/15 13/15 0.8 +0.2 A = β∙M + (1-β) [1/N]NxN

= A =

slide-35
SLIDE 35

 𝑠 = 𝐵 ⋅ 𝑠, where 𝐵𝑗𝑗 = 𝛾 𝑁𝑗𝑗 +

1−𝛾 𝑂

 𝑠𝑗 = ∑

𝐵𝑗𝑗 ⋅ 𝑠𝑗

𝑂 𝑗=1

 𝑠𝑗 = ∑

𝛾 𝑁𝑗𝑗 +

1−𝛾 𝑂

⋅ 𝑠𝑗

𝑂 𝑗=1

= ∑ 𝛾 𝑁𝑗𝑗 ⋅ 𝑠𝑗 +

1−𝛾 𝑂 𝑂 𝑗=1

∑ 𝑠𝑗

𝑂 𝑗=1

= ∑ 𝛾 𝑁𝑗𝑗 ⋅ 𝑠𝑗 +

1−𝛾 𝑂 𝑂 𝑗=1

, since ∑𝑠𝑗 = 1

 So, 𝑠 = 𝛾 𝑁 ⋅ 𝑠 +

1−𝛾 𝑂 𝑂

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 35

[x]N … a vector of length N with all entries x

slide-36
SLIDE 36

 We can rearrange the PageRank equation

  • 𝒔 = 𝜸𝜸 ⋅ 𝒔 +

𝟐−𝜸 𝑶 𝑶

  • [(1-β)/N]N is an N-vector with all entries (1-β)/N

 M is a sparse matrix!

  • 10 links per node, approx 10N entries

 So in each iteration, we need to:

  • Compute rnew = β M ∙ rold
  • Add a constant value (1-β)/N to each entry in rnew

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 36

slide-37
SLIDE 37

 Encode sparse matrix using only nonzero

entries

  • Space proportional roughly to number of links
  • Say 10N, or 4*10*1 billion = 40GB
  • Still won’t fit in memory, but will fit on disk

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 37

3 1, 5, 7 1 5 17, 64, 113, 117, 245 2 2 13, 23

source node degree destination nodes

slide-38
SLIDE 38

 Assume enough RAM to fit rnew into memory

  • Store rold and matrix M on disk

 Then 1 step of power-iteration is:

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 38

3 1, 5, 6 1 4 17, 64, 113, 117 2 2 13, 23

src degree destination 1 2 3 4 5 6 1 2 3 4 5 6 rnew rold

Initialize all entries of rnew to (1-β)/N For each page p (of out-degree n): Read into memory: p, n, dest1,…,destn, rold(p) for j = 1…n: rnew(destj) += β rold(p) / n

slide-39
SLIDE 39

 Assume enough RAM to fit rnew into memory

  • Store rold and matrix M on disk

 In each iteration, we have to:

  • Read rold and M
  • Write rnew back to disk
  • IO cost = 2|r| + |M|

 Question:

  • What if we could not even fit rnew in memory?

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 39

slide-40
SLIDE 40

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 40

4 0, 1, 3, 5 1 2 0, 5 2 2 3, 4

src degree destination 1 2 3 4 5 1 2 3 4 5 rnew rold

slide-41
SLIDE 41

 Similar to nested-loop join in databases

  • Break rnew into k blocks that fit in memory
  • Scan M and rold once for each block

 k scans of M and rold

  • k(|M| + |r|) + |r| = k|M| + (k+1)|r|

 Can we do better?

  • Hint: M is much bigger than r (approx 10-20x), so

we must avoid reading it k times per iteration

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 41

slide-42
SLIDE 42

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 42

4 0, 1 1 3 2 2 1

src degree destination 1 2 3 4 5 1 2 3 4 5 rnew rold

4 5 1 3 5 2 2 4 4 3 2 2 3

slide-43
SLIDE 43

 Break M into stripes

  • Each stripe contains only destination nodes in the

corresponding block of rnew

 Some additional overhead per stripe

  • But it is usually worth it

 Cost per iteration

  • |M|(1+ε) + (k+1)|r|

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 43

slide-44
SLIDE 44

 Measures generic popularity of a page

  • Biased against topic-specific authorities
  • Solution: Topic-Specific PageRank (next)

 Uses a single measure of importance

  • Other models e.g., hubs-and-authorities
  • Solution: Hubs-and-Authorities (next)

 Susceptible to Link spam

  • Artificial link topographies created in order to

boost page rank

  • Solution: TrustRank (next)

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 44