[PPT] - To Randomize or Not To Randomize: Space Optimal Summaries for PowerPoint Presentation

SLIDE 1

Fully Personalized PageRank Similarity Search

To Randomize or Not To Randomize: Space Optimal Summaries for Hyperlink Analysis

Tam´ as Sarl´

s,

E¨

tv¨
s University and

Computer and Automation Institute, Hungarian Academy of Sciences Joint work with Andr´ as A. Bencz´ ur, K´ aroly Csalog´ any, D´ aniel Fogaras, and Bal´ azs R´ acz

Tam´ as Sarl´

s et al., Hungarian Academy of Sciences

Space Optimal Summaries for Hyperlink Analysis

SLIDE 2

Fully Personalized PageRank Similarity Search

Personalized PageRank – Definition and Motivation

Definition: random surfer with teleportation distribution r and tel. probab. c ≈ 0.15 PPRr(u) = c·r(u)+(1−c)

v:(vu)∈E

PPRr(v) Motivation: Search engines

◮ Improved ranking ◮ Fighting link spam

Slow to compute naively with the power method

Tam´ as Sarl´

s et al., Hungarian Academy of Sciences

Space Optimal Summaries for Hyperlink Analysis

SLIDE 4

Fully Personalized PageRank Similarity Search

Personalized PageRank – Linearity

Linearity: PPRα1r1+α2r2(u) = α1PPRr1(u)+α2PPRr2(u) Single page teleportation suffices: PPRr(u) =

v

r(v) · PPRv(u)

Tam´ as Sarl´

s et al., Hungarian Academy of Sciences

Space Optimal Summaries for Hyperlink Analysis

SLIDE 5

Fully Personalized PageRank Similarity Search

Personalized PageRank – Preliminaries Two-phase algorithm

1. precomputes a PPR database
2. answers PageRank queries using the database

Exact PPR on a graph of n ≈ millions . . . billions of vertices: Storage requirement Person. Topic sensitive O(t · n) words t ≈ 10 − 100 [Haveliwala 02] topics Hub decomp. O(h · n) words h ≈ 100.000 [Jeh–Widom 03] pages Lower bound of Ω(n2) bits, infeasible all pages [Fogaras–R´ acz 04]

Tam´ as Sarl´

s et al., Hungarian Academy of Sciences

Space Optimal Summaries for Hyperlink Analysis

SLIDE 6

Fully Personalized PageRank Similarity Search

Sampling Fully Personalized PageRank

◮ Express PPRu(v) as probability of random walk

starting at u ending in v

◮ Sample ending points of random walks as above ◮ First algorithm with no restriction on u ◮ Additive error ±ǫ; out of bounds prob. δ ◮ Uses O(n · ǫ−2 log 1/δ log n) bits of space

Tam´ as Sarl´

s et al., Hungarian Academy of Sciences

Space Optimal Summaries for Hyperlink Analysis

SLIDE 7

Fully Personalized PageRank Similarity Search

Power Iteration and Dynamic Programming

Example

v1 v2 v4 v6 v3 u w v5

Power iteration amplifies the error downwards Dynamic programming [Jeh–Widom WWW 2003] averages the error upward PPR(k+1)

u

= cχu + (1 − c) ·

v:(uv)∈E

PPR(k)

v /d+(u)

Problem: small world, number of non-zeroes grow quickly in u’s neighborhood

Tam´ as Sarl´

s et al., Hungarian Academy of Sciences

Space Optimal Summaries for Hyperlink Analysis

SLIDE 8

Fully Personalized PageRank Similarity Search

Rounded Dynamic Programming

Repeat kmax = 2 log1−c ǫ times for all u

PPRu = Roundk
cχu+(1−c)·
v:(uv)∈E
PPRv/d+(u)
◮ Space: n sparse PPRu vectors in

O(n · 1/ǫ log n) bits – optimal for top queries

◮ Can gradually decrease rounding error ǫk from

ǫ1 = 1 to ǫkmax = ǫ

◮ Deterministic output; inductive proof shows

PPRu(v) − 2ǫ/c ≤ PPRu(v) ≤ PPRu(v)

◮ Preprocessing: linear O((n + m)/(cǫ)) time

Tam´ as Sarl´

s et al., Hungarian Academy of Sciences

Space Optimal Summaries for Hyperlink Analysis

SLIDE 9

Fully Personalized PageRank Similarity Search

Dynamic Programming with Sketches Drunken Surfer

◮ Mix up memories by random hash h(v) of

pages v SPPRu(i) =

v:h(v)=i

PPRu(v) for i = 1, . . ., 2e/ǫ

◮ Use surfers for j = 1, . . ., log 1/δ and use

minimum vote: Count-Min Sketch [Cormode–Muthukrishnan 05]

PPRu(v) =

min

j=1,...,log 1/δ SPPR(j) u (hj(v))

Tam´ as Sarl´

s et al., Hungarian Academy of Sciences

Space Optimal Summaries for Hyperlink Analysis

SLIDE 10

Fully Personalized PageRank Similarity Search

Dynamic Programming with Sketches Cont’d

◮ Dynamic programming over sketches by their

linearity

◮ A variant also gives linear time preprocessing ◮ O(n · 1/ǫ log 1/δ) bits of space – optimal for

value queries PPRu(v) − 2ǫ/c − ǫ ≤ PPRu(v) ≤ PPRu(v) + 2ǫ/c

Tam´ as Sarl´

s et al., Hungarian Academy of Sciences

Space Optimal Summaries for Hyperlink Analysis

SLIDE 11

Fully Personalized PageRank Similarity Search

Lower Bounds

Reduction to one-way communication complexity of bit-vector probing Alice Bob bit string y ∈ {0, 1}s index i, output: yi 1. creates G(y) 2. transmits the PPR database of G(y) 3. queries the database for PPRu(i)(v(i))

Tam´ as Sarl´

s et al., Hungarian Academy of Sciences

Space Optimal Summaries for Hyperlink Analysis

SLIDE 12

Fully Personalized PageRank Similarity Search

Experiments

Stanford WebBase: 80M nodes, 800M edges Measured accuracy over 1000 random nodes Effect of rounding with kmax = 35 iterations.

Worst case bound DP with rounding

Rounding error ǫ Maximum Error 0.01 0.001 1e-04 1e-05 1 0.1 0.01 0.001 1e-04 1e-05

Tam´ as Sarl´

s et al., Hungarian Academy of Sciences

Space Optimal Summaries for Hyperlink Analysis

SLIDE 13

Fully Personalized PageRank Similarity Search

Quality of Approximate Rankings @ t

Precision = Recall: |approximate top-t ∩ true top-t| t Kendall’s Tau: 1 − 2#inversions in approximate top-t t 2

Tam´

as Sarl´

s et al., Hungarian Academy of Sciences

Space Optimal Summaries for Hyperlink Analysis

SLIDE 14

Fully Personalized PageRank Similarity Search

Precision

BFS Monte Carlo Sketch Rounding ǫ = 2 · 10−5 Rounding ǫ = 10−5

Size of top list t Precision 1000 100 10 1 0.95 0.9 0.85 0.8 0.75 0.7 0.65

Tam´ as Sarl´

s et al., Hungarian Academy of Sciences

Space Optimal Summaries for Hyperlink Analysis

SLIDE 15

Fully Personalized PageRank Similarity Search

Kendall’s Tau

BFS Monte Carlo Sketch Rounding ǫ = 2 · 10−5 Rounding ǫ = 10−5

Size of top list t Kendall’s τ 1000 100 10 1 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5

Tam´ as Sarl´

s et al., Hungarian Academy of Sciences

Space Optimal Summaries for Hyperlink Analysis

SLIDE 16

Fully Personalized PageRank Similarity Search

SimRank – Preliminaries and Sampling

“Two pages are similar if pointed to by similar pages” [Jeh–Widom 02] Sim(k)(v1, v2) =

(1 − c) ·

Sim

(k−1)(u1,u2)

d−(v1)·d−(v2)

if v1 = v2 1 if v1 = v2. (1 − c)k′-weighted path pair summation (incl. sampling [Fogaras–R´ acz 05]) over v1 = w0, w1, . . . , wk′−1, wk′ = u v2 = w ′

0, w ′ 1, . . . , w ′ k′−1, w ′ k′ = u

Tam´ as Sarl´

s et al., Hungarian Academy of Sciences

Space Optimal Summaries for Hyperlink Analysis

SLIDE 17

Fully Personalized PageRank Similarity Search

SimRank – Reduction to Personalized PageRank

Version 0 reduction: count path pairs from v1 and v2 that may meet several times Sim(0)

v1,v2 =

k>0

(1 − c)k

u

RP[k]

v1 (u)RP[k] v2 (u)

Recursively define self-similarity SimRank of at least t + 1 inner meeting points as SSim(t+1)(v)

Tam´ as Sarl´

s et al., Hungarian Academy of Sciences

Space Optimal Summaries for Hyperlink Analysis

SLIDE 18

Fully Personalized PageRank Similarity Search

SimRank – Reduction to Personalized PageRank

Obtain SimRank by inclusion-exclusion of self-similarities

Sim(v1, v2) =

k>0

(1 − c)k

u

RP[k]

v1 (u)RP[k] v2 (u)·SSim(u)

SSim(u) = 1 − SSim(0)(u) + SSim(1)(u) − SSim(2)(u) + . . .

Converges for 1 − c < 1/2, technicalities to carry through approximation

Tam´ as Sarl´

s et al., Hungarian Academy of Sciences

Space Optimal Summaries for Hyperlink Analysis

SLIDE 19

Fully Personalized PageRank Similarity Search

Conclusion

◮ Efficient algorithms + lower bounds =

space-optimal summaries for

◮ Fully Personalized PageRank and for ◮ SimRank with decay factor < 1/2

◮ At the heart of it: low space approximation of

large vectors in the . . .∞ norm

◮ Works well in practice

Tam´ as Sarl´

s et al., Hungarian Academy of Sciences

Space Optimal Summaries for Hyperlink Analysis

SLIDE 20

Fully Personalized PageRank Similarity Search

Thank you!

◮ http://www.ilab.sztaki.hu/websearch

Tam´ as Sarl´

s et al., Hungarian Academy of Sciences

Space Optimal Summaries for Hyperlink Analysis

SLIDE 21

Fully Personalized PageRank Similarity Search

Algorithms Compared

Algorithm Running time Dynamic Programming with ǫ = 2 · 10−5 and ǫ = 10−5 rounding to varying ǫk 1.5 and 2.25 days Dynamic Programming with ǫ = 6 · 10−3, δ = 4 · 10−3 sketches 6 days Monte Carlo sampling with N = 10000 samples 6 days Breadth First Search heuristic 3.5 days

Tam´ as Sarl´

s et al., Hungarian Academy of Sciences

Space Optimal Summaries for Hyperlink Analysis

SLIDE 22

Fully Personalized PageRank Similarity Search

SimRank Example

v1 v2 u1 u2 u3

k>0

1 3k

u

RP[k]

v1 (u)RP[k] v2 (u) = 1

4 · 1 3

1 + 1

3 + 1 32 + . . .

= 1

12 · 3 2 SSim(0)(ui) = 1 3 + 1 32 + . . . = 1 2 SSim(1)(ui) = 1 4 SSim(ui) = 1 − 1 2 + 1 4 − 1 8 + . . . = 2 3

Tam´ as Sarl´

s et al., Hungarian Academy of Sciences

Space Optimal Summaries for Hyperlink Analysis

To Randomize or Not To Randomize: Space Optimal Summaries for Hyperlink Analysis

Tam´ as Sarl´

E¨

Computer and Automation Institute, Hungarian Academy of Sciences Joint work with Andr´ as A. Bencz´ ur, K´ aroly Csalog´ any, D´ aniel Fogaras, and Bal´ azs R´ acz

Contents

PageRank

PageRank

Personalized PageRank – Definition and Motivation

Definition: random surfer with teleportation distribution r and tel. probab. c ≈ 0.15 PPRr(u) = c·r(u)+(1−c)

PPRr(v) Motivation: Search engines

Slow to compute naively with the power method

Personalized PageRank – Linearity

Linearity: PPRα1r1+α2r2(u) = α1PPRr1(u)+α2PPRr2(u) Single page teleportation suffices: PPRr(u) =

r(v) · PPRv(u)

Personalized PageRank – Preliminaries Two-phase algorithm

Sampling Fully Personalized PageRank

starting at u ending in v

Power Iteration and Dynamic Programming

Example

Power iteration amplifies the error downwards Dynamic programming [Jeh–Widom WWW 2003] averages the error upward PPR(k+1)

= cχu + (1 − c) ·

PPR(k)

Problem: small world, number of non-zeroes grow quickly in u’s neighborhood

Rounded Dynamic Programming

Repeat kmax = 2 log1−c ǫ times for all u

O(n · 1/ǫ log n) bits – optimal for top queries

ǫ1 = 1 to ǫkmax = ǫ

PPRu(v) − 2ǫ/c ≤ PPRu(v) ≤ PPRu(v)

Dynamic Programming with Sketches Drunken Surfer

pages v SPPRu(i) =

PPRu(v) for i = 1, . . ., 2e/ǫ

minimum vote: Count-Min Sketch [Cormode–Muthukrishnan 05]

min

Dynamic Programming with Sketches Cont’d

linearity

value queries PPRu(v) − 2ǫ/c − ǫ ≤ PPRu(v) ≤ PPRu(v) + 2ǫ/c

Lower Bounds

Reduction to one-way communication complexity of bit-vector probing Alice Bob bit string y ∈ {0, 1}s index i, output: yi 1. creates G(y) 2. transmits the PPR database of G(y) 3. queries the database for PPRu(i)(v(i))

Experiments

Stanford WebBase: 80M nodes, 800M edges Measured accuracy over 1000 random nodes Effect of rounding with kmax = 35 iterations.

Worst case bound DP with rounding

Rounding error ǫ Maximum Error 0.01 0.001 1e-04 1e-05 1 0.1 0.01 0.001 1e-04 1e-05

Quality of Approximate Rankings @ t

Precision = Recall: |approximate top-t ∩ true top-t| t Kendall’s Tau: 1 − 2#inversions in approximate top-t t 2

Precision

BFS Monte Carlo Sketch Rounding ǫ = 2 · 10−5 Rounding ǫ = 10−5

Size of top list t Precision 1000 100 10 1 0.95 0.9 0.85 0.8 0.75 0.7 0.65

Kendall’s Tau

BFS Monte Carlo Sketch Rounding ǫ = 2 · 10−5 Rounding ǫ = 10−5

Size of top list t Kendall’s τ 1000 100 10 1 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5

SimRank – Preliminaries and Sampling

“Two pages are similar if pointed to by similar pages” [Jeh–Widom 02] Sim(k)(v1, v2) =

if v1 = v2 1 if v1 = v2. (1 − c)k′-weighted path pair summation (incl. sampling [Fogaras–R´ acz 05]) over v1 = w0, w1, . . . , wk′−1, wk′ = u v2 = w ′

SimRank – Reduction to Personalized PageRank

Version 0 reduction: count path pairs from v1 and v2 that may meet several times Sim(0)

(1 − c)k

RP[k]

Recursively define self-similarity SimRank of at least t + 1 inner meeting points as SSim(t+1)(v)

SimRank – Reduction to Personalized PageRank

Obtain SimRank by inclusion-exclusion of self-similarities

Sim(v1, v2) =

(1 − c)k

RP[k]

SSim(u) = 1 − SSim(0)(u) + SSim(1)(u) − SSim(2)(u) + . . .

Converges for 1 − c < 1/2, technicalities to carry through approximation

Conclusion

space-optimal summaries for

large vectors in the . . .∞ norm

Thank you!

Algorithms Compared

Algorithm Running time Dynamic Programming with ǫ = 2 · 10−5 and ǫ = 10−5 rounding to varying ǫk 1.5 and 2.25 days Dynamic Programming with ǫ = 6 · 10−3, δ = 4 · 10−3 sketches 6 days Monte Carlo sampling with N = 10000 samples 6 days Breadth First Search heuristic 3.5 days

SimRank Example

1 3k

RP[k]

4 · 1 3

3 + 1 32 + . . .

12 · 3 2 SSim(0)(ui) = 1 3 + 1 32 + . . . = 1 2 SSim(1)(ui) = 1 4 SSim(ui) = 1 − 1 2 + 1 4 − 1 8 + . . . = 2 3