chapter 5 link analysis for authority scoring
play

Chapter 5: Link Analysis for Authority Scoring 5.1 PageRank (S. Brin - PowerPoint PPT Presentation

Chapter 5: Link Analysis for Authority Scoring 5.1 PageRank (S. Brin and L. Page 1997/1998) 5.2 HITS (J. Kleinberg 1997/1999) 5.3 Comparison and Extensions 5.4 Topic-specific and Personalized PageRank 5.5 Efficiency Issues 5.6 Online Page


  1. Chapter 5: Link Analysis for Authority Scoring 5.1 PageRank (S. Brin and L. Page 1997/1998) 5.2 HITS (J. Kleinberg 1997/1999) 5.3 Comparison and Extensions 5.4 Topic-specific and Personalized PageRank 5.5 Efficiency Issues 5.6 Online Page Importance 5.7 Spam-Resilient Authority Scoring 5-1 IRDM WS 2005

  2. 5.3 Comparison and Extensions Literature contains plethora of variations on Page-Rank and HITS Key points are: • mutual reinforcement between hubs and authorities • re-scale edge weights (normalization) Unified notation (for link graph with n nodes): - n × n link matrix, L ij = 1 if there is an edge (i,j), 0 else L - n × 1 vector with din i = indegree(i), Din n × n = diag(din) din - n × 1 vector with dout i = outdegree(i), Dout n × n = diag(dout) dout - n × 1 authority vector x - n × 1 hub vector y Iop - operation applied to incoming links Oop - operation applied to outgoing links 5-2 IRDM WS 2005

  3. HITS and PageRank in Unified Framework HITS: x = Iop(y), y=Oop(x) with Iop(y) = L T y , Oop(x) = Lx PageRank : x = Iop(x) with Iop(x) = P T x with P T = L T Dout -1 or P T = α L T Dout -1 + (1- α ) (1/n) e e T SALSA (PageRank-style computation with mutual reinforcement): x = Iop(y) with Iop(y) = P T y with P T = L T Dout -1 y = Oop(x) with Oop(x) = Q x with Q = L Din -1 and other models of link analysis can be cast into this framework, too 5-3 IRDM WS 2005

  4. A Familiy of Link Analysis Methods General scheme: Iop( ⋅ ) = Din -p L T Dout -q ( ⋅ ) and Oop( ⋅ ) = Iop T ( ⋅ ) Specific instance Out-link normalized Rank (Onorm-Rank) : Iop( ⋅ ) = L T Dout -1/2 ( ⋅ ) , Oop( ⋅ ) = Dout -1/2 L ( ⋅ ) applied to x and y: x = Iop(y), y = Oop(x) In-link normalized Rank (Inorm-Rank) : Iop( ⋅ ) = Din -1/2 L T ( ⋅ ) , Oop( ⋅ ) = L Din -1/2 ( ⋅ ) Symmetric normalized Rank (Snorm-Rank) : Iop( ⋅ ) = Din -1/2 L T Dout -1/2 ( ⋅ ) , Oop( ⋅ ) = Dout -1/2 L Din -1/2 ( ⋅ ) Some properties of Snorm-Rank: x = Iop(y) = Iop(Oop(x)) → λ x = A (S) x with A (S) = Din -1/2 L T Dout -1 L Din -1/2 → Solution: λ = 1, x = din 1/2 and analogously for hub scores: λ y = H (S) y → λ =1, y = dout 1/2 5-4 IRDM WS 2005

  5. Experimental Results Construct neighborhood graph from result of query "star" Compare authority-scoring ranks HITS OnormRank PageRank 1 www.starwars.com www.starwars.com www.starwars.com 2 www.lucasarts.com www.lucasarts.com www.lucasarts.com 3 www.jediknight.net www.jediknight.net www.paramount.com 4 www.sirstevesguide.com www.paramount.com www.4starads.com/romanc 5 www.paramount.com www.sirstevesguide.com www.starpages.net 6 www.surfthe.net/swma/ www.surfthe.net/swma/ www.dailystarnews.com 7 insurrection.startrek.com insurrection.startrek.com www.state.mn.us 8 www.startrek.com www.fanfix.com www.star-telegram.com 9 www.fanfix.com shop.starwars.com www.starbulletin.com 10 www.physics.usyd.edu.au/ www.physics.usyd.edu.au/ www.kansascity.com .../starwars .../starwars ... Bottom line: 19 www.jediknight.net Differences between all kinds of authority 21 insurrection.startrek.com 23 www.surfthe.net/swma ranking methods are fairly minor ! 5-5 IRDM WS 2005

  6. More LAR (Link Analysis Ranking) Methods HubAveraging (similar to ONorm for hubs): 1 = a q h p = h p a q ( ) ( ) ( ) ( ) ∑ ∈ ∑ ∈ p IN q OUT p q OUT p ( ) ( ) | ( ) | AuthorityThreshold (only k best authorities per hub): 1 = a q h p = h p a q ( ) ( ) ( ) ( ) ∑ ∈ ∑ p IN q k ( ) ∈ − q OUT k p ( ) − = − ∈ p a q q OUT p OUT k ( ) argmax k { ( ) | ( )} with q Max (AuthorityThreshold with k=1): = a q h p = ∈ h p a a q q OUT p ( ) ( ) ( ) ( argmax { ( ) | ( )}) ∑ ∈ p IN q q ( ) BreadthFirstSearch (transitive citations up to depth k): j − 1 k where N (j) (q) are nodes that 1 =   j a q N q ( ) ( ) | ( ) | have a path to q by alternating   ∑ 2 = j   1 o OUT and i IN steps with j=o+i 5-6 IRDM WS 2005

  7. LAR as Bayesian Learning + h a e exp( ) p q p → = Postulate prob. model for p → → → → q: P p q [ ] + + h a e 1 exp( ) p q p with parameters θ θ θ θ = (h 1 , ..., h n , a 1 , ..., a n , e 1 , ..., e n ) Postulate prior f( θ θ ) for parameters θ θ θ θ θ θ : normal distr. ( µ µ µ µ , σ σ ) for each e i , exponential distr. ( λ σ σ λ λ =1) for each a i , h i λ Posterior f( θ θ |G) for links i → → j ∈ ∈ G: θ θ → → ∈ ∈ θ θ θ f G f G f ( | ) ~ ( | ) ( ) Theorem: + + a h e a h e − − − − µ σ h a e 2 2 θ Π ⋅ Π Π + f G e e e ( ) / 2 j i i j i i ( | ) ~ i i i / ( 1 ) = ∈ i n i j G i j 1 .. ( , ) , ˆ θ = E θ G Estimate using numerical algorithms : [ | ] h a p q → = P p q [ ] Alternative simpler model: + h a 1 p q 5-7 IRDM WS 2005

  8. LAR Quality Measures: Score Distances Consider two n-dimensional authority score vectors a and b = α − β d a b a b ( , ) min | | d 1 distance: ∑ = α β ≥ i i 1 , 1 i n 1 .. with scaling weights α , β to compensate normalization distortions could alternatively use Lq norm rather than L1 5-8 IRDM WS 2005

  9. LAR Quality Measures: Rank Distances Consider top-k of two rankings τ 1 and τ 2 or full permutations of 1..n • overlap similarity OSim ( τ 1, τ 2) = | top(k, τ 1) ∩ top(k, τ 2) | / k • Kendall's τ τ τ τ measure KDist ( τ 1, τ 2) = ∈ ≠ τ τ u v u v U u v and disagree on relative order of u v | {( , ) | , , , 1 , 2 , } ⋅ − U U | | (| | 1 ) with U = top(k, τ 1) ∪ top(k, τ 2) (with missing items set to rank k+1) with ties in one ranking and order in the other, count p with 0 ≤ p ≤ 1 → p=0: weak KDist, → p=1: strict KDist 1 τ − τ u u • footrule distance Fdist ( τ 1, τ 2) = | 1 ( ) 2 ( ) | ∑ U | | ∈ u U (normalized) Fdist is upper bound for KDist and Fdist/2 is lower bound 5-9 IRDM WS 2005

  10. LAR Similarity Two LAR algorithms A and B are similar on the class G G of graphs G G with n nodes under authority distance measure d if for n →∞ : max {d(A(G),B(G)) | G ∈ ∈ ∈ ∈ G G } = o(M n (d,L q )) G G where M n (d,L q ) is the maximum distance under d for any two n-dimensional vectors x and y that have L q norm 1 (which is Θ (n1-1/q) for d 1 distance and L q norm) Two LAR algorithms A and B are weakly (strictly) rank-similar on the class G G of graphs with n nodes under weak (strict) rank distance r G G if for n →∞ : max {r(A(G),B(G)) | G ∈ ∈ ∈ G ∈ G } = o(1) G G Theorems: SALSA and Indegree are similar and strictly rank-similar. No other LAR algorithms are known to be similar or weakly rank-similar. 5-10 IRDM WS 2005

  11. LAR Stability For graphs G=(V,E) and G‘=(V,E‘) the link distance d link is: d link (G,G‘) = |(E ∪ ∪ E‘) - (E ∩ ∩ E‘)| ∪ ∪ ∩ ∩ For graph G ∈ G, we define C k (G) = {G‘ ∈ G | d link (G,G‘) ≤ k} LAR algorithm A is stable on the class G of graphs with n nodes under authority distance measure d if for every k > 0 for n →∞ : max {d(A(G),A(G‘)) | G ∈ ∈ G, G, G‘ ∈ ∈ C k (G)} = o(M n (d,L q )) ∈ ∈ ∈ ∈ G, G, LAR algorithm A is weakly (strictly) rank-stable on the class G of graphs with n nodes under weak (strict) rank distance r if for every k > 0 for n →∞ : max {r(A(G),A(G‘)) | G ∈ ∈ ∈ ∈ G, G, G‘ ∈ ∈ ∈ ∈ C k (G)} = o(1) G, G, Theorems: Indegree is stable. No other LAR algorithm is known to be stable or weakly rank-stable (but some are under modified stability definitions). PageRank is stable with high probability for power-law graphs. 5-11 IRDM WS 2005

  12. LAR Experimental Comparison: Queries Experimental setup: • 34 queries • rootsets of 200 pages each obtained from Google • basesets computed using Google with first 50 predecessors per page Source: Borodin et al., ACM TOIT 2005 5-12 IRDM WS 2005

  13. LAR Experimental Comparison: Precision@10 Source: Borodin et al., ACM TOIT 2005 5-13 IRDM WS 2005

  14. LAR Experimental Comparison: Key Authorities Is there a winner at all? Source: Borodin et al., ACM TOIT 2005 5-14 IRDM WS 2005

  15. LAR Results for Query „Classical Guitar“ (1) Source: Borodin et al., ACM TOIT 2005 5-15 IRDM WS 2005

  16. LAR Results for Query „Classical Guitar“ (2) Source: Borodin et al., ACM TOIT 2005 5-16 IRDM WS 2005

  17. LAR Results for Query „Classical Guitar“ (3) Source: Borodin et al., ACM TOIT 2005 5-17 IRDM WS 2005

  18. 5.4 Topic-specific PageRank [Haveliwala 2003] Given: a (small) set of topics c k , each with a set T k of authorities (taken from a directory such as ODP (www.dmoz.org) or bookmark collection) Key idea : change the PageRank random walk by biasing the random-jump probabilities to the topic authorities T k : = ε + − ε r p A r with A' ij = 1/outdegree(j) for (j,i) ∈ E, 0 else ( 1 ) ' � � � k k k with (p k ) j = 1/|T k | for j ∈ T k , 0 else (instead of p j = 1/n) Approach: 1) Precompute topic-specific Page-Rank vectors r k 2) Classify user query q (incl. query context) w.r.t. each topic c k → probability w k := P[c k | q] w r d ( ) 3) Total authority score of doc d is ∑ k k k 5-18 IRDM WS 2005

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend