info 4300 cs4300 information retrieval slides adapted
play

INFO 4300 / CS4300 Information Retrieval slides adapted from - PowerPoint PPT Presentation

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from http://informationretrieval.org/ IR 16: Other Link Analysis Paul Ginsparg Cornell University, Ithaca, NY 28 Oct 2010 1 / 38 Administrativa


  1. INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch¨ utze’s, linked from http://informationretrieval.org/ IR 16: Other Link Analysis Paul Ginsparg Cornell University, Ithaca, NY 28 Oct 2010 1 / 38

  2. Administrativa Assignment 3 due Sun 7 Nov Python 2.6 and Java 1.6 are now available in /usr/local on RHEL 5 general purpose, 32-bit machines: Linus, Snoopy, Schroeder, Sally. Let us know if you need any of the 64-bit machines, as listed at http://www.csuglab.cornell.edu/Info/linux-info.html . 2 / 38

  3. Administrativa Topics for rest of course ( 5 weeks, 10 lectures): MRS: 19,20,12,13,14,16,17 web search language models text classification vector space classification clustering + map reduce and google file system Final exam will cover entire course (again questions likely prespecified) 3 / 38

  4. Overview Recap 1 HITS: Hubs & Authorities 2 Discussion 4 3 4 / 38

  5. Outline Recap 1 HITS: Hubs & Authorities 2 Discussion 4 3 5 / 38

  6. Model behind PageRank: Random walk Imagine a web surfer doing a random walk on the web Start at a random page At each step, go out of the current page along one of the links on that page, equiprobably In the steady state, each page has a long-term visit rate. This long-term visit rate is the page’s PageRank. PageRank = long-term visit rate = steady state probability 6 / 38

  7. One way of computing the PageRank � r Recall: regardless of where we start, we eventually reach the steady state � r . Start with any distribution � x , e.g., uniform distribution After one step, we’re at � xP . xP 2 . After two steps, we’re at � xP k . After k steps, we’re at � Algorithm: multiply � x by increasing powers of P until convergence. This is called the power method. 7 / 38

  8. Random Walker Redux The page rank r j of page j is determined self-consistently by the equation r i + α � r j = (1 − α ) n , d i i | i → j α is a number between 0 and 1 (originally taken to be .15) the sum on i is over pages i pointing to j d i is the outgoing degree of page i . Pages with high “Page rank” r i that have low d i , i.e., that don’t point to too many other pages, convey the most page rank to page j . (Note: the minimum PageRank is given by α/ n .) With probability 1 − α one goes at random to any of the pages j pointed to by page i , and with probability α one instead jumps at random to any page anywhere on the web. 8 / 38

  9. Eigenvector problem Incidence matrix A is defined by A ij = 1 if i points to j and otherwise A ij = 0. A matrix P giving the transition probability from page i to page j can be constructed in terms of the incidence matrix A as P ij = (1 − α ) 1 A ij + α n O ij d i where n = total # of pages, d i is the outdegree of node i , and  1 1 . . . 1  1 1 . . . 1   O ij = 1( ∀ i , j ) is a matrix of all ones (i.e., O =  ).   . . .  1 1 . . . 1 The matrix eigenvector relation r = P T � � rP = � r � r or is equivalent to the equation on the previous slide (with � r normalized as a probability, so that � i r i O ij = � i r i = 1) 9 / 38

  10. Eigenvector problem, cont’d By the Perron-Frobenius theorem, the matrix P has a unique principal eigenvector, corresponding to its largest eigenvalue, and those components are all positive. (Since � j A ij / d i = 1, we find � j P ij = (1 − α ) + α = 1 and P is normalized such that its principal eigenvalue is 1.) Thus the eigenvalue equation always has a solution. To calculate the rank of all the pages, the crawler visits as many pages as possible calculates the link structure of the web graph Calculating eigenvectors of enormous matrices can be painful, but in the case of the principal eigenvector there’s a simple method. 10 / 38

  11. Power Method, cont’d N � M n � α i λ n w = i � v ( i ) i =1 For large n , the term with the largest eigenvalue dominates: � λ i N 1 � n � M n � w = α 1 � v (1) + α i v ( i ) ≈ α 1 � � v (1) , λ n λ 1 1 i =2 since λ i /λ 1 < 1 ∀ i � = 1. Thus the principal eigenvector can be determined simply by applying a matrix M sufficiently many times to any vector which has non-zero dot product with the principal eigenvector (i.e., non-vanishing α 1 in the above) ⇒ effectively projects to the eigenvector of interest 11 / 38

  12. Web graph example d 0 gm car benz d 2 ford d 1 leopard honda d 5 jaguar tiger jag cheetah jaguar d 3 d 6 speed cat lion d 4 12 / 38

  13. Transition matrix with teleporting, α = . 14 d 0 d 1 d 2 d 3 d 4 d 5 d 6 d 0 0.02 0.02 0.88 0.02 0.02 0.02 0.02 0.02 0.45 0.45 0.02 0.02 0.02 0.02 d 1 d 2 0.31 0.02 0.31 0.31 0.02 0.02 0.02 d 3 0.02 0.02 0.02 0.45 0.45 0.02 0.02 d 4 0.02 0.02 0.02 0.02 0.02 0.02 0.88 d 5 0.02 0.02 0.02 0.02 0.02 0.45 0.45 0.02 0.02 0.02 0.31 0.31 0.02 0.31 d 6 13 / 38

  14. xP k Power method vectors � xP 1 xP 2 xP 3 xP 4 xP 5 xP 6 xP 7 xP 8 xP 9 xP 10 xP 11 xP 12 xP 13 x � � � � � � � � � � � � � � d 0 0 . 14 0 . 06 0 . 09 0 . 07 0 . 07 0 . 06 0 . 06 0 . 06 0 . 06 0 . 05 0 . 05 0 . 05 0 . 05 0 . 05 d 1 0 . 14 0 . 08 0 . 06 0 . 04 0 . 04 0 . 04 0 . 04 0 . 04 0 . 04 0 . 04 0 . 04 0 . 04 0 . 04 0 . 04 d 2 0 . 14 0 . 25 0 . 18 0 . 17 0 . 15 0 . 14 0 . 13 0 . 12 0 . 12 0 . 12 0 . 12 0 . 11 0 . 11 0 . 11 0 . 14 0 . 16 0 . 23 0 . 24 0 . 24 0 . 24 0 . 24 0 . 25 0 . 25 0 . 25 0 . 25 0 . 25 0 . 25 0 . 25 d 3 d 4 0 . 14 0 . 12 0 . 16 0 . 19 0 . 19 0 . 20 0 . 21 0 . 21 0 . 21 0 . 21 0 . 21 0 . 21 0 . 21 0 . 21 0 . 14 0 . 08 0 . 06 0 . 04 0 . 04 0 . 04 0 . 04 0 . 04 0 . 04 0 . 04 0 . 04 0 . 04 0 . 04 0 . 04 d 5 d 6 0 . 14 0 . 25 0 . 23 0 . 25 0 . 27 0 . 28 0 . 29 0 . 29 0 . 30 0 . 30 0 . 30 0 . 30 0 . 31 0 . 31 14 / 38

  15. PageRank summary Preprocessing Given graph of links, build matrix P Apply teleportation From modified matrix, compute � r � r i is the PageRank of page i . Query processing Retrieve pages satisfying the query Rank them by their PageRank Return reranked list to the user 15 / 38

  16. Outline Recap 1 HITS: Hubs & Authorities 2 Discussion 4 3 16 / 38

  17. HITS – Hyperlink-Induced Topic Search Premise: there are two different types of relevance on the web. Relevance type 1: Hubs. A hub page is a good list of links to pages answering the information need. Bob’s list of recommended hotels in London Relevance type 2: Authorities. An authority page is a direct answer to the information need. Home page of Four Seasons Hotel London By definition: Links to authority pages occur repeatedly on hub pages. Most approaches to search (including PageRank ranking) don’t make the distinction between these two very different types of relevance. 17 / 38

  18. Hubs and authorities: Definition A good hub page for a topic links to many authority pages for that topic. A good authority page for a topic is linked to by many hub pages for that topic. Circular definition – we will turn this into an iterative computation. 18 / 38

  19. Example for hubs and authorities hubs authorities www.bestfares.com www.aa.com www.airlinesquality.com www.delta.com blogs.usatoday.com/sky www.united.com aviationblog.dallasnews.com 19 / 38

  20. HITS=hypertext induced topic search, I h u b s a u t h o r i t i e s 20 / 38

  21. HITS=hypertext induced topic search, II Eevery page has both an authority weight a i and a hub weight h i , defined to satisfy � � a j = h i , h i = a j , i | i → j j | i → j i.e., the authority weight of a site j is given by the sum of the hub weights of sites i that point to the site, and the hub weight of a site i is given by the sum of the authority weights of the sites j to which it points. In terms of the incidence matrix A , these can be written a = A T � � � h , h = A � a . 21 / 38

  22. HITS=hypertext induced topic search, III Start with some trial forms of the hub and authorities weights, h (0) i and a (0) , and iterating the above equation: i a (1) h (0) h (1) a (0) � � = , = j i i j i | i → j j | i → j to provide more refined guesses h (1) and a (1) , and then continuing. i i In matrix form, the j th such iteration can be written a ( j ) = A T � � � h ( j − 1) , h ( j ) = A � a ( j − 1) . Note that the result of two iterations can be written h ( j ) = AA T � a ( j − 2) , � a ( j ) = A T A � � h ( j − 2) , so the result of 2 n such iterations is � h (2 n ) = ( AA T ) n � a (2 n ) = ( A T A ) n � � a (0) , h (0) . 22 / 38

  23. HITS=hypertext induced topic search, IV The matrices A T A and AA T are symmetric with non-negative entries that sum to 1 along any row, so for a suitably chosen subset of nodes such that they’re irreducible, the Perron-Frobenius theorem will apply. In that case, the above iterative procedure will converge, and we see that the authority weights a i will be given by the components of the principal eigenvector of A T A , and the hub weights h i by the components of the principal eigenvector of AA T . 23 / 38

  24. Iterative update y 1 y 2 d For all d : h ( d ) = � y 3 d �→ y a ( y ) y 1 y 2 d For all d : a ( d ) = � y �→ d h ( y ) y 3 Iterate these two steps until convergence 24 / 38

  25. Details Scaling To prevent the a () and h () values from getting too big, can scale down after each iteration Scaling factor doesn’t really matter. We care about the relative (as opposed to absolute) values of the scores. In most cases, the algorithm converges after a few iterations. 25 / 38

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend