info 4300 cs4300 information retrieval slides adapted

INFO 4300 / CS4300 Information Retrieval slides adapted from - PowerPoint PPT Presentation

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from http://informationretrieval.org/ IR 15: Citation Analysis and Link Analysis Paul Ginsparg Cornell University, Ithaca, NY 22 Oct 2009 1 / 49


  1. INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch¨ utze’s, linked from http://informationretrieval.org/ IR 15: Citation Analysis and Link Analysis Paul Ginsparg Cornell University, Ithaca, NY 22 Oct 2009 1 / 49

  2. Administrativa Assignment 3 now available: due Sun 8 Nov Discussion 5 (27 Oct 2009): Read and be prepared to discuss Sergey Brin and Lawrence Page, The Anatomy of a Large-Scale Hypertextual Web Search Engine. Seventh International World Wide Web Conference. Brisbane, Australia, 1998. http://www7.scu.edu.au/1921/com1921.htm (Note: second copy [with photos of authors] available at http://www-db.stanford.edu/˜backrub/google.html ) 2 / 49

  3. Overview Recap 1 Citation analysis 2 PageRank 3 3 / 49

  4. Outline Recap 1 Citation analysis 2 PageRank 3 4 / 49

  5. Brief History and Historical antecedents 1991–1994 growth, mainly in Europe, plus move to client server systems Page Rank methodology stems from long history of citation analysis, where “link” is some signal of recommendation (or popularity). Based on property of graph (i.e., query-independent), hence efficient for serving a large volume of queries. Underlying Markov process also not new, but was applied in a particularly powerful way ( ⇒ unexpected power of simple algorithms and ample computing power applied to massive datasets) 5 / 49

  6. The web as a directed graph hyperlink page d 2 page d 1 anchor text Assumption 1: A hyperlink is a quality signal. The hyperlink d 1 → d 2 indicates that d 1 ’s author deems d 2 relevant. Assumption 2: The anchor text describes the content of d 2 . We use anchor text somewhat loosely here for: the text surrounding the hyperlink. Example: “You can find cheap cars < a href=http://... > here < /a > .” Anchor text: “You can find cheap cars here” Easy to find cases where these two assumptions are violated. But they hold for most hyperlinks. 6 / 49

  7. Google bombs A Google bomb is a search with “bad” results due to maliciously manipulated anchor text. Google introduced a new weighting function in January 2007 that fixed many Google bombs. Still some remnants: [dangerous cult] on Google, Bing, Yahoo Coordinated link creation by those who dislike the Church of Scientology Defused Google bombs: [miserable failure], [antichrist], [evil empire] e.g., http://www.nytimes.com/2004/01/22/technology/circuits/22goog.html (or copy at http://www.cis.cornell.edu/Courses/cis295/2004fa/gbomb.html ) 7 / 49

  8. Outline Recap 1 Citation analysis 2 PageRank 3 8 / 49

  9. Origins of PageRank: Citation analysis (1) Citation analysis: analysis of citations in the scientific literature Example citation: “Miller (2001) has shown that physical activity alters the metabolism of estrogens.” We can view “Miller (2001)” as a hyperlink linking two scientific articles. One application of these “hyperlinks” in the scientific literature: Measure the similarity of two articles by the overlap of other articles citing them: cocitation similarity ( C → A , C → B ) Cocitation similarity on the web: Google’s “find pages like this” or “Similar” feature also coreference similarity: ( A → C , B → C ) 9 / 49

  10. Origins of PageRank: Citation analysis (2) Another application: Citation frequency can be used to measure the impact of an article. Each article gets one vote. Not a very accurate measure On the web: citation frequency = inlink count A high inlink count does not necessarily mean high quality . . . . . . mainly because of link spam. Better measure: weighted citation frequency / citation rank An article’s vote is weighted according to its citation impact. Circular? No: can be formalized in a well-defined way. 10 / 49

  11. Origins of PageRank: Citation analysis (3) Better measure: weighted citation frequency / citation rank Basically PageRank, invented in the context of citation by G. Pinski and F. Narin. ”Citation influence for journal aggregates of scientific publications: Theory, with application to the literature of physics” in Information Processing and Management. 12 (1976) Nancy L. Geller: On the citation influence methodology of Pinski and Narin. Inf. Process. Manage. 14(2): 93-95 (1978) Citation analysis plays a (too?) prominent role: The budget and salary of many researchers are determined by this measure of the impact of their publications... 11 / 49

  12. Origins of PageRank: Summary We can use the same formal representation for citations in the scientific literature hyperlinks on the web Appropriately weighted citation frequency is a measure of quality for web pages and for scientific publications 12 / 49

  13. Link-based ranking for web search Simple version of using links for ranking on the web First retrieve all pages satisfying the query (say venture capital ) Order these by the number of inlinks Simple link popularity (= number of inlinks) is easy to spam. Why? 13 / 49

  14. Outline Recap 1 Citation analysis 2 PageRank 3 14 / 49

  15. Model behind PageRank: Random walk Imagine a web surfer doing a random walk on the web Start at a random page At each step, go out of the current page along one of the links on that page, equiprobably In the steady state, each page has a long-term visit rate. This long-term visit rate is the page’s PageRank. PageRank = long-term visit rate = steady state probability 15 / 49

  16. Formalization of random walk: Markov chains A Markov chain consists of N states, plus an N × N transition probability matrix P . state = page At each step, we are on exactly one of the pages. For 1 ≤ i , j ≤ N , the matrix entry P ij tells us the probability of j being the next page, given we are currently on page i . P ij d j d i 16 / 49

  17. Markov chains Clearly, for all i, � N j =1 P ij = 1 Markov chains are abstractions of random walks. 17 / 49

  18. Example web graph d 0 gm benz car d 2 ford d 1 leopard honda d 5 jaguar tiger jag cheetah jaguar d 3 d 6 speed cat lion d 4 18 / 49

  19. Link matrix for example d 0 d 1 d 2 d 3 d 4 d 5 d 6 d 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 d 1 d 2 1 0 1 1 0 0 0 d 3 0 0 0 1 1 0 0 d 4 0 0 0 0 0 0 1 d 5 0 0 0 0 0 1 1 0 0 0 1 1 0 1 d 6 19 / 49

  20. Transition probability matrix P for example d 0 d 1 d 2 d 3 d 4 d 5 d 6 d 0 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.50 0.50 0.00 0.00 0.00 0.00 d 1 d 2 0.33 0.00 0.33 0.33 0.00 0.00 0.00 d 3 0.00 0.00 0.00 0.50 0.50 0.00 0.00 d 4 0.00 0.00 0.00 0.00 0.00 0.00 1.00 d 5 0.00 0.00 0.00 0.00 0.00 0.50 0.50 0.00 0.00 0.00 0.33 0.33 0.00 0.33 d 6 20 / 49

  21. Long-term visit rate Recall: PageRank = long-term visit rate Long-term visit rate of page d is the probability that a web surfer is at page d at a given point in time. Next: what properties must hold of the web graph for the long-term visit rate to be well defined? The web graph must correspond to an ergodic Markov chain. First a special case: The web graph must not contain dead ends. 21 / 49

  22. Dead ends ?? The web is full of dead ends. Random walk can get stuck in dead ends. If there are dead ends, long-term visit rates are not well-defined (or non-sensical). 22 / 49

  23. Teleporting At a dead end, jump to a random web page At any non-dead end, with probability 10%, jump to a random web page With remaining probability (90%), go out on a random hyperlink For example, if the page has 4 outgoing links: randomly choose one with probability (1-0.10)/4=0.225 . . . and choose one of the N pages randomly with probability 0 . 1 / N . 10% is a parameter. 23 / 49

  24. Result of teleporting With teleporting, we cannot get stuck in a dead end. But even without dead-ends, a graph may not have well-defined long-term visit rates. More generally, we require that the Markov chain be ergodic. 24 / 49

  25. Ergodic Markov chains A Markov chain is ergodic iff it is irreducible and aperiodic. Irreducibility. Roughly: there is a path from any page to any other page. Aperiodicity. Roughly: The pages cannot be partitioned such that the random walker visits the partitions sequentially. 1.0 1.0 A non-ergodic Markov chain: 25 / 49

  26. Ergodic Markov chains Theorem: For any ergodic Markov chain, there is a unique long-term visit rate for each state. This is the steady-state probability distribution. Over a long time period, we visit each state in proportion to this rate. It doesn’t matter where we start. 26 / 49

  27. Formalization of “visit”: Probability vector A probability (row) vector � x = ( x 1 , . . . , x N ) tells us where the random walk is at any point. ( 0 0 0 . . . 1 . . . 0 0 0 ) Example: 1 2 3 . . . . . . N-2 N-1 N i More generally: the random walk is on page i with probability x i . Example: ( 0.05 0.01 0.0 . . . 0.2 . . . 0.01 0.05 0.03 ) 1 2 3 . . . i . . . N-2 N-1 N � x i = 1 27 / 49

  28. Change in probability vector If the probability vector is � x = ( x 1 , . . . , x N ) at this step, what is it at the next step? Recall that row i of the transition probability matrix P tells us where we go next from state i . Equivalently: column j of P tells us “where we came from” (and with which probability). So from � x , our next state is distributed as � xP . 28 / 49

Recommend


More recommend