INFO 4300 / CS4300 Information Retrieval slides adapted from - PowerPoint PPT Presentation

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch¨ utze’s, linked from http://informationretrieval.org/ IR 16: Citation Analysis and Link Analysis Paul Ginsparg Cornell University, Ithaca, NY 27 Oct 2009 1 / 40

Administrativa Assignment 3 now available: due Sun 8 Nov Topics for rest of course ( 5 weeks, 10 lectures): MRS: 19,20,12,13,14,16,17 web search language models text classification vector space classification clustering + map reduce and google file system 2 / 40

Overview Recap 1 PageRank example 2 HITS: Hubs & Authorities 3 Discussion 5 4 3 / 40

Outline Recap 1 PageRank example 2 HITS: Hubs & Authorities 3 Discussion 5 4 4 / 40

Model behind PageRank: Random walk Imagine a web surfer doing a random walk on the web Start at a random page At each step, go out of the current page along one of the links on that page, equiprobably In the steady state, each page has a long-term visit rate. This long-term visit rate is the page’s PageRank. PageRank = long-term visit rate = steady state probability 5 / 40

One way of computing the PageRank � π Recall: regardless of where we start, we eventually reach the steady state � π . Start with any distribution � x , e.g., uniform distribution After one step, we’re at � xP . xP 2 . After two steps, we’re at � xP k . After k steps, we’re at � Algorithm: multiply � x by increasing powers of P until convergence. This is called the power method. 6 / 40

Random Walker Redux The page rank r j of page j is determined self-consistently by the equation r j = α r i � n + (1 − α ) , d i i | i → j α is a number between 0 and 1 (originally taken to be .15) the sum on i is over pages i pointing to j d i is the outgoing degree of page i . Pages with high “Page rank” r i that have low d i , i.e., that don’t point to too many other pages, convey the most page rank to page j . With probability 1 − α one goes at random to any of the pages j pointed to by page i , and with probability α one instead jumps at random to any page anywhere on the web. 7 / 40

Eigenvector problem Incidence matrix A is defined by A ij = 1 if i points to j and otherwise A ij = 0. A matrix P giving the transition probability from page i to page j can be constructed in terms of the incidence matrix A as P ij = α n O ij + (1 − α ) 1 A ij d i where n = total # of pages, d i is the outdegree of node i , and  1 1 . . . 1  1 1 . . . 1   O ij = 1( ∀ i , j ) is a matrix of all ones (i.e., O =  ).   . . .  1 1 . . . 1 The matrix eigenvector relation r = P T � � rP = � r � r or is equivalent to the equation on the previous slide (with � r is normalized as a probability, so that � i r i O ij = � i r i = 1) 8 / 40

Eigenvector problem, cont’d By the Perron-Frobenius theorem, the matrix P has a unique principal eigenvector, corresponding to its largest eigenvalue, and those components are all positive. (Since � j A ij / d i = 1, we find � j P ij = p + (1 − p ) = 1 and P is normalized such that its principal eigenvalue is 1.) Thus the eigenvalue equation always has a solution. To calculate the rank of all the pages, the crawler visits as many pages as possible calculates the link structure of the web graph Calculating eigenvectors of enormous matrices can be painful, but in the case of the principal eigenvector there’s a simple method. 9 / 40

Power Method, cont’d N � M n � α i λ n w = i � v ( i ) i =1 For large n , the term with the largest eigenvalue dominates: � λ i N 1 � n � M n � w = α 1 � v (1) + α i v ( i ) ≈ α 1 � � v (1) , λ n λ 1 1 i =2 since λ i /λ 1 < 1 ∀ i � = 1. Thus the principal eigenvector can be determined simply by applying a matrix M sufficiently many times to any vector which has non-zero dot product with the principal eigenvector (i.e., non-vanishing α 1 in the above) ⇒ effectively projects to the eigenvector of interest 10 / 40

PageRank summary Preprocessing Given graph of links, build matrix P Apply teleportation From modified matrix, compute � π � π i is the PageRank of page i . Query processing Retrieve pages satisfying the query Rank them by their PageRank Return reranked list to the user 11 / 40

PageRank issues Real surfers are not random surfers – Markov model is not a good model of surfing. Examples of nonrandom surfing: back button, short vs. long paths, bookmarks, directories – and search! But it’s good enough as a model for our purposes. Simple PageRank ranking (as described on previous slide) produces bad results for many pages. Consider the query video service The Yahoo home page (i) has a very high PageRank and (ii) contains both words. If we rank all Boolean hits according to PageRank, then the Yahoo home page would be top-ranked. Clearly not desirable In practice: rank according to weighted combination of raw text match, anchor text match, PageRank & other factors 12 / 40

Web graph example d 0 gm car benz d 2 ford d 1 leopard honda d 5 jaguar Exercise: tiger jag cheetah jaguar d 3 d 6 speed cat lion d 4 Without doing the actual computation, guess the ranking of the pages according to PageRank. 14 / 40

Transition (probability) matrix d 0 d 1 d 2 d 3 d 4 d 5 d 6 d 0 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.50 0.50 0.00 0.00 0.00 0.00 d 1 d 2 0.33 0.00 0.33 0.33 0.00 0.00 0.00 d 3 0.00 0.00 0.00 0.50 0.50 0.00 0.00 d 4 0.00 0.00 0.00 0.00 0.00 0.00 1.00 d 5 0.00 0.00 0.00 0.00 0.00 0.50 0.50 0.00 0.00 0.00 0.33 0.33 0.00 0.33 d 6 15 / 40

Transition matrix with teleporting, α = . 14 d 0 d 1 d 2 d 3 d 4 d 5 d 6 d 0 0.02 0.02 0.88 0.02 0.02 0.02 0.02 0.02 0.45 0.45 0.02 0.02 0.02 0.02 d 1 d 2 0.31 0.02 0.31 0.31 0.02 0.02 0.02 d 3 0.02 0.02 0.02 0.45 0.45 0.02 0.02 d 4 0.02 0.02 0.02 0.02 0.02 0.02 0.88 d 5 0.02 0.02 0.02 0.02 0.02 0.45 0.45 0.02 0.02 0.02 0.31 0.31 0.02 0.31 d 6 16 / 40

xP k Power method vectors � xP 1 xP 2 xP 3 xP 4 xP 5 xP 6 xP 7 xP 8 xP 9 xP 10 xP 11 xP 12 xP 13 x � � � � � � � � � � � � � � d 0 0 . 14 0 . 06 0 . 09 0 . 07 0 . 07 0 . 06 0 . 06 0 . 06 0 . 06 0 . 05 0 . 05 0 . 05 0 . 05 0 . 05 0 . 14 0 . 08 0 . 06 0 . 04 0 . 04 0 . 04 0 . 04 0 . 04 0 . 04 0 . 04 0 . 04 0 . 04 0 . 04 0 . 04 d 1 d 2 0 . 14 0 . 25 0 . 18 0 . 17 0 . 15 0 . 14 0 . 13 0 . 12 0 . 12 0 . 12 0 . 12 0 . 11 0 . 11 0 . 11 0 . 14 0 . 16 0 . 23 0 . 24 0 . 24 0 . 24 0 . 24 0 . 25 0 . 25 0 . 25 0 . 25 0 . 25 0 . 25 0 . 25 d 3 d 4 0 . 14 0 . 12 0 . 16 0 . 19 0 . 19 0 . 20 0 . 21 0 . 21 0 . 21 0 . 21 0 . 21 0 . 21 0 . 21 0 . 21 d 5 0 . 14 0 . 08 0 . 06 0 . 04 0 . 04 0 . 04 0 . 04 0 . 04 0 . 04 0 . 04 0 . 04 0 . 04 0 . 04 0 . 04 d 6 0 . 14 0 . 25 0 . 23 0 . 25 0 . 27 0 . 28 0 . 29 0 . 29 0 . 30 0 . 30 0 . 30 0 . 30 0 . 31 0 . 31 17 / 40

How important is PageRank? Frequent claim: PageRank is the most important component of web ranking. The reality: There are several components that are at least as important: e.g., anchor text, phrases, proximity, tiered indexes . . . Rumor has it that PageRank in its original form (as presented here) has a negligible impact on ranking! However, variants of a page’s PageRank are still an essential part of ranking. Addressing link spam is difficult and crucial. 18 / 40

HITS – Hyperlink-Induced Topic Search Premise: there are two different types of relevance on the web. Relevance type 1: Hubs. A hub page is a good list of links to pages answering the information need. Bob’s list of recommended hotels in London Relevance type 2: Authorities. An authority page is a direct answer to the information need. Home page of Four Seasons Hotel London By definition: Links to authority pages occur repeatedly on hub pages. Most approaches to search (including PageRank ranking) don’t make the distinction between these two very different types of relevance. 20 / 40

Hubs and authorities: Definition A good hub page for a topic links to many authority pages for that topic. A good authority page for a topic is linked to by many hub pages for that topic. Circular definition – we will turn this into an iterative computation. 21 / 40

Example for hubs and authorities hubs authorities www.bestfares.com www.aa.com www.airlinesquality.com www.delta.com blogs.usatoday.com/sky www.united.com aviationblog.dallasnews.com 22 / 40

HITS=hypertext induced topic search, I h u b s a u t h o r i t i e s 23 / 40

HITS=hypertext induced topic search, II Eevery page has both an authority weight a i and a hub weight h i , defined to satisfy � � a j = h i , h i = a j , i | i → j j | i → j i.e., the authority weight of a site j is given by the sum of the hub weights of sites i that point to the site, and the hub weight of a site i is given by the sum of the authority weights of the sites j to which it points. In terms of the incidence matrix A , these can be written a = A T � � � h , h = A � a . 24 / 40

INFO 4300 / CS4300 Information Retrieval slides adapted from - PowerPoint PPT Presentation

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from http://informationretrieval.org/ IR 16: Citation Analysis and Link Analysis Paul Ginsparg Cornell University, Ithaca, NY 27 Oct 2009 1 / 40

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

random stepped surfaces Richard Kenyon (Yale) <latexit

Divergence of the non-radom fluctuation in First-passage percolation Shuta Nakajima (Nagoya

Introduction to Survey Statistics Day 2 Sampling and Weighting Federico Vegetti Central

12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness

Normal numbers and automatic complexity alexander.shen@lirmm.fr, www.lirmm.fr/~ashen LIRMM CNRS

Structure and randomness in the prime numbers A small selection of results in number theory

Attack on Broadcast RC4 Revisited S. Maitra 1 G. Paul 2 S. Sen Gupta 1 1 Indian Statistical

Qualitative Research Theoretical Orientations ScWk 240 Week 10 Slides 1 Why Qualitative

INFO 4300 / CS4300 Information Retrieval slides adapted from - PowerPoint PPT Presentation

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from http://informationretrieval.org/ IR 16: Citation Analysis and Link Analysis Paul Ginsparg Cornell University, Ithaca, NY 27 Oct 2009 1 / 40

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

random stepped surfaces Richard Kenyon (Yale) &lt;latexit

Divergence of the non-radom fluctuation in First-passage percolation Shuta Nakajima (Nagoya

Introduction to Survey Statistics Day 2 Sampling and Weighting Federico Vegetti Central

12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness

Normal numbers and automatic complexity alexander.shen@lirmm.fr, www.lirmm.fr/~ashen LIRMM CNRS

Structure and randomness in the prime numbers A small selection of results in number theory

Attack on Broadcast RC4 Revisited S. Maitra 1 G. Paul 2 S. Sen Gupta 1 1 Indian Statistical

Qualitative Research Theoretical Orientations ScWk 240 Week 10 Slides 1 Why Qualitative

random stepped surfaces Richard Kenyon (Yale) <latexit