INFO 4300 / CS4300 Information Retrieval slides adapted from - - PowerPoint PPT Presentation

info 4300 cs4300 information retrieval slides adapted
SMART_READER_LITE
LIVE PREVIEW

INFO 4300 / CS4300 Information Retrieval slides adapted from - - PowerPoint PPT Presentation

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from http://informationretrieval.org/ IR 15: Citation Analysis and Link Analysis Paul Ginsparg Cornell University, Ithaca, NY 22 Oct 2009 1 / 49


slide-1
SLIDE 1

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch¨ utze’s, linked from http://informationretrieval.org/

IR 15: Citation Analysis and Link Analysis

Paul Ginsparg

Cornell University, Ithaca, NY

22 Oct 2009

1 / 49

slide-2
SLIDE 2

Administrativa

Assignment 3 now available: due Sun 8 Nov Discussion 5 (27 Oct 2009): Read and be prepared to discuss Sergey Brin and Lawrence Page, The Anatomy of a Large-Scale Hypertextual Web Search Engine. Seventh International World Wide Web Conference. Brisbane, Australia, 1998. http://www7.scu.edu.au/1921/com1921.htm (Note: second copy [with photos of authors] available at http://www-db.stanford.edu/˜backrub/google.html )

2 / 49

slide-3
SLIDE 3

Overview

1

Recap

2

Citation analysis

3

PageRank

3 / 49

slide-4
SLIDE 4

Outline

1

Recap

2

Citation analysis

3

PageRank

4 / 49

slide-5
SLIDE 5

Brief History and Historical antecedents

1991–1994 growth, mainly in Europe, plus move to client server systems Page Rank methodology stems from long history of citation analysis, where “link” is some signal of recommendation (or popularity). Based on property of graph (i.e., query-independent), hence efficient for serving a large volume of queries. Underlying Markov process also not new, but was applied in a particularly powerful way (⇒ unexpected power of simple algorithms and ample computing power applied to massive datasets)

5 / 49

slide-6
SLIDE 6

The web as a directed graph

page d1 anchor text page d2 hyperlink Assumption 1: A hyperlink is a quality signal.

The hyperlink d1 → d2 indicates that d1’s author deems d2 relevant.

Assumption 2: The anchor text describes the content of d2.

We use anchor text somewhat loosely here for: the text surrounding the hyperlink. Example: “You can find cheap cars <a href=http://...>here</a>.” Anchor text: “You can find cheap cars here”

Easy to find cases where these two assumptions are violated. But they hold for most hyperlinks.

6 / 49

slide-7
SLIDE 7

Google bombs

A Google bomb is a search with “bad” results due to maliciously manipulated anchor text. Google introduced a new weighting function in January 2007 that fixed many Google bombs. Still some remnants: [dangerous cult] on Google, Bing, Yahoo

Coordinated link creation by those who dislike the Church of Scientology

Defused Google bombs: [miserable failure], [antichrist], [evil empire] e.g., http://www.nytimes.com/2004/01/22/technology/circuits/22goog.html (or copy at http://www.cis.cornell.edu/Courses/cis295/2004fa/gbomb.html )

7 / 49

slide-8
SLIDE 8

Outline

1

Recap

2

Citation analysis

3

PageRank

8 / 49

slide-9
SLIDE 9

Origins of PageRank: Citation analysis (1)

Citation analysis: analysis of citations in the scientific literature Example citation: “Miller (2001) has shown that physical activity alters the metabolism of estrogens.” We can view “Miller (2001)” as a hyperlink linking two scientific articles. One application of these “hyperlinks” in the scientific literature:

Measure the similarity of two articles by the overlap of other articles citing them: cocitation similarity (C → A, C → B) Cocitation similarity on the web: Google’s “find pages like this” or “Similar” feature also coreference similarity: (A → C, B → C)

9 / 49

slide-10
SLIDE 10

Origins of PageRank: Citation analysis (2)

Another application: Citation frequency can be used to measure the impact of an article.

Each article gets one vote. Not a very accurate measure

On the web: citation frequency = inlink count

A high inlink count does not necessarily mean high quality . . . . . . mainly because of link spam.

Better measure: weighted citation frequency / citation rank

An article’s vote is weighted according to its citation impact. Circular? No: can be formalized in a well-defined way.

10 / 49

slide-11
SLIDE 11

Origins of PageRank: Citation analysis (3)

Better measure: weighted citation frequency / citation rank Basically PageRank, invented in the context of citation by

  • G. Pinski and F. Narin. ”Citation influence for journal

aggregates of scientific publications: Theory, with application to the literature of physics” in Information Processing and

  • Management. 12 (1976)

Nancy L. Geller: On the citation influence methodology of Pinski and Narin. Inf. Process. Manage. 14(2): 93-95 (1978)

Citation analysis plays a (too?) prominent role: The budget and salary of many researchers are determined by this measure

  • f the impact of their publications...

11 / 49

slide-12
SLIDE 12

Origins of PageRank: Summary

We can use the same formal representation for

citations in the scientific literature hyperlinks on the web

Appropriately weighted citation frequency is a measure of quality

for web pages and for scientific publications

12 / 49

slide-13
SLIDE 13

Link-based ranking for web search

Simple version of using links for ranking on the web

First retrieve all pages satisfying the query (say venture capital) Order these by the number of inlinks

Simple link popularity (= number of inlinks) is easy to spam. Why?

13 / 49

slide-14
SLIDE 14

Outline

1

Recap

2

Citation analysis

3

PageRank

14 / 49

slide-15
SLIDE 15

Model behind PageRank: Random walk

Imagine a web surfer doing a random walk on the web

Start at a random page At each step, go out of the current page along one of the links

  • n that page, equiprobably

In the steady state, each page has a long-term visit rate. This long-term visit rate is the page’s PageRank. PageRank = long-term visit rate = steady state probability

15 / 49

slide-16
SLIDE 16

Formalization of random walk: Markov chains

A Markov chain consists of N states, plus an N × N transition probability matrix P. state = page At each step, we are on exactly one of the pages. For 1 ≤ i, j ≤ N, the matrix entry Pij tells us the probability

  • f j being the next page, given we are currently on page i.

di dj Pij

16 / 49

slide-17
SLIDE 17

Markov chains

Clearly, for all i, N

j=1 Pij = 1

Markov chains are abstractions of random walks.

17 / 49

slide-18
SLIDE 18

Example web graph

d0 d2 d1 d5 d3 d6 d4 car benz ford gm honda jaguar jag cat leopard tiger jaguar lion cheetah speed

18 / 49

slide-19
SLIDE 19

Link matrix for example

d0 d1 d2 d3 d4 d5 d6 d0 1 d1 1 1 d2 1 1 1 d3 1 1 d4 1 d5 1 1 d6 1 1 1

19 / 49

slide-20
SLIDE 20

Transition probability matrix P for example

d0 d1 d2 d3 d4 d5 d6 d0 0.00 0.00 1.00 0.00 0.00 0.00 0.00 d1 0.00 0.50 0.50 0.00 0.00 0.00 0.00 d2 0.33 0.00 0.33 0.33 0.00 0.00 0.00 d3 0.00 0.00 0.00 0.50 0.50 0.00 0.00 d4 0.00 0.00 0.00 0.00 0.00 0.00 1.00 d5 0.00 0.00 0.00 0.00 0.00 0.50 0.50 d6 0.00 0.00 0.00 0.33 0.33 0.00 0.33

20 / 49

slide-21
SLIDE 21

Long-term visit rate

Recall: PageRank = long-term visit rate Long-term visit rate of page d is the probability that a web surfer is at page d at a given point in time. Next: what properties must hold of the web graph for the long-term visit rate to be well defined? The web graph must correspond to an ergodic Markov chain. First a special case: The web graph must not contain dead ends.

21 / 49

slide-22
SLIDE 22

Dead ends

?? The web is full of dead ends. Random walk can get stuck in dead ends. If there are dead ends, long-term visit rates are not well-defined (or non-sensical).

22 / 49

slide-23
SLIDE 23

Teleporting

At a dead end, jump to a random web page At any non-dead end, with probability 10%, jump to a random web page With remaining probability (90%), go out on a random hyperlink

For example, if the page has 4 outgoing links: randomly choose one with probability (1-0.10)/4=0.225 . . . and choose one of the N pages randomly with probability 0.1/N.

10% is a parameter.

23 / 49

slide-24
SLIDE 24

Result of teleporting

With teleporting, we cannot get stuck in a dead end. But even without dead-ends, a graph may not have well-defined long-term visit rates. More generally, we require that the Markov chain be ergodic.

24 / 49

slide-25
SLIDE 25

Ergodic Markov chains

A Markov chain is ergodic iff it is irreducible and aperiodic.

  • Irreducibility. Roughly: there is a path from any page to any
  • ther page.
  • Aperiodicity. Roughly: The pages cannot be partitioned such

that the random walker visits the partitions sequentially. A non-ergodic Markov chain: 1.0 1.0

25 / 49

slide-26
SLIDE 26

Ergodic Markov chains

Theorem: For any ergodic Markov chain, there is a unique long-term visit rate for each state. This is the steady-state probability distribution. Over a long time period, we visit each state in proportion to this rate. It doesn’t matter where we start.

26 / 49

slide-27
SLIDE 27

Formalization of “visit”: Probability vector

A probability (row) vector x = (x1, . . . , xN) tells us where the random walk is at any point. Example: ( . . . 1 . . . ) 1 2 3 . . . i . . . N-2 N-1 N More generally: the random walk is on page i with probability xi. Example: ( 0.05 0.01 0.0 . . . 0.2 . . . 0.01 0.05 0.03 ) 1 2 3 . . . i . . . N-2 N-1 N xi = 1

27 / 49

slide-28
SLIDE 28

Change in probability vector

If the probability vector is x = (x1, . . . , xN) at this step, what is it at the next step? Recall that row i of the transition probability matrix P tells us where we go next from state i. Equivalently: column j of P tells us “where we came from” (and with which probability). So from x, our next state is distributed as xP.

28 / 49

slide-29
SLIDE 29

Steady state in vector notation

The steady state in vector notation is simply a vector

  • π = (π1, π2, . . . , πN) of probabilities.

(We use π to distinguish it from the notation for the probability vector x.) πi is the long-term visit rate (or PageRank) of page i. So we can think of PageRank as a very long vector – one entry per page.

29 / 49

slide-30
SLIDE 30

Steady-state distribution: Example

What is the PageRank / steady state in this example? d1 d2 0.75 0.25 0.25 0.75

30 / 49

slide-31
SLIDE 31

Steady-state distribution: Example

x1 x2 Pt(d1) Pt(d2) P11 = 0.25 P12 = 0.75 P21 = 0.25 P22 = 0.75 t0 0.25 0.75 0.25 0.75 t1 0.25 0.75 (convergence) PageRank vector = π = (π1, π2) = (0.25, 0.75) Pt(d1) = Pt−1(d1) ∗ P11 + Pt−1(d2) ∗ P21 Pt(d2) = Pt−1(d1) ∗ P12 + Pt−1(d2) ∗ P22

31 / 49

slide-32
SLIDE 32

How do we compute the steady state vector?

In other words: how do we compute PageRank? Recall: π = (π1, π2, . . . , πN) is the PageRank vector, the vector of steady-state probabilities . . . . . . and if the distribution in this step is x, then the distribution in the next step is xP. But π is the steady state! So: π = πP Solving this matrix equation gives us π.

  • π is the principal left eigenvector for P . . .

. . . that is, π is the left eigenvector with the largest eigenvalue. All transition probability matrices have largest eigenvalue 1.

32 / 49

slide-33
SLIDE 33

One way of computing the PageRank π

Recall: regardless of where we start, we eventually reach the steady state π. Start with any distribution x, e.g., uniform distribution After one step, we’re at xP. After two steps, we’re at xP2. After k steps, we’re at xPk. Algorithm: multiply x by increasing powers of P until convergence. This is called the power method.

33 / 49

slide-34
SLIDE 34

Power method: Example

What is the PageRank / steady state in this example? d1 d2 0.9 0.3 0.1 0.7

34 / 49

slide-35
SLIDE 35

Computing PageRank: Power method

x1 x2 Pt(d1) Pt(d2) P11 = 0.1 P12 = 0.9 P21 = 0.3 P22 = 0.7 t0 1 0.3 0.7 = xP t1 0.3 0.7 0.24 0.76 = xP2 t2 0.24 0.76 0.252 0.748 = xP3 t3 0.252 0.748 0.2496 0.7504 = xP4 . . . t∞ 0.25 0.75 0.25 0.75 = xP∞ PageRank vector = π = (π1, π2) = (0.25, 0.75) Pt(d1) = Pt−1(d1) ∗ P11 + Pt−1(d2) ∗ P21 Pt(d2) = Pt−1(d1) ∗ P12 + Pt−1(d2) ∗ P22

35 / 49

slide-36
SLIDE 36

Exercise: Compute PageRank using power method

d1 d2 0.3 0.2 0.7 0.8

36 / 49

slide-37
SLIDE 37

Solution

x1 x2 Pt(d1) Pt(d2) P11 = 0.7 P12 = 0.3 P21 = 0.2 P22 = 0.8 t0 1 0.2 0.8 t1 0.2 0.8 0.3 0.7 t2 0.3 0.7 0.35 0.65 t3 0.35 0.65 0.375 0.625 . . . t∞ 0.4 0.6 0.4 0.6 PageRank vector = π = (π1, π2) = (0.4, 0.6) Pt(d1) = Pt−1(d1) ∗ P11 + Pt−1(d2) ∗ P21 Pt(d2) = Pt−1(d1) ∗ P12 + Pt−1(d2) ∗ P22

37 / 49

slide-38
SLIDE 38

Random Walker Redux

The page rank rj of page j is determined self-consistently by the equation rj = α n + (1 − α)

  • i|i→j

ri di , α is a number between 0 and 1 (originally taken to be .15) the sum on i is over pages i pointing to j di is the outgoing degree of page i. Pages with high “Page rank” ri that have low di, i.e., that don’t point to too many other pages, convey the most page rank to page j. With probability 1 − α one goes at random to any of the pages j pointed to by page i, and with probability α one instead jumps at random to any page anywhere on the web.

38 / 49

slide-39
SLIDE 39

Eigenvector problem

Incidence matrix A is defined by Aij = 1 if i points to j and

  • therwise Aij = 0.

A matrix P giving the transition probability from page i to page j can be constructed in terms of the incidence matrix A as Pij = α n Oij + (1 − α) 1 di Aij where n = total # of pages, di is the outdegree of node i, and Oij = 1(∀i, j) is a matrix of all ones (i.e., O =     1 1 . . . 1 1 1 . . . 1 . . . 1 1 . . . 1    ). The matrix eigenvector relation

  • rP =

r

  • r
  • r = PT

r is equivalent to the equation on the previous slide (with r is normalized as a probability, so that

i ri Oij = i ri = 1)

39 / 49

slide-40
SLIDE 40

Eigenvector problem, cont’d

By the Perron-Frobenius theorem, the matrix P has a unique principal eigenvector, corresponding to its largest eigenvalue, and those components are all positive. (Since

j Aij/di = 1, we find j Pij = p + (1 − p) = 1 and P is

normalized such that its principal eigenvalue is 1.) Thus the eigenvalue equation always has a solution. To calculate the rank of all the pages, the crawler visits as many pages as possible calculates the link structure of the web graph Calculating eigenvectors of enormous matrices can be painful, but in the case of the principal eigenvector there’s a simple method.

40 / 49

slide-41
SLIDE 41

“Formal” Proof of Power Method

Recall: any N × N matrix M has N eigenvectors v(i) that satisfy M v(i) = λi v(i) . They form a basis set: any other N-dimensional vector w can be expressed as linear combination w = N

i=1 αi

v(i), with αi constants. Take largest eigenvalue to be λ1, and apply M a total of n times Mn w =

N

  • i=1

αiλn

i

v(i)

41 / 49

slide-42
SLIDE 42

Power Method, cont’d

Mn w =

N

  • i=1

αiλn

i

v(i) For large n, the term with the largest eigenvalue dominates: 1 λn

1

Mn w = α1 v(1) +

N

  • i=2

αi λi λ1 n

  • v(i) ≈ α1

v(1) , since λi/λ1 < 1 ∀i = 1. Thus the principal eigenvector can be determined simply by applying a matrix M sufficiently many times to any vector which has non-zero dot product with the principal eigenvector (i.e., non-vanishing α1 in the above) ⇒ effectively projects to the eigenvector of interest

42 / 49

slide-43
SLIDE 43

PageRank summary

Preprocessing

Given graph of links, build matrix P Apply teleportation From modified matrix, compute π

  • πi is the PageRank of page i.

Query processing

Retrieve pages satisfying the query Rank them by their PageRank Return reranked list to the user

43 / 49

slide-44
SLIDE 44

PageRank issues

Real surfers are not random surfers – Markov model is not a good model of surfing.

Examples of nonrandom surfing: back button, short vs. long paths, bookmarks, directories – and search! But it’s good enough as a model for our purposes.

Simple PageRank ranking (as described on previous slide) produces bad results for many pages.

Consider the query video service The Yahoo home page (i) has a very high PageRank and (ii) contains both words. If we rank all Boolean hits according to PageRank, then the Yahoo home page would be top-ranked. Clearly not desirable

In practice: rank according to weighted combination of raw text match, anchor text match, PageRank & other factors

44 / 49

slide-45
SLIDE 45

Web graph example

d0 d2 d1 d5 d3 d6 d4 car benz ford gm honda jaguar jag cat leopard tiger jaguar lion cheetah speed Exercise: Without doing the actual computation, guess the ranking of the pages according to PageRank.

45 / 49

slide-46
SLIDE 46

Transition (probability) matrix

d0 d1 d2 d3 d4 d5 d6 d0 0.00 0.00 1.00 0.00 0.00 0.00 0.00 d1 0.00 0.50 0.50 0.00 0.00 0.00 0.00 d2 0.33 0.00 0.33 0.33 0.00 0.00 0.00 d3 0.00 0.00 0.00 0.50 0.50 0.00 0.00 d4 0.00 0.00 0.00 0.00 0.00 0.00 1.00 d5 0.00 0.00 0.00 0.00 0.00 0.50 0.50 d6 0.00 0.00 0.00 0.33 0.33 0.00 0.33

46 / 49

slide-47
SLIDE 47

Transition matrix with teleporting, α = .14

d0 d1 d2 d3 d4 d5 d6 d0 0.02 0.02 0.88 0.02 0.02 0.02 0.02 d1 0.02 0.45 0.45 0.02 0.02 0.02 0.02 d2 0.31 0.02 0.31 0.31 0.02 0.02 0.02 d3 0.02 0.02 0.02 0.45 0.45 0.02 0.02 d4 0.02 0.02 0.02 0.02 0.02 0.02 0.88 d5 0.02 0.02 0.02 0.02 0.02 0.45 0.45 d6 0.02 0.02 0.02 0.31 0.31 0.02 0.31

47 / 49

slide-48
SLIDE 48

Power method vectors xPk

  • x
  • xP1
  • xP2
  • xP3
  • xP4
  • xP5
  • xP6
  • xP7
  • xP8
  • xP9
  • xP10
  • xP11
  • xP12
  • xP13

d0 0.14 0.06 0.09 0.07 0.07 0.06 0.06 0.06 0.06 0.05 0.05 0.05 0.05 0.05 d1 0.14 0.08 0.06 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 d2 0.14 0.25 0.18 0.17 0.15 0.14 0.13 0.12 0.12 0.12 0.12 0.11 0.11 0.11 d3 0.14 0.16 0.23 0.24 0.24 0.24 0.24 0.25 0.25 0.25 0.25 0.25 0.25 0.25 d4 0.14 0.12 0.16 0.19 0.19 0.20 0.21 0.21 0.21 0.21 0.21 0.21 0.21 0.21 d5 0.14 0.08 0.06 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 d6 0.14 0.25 0.23 0.25 0.27 0.28 0.29 0.29 0.30 0.30 0.30 0.30 0.31 0.31

48 / 49

slide-49
SLIDE 49

How important is PageRank?

Frequent claim: PageRank is the most important component

  • f web ranking.

The reality:

There are several components that are at least as important: e.g., anchor text, phrases, proximity, tiered indexes . . . Rumor has it that PageRank in its original form (as presented here) has a negligible impact on ranking! However, variants of a page’s PageRank are still an essential part of ranking. Addressing link spam is difficult and crucial.

49 / 49