INFO 4300 / CS4300 Information Retrieval slides adapted from - - PowerPoint PPT Presentation

info 4300 cs4300 information retrieval slides adapted
SMART_READER_LITE
LIVE PREVIEW

INFO 4300 / CS4300 Information Retrieval slides adapted from - - PowerPoint PPT Presentation

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from http://informationretrieval.org/ IR 16: Citation Analysis and Link Analysis Paul Ginsparg Cornell University, Ithaca, NY 27 Oct 2009 1 / 40


slide-1
SLIDE 1

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch¨ utze’s, linked from http://informationretrieval.org/

IR 16: Citation Analysis and Link Analysis

Paul Ginsparg

Cornell University, Ithaca, NY

27 Oct 2009

1 / 40

slide-2
SLIDE 2

Administrativa

Assignment 3 now available: due Sun 8 Nov Topics for rest of course ( 5 weeks, 10 lectures): MRS: 19,20,12,13,14,16,17 web search language models text classification vector space classification clustering + map reduce and google file system

2 / 40

slide-3
SLIDE 3

Overview

1

Recap

2

PageRank example

3

HITS: Hubs & Authorities

4

Discussion 5

3 / 40

slide-4
SLIDE 4

Outline

1

Recap

2

PageRank example

3

HITS: Hubs & Authorities

4

Discussion 5

4 / 40

slide-5
SLIDE 5

Model behind PageRank: Random walk

Imagine a web surfer doing a random walk on the web

Start at a random page At each step, go out of the current page along one of the links

  • n that page, equiprobably

In the steady state, each page has a long-term visit rate. This long-term visit rate is the page’s PageRank. PageRank = long-term visit rate = steady state probability

5 / 40

slide-6
SLIDE 6

One way of computing the PageRank π

Recall: regardless of where we start, we eventually reach the steady state π. Start with any distribution x, e.g., uniform distribution After one step, we’re at xP. After two steps, we’re at xP2. After k steps, we’re at xPk. Algorithm: multiply x by increasing powers of P until convergence. This is called the power method.

6 / 40

slide-7
SLIDE 7

Random Walker Redux

The page rank rj of page j is determined self-consistently by the equation rj = α n + (1 − α)

  • i|i→j

ri di , α is a number between 0 and 1 (originally taken to be .15) the sum on i is over pages i pointing to j di is the outgoing degree of page i. Pages with high “Page rank” ri that have low di, i.e., that don’t point to too many other pages, convey the most page rank to page j. With probability 1 − α one goes at random to any of the pages j pointed to by page i, and with probability α one instead jumps at random to any page anywhere on the web.

7 / 40

slide-8
SLIDE 8

Eigenvector problem

Incidence matrix A is defined by Aij = 1 if i points to j and

  • therwise Aij = 0.

A matrix P giving the transition probability from page i to page j can be constructed in terms of the incidence matrix A as Pij = α n Oij + (1 − α) 1 di Aij where n = total # of pages, di is the outdegree of node i, and Oij = 1(∀i, j) is a matrix of all ones (i.e., O =     1 1 . . . 1 1 1 . . . 1 . . . 1 1 . . . 1    ). The matrix eigenvector relation

  • rP =

r

  • r
  • r = PT

r is equivalent to the equation on the previous slide (with r is normalized as a probability, so that

i ri Oij = i ri = 1)

8 / 40

slide-9
SLIDE 9

Eigenvector problem, cont’d

By the Perron-Frobenius theorem, the matrix P has a unique principal eigenvector, corresponding to its largest eigenvalue, and those components are all positive. (Since

j Aij/di = 1, we find j Pij = p + (1 − p) = 1 and P is

normalized such that its principal eigenvalue is 1.) Thus the eigenvalue equation always has a solution. To calculate the rank of all the pages, the crawler visits as many pages as possible calculates the link structure of the web graph Calculating eigenvectors of enormous matrices can be painful, but in the case of the principal eigenvector there’s a simple method.

9 / 40

slide-10
SLIDE 10

Power Method, cont’d

Mn w =

N

  • i=1

αiλn

i

v(i) For large n, the term with the largest eigenvalue dominates: 1 λn

1

Mn w = α1 v(1) +

N

  • i=2

αi λi λ1 n

  • v(i) ≈ α1

v(1) , since λi/λ1 < 1 ∀i = 1. Thus the principal eigenvector can be determined simply by applying a matrix M sufficiently many times to any vector which has non-zero dot product with the principal eigenvector (i.e., non-vanishing α1 in the above) ⇒ effectively projects to the eigenvector of interest

10 / 40

slide-11
SLIDE 11

PageRank summary

Preprocessing

Given graph of links, build matrix P Apply teleportation From modified matrix, compute π

  • πi is the PageRank of page i.

Query processing

Retrieve pages satisfying the query Rank them by their PageRank Return reranked list to the user

11 / 40

slide-12
SLIDE 12

PageRank issues

Real surfers are not random surfers – Markov model is not a good model of surfing.

Examples of nonrandom surfing: back button, short vs. long paths, bookmarks, directories – and search! But it’s good enough as a model for our purposes.

Simple PageRank ranking (as described on previous slide) produces bad results for many pages.

Consider the query video service The Yahoo home page (i) has a very high PageRank and (ii) contains both words. If we rank all Boolean hits according to PageRank, then the Yahoo home page would be top-ranked. Clearly not desirable

In practice: rank according to weighted combination of raw text match, anchor text match, PageRank & other factors

12 / 40

slide-13
SLIDE 13

Outline

1

Recap

2

PageRank example

3

HITS: Hubs & Authorities

4

Discussion 5

13 / 40

slide-14
SLIDE 14

Web graph example

d0 d2 d1 d5 d3 d6 d4 car benz ford gm honda jaguar jag cat leopard tiger jaguar lion cheetah speed Exercise: Without doing the actual computation, guess the ranking of the pages according to PageRank.

14 / 40

slide-15
SLIDE 15

Transition (probability) matrix

d0 d1 d2 d3 d4 d5 d6 d0 0.00 0.00 1.00 0.00 0.00 0.00 0.00 d1 0.00 0.50 0.50 0.00 0.00 0.00 0.00 d2 0.33 0.00 0.33 0.33 0.00 0.00 0.00 d3 0.00 0.00 0.00 0.50 0.50 0.00 0.00 d4 0.00 0.00 0.00 0.00 0.00 0.00 1.00 d5 0.00 0.00 0.00 0.00 0.00 0.50 0.50 d6 0.00 0.00 0.00 0.33 0.33 0.00 0.33

15 / 40

slide-16
SLIDE 16

Transition matrix with teleporting, α = .14

d0 d1 d2 d3 d4 d5 d6 d0 0.02 0.02 0.88 0.02 0.02 0.02 0.02 d1 0.02 0.45 0.45 0.02 0.02 0.02 0.02 d2 0.31 0.02 0.31 0.31 0.02 0.02 0.02 d3 0.02 0.02 0.02 0.45 0.45 0.02 0.02 d4 0.02 0.02 0.02 0.02 0.02 0.02 0.88 d5 0.02 0.02 0.02 0.02 0.02 0.45 0.45 d6 0.02 0.02 0.02 0.31 0.31 0.02 0.31

16 / 40

slide-17
SLIDE 17

Power method vectors xPk

  • x
  • xP1
  • xP2
  • xP3
  • xP4
  • xP5
  • xP6
  • xP7
  • xP8
  • xP9
  • xP10
  • xP11
  • xP12
  • xP13

d0 0.14 0.06 0.09 0.07 0.07 0.06 0.06 0.06 0.06 0.05 0.05 0.05 0.05 0.05 d1 0.14 0.08 0.06 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 d2 0.14 0.25 0.18 0.17 0.15 0.14 0.13 0.12 0.12 0.12 0.12 0.11 0.11 0.11 d3 0.14 0.16 0.23 0.24 0.24 0.24 0.24 0.25 0.25 0.25 0.25 0.25 0.25 0.25 d4 0.14 0.12 0.16 0.19 0.19 0.20 0.21 0.21 0.21 0.21 0.21 0.21 0.21 0.21 d5 0.14 0.08 0.06 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 d6 0.14 0.25 0.23 0.25 0.27 0.28 0.29 0.29 0.30 0.30 0.30 0.30 0.31 0.31

17 / 40

slide-18
SLIDE 18

How important is PageRank?

Frequent claim: PageRank is the most important component

  • f web ranking.

The reality:

There are several components that are at least as important: e.g., anchor text, phrases, proximity, tiered indexes . . . Rumor has it that PageRank in its original form (as presented here) has a negligible impact on ranking! However, variants of a page’s PageRank are still an essential part of ranking. Addressing link spam is difficult and crucial.

18 / 40

slide-19
SLIDE 19

Outline

1

Recap

2

PageRank example

3

HITS: Hubs & Authorities

4

Discussion 5

19 / 40

slide-20
SLIDE 20

HITS – Hyperlink-Induced Topic Search

Premise: there are two different types of relevance on the web. Relevance type 1: Hubs. A hub page is a good list of links to pages answering the information need.

Bob’s list of recommended hotels in London

Relevance type 2: Authorities. An authority page is a direct answer to the information need.

Home page of Four Seasons Hotel London By definition: Links to authority pages occur repeatedly on hub pages.

Most approaches to search (including PageRank ranking) don’t make the distinction between these two very different types of relevance.

20 / 40

slide-21
SLIDE 21

Hubs and authorities: Definition

A good hub page for a topic links to many authority pages for that topic. A good authority page for a topic is linked to by many hub pages for that topic. Circular definition – we will turn this into an iterative computation.

21 / 40

slide-22
SLIDE 22

Example for hubs and authorities

hubs authorities www.bestfares.com www.airlinesquality.com blogs.usatoday.com/sky aviationblog.dallasnews.com www.aa.com www.delta.com www.united.com

22 / 40

slide-23
SLIDE 23

HITS=hypertext induced topic search, I

h u b s a u t h
  • r
i t i e s

23 / 40

slide-24
SLIDE 24

HITS=hypertext induced topic search, II

Eevery page has both an authority weight ai and a hub weight hi, defined to satisfy aj =

  • i|i→j

hi , hi =

  • j|i→j

aj , i.e., the authority weight of a site j is given by the sum of the hub weights of sites i that point to the site, and the hub weight of a site i is given by the sum of the authority weights of the sites j to which it points. In terms of the incidence matrix A, these can be written

  • a = AT

h ,

  • h = A

a .

24 / 40

slide-25
SLIDE 25

HITS=hypertext induced topic search, III

Start with some trial forms of the hub and authorities weights, h(0)

i

and a(0)

i

, and iterating the above equation: a(1)

j

=

  • i|i→j

h(0)

i

, h(1)

i

=

  • j|i→j

a(0)

j

to provide more refined guesses h(1)

i

and a(1)

i

, and then continuing. In matrix form, the jth such iteration can be written

  • a(j) = AT

h(j−1) ,

  • h(j) = A

a(j−1) . Note that the result of two iterations can be written

  • a(j) = ATA

a(j−2), h(j) = AAT h(j−2), so the result of 2n such iterations is

  • a(2n) = (ATA)n

a(0) ,

  • h(2n) = (AAT)n

h(0) .

25 / 40

slide-26
SLIDE 26

HITS=hypertext induced topic search, IV

The matrices ATA and AAT are symmetric with non-negative entries that sum to 1 along any row, so for a suitably chosen subset of nodes such that they’re irreducible, the Perron-Frobenius theorem will apply. In that case, the above iterative procedure will converge, and we see that the authority weights ai will be given by the components

  • f the principal eigenvector of ATA, and the hub weights hi by the

components of the principal eigenvector of AAT.

26 / 40

slide-27
SLIDE 27

Root set and base set (1)

Do a regular web search first Call the search result the root set Find all pages that are linked to or link to pages in the root set Call this larger set the base set Finally, compute hubs and authorities for the base set (which we’ll view as a small web graph)

27 / 40

slide-28
SLIDE 28

Root set and base set (2)

base set root set The root set + Nodes to which root set nodes link + Nodes that link to root set nodes = The base set

28 / 40

slide-29
SLIDE 29

Root set and base set (3)

Root set typically has 200–1000 nodes. Base set may have up to 5000 nodes. Computation of base set:

Follow outlinks by parsing the pages in the root set Find d’s inlinks by searching for all pages containing a link to d This assumes that our inverted index supports search for links (in addition to terms).

29 / 40

slide-30
SLIDE 30

Hub and authority scores

Compute for each page d in the base set a hub score h(d) and an authority score a(d) Initialization: for all d: h(d) = 1, a(d) = 1 Iteratively update all h(d), a(d) After convergence:

Output pages with highest h scores as top hubs Output pages with highest a scores as top authorities So we output two ranked lists

30 / 40

slide-31
SLIDE 31

Iterative update

For all d: h(d) =

d→y a(y) d y1 y2 y3

For all d: a(d) =

y→d h(y) d y1 y2 y3

Iterate these two steps until convergence

31 / 40

slide-32
SLIDE 32

Details

Scaling

To prevent the a() and h() values from getting too big, can scale down after each iteration Scaling factor doesn’t really matter. We care about the relative (as opposed to absolute) values of the scores.

In most cases, the algorithm converges after a few iterations.

32 / 40

slide-33
SLIDE 33

Hubs & Authorities: Comments

HITS can pull together good pages regardless of page content. Once the base set is assembled, we only do link analysis, no text matching. Pages in the base set often do not contain any of the query words. In theory, an English query can retrieve Japanese-language pages!

If supported by the link structure between English and Japanese pages

Danger: topic drift – the pages found by following links may not be related to the original query.

33 / 40

slide-34
SLIDE 34

PageRank vs. HITS: Discussion

PageRank can be precomputed, HITS has to be computed at query time.

HITS is too expensive in most application scenarios.

PageRank and HITS make two different design choices concerning (i) the eigenproblem formalization (ii) the set of pages to apply the formalization to. These two are orthogonal.

We could also apply HITS to the entire web and PageRank to a small base set.

Claim: On the web, a good hub almost always is also a good authority. The actual difference between PageRank ranking and HITS ranking is therefore not as large as one might expect. Note also: topic specific PageRank

34 / 40

slide-35
SLIDE 35

Outline

1

Recap

2

PageRank example

3

HITS: Hubs & Authorities

4

Discussion 5

35 / 40

slide-36
SLIDE 36

Questions I

Sergey Brin and Lawrence Page, The Anatomy of a Large-Scale Hypertextual Web Search Engine. Seventh International World Wide Web Conference. Brisbane, Australia, 1998. http://www-db.stanford.edu/˜backrub/google.html What was authors’ institutional status at time of writing? What are their criticisms of conventional ranking methods, based on vector similarity? (Why not standard full-text indexing with tf.idf?) What do the authors mean by the “precision” they wish to

  • ptimize? What does this assume about users and their

wishes? How does their view of relevance differ from conventional notion? Why did this appear in a relatively obscure venue?

36 / 40

slide-37
SLIDE 37

Questions II

How does PageRank rank the hits returned by a query as compared with traditional search engines? What other other ranking methods do they propose? How do they use anchor text, and what are the computational challenges? Why are they unenthusiastic about using metadata for indexing the web? How many web pages were they indexing back then? How many now? How many queries per day? What is their strategy for scalability? How did they implement such a large-scale system with a small technical staff?

37 / 40

slide-38
SLIDE 38

Questions III

2.1.1: What is the typo in their equation PR(A) = (1 − d) + d(PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) Explain 3.2: “There are even numerous companies which specialize in manipulating search engines for profit”. How was problem dealt with at the time? 4.3: What are some of the “social considerations” of web crawling? 4.5.2 Feedback: “Figuring out the right values for these parameters is something of a black art” 6.1 Future Work: 100M pages? What have they realized? Where have they done more/less than projected?

38 / 40

slide-39
SLIDE 39

Questions IV

Where does this now stand, 1.3.2: “we think some of the most interesting research will involve leveraging the vast amount of usage data that is available from modern web systems” What did they mean by: 6.4: “We hope Google will be a resource for searchers and researchers all around the world” Appendix A: “we believe the issue of advertising causes enough mixed incentives that it is crucial to have a competitive search engine that is transparent and in the academic realm”

39 / 40

slide-40
SLIDE 40

Where are they now?

“. . . advertising-funded search engines will inherently be biased toward the advertisers and away from the needs of consumers.”

http://www.nytimes.com/2005/10/30/business/yourmoney/30google.html

“Google Wants to Dominate Madison Avenue, Too” (2005):

“They softened that line a bit by the time they got around to pitching their business to venture capitalists, allowing that selling ads would be a handy safety net if other, less distasteful ideas for generating revenue didn’t pan out. . . . As it turned out, the safety net was a trampoline. Those little ads — 12 word snippets of text, linked to topics that users are actually interested in — have turned Google into one of the biggest advertising vehicles the world has ever seen. This year, Google will sell $6.1 billion in ads, nearly double what it sold last year . . . That is more advertising than is sold by any newspaper chain, magazine publisher or television network. By next year . . . advertising revenue of $9.5 billion. That would place it fourth among American media companies in total ad sales after Viacom, the News Corporation and the Walt Disney Company, but ahead of giants including NBC Universal and Time Warner.”

40 / 40