INFO 4300 / CS4300 Information Retrieval slides adapted from - - PowerPoint PPT Presentation

info 4300 cs4300 information retrieval slides adapted
SMART_READER_LITE
LIVE PREVIEW

INFO 4300 / CS4300 Information Retrieval slides adapted from - - PowerPoint PPT Presentation

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from http://informationretrieval.org/ IR 6: Ranking Paul Ginsparg Cornell University, Ithaca, NY 13 Sep 2011 1 / 48 Administrativa Course Webpage:


slide-1
SLIDE 1

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch¨ utze’s, linked from http://informationretrieval.org/

IR 6: Ranking

Paul Ginsparg

Cornell University, Ithaca, NY

13 Sep 2011

1 / 48

slide-2
SLIDE 2

Administrativa

Course Webpage: http://www.infosci.cornell.edu/Courses/info4300/2011fa/ Assignment 1. Posted: 2 Sep, Due: Sun, 18 Sep Lectures: Tuesday and Thursday 11:40-12:55, Kimball B11 Instructor: Paul Ginsparg, ginsparg@..., 255-7371, Physical Sciences Building 452 Instructor’s Office Hours: Wed 1-2pm, Fri 2-3pm, or e-mail instructor to schedule an appointment Teaching Assistant: Saeed Abdullah, office hour Fri 3:30pm-4:30pm in the small conference room (133) at 301 College Ave, and by email, use cs4300-l@lists.cs.cornell.edu Course text at: http://informationretrieval.org/

Introduction to Information Retrieval, C.Manning, P.Raghavan, H.Sch¨ utze

see also

Information Retrieval, S. B¨ uttcher, C. Clarke, G. Cormack

http://mitpress.mit.edu/catalog/item/default.asp?ttype=2&tid=12307

2 / 48

slide-3
SLIDE 3

Administrativa

Reread assignment 1 instructions The Midterm Examination is on Thu, Oct 13 from 11:40 to 12:55, in Kimball B11. It will be open book. The topics to be examined are all the lectures and discussion class readings before the midterm break. According to the registrar (http://registrar.sas.cornell.edu/Sched/EXFA.html ), the final examination is Wed 14 Dec 7:00-9:30 pm (location TBD)

3 / 48

slide-4
SLIDE 4

Discussion 2, 20 Sep

For this class, read and be prepared to discuss the following:

  • K. Sparck Jones, “A statistical interpretation of term

specificity and its application in retrieval”. Journal of Documentation 28, 11-21, 1972. http://www.soi.city.ac.uk/∼ser/idfpapers/ksj orig.pdf Letter by Stephen Robertson and reply by Karen Sparck Jones, Journal of Documentation 28, 164-165, 1972. http://www.soi.city.ac.uk/∼ser/idfpapers/letters.pdf The first paper introduced the term weighting scheme known as inverse document frequency (IDF). Some of the terminology used in this paper will be introduced in the lectures. The letter describes a slightly different way of expressing IDF, which has become the standard form. (Stephen Robertson has mounted these papers on his Web site with permission from the publisher.)

4 / 48

slide-5
SLIDE 5

Overview

1

Recap

2

Zones

3

Why rank?

4

More on cosine

5

Implementation

5 / 48

slide-6
SLIDE 6

Outline

1

Recap

2

Zones

3

Why rank?

4

More on cosine

5

Implementation

6 / 48

slide-7
SLIDE 7

Query Scores: S(q, d) =

t∈q w (idf) t

· w (tf)

t,d

(ltn.lnn)

  • 1. “A sentence is a document.”
  • 2. “A document is a sentence and a sentence is a document.”
  • 3. “This document is short.”
  • 4. “This document is a sentence.”

tft,d doc1 doc2 doc3 doc4 a 2 4 1 and 1 document 1 2 1 1 is 1 2 1 1 sentence 1 2 1 short 1 this 1 1 → w(tf)

t,d

doc1 doc2 doc3 doc4 a 1.3 1.6 1 and 1 document 1 1.3 1 1 is 1 1.3 1 1 sentence 1 1.3 1 short 1 this 1 1 df w(idf)

t

a 3 .125 and 1 .6 document 4 is 4 sentence 2 .3 short 1 .6 this 2 .3

  • log(4/4) = 0, log(4/3) ≈ .125, log(4/2) ≈ .3, log(4/1) ≈ .6
  • Query: “a sentence”

doc1: .125 ∗ 1.3 + .3 ∗ 1 = .46, doc2: .125 ∗ 1.6 + .3 ∗ 1.3 = .59 doc3: .125 ∗ 0 + .3 ∗ 0 = 0, doc4: .125 ∗ 1 + .3 ∗ 1 = .425 Query: “short sentence” doc1: .6 ∗ 0 + .3 ∗ 1 = .3, doc2: .6 ∗ 0 + .3 ∗ 1.3 = .39 doc3: .6 ∗ 1 + .3 ∗ 0 = .6, doc4: .6 ∗ 0 + .3 ∗ 1 = .3

7 / 48

slide-8
SLIDE 8

Query Scores: S(q, d) =

t∈q w (idf) t

· ˆ w (tf)

t,d

(ltn.lnc)

  • 1. “A sentence is a document.”
  • 2. “A document is a sentence and a sentence is a document.”
  • 3. “This document is short.”
  • 4. “This document is a sentence.”

w(tf)

t,d

doc1 doc2 doc3 doc4 a 1.3 1.6 1 and 1 document 1 1.3 1 1 is 1 1.3 1 1 sentence 1 1.3 1 short 1 this 1 1 → ˆ w(tf)

t,d

doc1 doc2 doc3 doc4 a .60 .54 .45 and .34 document .46 .44 .5 .45 is .46 .44 .5 .45 sentence .46 .44 .45 short .5 this .5 .45 df w(idf)

t

a 3 .125 and 1 .6 document 4 is 4 sentence 2 .3 short 1 .6 this 2 .3

lengths(doc1,. . .,doc4) = (2.17, 2.94, 2, 2.24) Query: “a sentence” doc1: .125 ∗ .6 + .3 ∗ .46 = .21, doc2: .125 ∗ .54 + .3 ∗ .44 = .20 doc3: .125 ∗ 0 + .3 ∗ 0 = 0, doc4: .125 ∗ .45 + .3 ∗ .45 = .19 Query: “short sentence” doc1: .6 ∗ 0 + .3 ∗ .46 = .14, doc2: .6 ∗ 0 + .3 ∗ .44 = .133 doc3: .6 ∗ .5 + .3 ∗ 0 = .3, doc4: .6 ∗ 0 + .3 ∗ .45 = .134

8 / 48

slide-9
SLIDE 9

Cosine similarity between query and document

cos( q, d) = sim( q, d) = q | q| ·

  • d

| d| =

|V |

  • i=1

qi |V |

i=1 q2 i

· di |V |

i=1 d2 i

qi is the tf-idf weight (idf) of term i in the query. di is the tf-idf weight (tf) of term i in the document. | q| and | d| are the lengths of q and d.

  • q/|

q| and d/| d| are length-1 vectors (= normalized).

9 / 48

slide-10
SLIDE 10

Cosine similarity illustrated

1 1

rich poor

  • v(q)
  • v(d1)
  • v(d2)
  • v(d3)

θ

10 / 48

slide-11
SLIDE 11

Variant tf-idf functions

We’ve considered sublinear tf scaling (wft,d = 1 + log tft,d) Or normalize instead by maximum tf in document, tfmax(d): ntft,d = a + (1 − a) tft,d tfmax(d) where a ∈ [0, 1] (e.g., .4) is a smoothing term to avoid large swing in ntf due small changes in tf. This eliminates repeat content problem (d′ = d + d), but has

  • ther issues:

sensitive to change in stop word list

  • utlier terms with large tf

skewed distribution of many nearly most frequent terms.

11 / 48

slide-12
SLIDE 12

Components of tf.idf weighting

Term frequency Document frequency Normalization n (natural) tft,d n (no) 1 n (none) 1 l (logarithm) 1 + log(tft,d) t (idf) log N dft c (cosine)

1

w2

1 +w2 2 +...+w2 M

a (augmented) 0.5 +

0.5×tft,d maxt(tft,d)

p (prob idf) max{0, log N−dft

dft

} u (pivoted unique) 1/u b (boolean) 1 if tft,d > 0

  • therwise

b (byte size) 1/CharLengthα, α < 1 L (log ave)

1+log(tft,d) 1+log(avet∈d(tft,d))

Best known combination of weighting options Default: no weighting

12 / 48

slide-13
SLIDE 13

tf.idf example

We often use different weightings for queries and documents. Notation: qqq.ddd (term frequency / document frequency / normalization) for (query.document) Example: ltn.lnc query: logarithmic tf, idf, no normalization document: logarithmic tf, no df weighting, cosine normalization Isn’t it bad to not idf-weight the document? Example query: “best car insurance” Example document: “car insurance auto insurance”

13 / 48

slide-14
SLIDE 14

tf.idf example: ltn.lnc

Query: “best car insurance”. Document: “car insurance auto insurance”. word query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n’lized auto 5000 2.3 1 1 1 0.52 best 1 1 50000 1.3 1.3 car 1 1 10000 2.0 2.0 1 1 1 0.52 1.04 insurance 1 1 1000 3.0 3.0 2 1.3 1.3 0.68 2.04 Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n’lized: document weights after cosine normalization, product: the product of final query weight and final document weight √ 12 + 02 + 12 + 1.32 ≈ 1.92 1/1.92 ≈ 0.52 1.3/1.92 ≈ 0.68 Final similarity score between query and document:

  • i wqi · wdi = 0 + 0 + 1.04 + 2.04 = 3.08

14 / 48

slide-15
SLIDE 15

Outline

1

Recap

2

Zones

3

Why rank?

4

More on cosine

5

Implementation

15 / 48

slide-16
SLIDE 16

Parametric and Zone indices

Digital documents have additional structure: metadata encoded in machine-parseable form (e.g., author, title, date of publication, . . .) One parametric index for each field. Fields: take finite set of values (e.g., dates of authorship) Zones: arbitrary free text (e.g., titles, abstracts) Permits searching for documents by Shakespeare written in 1601 containing the phrase “alas poor Yorick”

  • r find documents with “merchant” in title and “william” in author

list and the phrase “gentle rain” in body Use separate indexes for each field and zone,

  • r use

william.abstract, william.title, william.author Permits weighted zone scoring

16 / 48

slide-17
SLIDE 17

Weighted Zone Scoring

Given boolean query q and document d, assign to the pair (q, d) a score in [0,1] by computing linear combination of zone scores. Let g1, . . . , gℓ ∈ [0, 1] such that ℓ

i=1 gi = 1.

For 1 ≤ i ≤ ℓ, let si be the score between q and the ith zone. Then the weighted zone score is defined as ℓ

i=1 gisi.

Example: Three zones: author title, body g1 = .2, g2 = .5, g3 = .3 (match in author zone least important) Compute weighted zone scores directly from inverted indexes: Instead of adding document to set of results as for boolean AND query, now compute a score for each document.

17 / 48

slide-18
SLIDE 18

Learning Weights

How to determine the weights gi for weighted zone scoring?

  • A. specified by expert
  • B. “learned” using training examples that have been judged

editorially (machine-learned relevance)

  • 1. given set of training examples [(q, d) plus relevance

judgment (e.g., yes/no)]

  • 2. set the weights gi to best approximate the relevance

judgments Expensive component: labor-intensive assembly of user-generated relevance judgments, especially expensive in rapidly changing collection (such as Web). Or use “passive collaborative feedback”? (clickthrough data)

18 / 48

slide-19
SLIDE 19

19 / 48

slide-20
SLIDE 20

Machine Learned Relevance

Given a table sT(d, q), sB(d, q) of Boolean matches, and relevance judgements r(d, q) (also, e.g., binary) of document d relevant to query q (see fig. 6.5 in text), compute a score for each of the training examples score(d, q) = gsT(d, q) + (1 − g)sB(d, q) and compare with r(d, q) using an error function ε(g, Φj) =

  • r(dj, qj) − score(dj, qj)

2 for each training example. Choose g to minimize total error

j ε(g, Φj) (a quadratic function

  • f g, so elementary algebra in this case; more generally a

sophisticated optimization problem).

20 / 48

slide-21
SLIDE 21

Outline

1

Recap

2

Zones

3

Why rank?

4

More on cosine

5

Implementation

21 / 48

slide-22
SLIDE 22

Why is ranking so important?

Two lectures ago: Problems with unranked retrieval

Users want to look at a few results – not thousands. It’s very hard to write queries that produce a few results. Even for expert searchers → Ranking is important because it effectively reduces a large set of results to a very small one.

Next: More data on “users only look at a few results” Actually, in the vast majority of cases they only look at 1, 2,

  • r 3 results.

22 / 48

slide-23
SLIDE 23

Empirical investigation of the effect of ranking

How can we measure how important ranking is? Observe what searchers do when they are searching in a controlled setting

Videotape them Ask them to “think aloud” Interview them Eye-track them Time them Record and count their clicks

The following slides are from Dan Russell’s JCDL talk 2007 Dan Russell is the “¨ Uber Tech Lead for Search Quality & User Happiness” at Google.

23 / 48

slide-24
SLIDE 24

Interview video

So . . . Did you notice the FTD official site? To be honest I didn’t even look at that. At first I saw “from $20” and $20 is what I was looking for. To be honest, 1800-flowers is what I’m familiar with and why I went there next even though I kind of assumed they wouldn’t have $20 flowers. And you knew they were expensive? I knew they were expensive but I thought “hey, maybe they’ve got some flowers for under $20 here . . .” But you didn’t notice the FTD? No I didn’t, actually. . . that’s really funny.

24 / 48

slide-25
SLIDE 25
slide-26
SLIDE 26

Local work

Granka, L., Joachims, T., and Gay, G. (2004) “Eye-Tracking Analysis of User Behavior in WWW Search” Proceedings of the 28th Annual ACM Conference on Research and Development in Information and Retrieval (SIGIR ’04) http://www.cs.cornell.edu/People/tj/publications/granka etal 04a.pdf

26 / 48

slide-27
SLIDE 27
slide-28
SLIDE 28
slide-29
SLIDE 29
slide-30
SLIDE 30
slide-31
SLIDE 31

Out of date?

Use of top and right margins “instant” results

  • nly for high bandwidth users?

mobile devices

31 / 48

slide-32
SLIDE 32

Importance of ranking: Summary

Viewing abstracts: Users are a lot more likely to read the abstracts of the top-ranked pages (1, 2, 3, 4) than the abstracts of the lower ranked pages (7, 8, 9, 10). Clicking: Distribution is even more skewed for clicking In 1 out of 2 cases, users click on the top-ranked page. Even if the top-ranked page is not relevant, 30% of users will click on it. → Getting the ranking right is very important. → Getting the top-ranked page right is most important.

32 / 48

slide-33
SLIDE 33

Outline

1

Recap

2

Zones

3

Why rank?

4

More on cosine

5

Implementation

33 / 48

slide-34
SLIDE 34

A problem for cosine normalization

Query q: “anti-doping rules Beijing 2008 olympics” Compare three documents

d1: a short document on anti-doping rules at 2008 Olympics d2: a long document that consists of a copy of d1 and 5 other news stories, all on topics different from Olympics/anti-doping d3: a short document on anti-doping rules at the 2004 Athens Olympics

What ranking do we expect in the vector space model? d2 is likely to be ranked below d3 . . . . . . but d2 is more relevant than d3. What can we do about this?

34 / 48

slide-35
SLIDE 35

Pivot normalization

Cosine normalization produces weights that are too large for short documents and too small for long documents (on average). Adjust cosine normalization by linear adjustment: “turning” the average normalization on the pivot Effect: Similarities of short documents with query decrease; similarities of long documents with query increase. This removes the unfair advantage that short documents have.

35 / 48

slide-36
SLIDE 36

Predicted and true probability of relevance

source: Lillian Lee

36 / 48

slide-37
SLIDE 37

Pivot normalization

source: Lillian Lee

37 / 48

slide-38
SLIDE 38

Outline

1

Recap

2

Zones

3

Why rank?

4

More on cosine

5

Implementation

38 / 48

slide-39
SLIDE 39

Now we also need term frequencies in the index

Brutus − → 1,2 7,3 83,1 87,2 . . . Caesar − → 1,1 5,1 13,1 17,1 . . . Calpurnia − → 7,1 8,2 40,1 97,3 term frequencies We also need positions. Not shown here.

39 / 48

slide-40
SLIDE 40

Term frequencies in the inverted index

In each posting, store tft,d in addition to docID d As an integer frequency, not as a (log-)weighted real number . . . . . . because real numbers are difficult to compress. Unary code is effective for encoding term frequencies. Why? Overall, additional space requirements are small: much less than a byte per posting.

40 / 48

slide-41
SLIDE 41

How do we compute the top k in ranking?

In many applications, we don’t need a complete ranking. We just need the top k for a small k (e.g., k = 100). If we don’t need a complete ranking, is there an efficient way

  • f computing just the top k?

Naive:

Compute scores for all N documents Sort Return the top k

What’s bad about this? Alternative?

41 / 48

slide-42
SLIDE 42

Use min heap for selecting top k out of N

Use a binary min heap A binary min heap is a binary tree in which each node’s value is less than the values of its children. Takes O(N log k) operations to construct (where N is the number of documents) . . . . . . then read off k winners in O(k log k) steps Essentially linear in N for small k and large N.

42 / 48

slide-43
SLIDE 43

Binary min heap

0.6 0.85 0.7 0.9 0.97 0.8 0.95

43 / 48

slide-44
SLIDE 44

Selecting top k scoring documents in O(N log k)

Goal: Keep the top k documents seen so far Use a binary min heap To process a new document d′ with score s′:

Get current minimum hm of heap (O(1)) If s′ ≤ hm skip to next document If s′ > hm heap-delete-root (O(log k)) Heap-add d′/s′ (O(log k))

44 / 48

slide-45
SLIDE 45

Even more efficient computation of top k?

Ranking has time complexity O(N) where N is the number of documents. Optimizations reduce the constant factor, but they are still O(N) — and 1010 < N < 1011! Are there sublinear algorithms? Ideas? What we’re doing in effect: solving the k-nearest neighbor (kNN) problem for the query vector (= query point). There are no general solutions to this problem that are sublinear. We will revisit when we do kNN classification

45 / 48

slide-46
SLIDE 46

Cluster pruning

Cluster docs in preprocessing step Pick √ N “leaders” For non-leaders, find nearest leader (expect √ N / leader) For query q, find closest leader L ( √ N computations) Rank L and followers

  • r generalize: b1 closest leaders, and then b2 leaders closest to

query

46 / 48

slide-47
SLIDE 47

47 / 48

slide-48
SLIDE 48

Even more efficient computation of top k

Idea 1: Reorder postings lists

Instead of ordering according to docID . . . . . . order according to some measure of “expected relevance”.

Idea 2: Heuristics to prune the search space

Not guaranteed to be correct . . . . . . but fails rarely. In practice, close to constant time. For this, we’ll need the concepts of document-at-a-time processing and term-at-a-time processing.

48 / 48