[PPT] - Scoring & result assembly CE-324: Modern Information Retrieval PowerPoint Presentation

SLIDE 1

Scoring & result assembly

CE-324: Modern Information Retrieval

Sharif University of Technology

M. Soleymani

Fall 2017

Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

SLIDE 2

Outline

 Speeding up vector space ranking  Putting together a complete search system

 Will

require learning about a number

f

miscellaneous topics and heuristics

2

SLIDE 3

Computing cosine scores

Sec. 6.3.3

3

SLIDE 4

Term-at-a-time vs. doc-at-a-time processing

4

 Completely process the postings list of the first query term,

then process the postings list of the second query term and so forth

 Doc-at-time

Brutus Caesar 1 2 3 5 8 13 21 34 2 4 8 16 32 64128 Antony 3 4 8 16 32 64128

SLIDE 5

Term frequencies in the inverted index

5

 In each posting, store 𝑢𝑔

𝑢,𝑒 in addition to docID

 As an integer frequency, not as a (log-)weighted real number

 because real numbers are difficult to compress.  overall, additional space requirements are small: a byte per posting or

less

SLIDE 6

Efficient ranking

 Usually we don’t need a complete ranking.

 We just need the top k for a small k (e.g., k = 100).

 Find 𝐿 docs in the collection “nearest” to query

  𝐿 largest query-doc scores.

 Efficient ranking:

 Computing a single score efficiently.  Choosing the 𝐿 largest scores efficiently.

 Can we do this without computing all 𝑂 cosines?

Sec. 7.1

6

SLIDE 7

Efficient cosine ranking

 What we’re doing in effect: solving the 𝐿 -nearest

neighbor problem for a query vector

 In general, we do not know how to do this efficiently for high-

dimensional spaces

 But it is solvable for short queries, and standard indexes

support this well

Sec. 7.1

7

SLIDE 8

Computing the 𝐿 largest cosines: selection vs. sorting

 Retrieve the top 𝐿 docs

 not to totally order all docs in the collection

 Can we pick off docs with 𝐿 highest cosines?  Let 𝐾= number of docs with nonzero cosines

 We seek the 𝐿 best of these 𝐾

Sec. 7.1

8

SLIDE 9

Use heap for selecting top K

 Construction: 2𝐾 operations  𝐿 “winners”: 2𝐿log 𝐾 operations  For 𝐾 = 1𝑁, 𝐿 = 100, this is about 10% of the cost of

sorting.

1 .9 .3 .8 .3 .1 .1

Sec. 7.1

9

SLIDE 10

Cosine similarity is only a proxy

 Cosine similarity is just a proxy for user happiness

 If we get a list of 𝐿 docs “close” to top 𝐿 by cosine measure, it

should be ok

Sec. 7.1.1

10

SLIDE 11

More efficient computation of top k: Heuristics

11

 Idea 1: Reorder postings lists

 Instead of ordering according to docID, order according to

some measure of “expected relevance”, “authority”, etc.

 Idea 2: Heuristics to prune the search space

 Not guaranteed to be correct but fails rarely.  In practice, close to constant time.

SLIDE 12

Generic idea of inexact top k search

 Find a set 𝐵 of contenders, with 𝐿 < |𝐵| ≪ 𝑂  𝐵 does not necessarily contain the top K

 but has many docs from among the top K

 Return the top K docs in A  Think of 𝐵 as pruning non-contenders  Same approach is also used for other scoring functions

 Will look at several schemes following this approach

Sec. 7.1.1

12

SLIDE 13

Ideas for more efficient computation of top k

13

 Index elimination  Champion lists  Global ordering  Impact ordering  Cluster pruning

SLIDE 14

Index elimination for cosine computation

 Basic algorithm: considers docs containing at least one

query term

 Extend this basic algorithm to:

 Only consider docs containing many (or all) query terms  Only consider high-idf query terms

Sec. 7.1.2

14

SLIDE 15

Docs containing many query terms

When we have multi-term queries

 Only compute scores for docs containing several of the

query terms

 Say, at least 3 out of 4  Imposes a “soft conjunction” on queries seen on web search

engines (early Google)

 May find fewer than k candidates

 Easy to implement in postings traversal

Sec. 7.1.2

15

SLIDE 16

3 of 4 query terms

Brutus Caesar Calpurnia 1 2 3 5 8 13 21 34 2 4 8 16 32 64128 13 16 Antony 3 4 8 16 32 64128 32

Scores only computed for docs 8, 16 and 32.

Sec. 7.1.2

16

SLIDE 17

High-idf query terms only

 Query: catcher in the rye

 Only accumulate scores from catcher and rye

 Intuition: in and the contribute little to the scores and so

don’t alter rank-ordering much

 Benefit:

 Postings of low-idf terms have many docs

  many docs are eliminated from set A of contenders

Sec. 7.1.2

17

SLIDE 18

Champion lists

 𝑠 docs of highest weight in the posting list of each

dictionary term

 Call this the champion list for 𝑢

 aka fancy list or top docs for 𝑢

 At query time, only compute scores for docs in the

champion list of some (or all of) query terms

 Pick the 𝐿 top-scoring docs from amongst these

 Note that 𝑠 has to be chosen at index build time

 Thus, it’s possible that the obtained list of docs contains less

than 𝐿 docs

Sec. 7.1.3

18

SLIDE 19

High and low lists

 For each term, two postings lists high and low

 High: like the champion list  Low: all other docs containing 𝑢

 Only traverse high lists first

 If we get more than 𝐿 docs, select top 𝐿 and stop  Else proceed to get docs from low lists

 A means for segmenting index into two tiers

Sec. 7.1.4

19

SLIDE 20

Quantitative

Static quality scores

 Top-ranking

docs needs to be both relevant and authoritative

 Relevance: modeled by cosine scores  Authority: typically a query-independent property of a doc

 Examples of authority signals

 Wikipedia among websites  Articles in certain newspapers  A paper with many citations  Pagerank

Sec. 7.1.4

20

SLIDE 21

Modeling authority

 Assign to each doc 𝑒 a query-independent quality

score in [0,1] (called 𝑕(𝑒))

 A quantity like the number of citations scaled into [0,1]

Sec. 7.1.4

21

SLIDE 22

Net score

 Simple total score: combining cosine relevance and

authority NetScore(𝑟, 𝑒) = 𝑕(𝑒) + 𝑑𝑝𝑡𝑗𝑜𝑓(𝑟, 𝑒)

 Can use some other linear combination  Indeed, any function of the two “signals” of user happiness

 Now we seek the top 𝐿 docs by net score

Sec. 7.1.4

22

SLIDE 23

Top 𝐿 by net score – fast methods

 First idea: Order all postings by 𝑕(𝑒)  Key: this is a common ordering for all postings

 All postings are ordered by a single common ordering  and the merge is then performed by a single pass through the

postings

 Can concurrently traverse query terms’ postings for

 Postings intersection  Cosine score computation

Sec. 7.1.4

23

SLIDE 24

Static quality-ordered index

24

𝑕 1 = 0.25 𝑕 2 = 0.5 𝑕 3 = 1

SLIDE 25

Why order postings by 𝑕(𝑒)?

 𝑕(𝑒)-ordering: top-scoring docs likely to appear early in

postings traversal

 In time-bound applications:

 It allows us to stop postings traversal early

 E.g., we have to return search results in 50 ms

Sec. 7.1.4

25

SLIDE 26

Global champion lists

 Can combine champion lists with 𝑕(𝑒)-ordering?  Maintain for each term a champion list of 𝑠 docs with

highest 𝑕 𝑒 + tf. idf𝑢𝑒

 Sorted by a common order 𝑕 𝑒

 Seek top-𝐿 results from only the docs in these champion

lists

Sec. 7.1.4

26

SLIDE 27

Impact-ordered postings

 If we have impact ordering

 Docs in the top k are likely to occur early in the ordered lists.

 We sort each postings list according to weight 𝑥𝑔

𝑢,𝑒

 Simplest case: normalized tf-idf weight

 ⇒ Early termination while processing postings lists is unlikely to

change the top k.

Sec. 7.1.5

27

SLIDE 28

Term-at-a-time processing

Sec. 6.3.3

28

SLIDE 29

Impact-ordered postings

29

 Now: not all postings in a common order!

 no longer a consistent ordering of docs in postings lists.  no longer can employ document-at-a-time processing

 Term-at-a-time processing

 Create an accumulator for each docID you encounter

 How do we compute scores in order to pick off inexact

top 𝐿?

 Early termination  idf-ordered terms

SLIDE 30

1. Early termination

 When traversing 𝑢’s postings, stop early after either

 a fixed number of 𝑠 docs  wft,d drops below some threshold

Sec. 7.1.5

30

SLIDE 31

2. idf-ordered terms

 When considering the postings of query terms

 Look at them in order of decreasing idf

 High idf terms likely to contribute most to score

 As we update score contribution from each query term we

can stop when doc scores are relatively unchanged

 If the changes are minimal, we may omit accumulation from the

remaining query terms

 or alternatively process shorter prefixes of their postings lists.

Sec. 7.1.5

31

SLIDE 32

Tiered indexes

32

 Basic idea:

 Create several tiers of indexes  During query processing, start with highest-tier index  If highest-tier index returns at least k (e.g., k = 100) results:

 stop and return results to user

 If we’ve only found < k hits: repeat for next index in tier

cascade

SLIDE 33

Tiered indexes

 Break postings up into a hierarchy of lists

 Most important to least important  Can be done by 𝑕(𝑒) or another measure

 Inverted index ⇒ tiers of decreasing importance  At query time use top tier unless it fails to yield 𝐿 docs

 If so drop to lower tiers

 Tiered indexes as one of the reasons for the success of early

Google (2000/01)

 along with PageRank, use of anchor text and proximity constraints

Sec. 7.2.1

33

SLIDE 34

Example tiered index

Sec. 7.2.1

34

20 ≤ 𝑢𝑔 10 ≤ 𝑢𝑔 < 20 𝑢𝑔 < 10

SLIDE 35

Cluster pruning: preprocessing

 Leaders:

𝑂 docs at random

 For every other doc, pre-compute nearest leader

 Followers: Docs attached to a leader  Likely: each leader has ~

𝑂 followers.

 Why random sampling for finding leaders:

 Fast approach  Leaders reflect data distribution

Sec. 7.1.6

35

SLIDE 36

Cluster pruning: query processing

 Given query 𝑅, find its nearest leader 𝑀.  Seek 𝐿 nearest docs from among L’s followers.

Sec. 7.1.6

36

SLIDE 37

Visualization

Query Leader Follower

Sec. 7.1.6

37

SLIDE 38

General variants

 Have each follower attached to 𝑐1 nearest leaders.  From query, find 𝑐2 nearest leaders and their followers.  Can recurse on leader/follower construction.

Sec. 7.1.6

38

SLIDE 39

Complete search system

Sec. 7.2.4

39

SLIDE 40

Components we haven’t covered yet

40

 Query parser  Zone indexes:They separate the indexes for different  Document cache: we need this for generating snippets  Machine-learned ranking functions

SLIDE 41

Query parsers

 Free text query from user may spawn one or more

queries to the indexes

 Run the query as a phrase query  If <𝐿 docs contain the phrase run smaller phrase queries  If we still have <𝐿 docs, run the vector space query  Rank matching docs by vector space scoring

 This sequence is issued by a query parser

Sec. 7.2.3

41

SLIDE 42

Query parsers

 Example:

 Query: rising interest rates  If < 𝐿 docs contain  “rising interest rates” 

run queries

 “rising interest” and “interest rates”  If we still have < 𝐿 docs, run the vector space query 

rising interest rates

 We need aggregate scoring function that accumulates

evidence of a doc’s relevance from multiple sources

42

SLIDE 43

Query term proximity

 Free text queries: just a set of terms

 Users may prefer docs in which query terms occur within

close proximity of each other

 w: smallest window in a doc containing all query terms

 Query: strained mercy  Doc:“The quality of mercy is not strained”  w: 4 (words)

 Would like scoring function to take this into account – how?

Sec. 7.2.2

43

SLIDE 44

Aggregate scores

 Score

functions can combine cosine, static quality, proximity, etc.

 How do we know the best combination?

 Some applications – expert-tuned  Increasingly common: machine-learned

Sec. 7.2.3

44

SLIDE 45

Parametric and zone indexes

 Thus far, a doc has been a sequence of terms  In fact docs have multiple parts, some with special

semantics:

 Author  Title  Date of publication  Language  Format  etc.

 These constitute the metadata about a document

Sec. 6.1

46

SLIDE 46

Fields

 We sometimes wish to search by these metadata

 E.g., find docs authored by William Shakespeare in the year 1601,

containing alas poorYorick

 Year = 1601 is an example of a field  Author last name = shakespeare

 Field or parametric index: postings for each field value

 Sometimes build range trees (e.g., for dates)

 Field query typically treated as conjunction

 (doc must be authored by shakespeare)

Sec. 6.1

47

SLIDE 47

Zone

 A zone is a region of the doc that can contain an

arbitrary amount of text, e.g.,

 Title  Abstract  References …

 Build inverted indexes on zones (to permit querying)

Sec. 6.1

48

Zone examples: the body of the doc, all highlighted text in the doc, anchor text, text in metadata fields

SLIDE 48

Example zone indexes Encode zones in dictionary vs. postings.

Sec. 6.1

49

SLIDE 49

Resources

 IIR 7, 6.1  Resources at http://ifnlp.org/ir



How Google tweaks its ranking function



Interview with Google search guru Udi Manber



Amit Singhal on Google ranking



SEO perspective: ranking factors

50