Scoring & result assembly CE-324: Modern Information Retrieval - - PowerPoint PPT Presentation

scoring result assembly
SMART_READER_LITE
LIVE PREVIEW

Scoring & result assembly CE-324: Modern Information Retrieval - - PowerPoint PPT Presentation

Scoring & result assembly CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2017 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Outline Speeding up


slide-1
SLIDE 1

Scoring & result assembly

CE-324: Modern Information Retrieval

Sharif University of Technology

  • M. Soleymani

Fall 2017

Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

slide-2
SLIDE 2

Outline

 Speeding up vector space ranking  Putting together a complete search system

 Will

require learning about a number

  • f

miscellaneous topics and heuristics

2

slide-3
SLIDE 3

Computing cosine scores

  • Sec. 6.3.3

3

slide-4
SLIDE 4

Term-at-a-time vs. doc-at-a-time processing

4

 Completely process the postings list of the first query term,

then process the postings list of the second query term and so forth

 Doc-at-time

Brutus Caesar 1 2 3 5 8 13 21 34 2 4 8 16 32 64128 Antony 3 4 8 16 32 64128

slide-5
SLIDE 5

Term frequencies in the inverted index

5

 In each posting, store 𝑢𝑔

𝑢,𝑒 in addition to docID

 As an integer frequency, not as a (log-)weighted real number

 because real numbers are difficult to compress.  overall, additional space requirements are small: a byte per posting or

less

slide-6
SLIDE 6

Efficient ranking

 Usually we don’t need a complete ranking.

 We just need the top k for a small k (e.g., k = 100).

 Find 𝐿 docs in the collection “nearest” to query

  𝐿 largest query-doc scores.

 Efficient ranking:

 Computing a single score efficiently.  Choosing the 𝐿 largest scores efficiently.

 Can we do this without computing all 𝑂 cosines?

  • Sec. 7.1

6

slide-7
SLIDE 7

Efficient cosine ranking

 What we’re doing in effect: solving the 𝐿 -nearest

neighbor problem for a query vector

 In general, we do not know how to do this efficiently for high-

dimensional spaces

 But it is solvable for short queries, and standard indexes

support this well

  • Sec. 7.1

7

slide-8
SLIDE 8

Computing the 𝐿 largest cosines: selection vs. sorting

 Retrieve the top 𝐿 docs

 not to totally order all docs in the collection

 Can we pick off docs with 𝐿 highest cosines?  Let 𝐾= number of docs with nonzero cosines

 We seek the 𝐿 best of these 𝐾

  • Sec. 7.1

8

slide-9
SLIDE 9

Use heap for selecting top K

 Construction: 2𝐾 operations  𝐿 “winners”: 2𝐿log 𝐾 operations  For 𝐾 = 1𝑁, 𝐿 = 100, this is about 10% of the cost of

sorting.

1 .9 .3 .8 .3 .1 .1

  • Sec. 7.1

9

slide-10
SLIDE 10

Cosine similarity is only a proxy

 Cosine similarity is just a proxy for user happiness

 If we get a list of 𝐿 docs “close” to top 𝐿 by cosine measure, it

should be ok

  • Sec. 7.1.1

10

slide-11
SLIDE 11

More efficient computation of top k: Heuristics

11

 Idea 1: Reorder postings lists

 Instead of ordering according to docID, order according to

some measure of “expected relevance”, “authority”, etc.

 Idea 2: Heuristics to prune the search space

 Not guaranteed to be correct but fails rarely.  In practice, close to constant time.

slide-12
SLIDE 12

Generic idea of inexact top k search

 Find a set 𝐵 of contenders, with 𝐿 < |𝐵| ≪ 𝑂  𝐵 does not necessarily contain the top K

 but has many docs from among the top K

 Return the top K docs in A  Think of 𝐵 as pruning non-contenders  Same approach is also used for other scoring functions

 Will look at several schemes following this approach

  • Sec. 7.1.1

12

slide-13
SLIDE 13

Ideas for more efficient computation of top k

13

 Index elimination  Champion lists  Global ordering  Impact ordering  Cluster pruning

slide-14
SLIDE 14

Index elimination for cosine computation

 Basic algorithm: considers docs containing at least one

query term

 Extend this basic algorithm to:

 Only consider docs containing many (or all) query terms  Only consider high-idf query terms

  • Sec. 7.1.2

14

slide-15
SLIDE 15

Docs containing many query terms

When we have multi-term queries

 Only compute scores for docs containing several of the

query terms

 Say, at least 3 out of 4  Imposes a “soft conjunction” on queries seen on web search

engines (early Google)

 May find fewer than k candidates

 Easy to implement in postings traversal

  • Sec. 7.1.2

15

slide-16
SLIDE 16

3 of 4 query terms

Brutus Caesar Calpurnia 1 2 3 5 8 13 21 34 2 4 8 16 32 64128 13 16 Antony 3 4 8 16 32 64128 32

Scores only computed for docs 8, 16 and 32.

  • Sec. 7.1.2

16

slide-17
SLIDE 17

High-idf query terms only

 Query: catcher in the rye

 Only accumulate scores from catcher and rye

 Intuition: in and the contribute little to the scores and so

don’t alter rank-ordering much

 Benefit:

 Postings of low-idf terms have many docs

  many docs are eliminated from set A of contenders

  • Sec. 7.1.2

17

slide-18
SLIDE 18

Champion lists

 𝑠 docs of highest weight in the posting list of each

dictionary term

 Call this the champion list for 𝑢

 aka fancy list or top docs for 𝑢

 At query time, only compute scores for docs in the

champion list of some (or all of) query terms

 Pick the 𝐿 top-scoring docs from amongst these

 Note that 𝑠 has to be chosen at index build time

 Thus, it’s possible that the obtained list of docs contains less

than 𝐿 docs

  • Sec. 7.1.3

18

slide-19
SLIDE 19

High and low lists

 For each term, two postings lists high and low

 High: like the champion list  Low: all other docs containing 𝑢

 Only traverse high lists first

 If we get more than 𝐿 docs, select top 𝐿 and stop  Else proceed to get docs from low lists

 A means for segmenting index into two tiers

  • Sec. 7.1.4

19

slide-20
SLIDE 20

Quantitative

Static quality scores

 Top-ranking

docs needs to be both relevant and authoritative

 Relevance: modeled by cosine scores  Authority: typically a query-independent property of a doc

 Examples of authority signals

 Wikipedia among websites  Articles in certain newspapers  A paper with many citations  Pagerank

  • Sec. 7.1.4

20

slide-21
SLIDE 21

Modeling authority

 Assign to each doc 𝑒 a query-independent quality

score in [0,1] (called 𝑕(𝑒))

 A quantity like the number of citations scaled into [0,1]

  • Sec. 7.1.4

21

slide-22
SLIDE 22

Net score

 Simple total score: combining cosine relevance and

authority NetScore(𝑟, 𝑒) = 𝑕(𝑒) + 𝑑𝑝𝑡𝑗𝑜𝑓(𝑟, 𝑒)

 Can use some other linear combination  Indeed, any function of the two “signals” of user happiness

 Now we seek the top 𝐿 docs by net score

  • Sec. 7.1.4

22

slide-23
SLIDE 23

Top 𝐿 by net score – fast methods

 First idea: Order all postings by 𝑕(𝑒)  Key: this is a common ordering for all postings

 All postings are ordered by a single common ordering  and the merge is then performed by a single pass through the

postings

 Can concurrently traverse query terms’ postings for

 Postings intersection  Cosine score computation

  • Sec. 7.1.4

23

slide-24
SLIDE 24

Static quality-ordered index

24

𝑕 1 = 0.25 𝑕 2 = 0.5 𝑕 3 = 1

slide-25
SLIDE 25

Why order postings by 𝑕(𝑒)?

 𝑕(𝑒)-ordering: top-scoring docs likely to appear early in

postings traversal

 In time-bound applications:

 It allows us to stop postings traversal early

 E.g., we have to return search results in 50 ms

  • Sec. 7.1.4

25

slide-26
SLIDE 26

Global champion lists

 Can combine champion lists with 𝑕(𝑒)-ordering?  Maintain for each term a champion list of 𝑠 docs with

highest 𝑕 𝑒 + tf. idf𝑢𝑒

 Sorted by a common order 𝑕 𝑒

 Seek top-𝐿 results from only the docs in these champion

lists

  • Sec. 7.1.4

26

slide-27
SLIDE 27

Impact-ordered postings

 If we have impact ordering

 Docs in the top k are likely to occur early in the ordered lists.

 We sort each postings list according to weight 𝑥𝑔

𝑢,𝑒

 Simplest case: normalized tf-idf weight

 ⇒ Early termination while processing postings lists is unlikely to

change the top k.

  • Sec. 7.1.5

27

slide-28
SLIDE 28

Term-at-a-time processing

  • Sec. 6.3.3

28

slide-29
SLIDE 29

Impact-ordered postings

29

 Now: not all postings in a common order!

 no longer a consistent ordering of docs in postings lists.  no longer can employ document-at-a-time processing

 Term-at-a-time processing

 Create an accumulator for each docID you encounter

 How do we compute scores in order to pick off inexact

top 𝐿?

 Early termination  idf-ordered terms

slide-30
SLIDE 30
  • 1. Early termination

 When traversing 𝑢’s postings, stop early after either

 a fixed number of 𝑠 docs  wft,d drops below some threshold

  • Sec. 7.1.5

30

slide-31
SLIDE 31
  • 2. idf-ordered terms

 When considering the postings of query terms

 Look at them in order of decreasing idf

 High idf terms likely to contribute most to score

 As we update score contribution from each query term we

can stop when doc scores are relatively unchanged

 If the changes are minimal, we may omit accumulation from the

remaining query terms

 or alternatively process shorter prefixes of their postings lists.

  • Sec. 7.1.5

31

slide-32
SLIDE 32

Tiered indexes

32

 Basic idea:

 Create several tiers of indexes  During query processing, start with highest-tier index  If highest-tier index returns at least k (e.g., k = 100) results:

 stop and return results to user

 If we’ve only found < k hits: repeat for next index in tier

cascade

slide-33
SLIDE 33

Tiered indexes

 Break postings up into a hierarchy of lists

 Most important to least important  Can be done by 𝑕(𝑒) or another measure

 Inverted index ⇒ tiers of decreasing importance  At query time use top tier unless it fails to yield 𝐿 docs

 If so drop to lower tiers

 Tiered indexes as one of the reasons for the success of early

Google (2000/01)

 along with PageRank, use of anchor text and proximity constraints

  • Sec. 7.2.1

33

slide-34
SLIDE 34

Example tiered index

  • Sec. 7.2.1

34

20 ≤ 𝑢𝑔 10 ≤ 𝑢𝑔 < 20 𝑢𝑔 < 10

slide-35
SLIDE 35

Cluster pruning: preprocessing

 Leaders:

𝑂 docs at random

 For every other doc, pre-compute nearest leader

 Followers: Docs attached to a leader  Likely: each leader has ~

𝑂 followers.

 Why random sampling for finding leaders:

 Fast approach  Leaders reflect data distribution

  • Sec. 7.1.6

35

slide-36
SLIDE 36

Cluster pruning: query processing

 Given query 𝑅, find its nearest leader 𝑀.  Seek 𝐿 nearest docs from among L’s followers.

  • Sec. 7.1.6

36

slide-37
SLIDE 37

Visualization

Query Leader Follower

  • Sec. 7.1.6

37

slide-38
SLIDE 38

General variants

 Have each follower attached to 𝑐1 nearest leaders.  From query, find 𝑐2 nearest leaders and their followers.  Can recurse on leader/follower construction.

  • Sec. 7.1.6

38

slide-39
SLIDE 39

Complete search system

  • Sec. 7.2.4

39

slide-40
SLIDE 40

Components we haven’t covered yet

40

 Query parser  Zone indexes:They separate the indexes for different  Document cache: we need this for generating snippets  Machine-learned ranking functions

slide-41
SLIDE 41

Query parsers

 Free text query from user may spawn one or more

queries to the indexes

 Run the query as a phrase query  If <𝐿 docs contain the phrase run smaller phrase queries  If we still have <𝐿 docs, run the vector space query  Rank matching docs by vector space scoring

 This sequence is issued by a query parser

  • Sec. 7.2.3

41

slide-42
SLIDE 42

Query parsers

 Example:

 Query: rising interest rates  If < 𝐿 docs contain  “rising interest rates” 

run queries

 “rising interest” and “interest rates”  If we still have < 𝐿 docs, run the vector space query 

rising interest rates

 We need aggregate scoring function that accumulates

evidence of a doc’s relevance from multiple sources

42

slide-43
SLIDE 43

Query term proximity

 Free text queries: just a set of terms

 Users may prefer docs in which query terms occur within

close proximity of each other

 w: smallest window in a doc containing all query terms

 Query: strained mercy  Doc:“The quality of mercy is not strained”  w: 4 (words)

 Would like scoring function to take this into account – how?

  • Sec. 7.2.2

43

slide-44
SLIDE 44

Aggregate scores

 Score

functions can combine cosine, static quality, proximity, etc.

 How do we know the best combination?

 Some applications – expert-tuned  Increasingly common: machine-learned

  • Sec. 7.2.3

44

slide-45
SLIDE 45

Parametric and zone indexes

 Thus far, a doc has been a sequence of terms  In fact docs have multiple parts, some with special

semantics:

 Author  Title  Date of publication  Language  Format  etc.

 These constitute the metadata about a document

  • Sec. 6.1

46

slide-46
SLIDE 46

Fields

 We sometimes wish to search by these metadata

 E.g., find docs authored by William Shakespeare in the year 1601,

containing alas poorYorick

 Year = 1601 is an example of a field  Author last name = shakespeare

 Field or parametric index: postings for each field value

 Sometimes build range trees (e.g., for dates)

 Field query typically treated as conjunction

 (doc must be authored by shakespeare)

  • Sec. 6.1

47

slide-47
SLIDE 47

Zone

 A zone is a region of the doc that can contain an

arbitrary amount of text, e.g.,

 Title  Abstract  References …

 Build inverted indexes on zones (to permit querying)

  • Sec. 6.1

48

Zone examples: the body of the doc, all highlighted text in the doc, anchor text, text in metadata fields

slide-48
SLIDE 48

Example zone indexes Encode zones in dictionary vs. postings.

  • Sec. 6.1

49

slide-49
SLIDE 49

Resources

 IIR 7, 6.1  Resources at http://ifnlp.org/ir

How Google tweaks its ranking function

Interview with Google search guru Udi Manber

Amit Singhal on Google ranking

SEO perspective: ranking factors

50