NPFL103: Information Retrieval (5) Ranking, Complete search system, - PowerPoint PPT Presentation

Ranking Complete search system Evaluation Benchmarks NPFL103: Information Retrieval (5) Ranking, Complete search system, Evaluation, Benchmarks Pavel Pecina Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics Charles University Original slides are courtesy of Hinrich Schütze, University of Stutugart. 1 / 71 pecina@ufal.mff.cuni.cz

Ranking Qvery processing Standard benchmarks Benchmarks A/B testing Ranked evaluation Unranked evaluation Evaluation Tiered indexes Complete search system Complete search system Implementation Motivation Ranking Contents Benchmarks Evaluation 2 / 71

Ranking Complete search system Evaluation Benchmarks Ranking 3 / 71

Ranking Complete search system Evaluation Benchmarks Why is ranking so important? Problems with unranked retrieval: 5 / 71 ▶ Users want to look at a few results – not thousands. ▶ It’s very hard to write queries that produce a few results. ▶ Even for expert searchers. → Ranking efgectively reduces a large set of results to a very small one.

Ranking Complete search system Evaluation Benchmarks Empirical investigation of the efgect of ranking 6 / 71 ▶ How can we measure how important ranking is? ▶ Observe what searchers do while searching in a controlled setuing. ▶ Videotape them ▶ Ask them to “think aloud” ▶ Interview them ▶ Eye-track them ▶ Time them ▶ Record and count their clicks ▶ The following slides are from Dan Russell from Google.

Ranking Complete search system Evaluation Benchmarks Importance of ranking: Summary the top-ranked pages (1, 2, 3, 4) than the abstracts of the lower ranked pages (7, 8, 9, 10). 11 / 71 ▶ Viewing abstracts: Users are a lot more likely to read the abstracts of ▶ Clicking: Distribution is even more skewed for clicking ▶ In 1 out of 2 cases (50%!), users click on the top-ranked page. ▶ Even if the top-ranked page is not relevant, 30% of users click on it. → Getuing the ranking right is very important. → Getuing the top-ranked page right is most important.

Ranking Complete search system We also need positions. Not shown here. term frequencies 97,3 40,1 8,2 7,1 Calpurnia … 17,1 13,1 5,1 1,1 13 / 71 Caesar … 87,2 83,1 7,3 1,2 Brutus We need term frequencies in the index Benchmarks Evaluation − → − → − →

Ranking Complete search system Evaluation Benchmarks Term frequencies in the inverted index … because real numbers are difgicult to compress. 14 / 71 ▶ In each posting, store tf t , d in addition to docID of d . ▶ Use an integer frequency, not as a (log-)weighted real number … ▶ Additional space requirements are small: a byte per posting or less.

Ranking Complete search system Evaluation Benchmarks How do we compute the top k in ranking? 15 / 71 ▶ In many applications, we don’t need a complete ranking. ▶ We just need the top k for a small k (e.g., k = 100 ). ▶ Is there an efgicient way of computing just the top k ? ▶ Naive (not very efgicient): ▶ Compute scores for all N documents ▶ Sort ▶ Return the top k ▶ Alternative: min heap

Ranking Complete search system 0.95 0.8 0.97 0.9 0.7 0.85 0.6 than the values of its children. Use min heap for selecting top k ouf of N Benchmarks Evaluation 16 / 71 ▶ A binary min heap is a binary tree in which each node’s value is less ▶ Takes O ( N log k ) operations to build ( N – number of documents) ▶ And then O ( k log k ) steps to read ofg k winners.

Ranking Complete search system Evaluation Benchmarks 17 / 71 Selecting top k scoring documents in O ( N log k ) ▶ Goal: Keep the top k documents seen so far ▶ Use a binary min heap ▶ To process a new document d ′ with score s ′ : 1. Get current minimum h m of heap ( O (1) ) 2. If s ′ ≤ h m skip to next document 3. If s ′ > h m heap-delete-root ( O ( log k ) ) 4. Heap-add d ′ / s ′ ( O ( log k ) )

Ranking Complete search system Evaluation Benchmarks Even more efgicient computation of top k ? problem for the query vector (= query point). 18 / 71 ▶ Ranking has time complexity O ( N ) , N is the number of documents. ▶ Optimizations reduce the constant factor, but are still O ( N ) , N > 10 10 ▶ Are there sublinear algorithms? ▶ What we’re doing in efgect: solving the k -nearest neighbor (kNN) ▶ There are no general solutions to this problem that are sublinear.

Ranking Complete search system Evaluation Benchmarks More efgicient computation of top k : Heuristics … order according to some measure of “expected relevance”. … but fails rarely. term-at-a-time processing. 19 / 71 ▶ Idea 1: Reorder postings lists ▶ Instead of ordering according to docID … ▶ Idea 2: Heuristics to prune the search space ▶ Not guaranteed to be correct … ▶ In practice, close to constant time. ▶ For this, we’ll need the concepts of document-at-a-time processing and

Ranking Complete search system Evaluation Benchmarks Non-docID ordering of postings lists pages hyperlink to d (later in this course) postings lists in their entirety to find top k . 20 / 71 ▶ So far: postings lists have been ordered according to docID. ▶ Alternative: a query-independent measure of “goodness” of a page ▶ Example: PageRank g ( d ) of page d , a measure of how many “good” ▶ Order documents in postings lists according to PageRank: g ( d 1 ) > g ( d 2 ) > g ( d 3 ) > . . . ▶ Define composite score of a document: s ( q , d ) = g ( d ) + cos ( q , d ) ▶ This scheme supports early termination: We do not have to process

Ranking Complete search system Evaluation Benchmarks Non-docID ordering of postings lists (2) (iii) smallest top k score we’ve found so far is 1.2 remainder of postings lists. 21 / 71 ▶ Order documents in postings lists according to PageRank: g ( d 1 ) > g ( d 2 ) > g ( d 3 ) > . . . ▶ Define composite score of a document: s ( q , d ) = g ( d ) + cos ( q , d ) ▶ Suppose: (i) g → [0 , 1] ; (ii) g ( d ) < 0 . 1 for the document d we’re currently processing; ▶ Then all subsequent scores will be < 1 . 1 . ▶ So we’ve already found the top k and can stop processing the

Ranking Complete search system Evaluation Benchmarks Document-at-a-time processing ordering on documents in postings lists. 22 / 71 ▶ Both docID-ordering and PageRank-ordering impose a consistent ▶ Computing cosines in this scheme is document-at-a-time: ▶ We complete computation of the query-document similarity score of document d i before starting to compute the query-document similarity score of d i +1 . ▶ Alternative: term-at-a-time processing.

Ranking Complete search system Evaluation Benchmarks Weight-sorted postings lists 23 / 71 ▶ Idea: don’t process postings that contribute litule to final score. ▶ Order documents in postings list according to weight. ▶ Simplest case: normalized tf-idf (rarely done: hard to compress). ▶ Top- k documents are likely to occur early in these ordered lists. → Early termination is unlikely to change the top k . ▶ But: ▶ no consistent ordering of documents in postings lists. ▶ no way to employ document-at-a-time processing.

Ranking Complete search system Evaluation Benchmarks Term-at-a-time processing … and so forth. 24 / 71 ▶ Simplest case: completely process postings list of the first query term. ▶ Create an accumulator for each docID you encounter. ▶ Then completely process the postings list of the second query term

Ranking 4 10 9 for each d 8 Read the array Length 7 6 Complete search system 5 for each query term t 1 Evaluation Benchmarks Term-at-a-time processing 3 25 / 71 2 CosineScore ( q ) float Scores [ N ] = 0 float Length [ N ] do calculate w t , q and fetch postings list for t for each pair ( d , tf t , d ) in postings list do Scores [ d ]+ = w t , d × w t , q do Scores [ d ] = Scores [ d ]/ Length [ d ] return Top k components of Scores [] ▶ Accumulators (“Scores[]”) as an array not optimal (or even infeasible). ▶ Thus: Only create accumulators for docs occurring in postings lists.

Ranking Caesar 97,3 40,1 8,2 7,1 Calpurnia … 17,1 13,1 5,1 Complete search system 1,1 … Brutus Evaluation Benchmarks Accumulators: Example 26 / 71 1,2 7,3 83,1 87,2 − → − → − → ▶ For query: [Brutus Caesar]: ▶ Only need accumulators for 1, 5, 7, 13, 17, 83, 87 ▶ Don’t need accumulators for 3, 8 etc.

Ranking Complete search system Evaluation Benchmarks Enforcing conjunctive search documents (and create accumulators) if all terms occur. 27 / 71 ▶ We can enforce conjunctive search (a la Google): only consider ▶ Example: just one accumulator for [Brutus Caesar] in the example above because only d 1 contains both words.

Ranking Complete search system Evaluation Benchmarks Complete search system 28 / 71

Ranking Complete search system Evaluation Benchmarks Complete search system 29 / 71

NPFL103: Information Retrieval (5) Ranking, Complete search system, - PowerPoint PPT Presentation

Ranking Complete search system Evaluation Benchmarks NPFL103: Information Retrieval (5) Ranking, Complete search system, Evaluation, Benchmarks Pavel Pecina Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics

NPFL103: Information Retrieval (1) Introduction, Boolean retrieval, Inverted index, Text

NPFL103: Information Retrieval (4) Ranked retrieval, Term weighting, Vector space model Pavel

NPFL103: Information Retrieval (8) Language Models for Information Retrieval, Text Classification

NPFL103: Information Retrieval (2) Dictionaries, Tolerant retrieval, Spelling correction Pavel

NPFL103: Information Retrieval (11) Latent semantic indexing Pavel Pecina Institute of Formal

Information Retrieval Introducing Information Retrieval and Web Search Information Retrieval

NPFL103: Information Retrieval (6) Result summaries, Relevance Feedback, Qvery Expansion Pavel

NPFL103: Information Retrieval (12) Web search, Crawling, Spam detection Pavel Pecina Institute

NPFL103: Information Retrieval (10) Document clustering Pavel Pecina Institute of Formal and

NPFL103: Information Retrieval (3) Index construction, Distributed and dynamic indexing, Index

NPFL103: Information Retrieval (9) Vector Space Classification Pavel Pecina Institute of Formal

XML Retrieval XML Retrieval XML Retrieval XML Retrieval DB/IR in DB/IR in Theory Theory Web

CS54701: Information Retrieval CS-54701 Information Retrieval Retrieval Models: Language models

CS54701: Information Retrieval CS-54701 Information Retrieval Luo Si Department of Computer

Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency

Model Divergence Retrieval LM, session 10 CS6200: Information Retrieval Slides by: Jesse

Overview Motivation and introduction Structure independent approach ECE 553: TESTING AND

Maximum Contiguous Subsequence Sum After todays class you will be able to: provide an example

Cohesive Constraints in a Beam Search Phrase-based Decoder Nguyen Bach, Stephan Vogel Colin

A New Class Of Weak Keys for Blowfish Orhun KARA and Cevat MANAP T UB ITAK - UEKAE

Announcements 61A Lecture 37 Syntactic Ambiguity in English Sentence Noun Verb Phrase

Improving Neural Language Modeling via Adversarial Training Dilin Wang, Chengyue Gong (equal

Exponential cone in MOSEK ISMP2018, Relative Entropy Optimization, 6 July 2018 Micha l

T w + C Minimize T z fo r some Z spae N 1 n 2 w n =1 K ( x , x ) =

Sambuz

Useful Links

Newsletter

Mail Us

NPFL103: Information Retrieval (5) Ranking, Complete search system, - PowerPoint PPT Presentation

Ranking Complete search system Evaluation Benchmarks NPFL103: Information Retrieval (5) Ranking, Complete search system, Evaluation, Benchmarks Pavel Pecina Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics

NPFL103: Information Retrieval (1) Introduction, Boolean retrieval, Inverted index, Text

NPFL103: Information Retrieval (4) Ranked retrieval, Term weighting, Vector space model Pavel

NPFL103: Information Retrieval (8) Language Models for Information Retrieval, Text Classification

NPFL103: Information Retrieval (2) Dictionaries, Tolerant retrieval, Spelling correction Pavel

NPFL103: Information Retrieval (11) Latent semantic indexing Pavel Pecina Institute of Formal

Information Retrieval Introducing Information Retrieval and Web Search Information Retrieval

NPFL103: Information Retrieval (6) Result summaries, Relevance Feedback, Qvery Expansion Pavel

NPFL103: Information Retrieval (12) Web search, Crawling, Spam detection Pavel Pecina Institute

NPFL103: Information Retrieval (10) Document clustering Pavel Pecina Institute of Formal and

NPFL103: Information Retrieval (3) Index construction, Distributed and dynamic indexing, Index

NPFL103: Information Retrieval (9) Vector Space Classification Pavel Pecina Institute of Formal

XML Retrieval XML Retrieval XML Retrieval XML Retrieval DB/IR in DB/IR in Theory Theory Web

CS54701: Information Retrieval CS-54701 Information Retrieval Retrieval Models: Language models

CS54701: Information Retrieval CS-54701 Information Retrieval Luo Si Department of Computer

Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency

Model Divergence Retrieval LM, session 10 CS6200: Information Retrieval Slides by: Jesse

Overview Motivation and introduction Structure independent approach ECE 553: TESTING AND

Maximum Contiguous Subsequence Sum After todays class you will be able to: provide an example

Cohesive Constraints in a Beam Search Phrase-based Decoder Nguyen Bach, Stephan Vogel Colin

A New Class Of Weak Keys for Blowfish Orhun KARA and Cevat MANAP T UB ITAK - UEKAE

Announcements 61A Lecture 37 Syntactic Ambiguity in English Sentence Noun Verb Phrase

Improving Neural Language Modeling via Adversarial Training Dilin Wang*, Chengyue Gong* (equal

Exponential cone in MOSEK ISMP2018, Relative Entropy Optimization, 6 July 2018 Micha l

T w + C Minimize T z fo r some Z spae N 1 n 2 w n =1 K ( x , x ) =

Sambuz

Useful Links

Newsletter

Mail Us

Improving Neural Language Modeling via Adversarial Training Dilin Wang, Chengyue Gong (equal