Chapter 12: Query Processing Computers are useless, they can only - PowerPoint PPT Presentation

Chapter 12: Query Processing Computers are useless, they can only give you answers. -- Pablo Picasso You have to think anyway, so why not think big? -- Donald Trump There are lies, damn lies, and workload assumptions. -- anonymous 12-1 IRDM WS 2015

Outline 12.1 Query Processing Algorithms 12.2 Fast Top-k Search 12.3 Phrase and Proximity Queries 12.4 Query Result Diversification loosely following Büttcher/Clarke/Cormack Chapters 5 and 8.6 plus Manning/Raghavan/Schütze Chapters 7 and 9 plus specific literature 12-2 IRDM WS 2015

Query Types • Conjunctive (i.e., all query terms are required) • Disjunctive (i.e., subset of query terms sufficient) • Phrase or proximity (i.e., query terms must occur in right order or close enough) • Mixed-mode with negation (e.g., “ harry potter” review + movie - book ) • Combined with ranking of result documents according to with score ( t , d ) depending on retrieval model (e.g. tf*idf) 12-3 IRDM WS 2015

Indexing with Document-Ordered Lists Data items: d 1 , …, d n d 1 s(t 1 ,d 1 ) = 0.7 … s(t m ,d 1 ) = 0.2 Index lists d1 d10 d23 d78 d88 index-list entries stored t 1 … 0.7 0.8 0.8 0.9 0.2 in ascending order of d1 d10 d23 d64 d78 t 2 … 0.2 0.6 0.6 0.8 0.1 document identifiers d10 d34 d64 d78 d99 t 3 … ( document-ordered lists ) 0.7 0.1 0.4 0.5 0.2 process all queries (conjunctive/disjunctive/mixed) by sequential scan and merge of posting lists 12-4 IRDM WS 2015

Document-at-a-Time Query Processing Document-at-a-Time (D AA T) query processing – assumes document-ordered posting lists – scans posting lists for query terms t 1 , …, t | q | concurrently – maintains an accumulator for each candidate result doc: – 𝑏𝑑𝑑 𝑒 = 𝑗: 𝑒 𝑡𝑓𝑓𝑜 𝑗𝑜 𝑀(𝑢𝑗) 𝑡𝑑𝑝𝑠𝑓(𝑢𝑗, 𝑒) Accumulators : 1.0 d 1 a d 1 , 1.0 d 4 , 2.0 d 7 , 0.2 d 8 , 0.1 : 6.0 d 4 b d 4 , 1.0 d 7 , 2.0 d 8 , 0.2 d 9 , 0.1 d 7 : 3.2 : 0.3 d 8 c d 4 , 3.0 d 7 , 1.0 : 0.1 d 9 – always advances posting list with lowest current doc id – exploit skip pointers when applicable – required memory depends on # results to be returned – top-k results in priority queue 12-5 IRDM WS 2015

D AA T with Weak And: WAND Method [Broder et al. 2003] Disjunctive (Weak And) query processing – assumes document-ordered posting lists with known maxscore(i) values for each t i : max d (score (d,t i )) – While scanning posting lists keep track of • min-k: the lowest total score in current top-k results • ordered term list : terms sorted by docId at current scan pos • pivot term: smallest j such that min-k  𝑗≤𝑘 𝑛𝑏𝑦𝑡𝑑𝑝𝑠𝑓(𝑗) • pivot doc: doc id at current scan pos in posting list Lj Eliminate docs that cannot become top-k results ( maxscore pruning ): – if pivot term does not exist (min-k > 𝑗 𝑛𝑏𝑦𝑡𝑑𝑝𝑠𝑓(𝑗)) – then stop – else advance scan positions to pos  id of pivot doc ( “ big skip “ ) 12-6 IRDM WS 2015

Example: D AA T with WAND Method [Broder et al. 2003] Key invariant : For terms i=1..|q| and current scan positions cur i assume that cur 1 = min {curi | i=1..|q|} Then for each posting list i there is no docid between cur 1 and cur i maxscore i term i cur i … 5 1 101 … … 4 2 250 … … … 2 3 300 … … … … … 3 4 600 cannot contain any docid  [102,599] Suppose that min-k = 12 then the pivot term is 4 (  i=1.3 maxscore i > min-k,  i=1.4 maxscore i  min-k) and the pivot docid is 600  can advance all scan positions cur i to 600 12-7 IRDM WS 2015

Term-at-a-Time Query Processing Term-at-a-Time (T AA T) query processing – assumes document-ordered posting lists – scans posting lists for query terms t 1 , …, t | q | one at a time, (possibly in decreasing order of idf values) – maintains an accumulator for each candidate result doc – after processing L(tj): 𝑏𝑑𝑑 𝑒 = 𝑗≤𝑘 𝑡𝑑𝑝𝑠𝑓(𝑢𝑗, 𝑒) Accumulators d 1 d 1 d 1 d 1 d 1 d 1 d 1 d 1 d 1 d 1 d 1 : : : : : : : : : : : 1.0 1.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 a d 1 , 1.0 d 4 , 2.0 d 7 , 0.2 d 8 , 0.1 : : : : : : : : : : : 6.0 2.0 0.0 2.0 3.0 2.0 3.0 0.0 3.0 3.0 6.0 d 4 d 4 d 4 d 4 d 4 d 4 d 4 d 4 d 4 d 4 d 4 b : : : : : : : : : : : 0.2 2.2 2.2 2.2 0.0 2.2 3.2 0.0 0.2 0.2 0.0 d 4 , 1.0 d 7 , 2.0 d 8 , 0.2 d 9 , 0.1 d 7 d 7 d 7 d 7 d 7 d 7 d 7 d 7 d 7 d 7 d 7 : : : : : : : : : : : 0.1 0.3 0.3 0.1 0.3 0.0 0.0 0.3 0.0 0.0 0.1 d 8 d 8 d 8 d 8 d 8 d 8 d 8 d 8 d 8 d 8 d 8 c d 4 , 3.0 d 7 , 1.0 d 9 d 9 d 9 d 9 d 9 d 9 d 9 d 9 d 9 d 9 d 9 : : : : : : : : : : : 0.1 0.0 0.0 0.1 0.0 0.0 0.0 0.0 0.0 0.1 0.0 – memory depends on the number of accumulators maintained – T AA T is attractive when scanning many short lists 12-8 IRDM WS 2015

Indexing with Impact-Ordered Lists Data items: d 1 , …, d n d 1 s(t 1 ,d 1 ) = 0.7 … s(t m ,d 1 ) = 0.2 Index lists d78 d23 d10 d1 d88 index-list entries stored t 1 … 0.9 0.8 0.8 0.7 0.2 in descending order of d64 d23 d10 d1 d78 t 2 … 0.8 0.6 0.6 0.2 0.1 per-term score impact d10 d78 d64 d99 d34 t 3 … ( impact-ordered lists ) 0.7 0.5 0.4 0.2 0.1 aims to avoid having to read entire lists rather scan only (short) prefixes of lists 12-9 IRDM WS 2015

Greedy Query Processing Framework Assume index lists are sorted by tf(t i ,d j ) or tf(t i ,d j )*idl(d j ) values idf values are stored separately Open scan cursors on all m index lists L(i) Repeat Find pos(g) among current cursor positions pos(i) (i=1..m) with the largest value of idf(t i )*tf(t i ,d j ) (or idf(t i )*tf(t i ,d j )*idl(d j )); Update the accumulator of the corresponding doc; Increment pos(g); Until stopping condition 12-10 IRDM WS 2015

Stopping Criterion: Quit & Continue Heuristics [Zobel/Moffat 1996] m   For scoring of the form score ( q , d ) s ( t , d ) j i i j  i 1   s ( t , d ) ~ tf ( t , d ) idf ( t ) idl ( d ) with i i j i j i j Assume hash array of accumulators for summing up score mass of candidate results quit heuristics (with docId-ordered or tf-ordered or tf*idl-ordered index lists): • ignore index list L(i) if idf(t i ) is below tunable threshold or • stop scanning L(i) if idf(t i )*tf(t i ,d j )*idl(d j ) drops below threshold or • stop scanning L(i) when the number of accumulators is too high continue heuristics: upon reaching threshold, continue scanning index lists, but do not add any new documents to the accumulator array 12-11 IRDM WS 2015

12.2 Fast Top-k Search Top-k aggregation query over relation R (Item, A1, ..., Am) : Select Item, s(R1.A1, ..., Rm.Am) As Aggr From Outer Join R1, …, Rm Order By Aggr Limit k with monotone s : (  i: x i  x i ‘ )  s(x 1 … x m )  s(x 1 ‘ … x m ‘) (example: item is doc, attributes are terms, attr values are scores) • Precompute per-attr (index) lists sorted in desc attr-value order (score-ordered, impact-ordered) • Scan lists by sorted access (SA) in round-robin manner • Perform random accesses (RA) by Item when convenient • Compute aggregation s incrementally in accumulators • Stop when threshold test guarantees correct top-k (or when heuristics indicate „ good enough “ approximation) simple & elegant, adaptable & extensible to distributed system following R. Fagin: Optimal aggregation algorithms for middleware, JCSS. 66(4), 2003 12-12 IRDM WS 2015

Threshold Algorithm (TA) [Fagin 01,Güntzer 00, Nepal 99, Buckley 85] simple & DB-style; Threshold algorithm (TA ): needs only O(k) memory scan index lists; consider d at pos i in L i ; high i := s(t i ,d); if d  top-k then { Data items: d 1 , …, d n look up s  (d) in all lists L  with  i; score(d) := aggr {s  (d) |  =1..m}; d 1 if score(d) > min-k then add d to top-k and remove min- score d’; s(t 1 ,d 1 ) = 0.7 … min-k := min{score(d’) | d’  top-k}; s(t m ,d 1 ) = 0.2 threshold := aggr {high  |  =1..m}; if threshold  min-k then exit; Query: q = (t 1 , t 2 , t 3 ) Index lists k = 2 Rank Doc Score Rank Doc Score d78 d23 d10 d1 d88 t 1 … Rank Doc Score 0.9 0.8 0.8 0.7 0.2 Scan 1 d10 2.1 1 d78 1.5 1 d78 1.5 1 d78 0.9 Rank Doc Score Scan d64 d23 d10 d12 d78 Scan t 2 1 d10 2.1 Scan … depth 1 Rank Doc Score 0.9 0.6 0.6 0.2 0.1 depth 2 2 d64 1.2 2 d64 0.9 2 d78 1.5 2 d64 0.9 1 d10 2.1 depth 3 depth 4 2 d78 1.5 d10 d78 d64 d99 d34 1 d10 2.1 t 3 … 2 d78 1.5 0.7 0.5 0.3 0.2 0.1 2 d78 1.5 STOP! 12-13 IRDM WS 2015

Chapter 12: Query Processing Computers are useless, they can only - PowerPoint PPT Presentation

Chapter 12: Query Processing Computers are useless, they can only give you answers. -- Pablo Picasso You have to think anyway, so why not think big? -- Donald Trump There are lies, damn lies, and workload assumptions. -- anonymous 12-1

Improve Query Performance with the Query Log Analyzer Kees Vegter Field Engineer Query Log

CS4224/CS5424 Lecture 9 Distributed Query Processing Query Processing Translates query into a

Chapter 3: Top-k Query Processing and Indexing 3.1 Top-k Algorithms 3.2 Approximate Top-k Query

Query Processing Relevance feedback; query expansion; Web Search 1 Overview Indexes Query

Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu Query

Query Understanding: A Manifesto Daniel Tunkelang queryunderstanding.com Overview What is

Perfect Query FORMULA 5 critical sections in every successful query letter (c) 2019

Query Op)miza)on 1 Query op)miza)on Given an SQL query,

Query Execuon Declarave Query (SQL) We start from

Online Query Processing Exposure to online query processing algorithms and fundamentals A

V.3 Query Processing 1. Term-at-a-Time 2. Document-at-a-Time 3. WAND 4. Quit & Continue 5.

Efficient query processing Efficient scoring, distributed query processing Web Search 1 Ranking

Query Processing Query Processing Steps balance < 2500 ( balance ( account)) balance

A Generic Mapping-based Query Translation A Generic Mapping-based Query Translation from SPARQL

Information Retrieval > Query Us User er Query Words Query Words Search Personalization

Module 13: Optimizing Query Performance Overview Introduction to the Query Optimizer

Virtualization of Event Sources in Wireless Sensor Networks for the Internet of Things Nstor

Linear Analysis and Optimization of Stream Programs Andrew A. Lamb William Thies Saman

RESTful Services made easy with ZF2 by Rob Allen and Matthew Weier OPhinney January 2013

Trumpet: Timely and Precise Triggers in Data Centers The Problem Evolve or Die, SIGCOMM 2016

New World Order 21H.102 December 5, 2005 Election of 1980 Professional Air Traffic Controllers

Khmer Rouge Origin Review Influenced by hill tribes Ethnonationalists Antimaterialist Year One

Three Evil Kings GODS FAITHFULNESS IN JUDGMENT 2 Kings 23:2924:9 Jeremiah 22 JOSIAH

Asyncio Stack & React.js or Development on the Edge Igor Davydenko EuroPython 2015 Intro I

Chapter 12: Query Processing Computers are useless, they can only - PowerPoint PPT Presentation

Chapter 12: Query Processing Computers are useless, they can only give you answers. -- Pablo Picasso You have to think anyway, so why not think big? -- Donald Trump There are lies, damn lies, and workload assumptions. -- anonymous 12-1

Improve Query Performance with the Query Log Analyzer Kees Vegter Field Engineer Query Log

CS4224/CS5424 Lecture 9 Distributed Query Processing Query Processing Translates query into a

Chapter 3: Top-k Query Processing and Indexing 3.1 Top-k Algorithms 3.2 Approximate Top-k Query

Query Processing Relevance feedback; query expansion; Web Search 1 Overview Indexes Query

Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu Query

Query Understanding: A Manifesto Daniel Tunkelang queryunderstanding.com Overview What is

Perfect Query FORMULA 5 critical sections in every successful query letter (c) 2019

Query Op)miza)on 1 Query op)miza)on Given an SQL query,

Query Execu*on Declara*ve Query (SQL) We start from

Online Query Processing Exposure to online query processing algorithms and fundamentals A

V.3 Query Processing 1. Term-at-a-Time 2. Document-at-a-Time 3. WAND 4. Quit &amp; Continue 5.

Efficient query processing Efficient scoring, distributed query processing Web Search 1 Ranking

Query Processing Query Processing Steps balance &lt; 2500 ( balance ( account)) balance

A Generic Mapping-based Query Translation A Generic Mapping-based Query Translation from SPARQL

Information Retrieval &gt; Query Us User er Query Words Query Words Search Personalization

Module 13: Optimizing Query Performance Overview Introduction to the Query Optimizer

Virtualization of Event Sources in Wireless Sensor Networks for the Internet of Things Nstor

Linear Analysis and Optimization of Stream Programs Andrew A. Lamb William Thies Saman

RESTful Services made easy with ZF2 by Rob Allen and Matthew Weier OPhinney January 2013

Trumpet: Timely and Precise Triggers in Data Centers The Problem Evolve or Die, SIGCOMM 2016

New World Order 21H.102 December 5, 2005 Election of 1980 Professional Air Traffic Controllers

Khmer Rouge Origin Review Influenced by hill tribes Ethnonationalists Antimaterialist Year One

Three Evil Kings GODS FAITHFULNESS IN JUDGMENT 2 Kings 23:2924:9 Jeremiah 22 JOSIAH

Asyncio Stack &amp; React.js or Development on the Edge Igor Davydenko EuroPython 2015 Intro I

Query Execuon Declarave Query (SQL) We start from

V.3 Query Processing 1. Term-at-a-Time 2. Document-at-a-Time 3. WAND 4. Quit & Continue 5.

Query Processing Query Processing Steps balance < 2500 ( balance ( account)) balance

Information Retrieval > Query Us User er Query Words Query Words Search Personalization

Asyncio Stack & React.js or Development on the Edge Igor Davydenko EuroPython 2015 Intro I