chapter 12 query processing
play

Chapter 12: Query Processing Computers are useless, they can only - PowerPoint PPT Presentation

Chapter 12: Query Processing Computers are useless, they can only give you answers. -- Pablo Picasso You have to think anyway, so why not think big? -- Donald Trump There are lies, damn lies, and workload assumptions. -- anonymous 12-1


  1. Chapter 12: Query Processing Computers are useless, they can only give you answers. -- Pablo Picasso You have to think anyway, so why not think big? -- Donald Trump There are lies, damn lies, and workload assumptions. -- anonymous 12-1 IRDM WS 2015

  2. Outline 12.1 Query Processing Algorithms 12.2 Fast Top-k Search 12.3 Phrase and Proximity Queries 12.4 Query Result Diversification loosely following Büttcher/Clarke/Cormack Chapters 5 and 8.6 plus Manning/Raghavan/Schütze Chapters 7 and 9 plus specific literature 12-2 IRDM WS 2015

  3. Query Types • Conjunctive (i.e., all query terms are required) • Disjunctive (i.e., subset of query terms sufficient) • Phrase or proximity (i.e., query terms must occur in right order or close enough) • Mixed-mode with negation (e.g., “ harry potter” review + movie - book ) • Combined with ranking of result documents according to with score ( t , d ) depending on retrieval model (e.g. tf*idf) 12-3 IRDM WS 2015

  4. Indexing with Document-Ordered Lists Data items: d 1 , …, d n d 1 s(t 1 ,d 1 ) = 0.7 … s(t m ,d 1 ) = 0.2 Index lists d1 d10 d23 d78 d88 index-list entries stored t 1 … 0.7 0.8 0.8 0.9 0.2 in ascending order of d1 d10 d23 d64 d78 t 2 … 0.2 0.6 0.6 0.8 0.1 document identifiers d10 d34 d64 d78 d99 t 3 … ( document-ordered lists ) 0.7 0.1 0.4 0.5 0.2 process all queries (conjunctive/disjunctive/mixed) by sequential scan and merge of posting lists 12-4 IRDM WS 2015

  5. Document-at-a-Time Query Processing Document-at-a-Time (D AA T) query processing – assumes document-ordered posting lists – scans posting lists for query terms t 1 , …, t | q | concurrently – maintains an accumulator for each candidate result doc: – 𝑏𝑑𝑑 𝑒 = 𝑗: 𝑒 𝑡𝑓𝑓𝑜 𝑗𝑜 𝑀(𝑢𝑗) 𝑡𝑑𝑝𝑠𝑓(𝑢𝑗, 𝑒) Accumulators : 1.0 d 1 a d 1 , 1.0 d 4 , 2.0 d 7 , 0.2 d 8 , 0.1 : 6.0 d 4 b d 4 , 1.0 d 7 , 2.0 d 8 , 0.2 d 9 , 0.1 d 7 : 3.2 : 0.3 d 8 c d 4 , 3.0 d 7 , 1.0 : 0.1 d 9 – always advances posting list with lowest current doc id – exploit skip pointers when applicable – required memory depends on # results to be returned – top-k results in priority queue 12-5 IRDM WS 2015

  6. D AA T with Weak And: WAND Method [Broder et al. 2003] Disjunctive (Weak And) query processing – assumes document-ordered posting lists with known maxscore(i) values for each t i : max d (score (d,t i )) – While scanning posting lists keep track of • min-k: the lowest total score in current top-k results • ordered term list : terms sorted by docId at current scan pos • pivot term: smallest j such that min-k  𝑗≤𝑘 𝑛𝑏𝑦𝑡𝑑𝑝𝑠𝑓(𝑗) • pivot doc: doc id at current scan pos in posting list Lj Eliminate docs that cannot become top-k results ( maxscore pruning ): – if pivot term does not exist (min-k > 𝑗 𝑛𝑏𝑦𝑡𝑑𝑝𝑠𝑓(𝑗)) – then stop – else advance scan positions to pos  id of pivot doc ( “ big skip “ ) 12-6 IRDM WS 2015

  7. Example: D AA T with WAND Method [Broder et al. 2003] Key invariant : For terms i=1..|q| and current scan positions cur i assume that cur 1 = min {curi | i=1..|q|} Then for each posting list i there is no docid between cur 1 and cur i maxscore i term i cur i … 5 1 101 … … 4 2 250 … … … 2 3 300 … … … … … 3 4 600 cannot contain any docid  [102,599] Suppose that min-k = 12 then the pivot term is 4 (  i=1.3 maxscore i > min-k,  i=1.4 maxscore i  min-k) and the pivot docid is 600  can advance all scan positions cur i to 600 12-7 IRDM WS 2015

  8. Term-at-a-Time Query Processing Term-at-a-Time (T AA T) query processing – assumes document-ordered posting lists – scans posting lists for query terms t 1 , …, t | q | one at a time, (possibly in decreasing order of idf values) – maintains an accumulator for each candidate result doc – after processing L(tj): 𝑏𝑑𝑑 𝑒 = 𝑗≤𝑘 𝑡𝑑𝑝𝑠𝑓(𝑢𝑗, 𝑒) Accumulators d 1 d 1 d 1 d 1 d 1 d 1 d 1 d 1 d 1 d 1 d 1 : : : : : : : : : : : 1.0 1.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 a d 1 , 1.0 d 4 , 2.0 d 7 , 0.2 d 8 , 0.1 : : : : : : : : : : : 6.0 2.0 0.0 2.0 3.0 2.0 3.0 0.0 3.0 3.0 6.0 d 4 d 4 d 4 d 4 d 4 d 4 d 4 d 4 d 4 d 4 d 4 b : : : : : : : : : : : 0.2 2.2 2.2 2.2 0.0 2.2 3.2 0.0 0.2 0.2 0.0 d 4 , 1.0 d 7 , 2.0 d 8 , 0.2 d 9 , 0.1 d 7 d 7 d 7 d 7 d 7 d 7 d 7 d 7 d 7 d 7 d 7 : : : : : : : : : : : 0.1 0.3 0.3 0.1 0.3 0.0 0.0 0.3 0.0 0.0 0.1 d 8 d 8 d 8 d 8 d 8 d 8 d 8 d 8 d 8 d 8 d 8 c d 4 , 3.0 d 7 , 1.0 d 9 d 9 d 9 d 9 d 9 d 9 d 9 d 9 d 9 d 9 d 9 : : : : : : : : : : : 0.1 0.0 0.0 0.1 0.0 0.0 0.0 0.0 0.0 0.1 0.0 – memory depends on the number of accumulators maintained – T AA T is attractive when scanning many short lists 12-8 IRDM WS 2015

  9. Indexing with Impact-Ordered Lists Data items: d 1 , …, d n d 1 s(t 1 ,d 1 ) = 0.7 … s(t m ,d 1 ) = 0.2 Index lists d78 d23 d10 d1 d88 index-list entries stored t 1 … 0.9 0.8 0.8 0.7 0.2 in descending order of d64 d23 d10 d1 d78 t 2 … 0.8 0.6 0.6 0.2 0.1 per-term score impact d10 d78 d64 d99 d34 t 3 … ( impact-ordered lists ) 0.7 0.5 0.4 0.2 0.1 aims to avoid having to read entire lists rather scan only (short) prefixes of lists 12-9 IRDM WS 2015

  10. Greedy Query Processing Framework Assume index lists are sorted by tf(t i ,d j ) or tf(t i ,d j )*idl(d j ) values idf values are stored separately Open scan cursors on all m index lists L(i) Repeat Find pos(g) among current cursor positions pos(i) (i=1..m) with the largest value of idf(t i )*tf(t i ,d j ) (or idf(t i )*tf(t i ,d j )*idl(d j )); Update the accumulator of the corresponding doc; Increment pos(g); Until stopping condition 12-10 IRDM WS 2015

  11. Stopping Criterion: Quit & Continue Heuristics [Zobel/Moffat 1996] m   For scoring of the form score ( q , d ) s ( t , d ) j i i j  i 1   s ( t , d ) ~ tf ( t , d ) idf ( t ) idl ( d ) with i i j i j i j Assume hash array of accumulators for summing up score mass of candidate results quit heuristics (with docId-ordered or tf-ordered or tf*idl-ordered index lists): • ignore index list L(i) if idf(t i ) is below tunable threshold or • stop scanning L(i) if idf(t i )*tf(t i ,d j )*idl(d j ) drops below threshold or • stop scanning L(i) when the number of accumulators is too high continue heuristics: upon reaching threshold, continue scanning index lists, but do not add any new documents to the accumulator array 12-11 IRDM WS 2015

  12. 12.2 Fast Top-k Search Top-k aggregation query over relation R (Item, A1, ..., Am) : Select Item, s(R1.A1, ..., Rm.Am) As Aggr From Outer Join R1, …, Rm Order By Aggr Limit k with monotone s : (  i: x i  x i ‘ )  s(x 1 … x m )  s(x 1 ‘ … x m ‘) (example: item is doc, attributes are terms, attr values are scores) • Precompute per-attr (index) lists sorted in desc attr-value order (score-ordered, impact-ordered) • Scan lists by sorted access (SA) in round-robin manner • Perform random accesses (RA) by Item when convenient • Compute aggregation s incrementally in accumulators • Stop when threshold test guarantees correct top-k (or when heuristics indicate „ good enough “ approximation) simple & elegant, adaptable & extensible to distributed system following R. Fagin: Optimal aggregation algorithms for middleware, JCSS. 66(4), 2003 12-12 IRDM WS 2015

  13. Threshold Algorithm (TA) [Fagin 01,Güntzer 00, Nepal 99, Buckley 85] simple & DB-style; Threshold algorithm (TA ): needs only O(k) memory scan index lists; consider d at pos i in L i ; high i := s(t i ,d); if d  top-k then { Data items: d 1 , …, d n look up s  (d) in all lists L  with  i; score(d) := aggr {s  (d) |  =1..m}; d 1 if score(d) > min-k then add d to top-k and remove min- score d’; s(t 1 ,d 1 ) = 0.7 … min-k := min{score(d’) | d’  top-k}; s(t m ,d 1 ) = 0.2 threshold := aggr {high  |  =1..m}; if threshold  min-k then exit; Query: q = (t 1 , t 2 , t 3 ) Index lists k = 2 Rank Doc Score Rank Doc Score d78 d23 d10 d1 d88 t 1 … Rank Doc Score 0.9 0.8 0.8 0.7 0.2 Scan 1 d10 2.1 1 d78 1.5 1 d78 1.5 1 d78 0.9 Rank Doc Score Scan d64 d23 d10 d12 d78 Scan t 2 1 d10 2.1 Scan … depth 1 Rank Doc Score 0.9 0.6 0.6 0.2 0.1 depth 2 2 d64 1.2 2 d64 0.9 2 d78 1.5 2 d64 0.9 1 d10 2.1 depth 3 depth 4 2 d78 1.5 d10 d78 d64 d99 d34 1 d10 2.1 t 3 … 2 d78 1.5 0.7 0.5 0.3 0.2 0.1 2 d78 1.5 STOP! 12-13 IRDM WS 2015

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend