top k queries marcin kwietniewski
play

Top-K Queries Marcin Kwietniewski Agenda Introduction Early - PowerPoint PPT Presentation

Top-K Queries Marcin Kwietniewski Agenda Introduction Early solution Translation approach RankSQL Related work Introduction Consider a large travel database with some spatial information We're interested in


  1. Top-K Queries Marcin Kwietniewski

  2. Agenda • Introduction • Early solution • Translation approach • RankSQL • Related work

  3. Introduction • Consider a large travel database with some spatial information • We're interested in finding 5 hotels that are closest to Pearson airport • Traditionally: SELECT * FROM hotels h, airports a WHERE a.name = 'Pearson' ORDER BY distance ( a.location, h.location) • In application, fetch only 5 results from the cursor • Even though we only need a few results, the database sorts the whole result set

  4. Introduction (contd.) • Often, amount of needed results is much less than the cardinality of the (sorted) tables • Need to support limiting the output size: • Query engine might use the information to enhance performance of queries

  5. New operator • Additional logical operator is needed in order to process Top-K queries • Scan-stop – if input is sorted, operator just keeps first K elements from input stream • Sort-stop – if input not sorted, keep top-K tuples in a priority heap while scanning the input stream • Output cardinality is known Stop(5) Join Hotels Airports

  6. Stop placement • Placing Stop deep in the execution tree may help a lot • Pushing down the Stop operator is tricky: – Other operators might decrease the number of tuples in the stream – We might keep more tuples, but not too many – Optimizer has no notion of risk

  7. Examples Stop(10) Join Join Stop(10) Dept Emp Dept Emp

  8. Conservative stop placement • Never insert a Stop operator in a place where it may discard tuples that should be in the final result • Example: Join Stop(10) Dept Emp • Rule: push below non-reductive predicates • Joins with equality predicate, where there is a guarantee that matching tuples will always be found (eg. foreign key)

  9. Aggressive • Place Stop operator whenever it can provide a beneficial cardinality reduction • Rule: perform Stop as soon as the first expression in the ORDER BY clause can be evaluated • How many tuples to keep? K = later_reduction_factor * final_K * safety_factor • Restart operator needed • Additional rule: Stop should be placed on top of a pipeline (no loss, less risk)

  10. Aggressive example Stop(N) Join Restart Dept Join TEA Stop(2*N) Emp

  11. Performance • Single table queries: – Oder of magnitude improvement vs traditional model for K up to 100 – As K approaches 10.000 improvement is getting small • Joins: – Orders of magnitude improvement for K < 1000 – Comparable to traditional model only for K around 100.000 • Aggressive policy: – When Stops overestimate, response time 3-10 times shorter – Even with underestimates not performing worse than Conservative

  12. Alternative: translation • Query Model: – Query: desired attribute values (ie. query point) – Answer: a set ordered by how closely the tuples match the query (score = distance from query point) • Idea: translate the query into SQL, using statistics on the data • Algorithm: – Use histograms to find the score S such that K tuples will have score > S – Formulate SQL constraints on attributes in order for score to be > S – Execute SQL query – If too few tuples, decrease S and restart

  13. Histogram usage • Bucket histograms: No restarts With restarts

  14. RankSQL • 2005: RankSQL – an approach to fully integrate rank and Top-K concepts with the relational model • Problem to tackle: • Ranking functions are monolithic to the query engine and are evaluated at the root of execution tree • Solution: – Extended algebra – Incremental execution model – Rank-aware optimization

  15. Rank algebra • Query (base relations, selection) augumented with: – A set of rank predicates, such that each tuple has a predicate score for every predicate – Ranking function – a monotonic function of the predicates • Idea: allow query engine to split the evaluation of ranking predicates and interleave them as it does with Boolean predicates • New concept: Rank-relation, characterized by – Ranking function F – Predicates already evaluated • Each tuple has the maximal possible score for predicates not yet evaluated

  16. Rank-relation example • F = p 1 + p 2 + p 3 • p 1 &p 2 already evaluated • Assume max score for p 3 is 2.0

  17. Operators • New operator Rank(predicate p): – Doesn't change tuple membership – Order induced by F with p evaluated • Other operators modified: – Join, Union, Intersection: order of joined tuples determined by predicates already evaluated for any input relation

  18. Equivalence laws • When can we push the operators down? • • • When p is available in R • When p is available in R and S

  19. Query execution model • Operators incrementally output rank relations – Tuples are returned in order • Queries have an explicit bound on number of desired results (K) • Operators need to know when to stop: – Example: Rank(p) operator can output tuple t if it gets a t' such that score t' with max score on p is smaller than score of t with real score on p. • Priority queues used to keep Top-K results

  20. Example • HRJN – hash rank join • NRJN – nested loops rank join • Scans in order of some predicate

  21. Optimization • Extension of the bottom-up System-R type optimizer • Plan enumeration performed in two dimensional space (membership and rank) • Subplans identified by set of relations and set of ranking predicates • Algorithm exponential in number of relations and predicates • Heuristic: – Left-deep (relations) – Greedy choice of a rank operator 1 − card  plan'  card  plan  goodness  rank op = cost  rank op 

  22. Cost Model • Cardinality estimation is important in the cost model • Here, input size depends on the consumer, ie. operators may choose to stop processing input • Use sampling to estimate the score of the K-th tuple in the final result – Run any conventional plan with a few random tuples from input relations – Derive cardinalities of the real results from the sample run • Authors admit this is hard

  23. Conclusion • First fully rank-aware DBMS • It seems that output cardinality estimation is the weak point (order of magnitude errors)

  24. Related work • Follow-up to RankSQL: adaptive optimization, ie. changing the execution plan at runtime • Rank and generalized Group-By operations combined • Rank in uncertain databases • Approximate algorithms for Top-K

  25. Thank you! Q&A

  26. References • Stop operator: – On Saying “Enough Already!” in SQL Michael J. Carey, Donald Kossmann • RankSQL: – RankSQL: Query Algebra and Optimization for Relational Top­k Queries Chengkai Li et al. – Adaptive Rank-Aware Query Optimization in Relational Databases Ihab F. Ilyas et al. • Translations: – Evaluating Top-k Selection Queries Surajit Chaudhuri, Luis Gravano

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend