Top-K Queries Marcin Kwietniewski Agenda Introduction Early - - PowerPoint PPT Presentation

top k queries marcin kwietniewski
SMART_READER_LITE
LIVE PREVIEW

Top-K Queries Marcin Kwietniewski Agenda Introduction Early - - PowerPoint PPT Presentation

Top-K Queries Marcin Kwietniewski Agenda Introduction Early solution Translation approach RankSQL Related work Introduction Consider a large travel database with some spatial information We're interested in


slide-1
SLIDE 1

Top-K Queries Marcin Kwietniewski

slide-2
SLIDE 2

Agenda

  • Introduction
  • Early solution
  • Translation approach
  • RankSQL
  • Related work
slide-3
SLIDE 3

Introduction

  • Consider a large travel database with some spatial information
  • We're interested in finding 5 hotels that are closest to Pearson

airport

  • Traditionally:

SELECT * FROM hotels h, airports a WHERE a.name = 'Pearson' ORDER BY distance ( a.location, h.location)

  • In application, fetch only 5 results from the cursor
  • Even though we only need a few results, the database sorts

the whole result set

slide-4
SLIDE 4

Introduction (contd.)

  • Often, amount of needed results is much less than the

cardinality of the (sorted) tables

  • Need to support limiting the output size:
  • Query engine might use the information to enhance

performance of queries

slide-5
SLIDE 5

New operator

  • Additional logical operator is needed in order to process

Top-K queries

  • Scan-stop – if input is sorted, operator just keeps first K

elements from input stream

  • Sort-stop – if input not sorted, keep top-K tuples in a

priority heap while scanning the input stream

  • Output cardinality is known

Stop(5) Join Airports Hotels

slide-6
SLIDE 6

Stop placement

  • Placing Stop deep in the execution tree may help a lot
  • Pushing down the Stop operator is tricky:

– Other operators might decrease the number of tuples in the stream – We might keep more tuples, but not too many – Optimizer has no notion of risk

slide-7
SLIDE 7

Examples

Stop(10) Join Dept Emp Stop(10) Join Dept Emp

slide-8
SLIDE 8

Conservative stop placement

  • Never insert a Stop operator in a place where it may

discard tuples that should be in the final result

  • Example:
  • Rule: push below non-reductive predicates
  • Joins with equality predicate, where there is a guarantee

that matching tuples will always be found (eg. foreign key)

Stop(10) Join Dept Emp

slide-9
SLIDE 9

Aggressive

  • Place Stop operator whenever it can provide a beneficial

cardinality reduction

  • Rule: perform Stop as soon as the first expression in the

ORDER BY clause can be evaluated

  • How many tuples to keep?

K = later_reduction_factor * final_K * safety_factor

  • Restart operator needed
  • Additional rule: Stop should be placed on top of a pipeline

(no loss, less risk)

slide-10
SLIDE 10

Aggressive example

Stop(2*N) Join Dept Emp Restart TEA Join Stop(N)

slide-11
SLIDE 11

Performance

  • Single table queries:

– Oder of magnitude improvement vs traditional model for K up to 100 – As K approaches 10.000 improvement is getting small

  • Joins:

– Orders of magnitude improvement for K < 1000 – Comparable to traditional model only for K around 100.000

  • Aggressive policy:

– When Stops overestimate, response time 3-10 times shorter – Even with underestimates not performing worse than Conservative

slide-12
SLIDE 12

Alternative: translation

  • Query Model:

– Query: desired attribute values (ie. query point) – Answer: a set ordered by how closely the tuples match the query (score = distance from query point)

  • Idea: translate the query into SQL, using statistics on the

data

  • Algorithm:

– Use histograms to find the score S such that K tuples will have score > S – Formulate SQL constraints on attributes in order for score to be > S – Execute SQL query – If too few tuples, decrease S and restart

slide-13
SLIDE 13

Histogram usage

  • Bucket histograms:

No restarts With restarts

slide-14
SLIDE 14

RankSQL

  • 2005: RankSQL – an approach to fully integrate rank and

Top-K concepts with the relational model

  • Problem to tackle:
  • Ranking functions are monolithic to the query engine and

are evaluated at the root of execution tree

  • Solution:

– Extended algebra – Incremental execution model – Rank-aware optimization

slide-15
SLIDE 15

Rank algebra

  • Query (base relations, selection) augumented with:

– A set of rank predicates, such that each tuple has a predicate score for every predicate – Ranking function – a monotonic function of the predicates

  • Idea: allow query engine to split the evaluation of ranking

predicates and interleave them as it does with Boolean predicates

  • New concept: Rank-relation, characterized by

– Ranking function F – Predicates already evaluated

  • Each tuple has the maximal possible score for predicates

not yet evaluated

slide-16
SLIDE 16

Rank-relation example

  • F = p1+ p2+ p3
  • p1&p2 already evaluated
  • Assume max score for p3

is 2.0

slide-17
SLIDE 17

Operators

  • New operator Rank(predicate p):

– Doesn't change tuple membership – Order induced by F with p evaluated

  • Other operators modified:

– Join, Union, Intersection: order of joined tuples determined by predicates already evaluated for any input relation

slide-18
SLIDE 18

Equivalence laws

  • When can we push the operators down?
  • When p is available in R
  • When p is available in R and S
slide-19
SLIDE 19

Query execution model

  • Operators incrementally output rank relations

– Tuples are returned in order

  • Queries have an explicit bound on number of desired

results (K)

  • Operators need to know when to stop:

– Example: Rank(p) operator can output tuple t if it gets a t' such that score t' with max score on p is smaller than score of t with real score on p.

  • Priority queues used to keep Top-K results
slide-20
SLIDE 20

Example

  • HRJN – hash rank join
  • NRJN – nested loops rank

join

  • Scans in order of some

predicate

slide-21
SLIDE 21

Optimization

  • Extension of the bottom-up System-R type optimizer
  • Plan enumeration performed in two dimensional space

(membership and rank)

  • Subplans identified by set of relations and set of ranking

predicates

  • Algorithm exponential in number of relations and

predicates

  • Heuristic:

– Left-deep (relations) – Greedy choice of a rank operator

goodnessrank op= 1−card  plan'  card  plan cost rank op

slide-22
SLIDE 22

Cost Model

  • Cardinality estimation is important in the cost model
  • Here, input size depends on the consumer, ie. operators

may choose to stop processing input

  • Use sampling to estimate the score of the K-th tuple in the

final result – Run any conventional plan with a few random tuples from input relations – Derive cardinalities of the real results from the sample run

  • Authors admit this is hard
slide-23
SLIDE 23

Conclusion

  • First fully rank-aware DBMS
  • It seems that output cardinality estimation is the weak

point (order of magnitude errors)

slide-24
SLIDE 24

Related work

  • Follow-up to RankSQL: adaptive
  • ptimization, ie. changing the execution

plan at runtime

  • Rank and generalized Group-By operations

combined

  • Rank in uncertain databases
  • Approximate algorithms for Top-K
slide-25
SLIDE 25

Thank you!

Q&A

slide-26
SLIDE 26

References

  • Stop operator:

– On Saying “Enough Already!” in SQL Michael J. Carey, Donald Kossmann

  • RankSQL:

– RankSQL: Query Algebra and Optimization for Relational Top­k Queries Chengkai Li et al. – Adaptive Rank-Aware Query Optimization in Relational Databases Ihab F. Ilyas et al.

  • Translations:

– Evaluating Top-k Selection Queries Surajit Chaudhuri, Luis Gravano