ranked query processing
play

Ranked Query Processing: Relevance a) Order-based Paradigm - PDF document

Ranking Ordering according to the degree of some fuzzy notions: Similarity (or dissimilarity) Ranked Query Processing: Relevance a) Order-based Paradigm Preference Q Kevin Chen-Chuan Chang ranking 2 Query models for


  1. Ranking– Ordering according to the degree of some fuzzy notions: � Similarity (or dissimilarity) Ranked Query Processing: � Relevance a) Order-based Paradigm � Preference Q Kevin Chen-Chuan Chang ranking 2 Query models for order-based paradigm– When multiple dimensions are available-- On the better-than graph Assume the database stores the information of a set of flights � For each flight � � Better-than graph � Its price t2 � Its route (travel-time or distance traveled) t1 A user would retrieve all the “interesting” flights � � A flight is interesting if and only if there is no other cheaper and t4 t3 shorter (route) at the same time � Best-Matches-Only (BMO) query model � Retrieve maximal elements price y 1 0 � Thus also called maximal vector b e 9 a c � These maximal elements form the “skyline”! 8 7 d 6 g f 5 l h n � On better-than graph, how to process BMO? 4 3 2 i k m 1 x 3 4 o distance 1 2 3 4 5 6 7 8 9 1 0 1

  2. The overall preference combines the Skyline Operation dimensions � P1 LOWEST(price) � Dominance: � a � b � i � c, h � g � d, m � f � n � k, e � l � A point dominates another point if it is no worse in all � P2 LOWEST(distance) dimensions, and better in at least one dimension � k � m, i � h, n � l � f � g � d � c � a � b, e � P :=({price,distance},<P1 ⊗ P2) Distance Price a 1 9 � Skyline: b 2 10 c 4 8 � BMO: Maximal elements of P? � A set of all points in the dataset that are not d 6 7 � Is a maximal? e 9 10 dominated by any other point in the dataset f 7 5 � Is b maximal? g 5 6 h 4 3 � Is c maximal? i 3 2 k 9 1 l 10 4 m 6 2 5 6 n 8 3 Why is it called “skyline”? What is skyline: An example (Also called: Pareto curve, Maximum Vector) � What do you see in the Chicago skyline? � Query: SELECT * FROM flights SKYLINE OF price MIN, distance MIN � What dominates what? � What points constitute the skyline? price y 1 0 b e 9 a c 8 7 d 6 g f 5 l h n 4 3 2 i k m 1 x 7 8 o distance 1 2 3 4 5 6 7 8 9 1 0 2

  3. Skyline Algorithms: We will look at a few examples Block Nested Loop [Börzsönyi et al., 2001] � Block nested loop (BNL) � Conceptually: Nested loop joins— � Divide and Conquer � Joining the table with itself � Compare every pair of points to check dominance � Bitmap � NN Price Distance Price Distance a 1 9 a 1 9 b 2 10 b 2 10 c 4 8 c 4 8 d 6 7 d 6 7 e 9 10 e 9 10 f 7 5 f 7 5 g 5 6 g 5 6 h 4 3 h 4 3 i 3 2 i 3 2 k 9 1 k 9 1 l 10 4 l 10 4 m 6 2 m 6 2 9 n 8 3 n 8 3 10 Block Nested Loop -- Implementation Block Nested Look– Improvements How if the window overflow? � Multi-pass algorithm � One-pass scan: � Scan the table; maintain a window of current skyline points � Scan the table, write any overflow to temp file � Return the window at the end � Scan the temp file; repeat till done Scan Price Distance Skyline Discarded Pass 1 Pass 2 a 1 9 a Scan b 2 10 a b Price Distance Skyline Discarded TempFile c 4 8 a,c a 1 9 a d 6 7 a,c,d b 2 10 a b e 9 10 a ,c,d e c 4 8 a,c f 7 5 a,c,d,f d 6 7 a,c,d Scan g 5 6 a,c,f, g d e 9 10 a ,c,d e h 4 3 a, h c,f,g f 7 5 a,c,d f TempFile i 3 2 a, i h g 5 6 a,c, g d k 9 1 a,i,k h 4 3 a, h c,g � Any problems? l 10 4 a, i ,k l i 3 2 a, i h k 9 1 a,i,k l 10 4 a, i ,k l 11 12 3

  4. Block Nested Look– Improvements However, BNL-based approaches are not How if the window overflow? incremental– Want progressive processing! [Börzsönyi et al., 2001] � Divide and conquer Desired: � Divide all the points into several groups such that each group � Compute the first few Skyline points almost instantaneously fits in memory � Compute more and more results incrementally � Process the groups separately � Merge their results y 10 b e 9 a c � Smart merging possible 8 s1 d 7 s2 � If s3 not empty then disregard s2 6 g f 5 � Use s3 to purge s1, s4 l h n 4 3 s3 s4 2 i k m 1 x o 1 2 3 4 5 6 7 8 9 10 13 14 Bitmap Algorithm: Representation [Tan et. al. Is b = (3, 2, 1) in the skyline? 2001] � For each dimension: � Any point with no-worse values in all dimensions? � 0110 & 0101 & 1111 = 0100 � n distinct values � n bits � Any point with a better value in some dimension? � A value as a bitmap of all no-higher bits = 1 � 0010 | 0001 | 1001 = 1011 d1: price d2: dist d3: rating � Any point satisfying both? 4 3 2 1 3 2 1 2 1 � 0100 & 1011 = 0000 a (1,1,2) 0 0 0 1 0 0 1 1 1 b (3,2,1) 0 1 1 1 0 1 1 0 1 � So, is b = (3,2,1) in the skyline? c (4,1,1) 1 1 1 1 0 0 1 0 1 d (2,3,2) 0 0 1 1 1 1 1 1 1 d1: price d2: dist d3: rating 4 3 2 1 3 2 1 2 1 a (1,1,2) 0 0 0 1 0 0 1 1 1 b (3,2,1) 0 1 1 1 0 1 1 0 1 c (4,1,1) 1 1 1 1 0 0 1 0 1 d (2,3,2) 0 0 1 1 1 1 1 1 1 15 16 4

  5. The Bitmap Algorithm Bitmap Algorithm: Problems � for each point x in DB: � Bitmaps are not dynamic structures � check if x is in skyline � Hard to update � Bitmaps can have prohibitive space overhead � output x if so � How if there are many distinct values? � E.g., How about continuous values? � Incremental indeed; bitmap computation efficient � No focus of directions at all in skyline search � Depend on what points you check first � However, any problem? 17 18 NN – Finding the First Skyline Point [Kossmann et. NN– Are there other skyline points? al. 2002] Start by finding the nearest neighbor of the origin � = + � I.e., the point p = ( x , y ) with the smallest 2 2 dist ( o , p ) x y � Pruning-- What cannot be in the skyline? � How to find NN: Use NN algorithm based on R-tree. � Those dominated by point I � This NN point must be in the skyline � Iteration– What may be in the skyline? � Otherwise? � Non-dominated region 2 and 3 y 10 b e y 10 9 a c b e 9 8 a c 8 7 d 7 d 6 4 3 g f 6 g 5 f l 5 h n l 4 n h 4 3 3 2 i k 2 m i m k 1 2 1 1 x x o o 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 19 20 5

  6. NN– Iteratively Process All the “ ToDo” Order-based rank query evaluation-- Still Regions until All Done ongoing research. y 10 � How optimal are these algorithms? Further b e 9 a c i improvement? 8 d 7 6 g 4 3 f 5 l h n y 4 10 3 4 b e 3 � Scale to high dimensionality? 9 a c 2 i m k 8 1 1 2 x 7 a o d 1 2 3 4 5 6 7 8 9 10 6 1 2 g f 5 h n l � Generalize to non-BMO type of aggregations? 4 y 3 10 b e 2 i m k 9 a c 1 x 8 o 1 2 3 4 5 6 7 8 9 10 7 d 6 k g f 5 l n h 4 3 m 2 i 3 k 4 21 22 1 2 1 x o 1 2 3 4 5 6 7 8 9 10 Ranking Query Processing: Thank You! b) Score-based Paradigms Kevin Chen-Chuan Chang 23 6

  7. Ranking– Ordering according to the degree of Relational DBMS scenarios– A brief overview some fuzzy notions: � Similarity (or dissimilarity) Relational DBMS– � Relevance � Value mapping: [Chaudhuri and Gravano, 1999] � Preference � Mapping top-k scores to Boolean selection ranges Q � May have to restart � Cardinality mapping: [Carey and Kossmann, 1997, 1998] � Pushing “limit k” down query tree � May have to restart ranking 25 26 Our Focus: Middleware scenarios Top-k algorithms rely on accesses to evaluate query scores u j To each predicate p i : select h.id , h.address from Hotel h Random access: ra i ( u j ) p 1 : rating ( h.rate ), p 2 : cheap ( h.price ), p 3 : safe ( h.zip )) � order by F=min ( p 1 stop after k=10 Return score of u j for p i � p 1 [ u j ] Sorted access : sa � i Return some next best object and � k=10 its score for p i RDBMS p 1 u 3: .70 Top -k top results p 2 hotels.com u 2: .65 Algorithm u 1: .60 p 3 F=min ( p 1 , p 2 , p 3 ) p 1 apbs.com 27 28 7

  8. Goal: Minimize the “ access” cost An algorithm performs a sequence of accesses: A simple algorithm � Sorted access on P1 then random accesses to P2, P2 RDBMS rating s 1 =3ms, r 1 =20ms Top -k hotels.com cheap Algorithm c:.80 b:.45 a:.30 k=1 s 2 =44ms, r 2 =466ms p 1 RDBMS apbs.com safe a:.8 b:.90 c:.90 c:.80 s 3 = ∞ , r 3 =700ms F, k top result p 2 Sort hotels.com a:.9 b:.70 c:.95 Access costs dominate in “middleware” scenarios p 3 F=min ( p 1 , p 2 , p 3 ) apbs.com � Cost model: aggregate of all access costs 29 30 Assumption: Monotonic scoring functions The Naïve Algorithm � Get all p i [ u ] score for every object u � Monotonic: � f ( x 1 , …, x n ) ≤ f ( x 1 ′ , …, x n ′ ) if x i ≤ x i ′ for all i � e.g., by complete sorted accesses � Compute F [ u ] = F ( p 1 [ u ] ,…, p m [ u ] ) for every u � Why good for query evaluation? � Sort � Gives bounds for pruning data � Return top k � Gives a simple function “surface” to maximize f � Reasonable? � Obviously expensive. Can we do better? � Analogy: Negation rarely used in Boolean queries � Note k is typically small � But, new “function-inference” front -ends also found this to be violated in many cases 31 32 8

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend