outline
play

Outline Ranking and skyline Top- k algorithms Skyline algorithms - PowerPoint PPT Presentation

Data Mining Top-K and Skyline October 17, 2017 1 Outline Ranking and skyline Top- k algorithms Skyline algorithms Reconciling top-k and skyline 2 Ranking queries Who is the best NBA player? According to points : Tracy McGrady,


  1. Data Mining Top-K and Skyline October 17, 2017 1

  2. Outline  Ranking and skyline  Top- k algorithms  Skyline algorithms  Reconciling top-k and skyline 2

  3. Ranking queries Who is the best NBA player? According to points : Tracy McGrady, score 2003 According to rebounds : Shaquille O'Neal, score 760 According to points + rebounds : Tracy McGrady, score 2487 …… Name Points Rebounds Assists Steals …… Tracy McGrady 2003 484 448 135 …… Kobe Bryant 1819 392 398 86 …… Shaquille O'Neal 1669 760 200 36 …… Yao Ming 1465 669 61 34 …… Dwyane Wade 1854 397 520 121 …… Steve Nash 1165 249 861 74 …… …… …… …… …… …… 3

  4. Ranking queries Top- k Query Given a dataset D of n objects, a scoring function F (according to which we rank the objects in D) and k, a Top-k query returns the k objects with the best score (rank) in D. 4

  5. Similarity queries K-NN Query Given a dataset D of n objects, a query point q, a distance function F and k, a k-NN query returns the k objects with the smallest distance to q. 5

  6. Problems of top-K and k-NN In a Top- k and k -NN query the ranking/distance function F as well as the number of answers k must be provided by the user. In many cases it is difficult to define a meaningful ranking/distance function, especially when the attributes have different semantics (e.g., find the cheapest hotel closest to the beach). 6

  7. Skyline: Hotel Example price hotel distance price p 1 4 400 400 p 1 p 2 24 380 p 2 p 3 14 340 p 3 p 4 300 36 300 p 4 p 5 p 5 26 280 p 6 p 6 8 260 200 p 7 p 7 40 200 p 8 p 8 20 180 p 9 p 10 p 9 100 34 140 p 10 28 120 p 11 p 11 16 60 10 20 30 40 distance to the destination Skyline Computation: Challenges and Opportunities

  8. Skyline: Hotel Example price hotel distance price p 1 4 400 400 p 1 p 2 24 380 p 2 p 3 14 340 p 3 p 4 300 36 300 p 4 p 5 p 5 26 280 p 6 p 6 8 260 200 p 7 p 7 40 200 p 8 p 8 20 180 p 9 p 10 p 9 100 34 140 p 10 28 120 p 11 p 11 16 60 10 20 30 40 distance to the destination Skyline Computation: Challenges and Opportunities

  9. Skyline: Hotel Example price hotel distance price p 1 4 400 400 p 1 p 2 24 380 p 2 p 3 14 340 p 3 p 4 300 36 300 p 4 p 5 p 5 26 280 p 6 p 6 8 260 200 p 7 p 7 40 200 p 8 p 8 20 180 p 9 p 10 p 9 100 34 140 p 10 28 120 p 11 p 11 16 60 10 20 30 40 distance to the destination Skyline Computation: Challenges and Opportunities

  10. Skyline: Hotel Example price hotel distance price 0.75*Distance + 0.25*price/10 400 p 1 4 400 13 p 1 p 2 p 2 24 380 27.5 p 3 p 3 14 340 19 300 p 4 P 4 36 300 34.5 p 5 p 6 p 5 26 280 26.5 200 p 6 8 260 12.5 p 7 p 8 p 7 40 200 35 p 9 p 8 20 180 19.5 p 10 100 p 9 34 140 29 p 11 p 10 28 120 24 p 11 16 60 13.5 10 20 30 40 distance to the destination Skyline Computation: Challenges and Opportunities

  11. Skyline: Hotel Example price hotel distance price p 1 4 400 400 p 1 p 2 24 380 p 2 p 3 14 340 p 3 p 4 300 36 300 p 4 p 5 p 5 26 280 p 6 p 6 8 260 200 p 7 p 7 40 200 p 8 p 8 20 180 p 9 p 9 p 10 100 34 140 p 10 28 120 p 11 p 11 16 60 10 20 30 40 distance to the destination Definition ( Skyline ). Given a dataset P of n points in d -dimensional space. Let p and p t be two different points in P , p dominates p t , if for all i , p [ i ] ≤ p t [ i ], and for at least one i , p [ i ] < p t [ i ]. The skyline points are those points that are not dominated by any other point in P . Skyline Computation: Challenges and Opportunities

  12. Skyline Queries: Patient Similarity Search Example Skyline Queries Table:Sample of heart disease dataset. (a) Original data. ID age trestbps 40 140 p 1 39 120 p 2 45 130 p 3 37 140 p 4 trestbps 140 p 4 p 1 130 p 3 q 120 p 2 110 45 age 35 40 Query point: q(41,125) Skyline Computation: Challenges and Opportunities

  13. Motivating Example: Skyline Queries Table:Sample of heart disease dataset. (a) Original data. (b) Mapped Data. ID age trestbps ID age trestbps 40 140 42 140 p 1 t 1 39 120 43 130 p 2 t 2 45 130 45 130 p 3 t 3 37 140 45 140 p 4 t 4 trestbps 140 p 4 p 1 t 1 t 4 t 3 130 t 2 p 3 q 120 p 2 110 45 age 35 40 Query point: q(41,125). Skyline Computation: Challenges and Opportunities

  14. Motivating Example: Skyline Queries Table:Sample of heart disease dataset. (a) Original data. (b) Mapped Data. ID age trestbps ID age trestbps 40 140 42 140 p 1 t 1 39 120 43 130 p 2 t 2 45 130 45 130 p 3 t 3 37 140 45 140 p 4 t 4 trestbps 140 p 4 p 1 t 1 t 4 t 3 130 t 2 p 3 q 120 p 2 110 45 age 35 40 Query point :q(41,125). Skyline Computation: Challenges and Opportunities

  15. Skyline  Applications  Recommendation: recommend phones as cheap as possible, as large memory capacity as possible, as light weight as possible  Aggregation/integration: rank results from multiple search engines with relevance score  Preprocessing for top-k: all candidates for top-1 15

  16. Skyline for Top-1 price hotel distance price p 1 4 400 400 p 1 p 2 24 380 p 2 p 3 14 340 p 3 p 4 300 36 300 p 4 p 5 p 5 26 280 p 6 p 6 8 260 200 p 7 p 7 40 200 p 8 p 8 20 180 p 9 p 10 p 9 100 34 140 p 10 28 120 p 11 p 11 16 60 10 20 30 40 distance to the destination Skyline Computation: Challenges and Opportunities

  17. What about Top-K? price hotel distance price p 1 4 400 400 p 1 p 2 24 380 p 2 p 3 14 340 p 3 p 4 300 36 300 p 4 p 5 p 5 26 280 p 6 p 6 8 260 200 p 7 p 7 40 200 p 8 p 8 20 180 p 9 p 10 p 9 100 34 140 p 10 28 120 p 11 p 11 16 60 10 20 30 40 distance to the destination Skyline Computation: Challenges and Opportunities

  18. Skyline for TopK price hotel distance price p 1 4 400 400 p 1 p 2 24 380 p 2 p 3 14 340 p 3 300 p 4 36 300 p 4 p 5 p 5 26 280 p 6 p 6 8 260 200 p 7 p 7 40 200 p 8 p 8 20 180 p 9 100 p 10 p 9 34 140 Lowest Price p 10 28 120 p 11 p 11 16 60 10 20 30 40 distance to the destination • Skyline: pareto top-1 points • Group skyline: pareto top-k groups Skyline Computation: Challenges and Opportunities

  19. Group skyline definition: Dominance Definition ( G-Skyline ). We say group G dominates group G t , denoted by G ≺ g G t , if we can find two permutations of the t t k points for G and G t , G = { p u 1 , p u 2 , ..., p uk } and G t = { p t } , such that p Ç p t for all i v 1 , p v 2 , ..., p vk ui vi (1 ≤ i ≤ k ) and p ui ≺ p t vi for at least one i . The k -point G-Skyline consists of those groups with k points that are not g-dominated by any other group with same size. price hotel distance price p 1 4 400 400 p 1 p 2 24 380 p 2 p 3 14 340 p 3 p 4 300 36 300 p 4 p 5 p 5 26 280 p 6 p 6 8 260 200 p 7 p 7 40 200 p 8 p 8 20 180 p 9 p 10 p 9 100 34 140 p 10 28 120 p 11 p 11 16 60 10 20 30 40 distance to the destination Skyline Computation: Challenges and Opportunities

  20. Hotel Example price hotel distance price p 1 4 400 400 p 1 p 2 24 380 p 2 p 3 14 340 p 3 300 p 4 36 300 p 4 p 5 p 5 26 280 p 6 p 6 8 260 200 p 7 p 7 40 200 p 8 p 8 20 180 p 9 100 p 10 p 9 34 140 Lowest Price p 10 28 120 p 11 p 11 16 60 10 20 30 40 distance to the destination Skyline Computation: Challenges and Opportunities

  21. Outline  Ranking and skyline  Top- k algorithms  Skyline algorithms  Reconciling top-k and skyline 22

  22. Introduction – naïve methods Top-k processing  Apply the ranking function F to all objects  Unsorted: linearly scan all objects (online)  Sorted list: sorting all objects (offline)  Priority queue: build queue (offline), remove top-k (online)  Offline computation needs to know the scoring function! 23

  23. Top- k Computation – FA algorithm F agin’s Algorithm (FA) R. Fagin, Amnon Lotem, Moni Naor . “ Optimal Aggregation Algorithms for Middleware ”. J. Comput. Syst. Sci. 66(4), pp. 614-656, 2003. The algorithm is based on two types of accesses: Sorted access on attribute a i : retrieves the next object in the sorted list of a i Random access on attribute a i : gives the value of the i -th attribute for a specific object identifier. 24

  24. Top- k Computation The database can be considered as an n x m score matrix, storing the score values of every object in every attribute. a1 a2 a3 a4 a5 O 3 , 99 O 1 , 91 O 1 , 92 O 3 , 74 O 3 , 67 O 1 , 66 O 3 , 90 O 3 , 75 O 1 , 56 O 4 , 67 O 0 , 63 O 0 , 61 O 4 , 70 O 0 , 56 O 1 , 58 O 2 , 48 O 4 , 07 O 2 , 16 O 2 , 28 O 2 , 54 O 4 , 44 O 2 , 01 O 0 , 01 O 4 , 19 O 0 , 35 Note that, for each attribute scores are sorted in descending order. 25

  25. Top- k Computation – FA algorithm Outline of FA Step 1: • Read attributes from every sorted list using sorted access. • Stop when k objects have been seen in common from all lists. Step 2: • Use random access to find missing scores. Step 3: • Compute the scores of the seen objects. • Return the k highest scored objects. 26

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend