October 17, 2017 1
Outline Ranking and skyline Top- k algorithms Skyline algorithms - - PowerPoint PPT Presentation
Outline Ranking and skyline Top- k algorithms Skyline algorithms - - PowerPoint PPT Presentation
Data Mining Top-K and Skyline October 17, 2017 1 Outline Ranking and skyline Top- k algorithms Skyline algorithms Reconciling top-k and skyline 2 Ranking queries Who is the best NBA player? According to points : Tracy McGrady,
2
Outline
- Ranking and skyline
- Top-k algorithms
- Skyline algorithms
- Reconciling top-k and skyline
3
Ranking queries
Who is the best NBA player?
Name Points Rebounds Assists Steals …… Tracy McGrady 2003 484 448 135 …… Kobe Bryant 1819 392 398 86 …… Shaquille O'Neal 1669 760 200 36 …… Yao Ming 1465 669 61 34 …… Dwyane Wade 1854 397 520 121 …… Steve Nash 1165 249 861 74 …… …… …… …… …… …… …… According to points: Tracy McGrady, score 2003 According to rebounds: Shaquille O'Neal, score 760 According to points+rebounds: Tracy McGrady, score 2487
4
Ranking queries
Top-k Query Given a dataset D of n objects, a scoring function F (according to which we rank the
- bjects in D) and k, a Top-k query returns the k
- bjects with the best score (rank) in D.
5
Similarity queries
K-NN Query Given a dataset D of n objects, a query point q, a distance function F and k, a k-NN query returns the k objects with the smallest distance to q.
6
Problems of top-K and k-NN
In a Top-k and k-NN query the ranking/distance function F as well as the number of answers k must be provided by the user. In many cases it is difficult to define a meaningful ranking/distance function, especially when the attributes have different semantics (e.g., find the cheapest hotel closest to the beach).
Skyline: Hotel Example
hotel distance price p1 4 400 p2 24 380 p3 14 340 p4 36 300 p5 26 280 p6 8 260 p7 40 200 p8 20 180 p9 34 140 p10 28 120 p11 16 60 10 20 30 40 100 200 300 400 price distance to the destination p1 p6 p3 p2 p8 p11 p5 p4 p7 p9 p10
Skyline Computation: Challenges and Opportunities
Skyline: Hotel Example
hotel distance price p1 4 400 p2 24 380 p3 14 340 p4 36 300 p5 26 280 p6 8 260 p7 40 200 p8 20 180 p9 34 140 p10 28 120 p11 16 60 10 20 30 40 100 200 300 400 price distance to the destination p1 p6 p3 p2 p8 p11 p5 p4 p7 p9 p10
Skyline Computation: Challenges and Opportunities
Skyline: Hotel Example
hotel distance price p1 4 400 p2 24 380 p3 14 340 p4 36 300 p5 26 280 p6 8 260 p7 40 200 p8 20 180 p9 34 140 p10 28 120 p11 16 60 10 20 30 40 100 200 300 400 price distance to the destination p1 p6 p3 p2 p8 p11 p5 p4 p7 p9 p10
Skyline Computation: Challenges and Opportunities
Skyline: Hotel Example
10 20 30 40 100 200 300 400 price distance to the destination p1 p6 p3 p2 p8 p11 p5 p4 p7 p9 p10 hotel distance price 0.75*Distance + 0.25*price/10 p1 4 400 13 p2 24 380 27.5 p3 14 340 19 P4 36 300 34.5 p5 26 280 26.5 p6 8 260 12.5 p7 40 200 35 p8 20 180 19.5 p9 34 140 29 p10 28 120 24 p11 16 60 13.5 Skyline Computation: Challenges and Opportunities
Skyline: Hotel Example
hotel distance price
p1 4 400
p2 24 380 p3 14 340 p4 36 300 p5 26 280
p6 8 260
p7 40 200 p8 20 180 p9 34 140 p10 28 120
p11 16 60
10 20 30 40 100 200 300 400 price distance to the destination
p1 p6
p3 p2 p8
p11
p5 p4 p7 p9 p10
Definition (Skyline). Given a dataset P of n points in d -dimensional space. Let p and pt be two different points in P, p dominates pt, if for all i , p[i ] ≤ pt[i ], and for at least one i , p[i ] < pt[i ]. The skyline points are those points that are not dominated by any other point in P.
Skyline Computation: Challenges and Opportunities
Skyline Queries: Patient Similarity Search Example Skyline Queries
ID age trestbps p1 40 140 p2 39 120 p3 45 130 p4 37 140
Table:Sample of heart disease dataset.
(a) Original data.
130 140 trestbps
p4 p3 q p1
120
p2
110 35 40 45 age
Query point: q(41,125)
Skyline Computation: Challenges and Opportunities
Motivating Example: Skyline Queries
ID age trestbps p1 40 140 p2 39 120 p3 45 130 p4 37 140
Table:Sample of heart disease dataset.
(a) Original data. (b) Mapped Data. ID age trestbps t1 42 140 t2 43 130 t3 45 130 t4 45 140
130 140 trestbps
p4 p3 q p1 t1 t4 t2 t3
120
p2
110 35 40 45 age
Query point: q(41,125).
Skyline Computation: Challenges and Opportunities
Motivating Example: Skyline Queries
ID age trestbps p1 40 140 p2 39 120 p3 45 130 p4 37 140
Table:Sample of heart disease dataset.
(a) Original data. (b) Mapped Data. ID age trestbps t1 42 140 t2 43 130 t3 45 130 t4 45 140
130 140 trestbps
p4 p3 q p1 t1 t4 t2 t3
120
p2
110 35 40 45 age
Query point :q(41,125).
Skyline Computation: Challenges and Opportunities
15
Skyline
Applications
Recommendation: recommend phones as cheap as
possible, as large memory capacity as possible, as light weight as possible
Aggregation/integration: rank results from multiple
search engines with relevance score
Preprocessing for top-k: all candidates for top-1
Skyline for Top-1
hotel distance price
p1 4 400
p2 24 380 p3 14 340 p4 36 300 p5 26 280
p6 8 260
p7 40 200 p8 20 180 p9 34 140 p10 28 120
p11 16 60
10 20 30 40 100 200 300 400 price distance to the destination
p1 p6
p3 p2 p8
p11
p5 p4 p7 p9 p10
Skyline Computation: Challenges and Opportunities
What about Top-K?
hotel distance price
p1 4 400
p2 24 380 p3 14 340 p4 36 300 p5 26 280
p6 8 260
p7 40 200 p8 20 180 p9 34 140 p10 28 120
p11 16 60
10 20 30 40 100 200 300 400 price distance to the destination
p1 p6
p3 p2 p8
p11
p5 p4 p7 p9 p10
Skyline Computation: Challenges and Opportunities
Skyline for TopK
hotel distance price
p1 4 400
p2 24 380 p3 14 340 p4 36 300 p5 26 280
p6 8 260
p7 40 200 p8 20 180 p9 34 140 p10 28 120
p11 16 60
10 20 30 40 100 200 300 400 price distance to the destination
p1 p6
p3 p2 p8
p11
p5 p4 p7 p10 p9 Lowest Price
Skyline Computation: Challenges and Opportunities
- Skyline: pareto top-1 points
- Group skyline: pareto top-k groups
Group skyline definition: Dominance
Definition (G-Skyline). We say group G dominates group Gt , denoted by G ≺g Gt , if we can find two permutations of the
t t ui vi v1, p
}, such that p Ç pt for all i k points for G and Gt , G = {pu1 , pu2 , ..., puk } and Gt = {pt
v2, ..., pvk
(1 ≤ i ≤ k) and pui ≺ pt
vi for at least one i . The k-point G-Skyline consists of those groups with k points that
are not g-dominated by any other group with same size.
hotel distance price
p1 4 400
p2 24 380 p3 14 340 p4 36 300 p5 26 280
p6 8 260
p7 40 200 p8 20 180 p9 34 140 p10 28 120
p11 16 60
10 20 30 40 100 200 300 400 price distance to the destination
p1 p6
p3 p2 p8
p11
p5 p4 p7 p9 p10
Skyline Computation: Challenges and Opportunities
Hotel Example
hotel distance price
p1 4 400
p2 24 380 p3 14 340 p4 36 300 p5 26 280
p6 8 260
p7 40 200 p8 20 180 p9 34 140 p10 28 120
p11 16 60
10 20 30 40 100 200 300 400 price distance to the destination
p1 p6
p3 p2 p8
p11
p5 p4 p7 p10 p9 Lowest Price
Skyline Computation: Challenges and Opportunities
22
Outline
- Ranking and skyline
- Top-k algorithms
- Skyline algorithms
- Reconciling top-k and skyline
23
Introduction – naïve methods
Top-k processing
Apply the ranking function F to all objects Unsorted: linearly scan all objects (online) Sorted list: sorting all objects (offline) Priority queue: build queue (offline), remove
top-k (online)
Offline computation needs to know the scoring
function!
24
Top-k Computation – FA algorithm
Fagin’s Algorithm (FA)
- R. Fagin, Amnon Lotem, Moni Naor. “Optimal
Aggregation Algorithms for Middleware”. J. Comput.
- Syst. Sci. 66(4), pp. 614-656, 2003.
The algorithm is based on two types of accesses: Sorted access on attribute ai: retrieves the next object in the sorted list of ai Random access on attribute ai: gives the value of the i-th attribute for a specific object identifier.
25
Top-k Computation
The database can be considered as an n x m score matrix, storing the score values of every object in every attribute.
a1 a2 a3 a4 a5
O3, 99 O1, 91 O1, 92 O3, 74 O3, 67 O1, 66 O3, 90 O3, 75 O1, 56 O4, 67 O0, 63 O0, 61 O4, 70 O0, 56 O1, 58 O2, 48 O4, 07 O2, 16 O2, 28 O2, 54 O4, 44 O2, 01 O0, 01 O4, 19 O0, 35 Note that, for each attribute scores are sorted in descending order.
26
Top-k Computation – FA algorithm
Outline of FA Step 1:
- Read attributes from every sorted list using sorted access.
- Stop when k objects have been seen in common from all lists.
Step 2:
- Use random access to find missing scores.
Step 3:
- Compute the scores of the seen objects.
- Return the k highest scored objects.
Top-k Computation – FA algorithm
Step 1:
- Read attributes from every sorted list using sorted access
- Stop when k objects have been seen in common from all lists
27
a1 a2 a3 a4 a5
O3, 99 O1, 91 O1, 92 O3, 74 O3, 67 O1, 66 O3, 90 O3, 75 O1, 56 O4, 67 O0, 63 O0, 61 O4, 70 O0, 56 O1, 58 O2, 48 O4, 07 O2, 16 O2, 28 O2, 54 O4, 44 O2, 01 O0, 01 O4, 19 O0, 35
id a1 a2 a3 a4 a5
O3 99 90 75 74 67 O1 66 91 92 56 58 O4 70 67 O0 63 61 56
No more sorted accesses are required, since we have determined k=2
- bjects contained in all lists (objects O1 and O3).
Top-k Computation – FA algorithm
Step 2:
- Use random access to find missing scores
28
a1 a2 a3 a4 a5
O3, 99 O1, 91 O1, 92 O3, 74 O3, 67 O1, 66 O3, 90 O3, 75 O1, 56 O4, 67 O0, 63 O0, 61 O4, 70 O0, 56 O1, 58 O2, 48 O4, 07 O2, 16 O2, 28 O2, 54 O4, 44 O2, 01 O0, 01 O4, 19 O0, 35
id a1 a2 a3 a4 a5
O3 99 90 75 74 67 O1 66 91 92 56 58 O4 70 67 O0 63 61 56 44 07 19 01 35
All missing values for seen objects have been determined. Therefore, no more random accesses are required.
Top-k Computation – FA algorithm
Step 3:
- Compute the scores of the seen objects.
- Return the k highest scored objects.
29
id a1 a2 a3 a4 a5
O3 99 90 75 74 67 O1 66 91 92 56 58 O4 70 67 O0 63 61 56 44 07 19 01 35
Total Score 405 363 207 216
Top-2
Threshold Algorithm (TA)
- Idea: sometimes we can stop before seeing k objects in every list
- Compute the score for all objects seen so far
- Compute the upper bound of how good a score of an unseen
- bject can be
- Aggregating the minimal (worst) score seen so far in all lists
- If there are already k objects above the threshold, stop
Top-k Computation – TA algorithm
31
Outline of TA Step 1:
- Read attributes from every sorted list using sorted access.
- For each object seen x:
- Use random access to find missing values.
- Determine the score F(x) of object x.
- If the object is among the top-k keep it in buffer.
Step 2:
- Determine threshold value T based on objects currently seen under sorted access.
T = a1(p) + a2(p) + … + am(p) where p is the current sorted access position.
- If there are k objects with total scores >= T then STOP and report answers
else p = p + 1 and GOTO Step1.
Top-k Computation – TA algorithm
Step 1:
- Read attributes from every sorted list using sorted access.
- For each object seen x:
- Use random access to find missing values.
- Determine the score F(x) of object x.
- If the object is among the top-k keep it in buffer.
32
a1 a2 a3 a4 a5
O3, 99 O1, 91 O1, 92 O3, 74 O3, 67 O1, 66 O3, 90 O3, 75 O1, 56 O4, 67 O0, 63 O0, 61 O4, 70 O0, 56 O1, 58 O2, 48 O4, 07 O2, 16 O2, 28 O2, 54 O4, 44 O2, 01 O0, 01 O4, 19 O0, 35
id a1 a2 a3 a4 a5 F
O3 99 90 75 74 67 405 O1 66 91 92 56 58 363
BUFFER: (O3,405) (O1,363)
p=1
Top-k Computation – TA algorithm
Step 2:
- Determine threshold value T based on objects currently seen
under sorted access. T = a1(p) + a2(p) + … + am(p) where p is the current sorted access position.
- If there are k objects with total scores >= T then STOP and report
answers else p = p + 1 and GOTO Step1.
33
a1 a2 a3 a4 a5
O3, 99 O1, 91 O1, 92 O3, 74 O3, 67 O1, 66 O3, 90 O3, 75 O1, 56 O4, 67 O0, 63 O0, 61 O4, 70 O0, 56 O1, 58 O2, 48 O4, 07 O2, 16 O2, 28 O2, 54 O4, 44 O2, 01 O0, 01 O4, 19 O0, 35
id a1 a2 a3 a4 a5 F
405 363 O3 99 90 75 74 67 O1 66 91 92 56 58
T = 99+91+92+74+67 = 423
p=1
There are NO k objects with a score >= T, GOTO Step1 …
BUFFER: (O3 405) (O1,363)
Top-k Computation – TA algorithm
Step 1 (second execution):
- Read attributes from every sorted list using sorted
access.
- For each object seen x:
- Use random access to find missing values.
- Determine the score F(x) of object x.
- If the object is among the top-k keep it in buffer.
34
a1 a2 a3 a4 a5
O3, 99 O1, 91 O1, 92 O3, 74 O3, 67 O1, 66 O3, 90 O3, 75 O1, 56 O4, 67 O0, 63 O0, 61 O4, 70 O0, 56 O1, 58 O2, 48 O4, 07 O2, 16 O2, 28 O2, 54 O4, 44 O2, 01 O0, 01 O4, 19 O0, 35
id a1 a2 a3 a4 a5 F
O3 99 90 75 74 67 405 O1 66 91 92 56 58 363
p=2
O4 44 07 70 19 67 207
BUFFER: (O3,405) (O1,363)
Top-k Computation – TA algorithm
Step 2 (second execution):
- Determine threshold value T based on objects currently seen
under sorted access. T = a1(p) + a2(p) + … + am(p) where p is the current sorted access position.
- If there are k objects with total scores >= T then STOP and
report answers else p = p + 1 and GOTO Step1.
35
a1 a2 a3 a4 a5
O3, 99 O1, 91 O1, 92 O3, 74 O3, 67 O1, 66 O3, 90 O3, 75 O1, 56 O4, 67 O0, 63 O0, 61 O4, 70 O0, 56 O1, 58 O2, 48 O4, 07 O2, 16 O2, 28 O2, 54 O4, 44 O2, 01 O0, 01 O4, 19 O0, 35
id a1 a2 a3 a4 a5 F
O3 99 90 75 74 67 405 O1 66 91 92 56 58 363
p=2
O4 44 07 70 19 67 207
T = 66+90+75+56+67 = 354
BUFFER: (O3,405) (O1,363)
Both objects in the buffer have scores higher than T. STOP and report answers.
Top-k Computation - FA vs TA
- TA sees less objects than FA
- TA stops at least as early as FA
- TA may perform more random accesses than FA
- In TA, (m-1) random accesses for each object.
- In FA, random accesses are done at the end, only for missing
scores.
36
No Random Access
- Maintain lower and upper bounds for every
- bject (worst and best possible score)
- Best is the aggregation of what we have seen +
the best we can see from remaining list
- Worst is the aggregation of what we have seen +
zeros
- Stop if the best possible score for objects
- utside list is less than the k'th Worst in the list
Top-k Computation – NRA algorithm
38
a1 a2 a3 a4 a5
O3, 99 O1, 91 O1, 92 O3, 74 O3, 67 O1, 66 O3, 90 O3, 75 O1, 56 O4, 67 O0, 63 O0, 61 O4, 70 O0, 56 O1, 58 O2, 48 O4, 07 O2, 16 O2, 28 O2, 54 O4, 44 O2, 01 O0, 01 O4, 19 O0, 35
id a1 a2 a3 a4 a5 F
O3 99 74 67 O1 91 92
BUFFER:
O3,240,423 O1,183, 423 p=1
T = 99+91+92+74+67 = 423
Top-k Computation – TA algorithm
a1 a2 a3 a4 a5
O3, 99 O1, 91 O1, 92 O3, 74 O3, 67 O1, 66 O3, 90 O3, 75 O1, 56 O4, 67 O0, 63 O0, 61 O4, 70 O0, 56 O1, 58 O2, 48 O4, 07 O2, 16 O2, 28 O2, 54 O4, 44 O2, 01 O0, 01 O4, 19 O0, 35
id a1 a2 a3 a4 a5 F
O3 99 90 75 74 67 405 O1 66 91 92 56 363
p=2
O4 67
T = 66+90+75+56+67 = 354
BUFFER:
O3,405 O1,306,373
Stop when k objects with worst scores greater than T
O4,67,354
40
Outline
- Ranking and skyline
- Top-k algorithms
- Skyline algorithms
- Skyline
- Group skyline
- Reconciling top-k and skyline
41
Dominance
p1 dominates p2. Hence, p1 has a smaller score under any monotone preference function f(x, y).
- f(x, y) is monotone if it increases with both x and y.
x y p1 p2
42
Skyline
x y p1 p2 p3 p4 p5 p6 p7 p8
The skyline contains points that are not dominated by others.
43
Skyline vs. convex hull
x y p1 p2 p3 p4 p5 p6 p7 p8 x y p1 p2 p3 p4 p5 p6 p7 p8 Contains the top-1 object of any monotone function. Contains the top-1 object of any linear function.
44
Skyline computation – naïve methods
Skyline
For each object, check if it is dominated by any
- ther object
Return the objects that are not dominated
Complexity?
45
Skyline Computation
- 2D Scanning algorithm
- Divide and Conquer
- Nearest-Neighbor based
- Branch and Bound Algorithm
46
Skyline
x y p1 p2 p3 p4 p5 p6 p7 p8
The skyline contains points that are not dominated by others.
47
2D
Sort all points by x Scan the points one at a time by increasing x-order Dominance check: if y is smaller than the smallest y so far (the previous skyline), add to skyline
x y plane sweep
48
2D
Sort all points by x Scan the points one at a time by increasing x-order Dominance check: if y is smaller than the smallest y so far (the previous skyline), add to skyline Complexity: sort O(NlgN), scan O(N)
x y plane sweep
D & C Algorithm
- 1. Original
data
- 2. Divides dataset into 2 parts
by medium
- 3. Compute Skyline
S1 and S2
- 4. Eliminates
points in S2 dominated by S1
D&C algorithm
- 2D: O(NlgN)
T(N) = 2T(N/2) + O(N)
D&C algorithm
- 2D: O(NlgN)
T(N) = 2T(N/2) + O(N)
- 3D: O(NlgN)
merge: solve 2-D by scanning algorithm
T(N) = 2T(N/2) + O(N)
D&C algorithm
- 2D: O(NlgN)
T(N) = 2T(N/2) + O(N)
- 3D: O(NlgN)
merge: solve 2-D by scanning algorithm
T(N) = 2T(N/2) + O(N)
- General: O(Nlgk-2N), average O(NlgN)
T(N) = 2T(N/2, k) + T(N, k-1) + O(N)
54
Nearest Neighbor search
[Kossmann et al. VLDB 02] Find nearest neighbor as skyline Eliminate points in IV Compute recursively in II and III (to do list)
z y p1 I II III IV z y p2 p3
55
Branch-and-bound skyline (BBS)
[Papadias et al. SIGMOD 04]
Assume data are indexed by R-Tree
x y O p1 p2 p3 p4 p5 p6 p7 p8 p1 p2 p3 p4 p5 p6 p7 p8
0.2 0.4 0.6 0.8 1 1 0.8 0.6 0.4 0.2
56
R-trees – structure
2 4 6 8 10 2 4 6 8 10
x axis y axis
b c a
E
3
a b c d e E 1 E 2 E 3 E 4 E 5 Root E 1 E 2 E 3 E 4 f g h E 5
d e f g h i j k l m
l m E 7 i j k E 6 E 6 E 7
Minimum Bounding Rectangle (MBR)
60
Introduction to R-trees – range query
2 4 6 8 10 2 4 6 8 10 x axis y axis b c a
E
1 d e f g h i j k l m
E
2
a b c d e E1 E2 E3 E4 E5 Root E1 E2 E3 E4 f g h E5 l m E7 i j k E6 E6 E7 E3 E4 E5 E6 E7
61
Branch-and-bound skyline (BBS)
[Papadias et al. SIGMOD 04]
Assume data are indexed by R-Tree How to branch and bound for fast skyline search and search space elimination
x y O p1 p2 p3 p4 p5 p6 p7 p8 p1 p2 p3 p4 p5 p6 p7 p8
0.2 0.4 0.6 0.8 1 1 0.8 0.6 0.4 0.2
62
Branch-and-bound skyline (BBS)
[Papadias et al. SIGMOD 04]
Assume data are indexed by R-Tree How to branch and bound for fast skyline search and search space elimination Key idea: examine closest MBR
x y O p1 p2 p3 p4 p5 p6 p7 p8 p1 p2 p3 p4 p5 p6 p7 p8
0.2 0.4 0.6 0.8 1 1 0.8 0.6 0.4 0.2
64
BBS Algorithm
- Initialize the skyline set S
- Add all entries of the root to priority queue Q
(based on L1 distance of the MBR lower-left corner to the origin)
- Repeat: remove top entry e from Q
- If d is dominated by S, discard
- If e is an intermediate entry, add non-dominated
children to Q
- If e is a point, add to S
- Stop when Q is empty
65
x y b a i k N2 N1 N3 N4 h N6 N7 g d f e c l
- 1
2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
m n N5
a b c d e f g h i l k e1 e2 e3 e4 e6 e7 N1 N2 N6 N3 N4 N7 R m n N5 e5
action heap contents S access root <e7,4><e6,6>
- Each heap entry keeps the
mindist of the MBR.
BBS Algorithm - example
66
BBS Algorithm - example
x y b a i k N2 N1 N3 N4 h N6 N7 g d f e c l
- 1
2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
m n N5
a b c d e f g h i l k e1 e2 e3 e4 e6 e7 N1 N2 N6 N3 N4 N7 R m n N5 e5
action heap contents S access root <e7,4><e6,6> expand e7 <e3,5><e6,6><e5,8><e4,10>
- Process entries in ascending
- rder of their mindists.
67
BBS Algorithm - example
x y b a i k N2 N1 N3 N4 h N6 N7 g d f e c l
- 1
2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
m n N5
a b c d e f g h i l k e1 e2 e3 e4 e6 e7 N1 N2 N6 N3 N4 N7 R m n N5 e5
action heap contents S access root <e7,4><e6,6> expand e7 <e3,5><e6,6><e5,8><e4,10> expand e3 <i,5><e6,6><e5,8> <e4,10> {i}
68
BBS Algorithm - example
x y b a i k N2 N1 N3 N4 h N6 N7 g d f e c l
- 1
2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
m n N5
a b c d e f g h i l k e1 e2 e3 e4 e6 e7 N1 N2 N6 N3 N4 N7 R m n N5 e5
action heap contents S access root <e7,4><e6,6> expand e7 <e3,5><e6,6><e5,8><e4,10> expand e3 <i,5><e6,6><e5,8> <e4,10> {i} expand e6 <e5,8><e1,9><e4,10> {i}
69
BBS Algorithm - example
x y b a i k N2 N1 N3 N4 h N6 N7 g d f e c l
- 1
2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
n N5 m
a b c d e f g h i l k e1 e2 e3 e4 e6 e7 N1 N2 N6 N3 N4 N7 R m n N5 e5
{i} remove e5 <e1,9><e4,10> action heap contents S access root <e7,4><e6,6> expand e7 <e3,5><e6,6><e5,8><e4,10> expand e3 <i,5><e6,6><e5,8> <e4,10> {i} expand e6 <e5,8><e1,9><e4,10> {i}
70
BBS Algorithm - example
x y b a i k N2 N1 N3 N4 h N6 N7 g d f e c l
- 1
2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
m n N5
a b c d e f g h i l k e1 e2 e3 e4 e6 e7 N1 N2 N6 N3 N4 N7 R m n N5 e5
{i} remove e5 <e1,9><e4,10> expand 1 e <a,10><e4,10> {i,a} action heap contents S access root <e7,4><e6,6> expand e7 <e3,5><e6,6><e5,8><e4,10> expand e3 <i,5><e6,6><e5,8> <e4,10> {i} expand e6 <e5,8><e1,9><e4,10> {i}
71
BBS Algorithm - example
k x y b a i N2 N1 N3 N4 h N6 N7 g d f e c l
- 1
2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
m n N5
a b c d e f g h i l k e1 e2 e3 e4 e6 e7 N1 N2 N6 N3 N4 N7 R m n N5 e5
{i} remove e5 <e1,9><e4,10> expand 1 e <a,10><e4,10> {i,a} expand e4 {i,a,k} action heap contents S access root <e7,4><e6,6> expand e7 <e3,5><e6,6><e5,8><e4,10> expand e3 <i,5><e6,6><e5,8> <e4,10> {i} expand e6 <e5,8><e1,9><e4,10> {i} <k,10>
72
BBS Algorithm - performance
BBS performs better than previously proposed Skyline algorithms, regarding CPU time and I/O time.
(source: Papadias et al TODS 2005)
Number of R-tree node accesses vs dimensionality
73
Outline
- Ranking and skyline
- Top-k algorithms
- Skyline algorithms
- Skyline
- Group skyline
- Reconciling top-k and skyline
Group skyline Computation
Definition (G-Skyline). We say group G dominates group Gt , denoted by G ≺g Gt , if we can find two permutations of the
t t ui vi v1, p
}, such that p Ç pt for all i k points for G and Gt , G = {pu1 , pu2 , ..., puk } and Gt = {pt
v2, ..., pvk
(1 ≤ i ≤ k) and pui ≺ pt
vi for at least one i . The k-point G-Skyline consists of those groups with k points that
are not g-dominated by any other group with same size.
hotel distance price
p1 4 400
p2 24 380 p3 14 340 p4 36 300 p5 26 280
p6 8 260
p7 40 200 p8 20 180 p9 34 140 p10 28 120
p11 16 60
10 20 30 40 100 200 300 400 price distance to the destination
p1 p6
p3 p2 p8
p11
p5 p4 p7 p9 p10
Skyline Computation: Challenges and Opportunities
Brute-Force Algorithm
Skyline Computation: Challenges and Opportunities
Hotel Example
hotel distance price
p1 4 400
p2 24 380 p3 14 340 p4 36 300 p5 26 280
p6 8 260
p7 40 200 p8 20 180 p9 34 140 p10 28 120
p11 16 60
10 20 30 40 100 200 300 400 price distance to the destination
p1 p6
p3 p2 p8
p11
p5 p4 p7 p10 p9 Lowest Price
Skyline Computation: Challenges and Opportunities
Observations
hotel distance price
p1 4 400
p2 24 380 p3 14 340 p4 36 300 p5 26 280
p6 8 260
p7 40 200 p8 20 180 p9 34 140 p10 28 120
p11 16 60
10 20 30 40 100 200 300 400
A G-Skyline group cannot have a non-skyline point dominated by a point outside the group.
price distance to the destination
p1 p6
p3 p2 p8
p11
p5 p4 p7 p9 p10
Skyline Computation: Challenges and Opportunities
Observations
hotel distance price
p1 4 400
p2 24 380 p3 14 340 p4 36 300 p5 26 280
p6 8 260
p7 40 200 p8 20 180 p9 34 140 p10 28 120
p11 16 60
10 20 30 40 100 200 300 400 price distance to the destination
p1 p6
p3 p2 p8
p11
p5 p4 p7 p10 p9 Lowest Price
Skyline Computation: Challenges and Opportunities
Lemma A G-skyline group can have a non-skyline point dominated by a point within the group
Unit Group
Lemma A point in a G-Skyline group cannot be dominated by a point outside the group. Lemma Given a point p, if p is in a G-Skyline group, p’s parentsmust be included in this G-Skyline group. Definition (Unit Group). Given a point p, p and its parents form the unit group for p.
Skyline Computation: Challenges and Opportunities
Skyline Layers and Directed Skyline Graph
100 200 300 400 p 1
p6
p3 p2 p8 p5 p4 p7 p9 p10 layer1 layer2 layer3 layer4
p11
10 20 30 40
Figure:Skyline layers.
p1 p6 p3 p2 p8 p11 p5 p4 p7 p9 p10
layer1 layer2 layer3 layer4
Figure:Directed skyline graph.
Skyline Computation: Challenges and Opportunities
Skyline Layers and Directed Skyline Graph
100 200 300 400 p 1
p6 p3
p2 p8 p5 p4 p7 p9 p10 layer1 layer2 layer3 layer4
p11
10 20 30 40
Figure:Skyline layers.
p1 p6 p3 p2 p8 p11 p5 p4 p7 p9 p10
layer1 layer2 layer3 layer4
Figure:Directed skyline graph.
Skyline Computation: Challenges and Opportunities
Unit Group
Definition (Unit Group). Given a point p in DSG, p and its parents form the unit group for p.
p1 p6 p3 p2 p8 p11 p5 p4 p7 p9 p10
layer1 layer2 layer3 layer4
Skyline Computation: Challenges and Opportunities
Verification of G-Skyline
Theorem ( Verification of G-Skyline). Given a group G = {p1, p2, ..., pk}, it is a G-Skyline group, if its corresponding unit group set S = u1 ∪u2 ∪... ∪ uk contains k points, i.e., |S|p =k.
p1 p6 p3 p2 p8 p11 p5 p4 p7 p9 p10
layer1 layer2 layer3 layer4
Skyline Computation: Challenges and Opportunities
Finding G-Skyline Groups
Baseline: Enumerate all candidates, and check if the unit-groups are in the group
Skyline Computation: Challenges and Opportunities
All possible candidates: set enumeration tree
p1 p6 p3 p2 p8 p11 p4 p7 p5 p9 p10
{1,2,...,k,...,Sk} {} {2} ... {k} ...{Sk}
... ...
{1}
...
{1, 2}... {1, k}... {1, Sk}
...
{2, 3}... {2, k}... {2, Sk}... {k,k + 1}... {k,Sk}... |s| =0 |s| =1 |s| =2 |s| =k |s| =Sk {1, 2,...,k − 1,k}... {1, 2,...,k − 1,Sk}... {2,3,...,k,k + 1}... {2, 3,...,k,Sk}... Skyline Computation: Challenges and Opportunities
Finding G-Skyline Groups
Baseline: Enumerate all candidates, and check if the unit-groups are in the group Point-Wise: Point-wise algorithm with Subtree Pruning and Tail Set Pruning.
Skyline Computation: Challenges and Opportunities
88
Outline
- Ranking and skyline
- Top-k algorithms
- Skyline algorithms
- Skyline
- Group skyline
- Reconciling top-k and skyline