Outline Ranking and skyline Top- k algorithms Skyline algorithms - - PowerPoint PPT Presentation

outline
SMART_READER_LITE
LIVE PREVIEW

Outline Ranking and skyline Top- k algorithms Skyline algorithms - - PowerPoint PPT Presentation

Data Mining Top-K and Skyline October 17, 2017 1 Outline Ranking and skyline Top- k algorithms Skyline algorithms Reconciling top-k and skyline 2 Ranking queries Who is the best NBA player? According to points : Tracy McGrady,


slide-1
SLIDE 1

October 17, 2017 1

Data Mining Top-K and Skyline

slide-2
SLIDE 2

2

Outline

  • Ranking and skyline
  • Top-k algorithms
  • Skyline algorithms
  • Reconciling top-k and skyline
slide-3
SLIDE 3

3

Ranking queries

Who is the best NBA player?

Name Points Rebounds Assists Steals …… Tracy McGrady 2003 484 448 135 …… Kobe Bryant 1819 392 398 86 …… Shaquille O'Neal 1669 760 200 36 …… Yao Ming 1465 669 61 34 …… Dwyane Wade 1854 397 520 121 …… Steve Nash 1165 249 861 74 …… …… …… …… …… …… …… According to points: Tracy McGrady, score 2003 According to rebounds: Shaquille O'Neal, score 760 According to points+rebounds: Tracy McGrady, score 2487

slide-4
SLIDE 4

4

Ranking queries

Top-k Query Given a dataset D of n objects, a scoring function F (according to which we rank the

  • bjects in D) and k, a Top-k query returns the k
  • bjects with the best score (rank) in D.
slide-5
SLIDE 5

5

Similarity queries

K-NN Query Given a dataset D of n objects, a query point q, a distance function F and k, a k-NN query returns the k objects with the smallest distance to q.

slide-6
SLIDE 6

6

Problems of top-K and k-NN

In a Top-k and k-NN query the ranking/distance function F as well as the number of answers k must be provided by the user. In many cases it is difficult to define a meaningful ranking/distance function, especially when the attributes have different semantics (e.g., find the cheapest hotel closest to the beach).

slide-7
SLIDE 7

Skyline: Hotel Example

hotel distance price p1 4 400 p2 24 380 p3 14 340 p4 36 300 p5 26 280 p6 8 260 p7 40 200 p8 20 180 p9 34 140 p10 28 120 p11 16 60 10 20 30 40 100 200 300 400 price distance to the destination p1 p6 p3 p2 p8 p11 p5 p4 p7 p9 p10

Skyline Computation: Challenges and Opportunities

slide-8
SLIDE 8

Skyline: Hotel Example

hotel distance price p1 4 400 p2 24 380 p3 14 340 p4 36 300 p5 26 280 p6 8 260 p7 40 200 p8 20 180 p9 34 140 p10 28 120 p11 16 60 10 20 30 40 100 200 300 400 price distance to the destination p1 p6 p3 p2 p8 p11 p5 p4 p7 p9 p10

Skyline Computation: Challenges and Opportunities

slide-9
SLIDE 9

Skyline: Hotel Example

hotel distance price p1 4 400 p2 24 380 p3 14 340 p4 36 300 p5 26 280 p6 8 260 p7 40 200 p8 20 180 p9 34 140 p10 28 120 p11 16 60 10 20 30 40 100 200 300 400 price distance to the destination p1 p6 p3 p2 p8 p11 p5 p4 p7 p9 p10

Skyline Computation: Challenges and Opportunities

slide-10
SLIDE 10

Skyline: Hotel Example

10 20 30 40 100 200 300 400 price distance to the destination p1 p6 p3 p2 p8 p11 p5 p4 p7 p9 p10 hotel distance price 0.75*Distance + 0.25*price/10 p1 4 400 13 p2 24 380 27.5 p3 14 340 19 P4 36 300 34.5 p5 26 280 26.5 p6 8 260 12.5 p7 40 200 35 p8 20 180 19.5 p9 34 140 29 p10 28 120 24 p11 16 60 13.5 Skyline Computation: Challenges and Opportunities

slide-11
SLIDE 11

Skyline: Hotel Example

hotel distance price

p1 4 400

p2 24 380 p3 14 340 p4 36 300 p5 26 280

p6 8 260

p7 40 200 p8 20 180 p9 34 140 p10 28 120

p11 16 60

10 20 30 40 100 200 300 400 price distance to the destination

p1 p6

p3 p2 p8

p11

p5 p4 p7 p9 p10

Definition (Skyline). Given a dataset P of n points in d -dimensional space. Let p and pt be two different points in P, p dominates pt, if for all i , p[i ] ≤ pt[i ], and for at least one i , p[i ] < pt[i ]. The skyline points are those points that are not dominated by any other point in P.

Skyline Computation: Challenges and Opportunities

slide-12
SLIDE 12

Skyline Queries: Patient Similarity Search Example Skyline Queries

ID age trestbps p1 40 140 p2 39 120 p3 45 130 p4 37 140

Table:Sample of heart disease dataset.

(a) Original data.

130 140 trestbps

p4 p3 q p1

120

p2

110 35 40 45 age

Query point: q(41,125)

Skyline Computation: Challenges and Opportunities

slide-13
SLIDE 13

Motivating Example: Skyline Queries

ID age trestbps p1 40 140 p2 39 120 p3 45 130 p4 37 140

Table:Sample of heart disease dataset.

(a) Original data. (b) Mapped Data. ID age trestbps t1 42 140 t2 43 130 t3 45 130 t4 45 140

130 140 trestbps

p4 p3 q p1 t1 t4 t2 t3

120

p2

110 35 40 45 age

Query point: q(41,125).

Skyline Computation: Challenges and Opportunities

slide-14
SLIDE 14

Motivating Example: Skyline Queries

ID age trestbps p1 40 140 p2 39 120 p3 45 130 p4 37 140

Table:Sample of heart disease dataset.

(a) Original data. (b) Mapped Data. ID age trestbps t1 42 140 t2 43 130 t3 45 130 t4 45 140

130 140 trestbps

p4 p3 q p1 t1 t4 t2 t3

120

p2

110 35 40 45 age

Query point :q(41,125).

Skyline Computation: Challenges and Opportunities

slide-15
SLIDE 15

15

Skyline

 Applications

 Recommendation: recommend phones as cheap as

possible, as large memory capacity as possible, as light weight as possible

 Aggregation/integration: rank results from multiple

search engines with relevance score

 Preprocessing for top-k: all candidates for top-1

slide-16
SLIDE 16

Skyline for Top-1

hotel distance price

p1 4 400

p2 24 380 p3 14 340 p4 36 300 p5 26 280

p6 8 260

p7 40 200 p8 20 180 p9 34 140 p10 28 120

p11 16 60

10 20 30 40 100 200 300 400 price distance to the destination

p1 p6

p3 p2 p8

p11

p5 p4 p7 p9 p10

Skyline Computation: Challenges and Opportunities

slide-17
SLIDE 17

What about Top-K?

hotel distance price

p1 4 400

p2 24 380 p3 14 340 p4 36 300 p5 26 280

p6 8 260

p7 40 200 p8 20 180 p9 34 140 p10 28 120

p11 16 60

10 20 30 40 100 200 300 400 price distance to the destination

p1 p6

p3 p2 p8

p11

p5 p4 p7 p9 p10

Skyline Computation: Challenges and Opportunities

slide-18
SLIDE 18

Skyline for TopK

hotel distance price

p1 4 400

p2 24 380 p3 14 340 p4 36 300 p5 26 280

p6 8 260

p7 40 200 p8 20 180 p9 34 140 p10 28 120

p11 16 60

10 20 30 40 100 200 300 400 price distance to the destination

p1 p6

p3 p2 p8

p11

p5 p4 p7 p10 p9 Lowest Price

Skyline Computation: Challenges and Opportunities

  • Skyline: pareto top-1 points
  • Group skyline: pareto top-k groups
slide-19
SLIDE 19

Group skyline definition: Dominance

Definition (G-Skyline). We say group G dominates group Gt , denoted by G ≺g Gt , if we can find two permutations of the

t t ui vi v1, p

}, such that p Ç pt for all i k points for G and Gt , G = {pu1 , pu2 , ..., puk } and Gt = {pt

v2, ..., pvk

(1 ≤ i ≤ k) and pui ≺ pt

vi for at least one i . The k-point G-Skyline consists of those groups with k points that

are not g-dominated by any other group with same size.

hotel distance price

p1 4 400

p2 24 380 p3 14 340 p4 36 300 p5 26 280

p6 8 260

p7 40 200 p8 20 180 p9 34 140 p10 28 120

p11 16 60

10 20 30 40 100 200 300 400 price distance to the destination

p1 p6

p3 p2 p8

p11

p5 p4 p7 p9 p10

Skyline Computation: Challenges and Opportunities

slide-20
SLIDE 20

Hotel Example

hotel distance price

p1 4 400

p2 24 380 p3 14 340 p4 36 300 p5 26 280

p6 8 260

p7 40 200 p8 20 180 p9 34 140 p10 28 120

p11 16 60

10 20 30 40 100 200 300 400 price distance to the destination

p1 p6

p3 p2 p8

p11

p5 p4 p7 p10 p9 Lowest Price

Skyline Computation: Challenges and Opportunities

slide-21
SLIDE 21

22

Outline

  • Ranking and skyline
  • Top-k algorithms
  • Skyline algorithms
  • Reconciling top-k and skyline
slide-22
SLIDE 22

23

Introduction – naïve methods

Top-k processing

 Apply the ranking function F to all objects  Unsorted: linearly scan all objects (online)  Sorted list: sorting all objects (offline)  Priority queue: build queue (offline), remove

top-k (online)

 Offline computation needs to know the scoring

function!

slide-23
SLIDE 23

24

Top-k Computation – FA algorithm

Fagin’s Algorithm (FA)

  • R. Fagin, Amnon Lotem, Moni Naor. “Optimal

Aggregation Algorithms for Middleware”. J. Comput.

  • Syst. Sci. 66(4), pp. 614-656, 2003.

The algorithm is based on two types of accesses: Sorted access on attribute ai: retrieves the next object in the sorted list of ai Random access on attribute ai: gives the value of the i-th attribute for a specific object identifier.

slide-24
SLIDE 24

25

Top-k Computation

The database can be considered as an n x m score matrix, storing the score values of every object in every attribute.

a1 a2 a3 a4 a5

O3, 99 O1, 91 O1, 92 O3, 74 O3, 67 O1, 66 O3, 90 O3, 75 O1, 56 O4, 67 O0, 63 O0, 61 O4, 70 O0, 56 O1, 58 O2, 48 O4, 07 O2, 16 O2, 28 O2, 54 O4, 44 O2, 01 O0, 01 O4, 19 O0, 35 Note that, for each attribute scores are sorted in descending order.

slide-25
SLIDE 25

26

Top-k Computation – FA algorithm

Outline of FA Step 1:

  • Read attributes from every sorted list using sorted access.
  • Stop when k objects have been seen in common from all lists.

Step 2:

  • Use random access to find missing scores.

Step 3:

  • Compute the scores of the seen objects.
  • Return the k highest scored objects.
slide-26
SLIDE 26

Top-k Computation – FA algorithm

Step 1:

  • Read attributes from every sorted list using sorted access
  • Stop when k objects have been seen in common from all lists

27

a1 a2 a3 a4 a5

O3, 99 O1, 91 O1, 92 O3, 74 O3, 67 O1, 66 O3, 90 O3, 75 O1, 56 O4, 67 O0, 63 O0, 61 O4, 70 O0, 56 O1, 58 O2, 48 O4, 07 O2, 16 O2, 28 O2, 54 O4, 44 O2, 01 O0, 01 O4, 19 O0, 35

id a1 a2 a3 a4 a5

O3 99 90 75 74 67 O1 66 91 92 56 58 O4 70 67 O0 63 61 56

No more sorted accesses are required, since we have determined k=2

  • bjects contained in all lists (objects O1 and O3).
slide-27
SLIDE 27

Top-k Computation – FA algorithm

Step 2:

  • Use random access to find missing scores

28

a1 a2 a3 a4 a5

O3, 99 O1, 91 O1, 92 O3, 74 O3, 67 O1, 66 O3, 90 O3, 75 O1, 56 O4, 67 O0, 63 O0, 61 O4, 70 O0, 56 O1, 58 O2, 48 O4, 07 O2, 16 O2, 28 O2, 54 O4, 44 O2, 01 O0, 01 O4, 19 O0, 35

id a1 a2 a3 a4 a5

O3 99 90 75 74 67 O1 66 91 92 56 58 O4 70 67 O0 63 61 56 44 07 19 01 35

All missing values for seen objects have been determined. Therefore, no more random accesses are required.

slide-28
SLIDE 28

Top-k Computation – FA algorithm

Step 3:

  • Compute the scores of the seen objects.
  • Return the k highest scored objects.

29

id a1 a2 a3 a4 a5

O3 99 90 75 74 67 O1 66 91 92 56 58 O4 70 67 O0 63 61 56 44 07 19 01 35

Total Score 405 363 207 216

Top-2

slide-29
SLIDE 29

Threshold Algorithm (TA)

  • Idea: sometimes we can stop before seeing k objects in every list
  • Compute the score for all objects seen so far
  • Compute the upper bound of how good a score of an unseen
  • bject can be
  • Aggregating the minimal (worst) score seen so far in all lists
  • If there are already k objects above the threshold, stop
slide-30
SLIDE 30

Top-k Computation – TA algorithm

31

Outline of TA Step 1:

  • Read attributes from every sorted list using sorted access.
  • For each object seen x:
  • Use random access to find missing values.
  • Determine the score F(x) of object x.
  • If the object is among the top-k keep it in buffer.

Step 2:

  • Determine threshold value T based on objects currently seen under sorted access.

T = a1(p) + a2(p) + … + am(p) where p is the current sorted access position.

  • If there are k objects with total scores >= T then STOP and report answers

else p = p + 1 and GOTO Step1.

slide-31
SLIDE 31

Top-k Computation – TA algorithm

Step 1:

  • Read attributes from every sorted list using sorted access.
  • For each object seen x:
  • Use random access to find missing values.
  • Determine the score F(x) of object x.
  • If the object is among the top-k keep it in buffer.

32

a1 a2 a3 a4 a5

O3, 99 O1, 91 O1, 92 O3, 74 O3, 67 O1, 66 O3, 90 O3, 75 O1, 56 O4, 67 O0, 63 O0, 61 O4, 70 O0, 56 O1, 58 O2, 48 O4, 07 O2, 16 O2, 28 O2, 54 O4, 44 O2, 01 O0, 01 O4, 19 O0, 35

id a1 a2 a3 a4 a5 F

O3 99 90 75 74 67 405 O1 66 91 92 56 58 363

BUFFER: (O3,405) (O1,363)

p=1

slide-32
SLIDE 32

Top-k Computation – TA algorithm

Step 2:

  • Determine threshold value T based on objects currently seen

under sorted access. T = a1(p) + a2(p) + … + am(p) where p is the current sorted access position.

  • If there are k objects with total scores >= T then STOP and report

answers else p = p + 1 and GOTO Step1.

33

a1 a2 a3 a4 a5

O3, 99 O1, 91 O1, 92 O3, 74 O3, 67 O1, 66 O3, 90 O3, 75 O1, 56 O4, 67 O0, 63 O0, 61 O4, 70 O0, 56 O1, 58 O2, 48 O4, 07 O2, 16 O2, 28 O2, 54 O4, 44 O2, 01 O0, 01 O4, 19 O0, 35

id a1 a2 a3 a4 a5 F

405 363 O3 99 90 75 74 67 O1 66 91 92 56 58

T = 99+91+92+74+67 = 423

p=1

There are NO k objects with a score >= T, GOTO Step1 …

BUFFER: (O3 405) (O1,363)

slide-33
SLIDE 33

Top-k Computation – TA algorithm

Step 1 (second execution):

  • Read attributes from every sorted list using sorted

access.

  • For each object seen x:
  • Use random access to find missing values.
  • Determine the score F(x) of object x.
  • If the object is among the top-k keep it in buffer.

34

a1 a2 a3 a4 a5

O3, 99 O1, 91 O1, 92 O3, 74 O3, 67 O1, 66 O3, 90 O3, 75 O1, 56 O4, 67 O0, 63 O0, 61 O4, 70 O0, 56 O1, 58 O2, 48 O4, 07 O2, 16 O2, 28 O2, 54 O4, 44 O2, 01 O0, 01 O4, 19 O0, 35

id a1 a2 a3 a4 a5 F

O3 99 90 75 74 67 405 O1 66 91 92 56 58 363

p=2

O4 44 07 70 19 67 207

BUFFER: (O3,405) (O1,363)

slide-34
SLIDE 34

Top-k Computation – TA algorithm

Step 2 (second execution):

  • Determine threshold value T based on objects currently seen

under sorted access. T = a1(p) + a2(p) + … + am(p) where p is the current sorted access position.

  • If there are k objects with total scores >= T then STOP and

report answers else p = p + 1 and GOTO Step1.

35

a1 a2 a3 a4 a5

O3, 99 O1, 91 O1, 92 O3, 74 O3, 67 O1, 66 O3, 90 O3, 75 O1, 56 O4, 67 O0, 63 O0, 61 O4, 70 O0, 56 O1, 58 O2, 48 O4, 07 O2, 16 O2, 28 O2, 54 O4, 44 O2, 01 O0, 01 O4, 19 O0, 35

id a1 a2 a3 a4 a5 F

O3 99 90 75 74 67 405 O1 66 91 92 56 58 363

p=2

O4 44 07 70 19 67 207

T = 66+90+75+56+67 = 354

BUFFER: (O3,405) (O1,363)

Both objects in the buffer have scores higher than T. STOP and report answers.

slide-35
SLIDE 35

Top-k Computation - FA vs TA

  • TA sees less objects than FA
  • TA stops at least as early as FA
  • TA may perform more random accesses than FA
  • In TA, (m-1) random accesses for each object.
  • In FA, random accesses are done at the end, only for missing

scores.

36

slide-36
SLIDE 36

No Random Access

  • Maintain lower and upper bounds for every
  • bject (worst and best possible score)
  • Best is the aggregation of what we have seen +

the best we can see from remaining list

  • Worst is the aggregation of what we have seen +

zeros

  • Stop if the best possible score for objects
  • utside list is less than the k'th Worst in the list
slide-37
SLIDE 37

Top-k Computation – NRA algorithm

38

a1 a2 a3 a4 a5

O3, 99 O1, 91 O1, 92 O3, 74 O3, 67 O1, 66 O3, 90 O3, 75 O1, 56 O4, 67 O0, 63 O0, 61 O4, 70 O0, 56 O1, 58 O2, 48 O4, 07 O2, 16 O2, 28 O2, 54 O4, 44 O2, 01 O0, 01 O4, 19 O0, 35

id a1 a2 a3 a4 a5 F

O3 99 74 67 O1 91 92

BUFFER:

O3,240,423 O1,183, 423 p=1

T = 99+91+92+74+67 = 423

slide-38
SLIDE 38

Top-k Computation – TA algorithm

a1 a2 a3 a4 a5

O3, 99 O1, 91 O1, 92 O3, 74 O3, 67 O1, 66 O3, 90 O3, 75 O1, 56 O4, 67 O0, 63 O0, 61 O4, 70 O0, 56 O1, 58 O2, 48 O4, 07 O2, 16 O2, 28 O2, 54 O4, 44 O2, 01 O0, 01 O4, 19 O0, 35

id a1 a2 a3 a4 a5 F

O3 99 90 75 74 67 405 O1 66 91 92 56 363

p=2

O4 67

T = 66+90+75+56+67 = 354

BUFFER:

O3,405 O1,306,373

Stop when k objects with worst scores greater than T

O4,67,354

slide-39
SLIDE 39

40

Outline

  • Ranking and skyline
  • Top-k algorithms
  • Skyline algorithms
  • Skyline
  • Group skyline
  • Reconciling top-k and skyline
slide-40
SLIDE 40

41

Dominance

 p1 dominates p2.  Hence, p1 has a smaller score under any monotone preference function f(x, y).

  • f(x, y) is monotone if it increases with both x and y.

x y p1 p2

slide-41
SLIDE 41

42

Skyline

x y p1 p2 p3 p4 p5 p6 p7 p8

 The skyline contains points that are not dominated by others.

slide-42
SLIDE 42

43

Skyline vs. convex hull

x y p1 p2 p3 p4 p5 p6 p7 p8 x y p1 p2 p3 p4 p5 p6 p7 p8 Contains the top-1 object of any monotone function. Contains the top-1 object of any linear function.

slide-43
SLIDE 43

44

Skyline computation – naïve methods

Skyline

 For each object, check if it is dominated by any

  • ther object

 Return the objects that are not dominated

Complexity?

slide-44
SLIDE 44

45

Skyline Computation

  • 2D Scanning algorithm
  • Divide and Conquer
  • Nearest-Neighbor based
  • Branch and Bound Algorithm
slide-45
SLIDE 45

46

Skyline

x y p1 p2 p3 p4 p5 p6 p7 p8

 The skyline contains points that are not dominated by others.

slide-46
SLIDE 46

47

2D

 Sort all points by x  Scan the points one at a time by increasing x-order  Dominance check: if y is smaller than the smallest y so far (the previous skyline), add to skyline

x y plane sweep

slide-47
SLIDE 47

48

2D

 Sort all points by x  Scan the points one at a time by increasing x-order  Dominance check: if y is smaller than the smallest y so far (the previous skyline), add to skyline  Complexity: sort O(NlgN), scan O(N)

x y plane sweep

slide-48
SLIDE 48

D & C Algorithm

  • 1. Original

data

  • 2. Divides dataset into 2 parts

by medium

  • 3. Compute Skyline

S1 and S2

  • 4. Eliminates

points in S2 dominated by S1

slide-49
SLIDE 49

D&C algorithm

  • 2D: O(NlgN)

T(N) = 2T(N/2) + O(N)

slide-50
SLIDE 50

D&C algorithm

  • 2D: O(NlgN)

T(N) = 2T(N/2) + O(N)

  • 3D: O(NlgN)

merge: solve 2-D by scanning algorithm

T(N) = 2T(N/2) + O(N)

slide-51
SLIDE 51

D&C algorithm

  • 2D: O(NlgN)

T(N) = 2T(N/2) + O(N)

  • 3D: O(NlgN)

merge: solve 2-D by scanning algorithm

T(N) = 2T(N/2) + O(N)

  • General: O(Nlgk-2N), average O(NlgN)

T(N) = 2T(N/2, k) + T(N, k-1) + O(N)

slide-52
SLIDE 52

54

Nearest Neighbor search

[Kossmann et al. VLDB 02]  Find nearest neighbor as skyline  Eliminate points in IV  Compute recursively in II and III (to do list)

z y p1 I II III IV z y p2 p3

slide-53
SLIDE 53

55

Branch-and-bound skyline (BBS)

[Papadias et al. SIGMOD 04]

 Assume data are indexed by R-Tree

x y O p1 p2 p3 p4 p5 p6 p7 p8 p1 p2 p3 p4 p5 p6 p7 p8

0.2 0.4 0.6 0.8 1 1 0.8 0.6 0.4 0.2

slide-54
SLIDE 54

56

R-trees – structure

2 4 6 8 10 2 4 6 8 10

x axis y axis

b c a

E

3

a b c d e E 1 E 2 E 3 E 4 E 5 Root E 1 E 2 E 3 E 4 f g h E 5

d e f g h i j k l m

l m E 7 i j k E 6 E 6 E 7

Minimum Bounding Rectangle (MBR)

slide-55
SLIDE 55

60

Introduction to R-trees – range query

2 4 6 8 10 2 4 6 8 10 x axis y axis b c a

E

1 d e f g h i j k l m

E

2

a b c d e E1 E2 E3 E4 E5 Root E1 E2 E3 E4 f g h E5 l m E7 i j k E6 E6 E7 E3 E4 E5 E6 E7

slide-56
SLIDE 56

61

Branch-and-bound skyline (BBS)

[Papadias et al. SIGMOD 04]

 Assume data are indexed by R-Tree  How to branch and bound for fast skyline search and search space elimination

x y O p1 p2 p3 p4 p5 p6 p7 p8 p1 p2 p3 p4 p5 p6 p7 p8

0.2 0.4 0.6 0.8 1 1 0.8 0.6 0.4 0.2

slide-57
SLIDE 57

62

Branch-and-bound skyline (BBS)

[Papadias et al. SIGMOD 04]

 Assume data are indexed by R-Tree  How to branch and bound for fast skyline search and search space elimination  Key idea: examine closest MBR

x y O p1 p2 p3 p4 p5 p6 p7 p8 p1 p2 p3 p4 p5 p6 p7 p8

0.2 0.4 0.6 0.8 1 1 0.8 0.6 0.4 0.2

slide-58
SLIDE 58

64

BBS Algorithm

  • Initialize the skyline set S
  • Add all entries of the root to priority queue Q

(based on L1 distance of the MBR lower-left corner to the origin)

  • Repeat: remove top entry e from Q
  • If d is dominated by S, discard
  • If e is an intermediate entry, add non-dominated

children to Q

  • If e is a point, add to S
  • Stop when Q is empty
slide-59
SLIDE 59

65

x y b a i k N2 N1 N3 N4 h N6 N7 g d f e c l

  • 1

2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

m n N5

a b c d e f g h i l k e1 e2 e3 e4 e6 e7 N1 N2 N6 N3 N4 N7 R m n N5 e5

action heap contents S access root <e7,4><e6,6> 

  • Each heap entry keeps the

mindist of the MBR.

BBS Algorithm - example

slide-60
SLIDE 60

66

BBS Algorithm - example

x y b a i k N2 N1 N3 N4 h N6 N7 g d f e c l

  • 1

2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

m n N5

a b c d e f g h i l k e1 e2 e3 e4 e6 e7 N1 N2 N6 N3 N4 N7 R m n N5 e5

action heap contents S access root <e7,4><e6,6>  expand e7 <e3,5><e6,6><e5,8><e4,10> 

  • Process entries in ascending
  • rder of their mindists.
slide-61
SLIDE 61

67

BBS Algorithm - example

x y b a i k N2 N1 N3 N4 h N6 N7 g d f e c l

  • 1

2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

m n N5

a b c d e f g h i l k e1 e2 e3 e4 e6 e7 N1 N2 N6 N3 N4 N7 R m n N5 e5

action heap contents S access root <e7,4><e6,6>  expand e7 <e3,5><e6,6><e5,8><e4,10>  expand e3 <i,5><e6,6><e5,8> <e4,10> {i}

slide-62
SLIDE 62

68

BBS Algorithm - example

x y b a i k N2 N1 N3 N4 h N6 N7 g d f e c l

  • 1

2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

m n N5

a b c d e f g h i l k e1 e2 e3 e4 e6 e7 N1 N2 N6 N3 N4 N7 R m n N5 e5

action heap contents S access root <e7,4><e6,6>  expand e7 <e3,5><e6,6><e5,8><e4,10>  expand e3 <i,5><e6,6><e5,8> <e4,10> {i} expand e6 <e5,8><e1,9><e4,10> {i}

slide-63
SLIDE 63

69

BBS Algorithm - example

x y b a i k N2 N1 N3 N4 h N6 N7 g d f e c l

  • 1

2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

n N5 m

a b c d e f g h i l k e1 e2 e3 e4 e6 e7 N1 N2 N6 N3 N4 N7 R m n N5 e5

{i} remove e5 <e1,9><e4,10> action heap contents S access root <e7,4><e6,6>  expand e7 <e3,5><e6,6><e5,8><e4,10>  expand e3 <i,5><e6,6><e5,8> <e4,10> {i} expand e6 <e5,8><e1,9><e4,10> {i}

slide-64
SLIDE 64

70

BBS Algorithm - example

x y b a i k N2 N1 N3 N4 h N6 N7 g d f e c l

  • 1

2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

m n N5

a b c d e f g h i l k e1 e2 e3 e4 e6 e7 N1 N2 N6 N3 N4 N7 R m n N5 e5

{i} remove e5 <e1,9><e4,10> expand 1 e <a,10><e4,10> {i,a} action heap contents S access root <e7,4><e6,6>  expand e7 <e3,5><e6,6><e5,8><e4,10>  expand e3 <i,5><e6,6><e5,8> <e4,10> {i} expand e6 <e5,8><e1,9><e4,10> {i}

slide-65
SLIDE 65

71

BBS Algorithm - example

k x y b a i N2 N1 N3 N4 h N6 N7 g d f e c l

  • 1

2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

m n N5

a b c d e f g h i l k e1 e2 e3 e4 e6 e7 N1 N2 N6 N3 N4 N7 R m n N5 e5

{i} remove e5 <e1,9><e4,10> expand 1 e <a,10><e4,10> {i,a} expand e4 {i,a,k} action heap contents S access root <e7,4><e6,6>  expand e7 <e3,5><e6,6><e5,8><e4,10>  expand e3 <i,5><e6,6><e5,8> <e4,10> {i} expand e6 <e5,8><e1,9><e4,10> {i} <k,10>

slide-66
SLIDE 66

72

BBS Algorithm - performance

BBS performs better than previously proposed Skyline algorithms, regarding CPU time and I/O time.

(source: Papadias et al TODS 2005)

Number of R-tree node accesses vs dimensionality

slide-67
SLIDE 67

73

Outline

  • Ranking and skyline
  • Top-k algorithms
  • Skyline algorithms
  • Skyline
  • Group skyline
  • Reconciling top-k and skyline
slide-68
SLIDE 68

Group skyline Computation

Definition (G-Skyline). We say group G dominates group Gt , denoted by G ≺g Gt , if we can find two permutations of the

t t ui vi v1, p

}, such that p Ç pt for all i k points for G and Gt , G = {pu1 , pu2 , ..., puk } and Gt = {pt

v2, ..., pvk

(1 ≤ i ≤ k) and pui ≺ pt

vi for at least one i . The k-point G-Skyline consists of those groups with k points that

are not g-dominated by any other group with same size.

hotel distance price

p1 4 400

p2 24 380 p3 14 340 p4 36 300 p5 26 280

p6 8 260

p7 40 200 p8 20 180 p9 34 140 p10 28 120

p11 16 60

10 20 30 40 100 200 300 400 price distance to the destination

p1 p6

p3 p2 p8

p11

p5 p4 p7 p9 p10

Skyline Computation: Challenges and Opportunities

slide-69
SLIDE 69

Brute-Force Algorithm

Skyline Computation: Challenges and Opportunities

slide-70
SLIDE 70

Hotel Example

hotel distance price

p1 4 400

p2 24 380 p3 14 340 p4 36 300 p5 26 280

p6 8 260

p7 40 200 p8 20 180 p9 34 140 p10 28 120

p11 16 60

10 20 30 40 100 200 300 400 price distance to the destination

p1 p6

p3 p2 p8

p11

p5 p4 p7 p10 p9 Lowest Price

Skyline Computation: Challenges and Opportunities

slide-71
SLIDE 71

Observations

hotel distance price

p1 4 400

p2 24 380 p3 14 340 p4 36 300 p5 26 280

p6 8 260

p7 40 200 p8 20 180 p9 34 140 p10 28 120

p11 16 60

10 20 30 40 100 200 300 400

A G-Skyline group cannot have a non-skyline point dominated by a point outside the group.

price distance to the destination

p1 p6

p3 p2 p8

p11

p5 p4 p7 p9 p10

Skyline Computation: Challenges and Opportunities

slide-72
SLIDE 72

Observations

hotel distance price

p1 4 400

p2 24 380 p3 14 340 p4 36 300 p5 26 280

p6 8 260

p7 40 200 p8 20 180 p9 34 140 p10 28 120

p11 16 60

10 20 30 40 100 200 300 400 price distance to the destination

p1 p6

p3 p2 p8

p11

p5 p4 p7 p10 p9 Lowest Price

Skyline Computation: Challenges and Opportunities

Lemma A G-skyline group can have a non-skyline point dominated by a point within the group

slide-73
SLIDE 73

Unit Group

Lemma A point in a G-Skyline group cannot be dominated by a point outside the group. Lemma Given a point p, if p is in a G-Skyline group, p’s parentsmust be included in this G-Skyline group. Definition (Unit Group). Given a point p, p and its parents form the unit group for p.

Skyline Computation: Challenges and Opportunities

slide-74
SLIDE 74

Skyline Layers and Directed Skyline Graph

100 200 300 400 p 1

p6

p3 p2 p8 p5 p4 p7 p9 p10 layer1 layer2 layer3 layer4

p11

10 20 30 40

Figure:Skyline layers.

p1 p6 p3 p2 p8 p11 p5 p4 p7 p9 p10

layer1 layer2 layer3 layer4

Figure:Directed skyline graph.

Skyline Computation: Challenges and Opportunities

slide-75
SLIDE 75

Skyline Layers and Directed Skyline Graph

100 200 300 400 p 1

p6 p3

p2 p8 p5 p4 p7 p9 p10 layer1 layer2 layer3 layer4

p11

10 20 30 40

Figure:Skyline layers.

p1 p6 p3 p2 p8 p11 p5 p4 p7 p9 p10

layer1 layer2 layer3 layer4

Figure:Directed skyline graph.

Skyline Computation: Challenges and Opportunities

slide-76
SLIDE 76

Unit Group

Definition (Unit Group). Given a point p in DSG, p and its parents form the unit group for p.

p1 p6 p3 p2 p8 p11 p5 p4 p7 p9 p10

layer1 layer2 layer3 layer4

Skyline Computation: Challenges and Opportunities

slide-77
SLIDE 77

Verification of G-Skyline

Theorem ( Verification of G-Skyline). Given a group G = {p1, p2, ..., pk}, it is a G-Skyline group, if its corresponding unit group set S = u1 ∪u2 ∪... ∪ uk contains k points, i.e., |S|p =k.

p1 p6 p3 p2 p8 p11 p5 p4 p7 p9 p10

layer1 layer2 layer3 layer4

Skyline Computation: Challenges and Opportunities

slide-78
SLIDE 78

Finding G-Skyline Groups

Baseline: Enumerate all candidates, and check if the unit-groups are in the group

Skyline Computation: Challenges and Opportunities

slide-79
SLIDE 79

All possible candidates: set enumeration tree

p1 p6 p3 p2 p8 p11 p4 p7 p5 p9 p10

{1,2,...,k,...,Sk} {} {2} ... {k} ...{Sk}

... ...

{1}

...

{1, 2}... {1, k}... {1, Sk}

...

{2, 3}... {2, k}... {2, Sk}... {k,k + 1}... {k,Sk}... |s| =0 |s| =1 |s| =2 |s| =k |s| =Sk {1, 2,...,k − 1,k}... {1, 2,...,k − 1,Sk}... {2,3,...,k,k + 1}... {2, 3,...,k,Sk}... Skyline Computation: Challenges and Opportunities

slide-80
SLIDE 80

Finding G-Skyline Groups

Baseline: Enumerate all candidates, and check if the unit-groups are in the group Point-Wise: Point-wise algorithm with Subtree Pruning and Tail Set Pruning.

Skyline Computation: Challenges and Opportunities

slide-81
SLIDE 81
slide-82
SLIDE 82

88

Outline

  • Ranking and skyline
  • Top-k algorithms
  • Skyline algorithms
  • Skyline
  • Group skyline
  • Reconciling top-k and skyline