1
1/23
Best Position Algorithms for Top-k Queries*
Reza Akbarinia, Esther Pacitti, Patrick Valduriez Atlas Team, INRIA and LINA, Nantes
September 2007
Best Position Algorithms for Top-k Queries* Reza Akbarinia, Esther - - PowerPoint PPT Presentation
1 Best Position Algorithms for Top-k Queries* Reza Akbarinia, Esther Pacitti, Patrick Valduriez Atlas Team, INRIA and LINA, Nantes September 2007 * VLDB Conference, 2007 1/23 2 Outline Introduction Problem Definition Related Work (FA and
1
1/23
September 2007
2
2/23
3
3/23
4
4/23
– Each data item has – a local score in each list – Each list – is sorted in decreasing order of the local scores
5
5/23
6
6/23
– Reads next item in the list (starts with the first data item)
– Looks up a given data item in the list by its identifier (e.g. TID)
7
7/23
8
8/23
– Do sorted access in parallel to the lists until at least k data items have been seen in all lists
– [Nepal and Ramakrishna, ICDE99] – [Fagin, Lotem and Naor, PODS01] – [Güntzer, Kießling and Balke, ITCC01]
9
9/23
– Based on the last scores seen in the lists under sorted access
10
10/23
… … … … … … …
12 i 13 d 11 b 9 14 a 14 c 14 f 8 15 m 20 h 17 e 7 19 f 21 a 23 h 6 24 b 23 i 25 g 5 25 d 24 e 26 c 4 28 h 25 g 27 i 3 29 e 27 f 28 d 2 30 c 28 b 30 a 1
Local score s3 Data item Local score s2 Data item Local score s1 Data item List 3 List 2 List 1 Position
sf () = s1 + s2 + s3, k = 3 Y: {top seen items}
Y = {(c, 70), (a, 65), (b, 63)} T=28+27+29 = 84 sorted access T=30+28+30 = 88 T=27+25+28 = 80 T=26+24+25 = 75 T=25+23+24 = 72 T=23+21+19 = 63 random access Y = {(e, 70), (c, 70), (a, 65)} Y = {(h, 71), (e, 70), (c, 70)}
11
11/23
12
12/23
– Do random access to the other lists to retrieve the item’s score and position – Maintain the positions and scores of the seen data item
13
13/23
11 g … … … … 10
12 i 13 d 11 b 9 14 a 14 c 14 f 8 15 m 20 h 17 e 7 19 f 21 a 23 h 6 24 b 23 i 25 g 5 25 d 24 e 26 c 4 28 h 25 g 27 i 3 29 e 27 f 28 d 2 30 c 28 b 30 a 1
Local score s3 Data item Local score s2 Data item Local score s1 Data item List 3 List 2 List 1 Position Y = {(c, 70), (a, 65), (b, 63)} sorted access random access Y = {(e, 70), (c, 70), (a, 65)} Y = {(h, 71), (e, 70), (c, 70)} random access random access random access random access random access random access Best Positions: Best Positions Overall Score = 30 + 28 + 30 = 88 Best Positions Overall Score = 11 + 13 + 19 = 43 Best Positions Overall Score = 28 + 27 + 29 = 84 Best Positions: Best Positions: At position 3, the best position overall score is less than the score of the k data items, thus BPA stops. Recall that, over this database, TA stops at position 6. Thus, the number of sorted (random) accesses done by BPA is ½ that of TA.
sf () = s1 + s2 + s3, k = 3 Y: {top seen items}
14
14/23
15
15/23
– Retrieves the data item which is at a given position in a list
– In BPA, a data item may be accessed several times in different lists – In BPA2, no data item in a list is accessed more than once
– Bit array or B+-tree over the list
16
16/23
– Do random access to the other lists to retrieve the scores of the seen data item in all lists – After each direct or random access to a list, update the best position
17
17/23
18
18/23
– Customized for centralized systems – Cost of a random access is (log n) times of a sorted access
– Useful in distributed systems
– Over a machine with a 2.4 GHz Intel Pentium 4
19
19/23
Execution cost Uniform database, k=20
1 0000000 20000000 30000000 40000000 50000000 60000000 2 4 6 8 1 1 2 1 4 1 6 1 8
m Execution Cost
TA BPA BPA2
Response time Uniform database, k=20
500 1 000 1 500 2000 2500 2 4 6 8 1 1 2 1 4 1 6 1 8
m Response Time (ms)
TA BPA BPA2
20
20/23
Number of accesses Uniform database, k=20
500000 1000000 1500000 2000000 2500000 3000000 3500000 4000000 2 4 6 8 10 12 14 16 18
m Number of Accesses
T A BPA BPA2
21
21/23
22
22/23
23
23/23
Conf., pages 495-506, 2007.
Conf., pages 211-222, 2007.
top-k queries. Journal of Distributed and Parallel Databases, 19(2-3), pages 67-86, 2006.
5(3), pages 303-317, 2007.
Management (Eds. R. Baldoni, G. Cortese and F. Davide), IOS Press, 2006.
In Proc. of the Int. Conf. on High Performance Computing for Computational Science (VecPar), LNCS 4395, Springer, pages 158-171, 2006.
In Proc. of the Journées Bases de Données Avancées (BDA), 2006.
management system. In Proc. of the Int. Workshop on Distributed Data and Structures (WDAS), Carleton Scientific, pages 19-33, 2004.
24
24/23
[Fag99] R. Fagin. Combining fuzzy information from multiple systems. J. Comput. System Sci., 58 (1), 1999. [FLN01] R. Fagin, A. Lotem and M. Naor. Optimal aggregation algorithms for middleware. PODS Conf., 2001. [GKB00] U. Güntzer, W. Kießling and W.-T Balke. Optimizing multi-feature queries for image databases. VLDB Conf., 2000. [NR99] S. Nepal and M.V. Ramakrishna. Query processing issues in image (multimedia) databases. ICDE Conf., 1999.
25
25/23
26
26/23
Execution cost Uniform database, m=8
2000000 4000000 6000000 8000000 10000000 10 20 30 40 50 60 70 80 90 100
k
Execution Cost
T A BP A BP A2
27
27/23
Execution cost Uniform database, m=8
2000000 4000000 6000000 8000000 1 0000000 1 2000000 1 4000000 25K 50K 75K 1 00K 125K 1 50K 1 75K 200K
n
Execution Cost
TA BPA BPA2