Top-K Query Processing D. Gunopulos 1 Multimedia Top-K Queries - - PDF document

top k query processing
SMART_READER_LITE
LIVE PREVIEW

Top-K Query Processing D. Gunopulos 1 Multimedia Top-K Queries - - PDF document

Top-K Query Processing D. Gunopulos 1 Multimedia Top-K Queries The IBM QBIC project (90s): How to store and index multimedia objects? How to store and index multimedia objects? Multimedia objects can have many attributes, with


slide-1
SLIDE 1

1

Top-K Query Processing

  • D. Gunopulos

1

Multimedia Top-K Queries

The IBM QBIC project (90’s): How to store and index multimedia objects? How to store and index multimedia objects?

  • Multimedia objects can have many attributes,

with numerical, “fuzzy”, values

e.g. Figure1: 0.7 red, Figure2: 0.4 red, same for blue, etc

2

  • How to find similar objects?
slide-2
SLIDE 2

2

Retrieving Multimedia Objects

Must address the similarity question

  • The user specifies the query as a function of

the attributes:

Find the objects most similar to: (Red = 50), (Green = 20), (Blue = 30) with Red twice as important as Green or Blue

How to rank the objects?

3

How to rank the objects?

  • Must find the best (most similar) objects

without accessing everything The right model: Top-K Queries! Data Management & Query Processing Today

4

We are living in a world where data is generated All The Time & Everywhere

slide-3
SLIDE 3

3

Characteristics of the Applications

  • “Data is generated in a distributed

fashion” e.g. sensor data, file-sharing data,

Geographically Distributed Clusters) Geographically Distributed Clusters)

  • “Distributed Data is often outdated

before it is ever used”

(e.g. CCTV video traces, Internet ping data, sensor readings, weblogs, RFID Tags,…)

5

  • “Transferring the Data to a centralized

repository is usually more expensive than storing it locally”

Motivating Problems

  • “In-situ Data Storage & Retrieval”

– Data remains in-situ (at the generating site). When users want to search/retrieve some – When users want to search/retrieve some information they perform on-demand queries.

  • Challenges:

– Combine different attributes and data sources – Minimize the utilization of the communication di

6

medium – Exploit the network and the inherent parallelism

  • f a distributed environment. Focus on ubiquitous

Hierarchical Networks (e.g. P2P, and sensor-nets). – Number of Answers might be very large Focus on Top-K queries

slide-4
SLIDE 4

4

Top-K Query Example

  • Assume that we have a cluster of n=5

webservers.

  • Each server maintains locally the same m=5

y webpages.

  • When a web page is accessed by a client, a

server increases a local hit counter by one.

7

Top-K Query Example

  • TOP-1 Query: “Which Webpage has the highest

number of hits across all servers (i.e. highest S ( ) )?” Score(oi) )?”

  • Score(oi) can only be calculated if we combine

the hit count from all 5 servers.

Local score

8

{

m n

URL

TOTAL SCORE

slide-5
SLIDE 5

5

Top-K Query Processing Other Applications C ll b ti S D t ti N t k

  • Collaborative Spam Detection Networks
  • Content Distribution Networks
  • Information Retrieval
  • Sensor Networks

9

Top-K Query Processing Setting Setting for this talk:

  • Vertical partitioning:

Independent access Independent access for each attribute (or sets of attributes)

  • Assume index or

sorted access per attribute

10

attribute

  • Centralized or

distributed setting

slide-6
SLIDE 6

6

Top-K Query Processing Setting What kinds of queries?

  • Monotone functions of the attributes:

f( ) f( ) if – f(x11, x21,x31) >= f(x12,x22,x32) if x11 >= x12 and x21>= x22 and x31 >= x32

  • Typically assume linear functions

2 1

2 3 X X fQ + =

tid

1 1

Q X1

R T

(1,1) (0,1)

11

X2

O P

(0,0) (1,0)

Presentation Outline

  • Introduction to Top-K Query Processing
  • Centralized techniques

F i ’ Al ith – Fagin’s Algorithm – Optimal Algorithms: TA (Threshold Algorithm) – Restricted Access Models: TA-Sorted – Probabilistic TA-Sorted – Using previous query instantiations: LPTA

  • Distributed techniques

12

  • Distributed techniques
  • Online Algorithms for Monitoring Top-K results
  • Future Work
slide-7
SLIDE 7

7

Fagin’s Algorithm

[Fagin, PODS’98], [Fagin, Lotem, Naor, PODS’01]

The first efficient algorithm

FΑ Algorithm

1) Access the n lists in parallel. 2) Stop after the values of K objects have been found

Assumes an index per attribute

13

3) While some object oi is seen, but not resolved, perform a random access to the other lists to find the complete score for oi. 4) Return the K objects with the highest score

Fagin’s Algorithm (Example)

  • 3 4 05/5= 81

v1 v2 v3 v4 v5

  • 3 99
  • 1 91
  • 1 92
  • 3 74
  • 3 67

TOP-K

O3 405

  • 3,4.05/5 .81
  • 3, 99
  • 1, 66
  • 0, 63
  • 2, 48
  • 4, 44
  • 1, 91
  • 3, 90
  • 0, 61
  • 4, 07
  • 2, 01
  • 1, 92
  • 3, 75
  • 4, 70
  • 2, 16
  • 0, 01
  • 3, 74
  • 1, 56
  • 2, 56
  • 0, 28
  • 4, 19
  • 3, 67
  • 4, 67
  • 1, 58
  • 2, 54
  • 0, 35

Iteration 1

O3, 405 O1, 363 O4, 207

14

Iteration 2 Resolve o3, partial scores for o1, o4 For Top-1, we resolve o1, o4 with random accesses For Top-2, we continue with Iteration 3 Partial Scores for o1, 03

slide-8
SLIDE 8

8

Fagin’s* Threshold Algorithm

Long studied and well understood.

* C tl d l d b 3 [Fagin, Lotem, Naor, PODS’01], [Guntzer, Balke, Kieling, VLDB’00] [Nepal, Ramakrishna, ICDE’99] * Concurrently developed by 3 groups

ΤΑ Algorithm

1) Access the n lists in parallel. 2) While some object oi is seen, perform a random access to the other lists to find the complete score for oi. 3) Do the same for all objects in the current row

15

3) Do the same for all objects in the current row. 4) Now compute the threshold τ as the sum of scores in the current row. 5)The algorithm stops after K objects have been found with a score above τ.

The Threshold Algorithm (Example)

  • 3 4 05/5= 81

v1 v2 v3 v4 v5

  • 3 99
  • 1 91
  • 1 92
  • 3 74
  • 3 67

TOP-K

O3 405

  • 3,4.05/5 .81
  • 3, 99
  • 1, 66
  • 0, 63
  • 2, 48
  • 4, 44
  • 1, 91
  • 3, 90
  • 0, 61
  • 4, 07
  • 2, 01
  • 1, 92
  • 3, 75
  • 4, 70
  • 2, 16
  • 0, 01
  • 3, 74
  • 1, 56
  • 2, 56
  • 0, 28
  • 4, 19
  • 3, 67
  • 4, 67
  • 1, 58
  • 2, 54
  • 0, 35

Iteration 1 Threshold O3, 405 O1, 363 O4, 207

16

Have we found K=1 objects with a score above τ? => ΝΟ Have we found K=1 objects with a score above τ? => YES! τ = 99 + 91 + 92 + 74 + 67 => τ = 423

Iteration 2 Threshold

τ (2nd row)= 66 + 90 + 75 + 56 + 67 => τ = 354

slide-9
SLIDE 9

9

Comparison of Fagin’s and Threshold Algorithm

  • TA sees less objects than FA
  • TA may perform more random accesses than FA
  • TA requires only bounded buffer space (k)
  • At the expense of more random seeks
  • FA makes use of unbounded buffers

17

FA makes use of unbounded buffers

Optimal algorithms

Algorithm B is instance optimal over set of algorithms A and set of inputs D if : B E A and Cost(B,D ) = O(Cost(A,D )) A E A, D E D Which means that: Cost(B,D ) ≤ c · Cost(A,D ) + c’, A E A, D E D

Theorem [Fagin et. al. 2003]: TA is instance optimal for every monotone

18

TA is instance optimal for every monotone aggregation function, over every database (excluding wild guesses)

slide-10
SLIDE 10

10

TA-Sorted

TA makes random accesses, assumes they are possible and inexpensive In many situations, random accesses are much In many situations, random accesses are much more expensive than sequential access, or may be difficult to implement. TA-sorted uses sequential access only Assumes sorted access for each attribute

ΤΑ-Sorted Algorithm

19

ΤΑ Sorted Algorithm

1) Access the n lists in parallel. 2) When some object oi is seen, update its Upper and Lower bound. 3)The algorithm stops after K objects have been resolved and all other objects have a lower upper bound.

The TA-Sorted Algorithm (Example)

  • 3 4 05/5= 81

v1 v2 v3 v4 v5

  • 3 99
  • 1 91
  • 1 92
  • 3 74
  • 3 67

TOP-K

O3 405

  • 3,4.05/5 .81
  • 3, 99
  • 1, 66
  • 0, 63
  • 2, 48
  • 4, 44
  • 1, 91
  • 3, 90
  • 0, 61
  • 4, 07
  • 2, 01
  • 1, 92
  • 3, 75
  • 4, 70
  • 2, 16
  • 0, 01
  • 3, 74
  • 1, 56
  • 2, 56
  • 0, 28
  • 4, 19
  • 3, 67
  • 4, 67
  • 1, 58
  • 2, 54
  • 0, 35

Iteration 1: o1 = [183, 423], o3 = [240,423] O3, 405 O1, 363

20

Iteration 2: o1 = [305, 372], o3 = 405, o4 = [67, 354] Iteration 3: o1 = 363, o3 = 405, o4 = [137, 317]

slide-11
SLIDE 11

11

TA-Sorted and Relational Systems

The sorted access makes it easier and

[Ilyas, Aref, Elmagarmid, VLDB’03] [Tsaparas, Palpanas, Kotidis, Koudas, Srivastava, ICDE’03] [Bruno, Chauduri, Gravano, ICDE’01]

The sorted access makes it easier and conceptually simpler to integrate to relational Database Management systems: Ilyas et al preset the RNA-RJ Rank-Join Query

  • perator

tid

1 1

Q X1

R T

(1,1) (0,1)

21

Bruno et al show that Top-K queries can be reduced to multidimensional range queries (using histograms to model the data distribution)

X2

O P

(0,0) (1,0)

Probabilistic TA-Sorted

TA-Sorted can keep large intermediary results A smart idea: Use information about the

[Theobald, Weikum, Scenkel, VLDB’04]

distribution of the values to eliminate objects that are unlikely to be in the Top-K result The key: compute probabilistic guarantees

Probabilistic ΤΑ-Sorted

22

1) Access the n lists in parallel. 2) When some object oi is seen, update it’s Upper and Lower. 3)The algorithm stops after K objects have been resolved and all other objects have a lower upper bound.

slide-12
SLIDE 12

12

Probabilistic TA-Sorted

How to compute probabilistic guarantees? Need an estimate for the score of unseen objects

  • Assume attribute independence

Assume attribute independence

  • Compute a model of the distribution in each

attribute Xi

  • Use convolution to estimate the distribution of

X1 + X2

X1 X1+X2

23

Use histograms to model data distributions:

X2

Probabilistic TA-Sorted (Example)

  • 3 4 05/5= 81

v1 v2 v3 v4 v5

  • 3 99
  • 1 91
  • 1 92
  • 3 74
  • 3 67

TOP-K

O3 405

  • 3,4.05/5 .81
  • 3, 99
  • 1, 66
  • 0, 63
  • 2, 48
  • 4, 44
  • 1, 91
  • 3, 90
  • 0, 61
  • 4, 07
  • 2, 01
  • 1, 92
  • 3, 75
  • 4, 70
  • 2, 16
  • 0, 01
  • 3, 74
  • 1, 56
  • 2, 56
  • 0, 28
  • 4, 19
  • 3, 67
  • 4, 67
  • 1, 58
  • 2, 54
  • 0, 35

Iteration 1: o1 = [183, 423], o3 = [240,423]

O3, 405 O1, 363

24

Iteration 3: o1 = 363, o3 = 405 Iteration 2: o1 = [305, 372], o3 = 405, o4 = [67, 354]

  • 4 can be removed from candidate list if it is unlikely

to be in the Top-2

slide-13
SLIDE 13

13

Top-K processing using Views

  • Query answering using views

[Das, Gunopulos, Koudas, Tsirogiannis, VLDB’06]

  • Improved efficiency:

– Use similar, previously instantiated queries – Use previous queries to model the correlations between attributes

25

Top-K processing using Views

Ranking Views: Materialized results of previously asked top-k queries

3 2 1

5 2 3 X X X fQ + + =

V1 tid Score

fV1 = 2X1 + 5X2

V2 tid Score

fV 2 = X2 + 4X3

R t X X X

Problem: Can we answer new top-k queries efficiently using ranking views?

26 1 id

3 553 4 385 5 216 2 201 1 169

2 id

2 351 1 237 5 177 3 159 4 88 R tid X1 X2 X3 1 82 1 59 2 53 19 83 3 29 99 15 4 80 45 8 5 28 32 39

slide-14
SLIDE 14

14

LPTA - Setting

  • Linear additive scoring functions e.g.

fQ = 3X1 + 2X2 + 5X3

  • Set of Views:

– Materialized result of the tuples of a previously executed top-k query – Arbitrary subset of attributes – Sorted access on pairs

tid,scoreQ tid

( )

( )

fQ

1 2 3

27

p

  • Random access on the base table R
  • Extends PREFER [Hristidis et al, 2001]

,

Q(

)

( )

LPTA - Example

Top-1 V1 Q stopping diti X1

tid1

1 s 1 1

tid2

1 1

s2

1

tid1

2 s 1 2

tid2

2 2

s2

2

V1 V2

tid1

1

tid2

Top-1 Q condition R(X1, X2)

R T (1,1) (0,1)

28

tid3

1

tid4

1

tid5

1

s3

1

s4

1

s5

1

tid3

2

tid4

2

tid5

2

s3

2

s4

2

s5

2

tid1 V2

X2

O P (0,0) (1,0)

slide-15
SLIDE 15

15

LPTA - Example (cont’)

Top-1 V1 Q stopping X1

tid1

1 s 1 1

tid2

1 1

s2

1 1

tid1

2 s 1 2

tid2

2 2

s2

2 2

V1 V2

tid1

1

tid2 tid2

1

Top 1 Q pp g condition R(X1, X2)

R T (1,1) (0,1)

29

tid3

1

tid4

1

tid5

1

s3

1

s4

1

s5

1

tid3

2

tid4

2

tid5

2

s3

2

s4

2

s5

2

tid1 tid2

2

V2 X2

O P (0,0) (1,0)

LPTA Linear Programming adaptation of TA

R(X1,X2)

Q: fQ = 3X1 +10X2

max( fQ) 0 ≤ X1,X2 ≤100 fV1 = 2X1 + 5X2 fV 2 = X1 + 2X2

V1 V2

tid Score tid Score

30

tidd

1

sd

1

tidd

2

sd

2

2X1 + 5X2 ≤ sd

1

X2 + 2X2 ≤ sd

2

unseenmax ≤ topkmin

d iteration

slide-16
SLIDE 16

16

View Selection Heuristics

Select Views By Angle (SVA): Sort the views by increasing angle with respect to Q.

(0,1)

Q Selected Views

31

(1,0) (0,0)

View Selection: Cost Estimation Framework

  • What is the cost of running LPTA on a specific

set of views?

  • Precise indicator of sequential and random
  • Precise indicator of sequential and random

accesses

  • Cost = number of sequential accesses

32

slide-17
SLIDE 17

17

Simulation of LPTA using Histograms

HQ: approximates the score

1. Estimate the score of the k highest tuple. 2. Run LPTA in a bucket b b k t l k t

HQ HV1 HV2 topkmin HQ: approximates the score distribution with respect to Q

33

by bucket lock step. 3. Estimate the cost.

b buckets n/b tuples per bucket

Presentation Outline

  • Introduction to Top-K Query Processing
  • Centralized techniques

Di t ib t d t h i

  • Distributed techniques

– Exact algorithms – Exact algorithms with fixed rounds of communication: TPUT, TJA, TPAT – Approximate Algorithms Using data distribution information: KLEE

34

– Exact algorithms using upper/lower bounds: LBK

  • Online Algorithms for Monitoring Top-K results
  • Future Work
slide-18
SLIDE 18

18

Distributed Top-K Query Processing Example: Sensor monitoring

  • Consider n sensors S={s1,s2,…,sn} each of which

maintains a sliding window of m {o1 o2

  • } readings

maintains a sliding window of m {o1,o2,…,om} readings. Note: oij denotes the ith reading of the jth sensor.

  • Given an n-dimensional query point Q = {q1,q2,…,qn}
  • Objective:Find the K timestamps with the maximum

value:

35

– wj : Sensor Weight. The readings of some sensors might be more important than other sensors – Sim(qj , ojj) : A monotone Similarity Function.

Distributed Top-K Query Processing

Cost Metric in a Distributed Environment A) Utilization of the Communication Medium

Transmitting less data conserves resources energy and – Transmitting less data conserves resources, energy and minimizes failures. – e.g. in a Sensor Network sending 1 byte ≈ 1120 CPU instructions. Source: The RISE (Riverside Sensor) (NetDB’05, IPSN’05 Demo, IEEE SECON’05)

36

B) Query Response Time

  • The #bytes transmitted is not the only parameter.
  • Minimize the time to execute a query.
slide-19
SLIDE 19

19

Communication Topologies

  • Assume that the distributed sites are

interconnected in a graph topology.

QN

v1 v2 v1 v2

interconnected in a graph topology.

– Example: Peer-to-peer or Sensor networks

37

v1 v2 vn ... Star Topology

QN vn v3 Graph Topology QN vn v3 Spanning Tree

Naïve Solution: Centralized Join (CJA)

  • Each Node sends all its local

scores (list) Each intermediate node

v1

TOP-1 5: 4: 3: 2: 5: 4: 3: 2: 1: 1,2,3,4,5

  • Each intermediate node

forwards all received lists

  • The Gnutella Approach

v3 v2 v4 v5

5: 4: 5: 3: 5:

38

Drawbacks

  • Overwhelming amount of messages.
  • Huge Query Response Time
slide-20
SLIDE 20

20

Simple Solution: Staged Join (SJA)

  • Aggregate the lists before these

are forwarded to the parent

v1

2,3,4,5: TOP- 1 1,2,3,4,5 1,2,3,4,5

[Madden, Franklin, Hellerstein, Wong, OSDI’02]

are forwarded to the parent using:

  • This is essentially the TAG

h

v3 v2 v4 v5

5: 3: 4,5: 2,3 4,5 4 5

39

approach (Madden et al. OSDI '02)

  • Advantage: Only (n-1) messages
  • Drawback: Still sending everything!

The Threshold Algorithm Advantages:

  • The number of objects accessed is minimized
  • Marian et al show how to minimize random accesses

Why Not TA in a distributed Environment? Disadvantages:

Each object is accessed individually (random accesses) A huge number of round trips (phases)

  • Marian et al show how to minimize random accesses

[Marian, Bruno, Gravano, TODS’04]

40

A huge number of round trips (phases) Unpredictable Latency (Phases are sequential) In-network Aggregation not possible

slide-21
SLIDE 21

21

The TPUT Algorithm

[Cao and Wang, PODC’04]

TPUT is a 3-round algorithm: Improves query response time

TPUT (Three Phase Uniform Threshold)

1) Fetch K first entries from all n lists. Define threshold τ as τ = (Kth lowest partial score / n). τ (the uniform threshold) is then disseminated to all nodes.

41

2) Each node sends any pair which has a score above τ. 3) If we found the complete score for less than K objects then we perform a random access for all incomplete

  • bjects

The TPUT Algorithm: Example

v1 v2 v3 v4 v5

3 99 1 91 1 92 3 74 3 67

TOP-1

1 183 3 240

3 405

  • 3, 99
  • 1, 66
  • 0, 63
  • 2, 48
  • 4, 44
  • 1, 91
  • 3, 90
  • 0, 61
  • 4, 07
  • 2, 01
  • 1, 92
  • 3, 75
  • 4, 70
  • 2, 16
  • 0, 01
  • 3, 74
  • 1, 56
  • 2, 56
  • 0, 28
  • 4, 19
  • 3, 67
  • 4, 67
  • 1, 58
  • 2, 54
  • 0, 35

P1 P2 P3

Q: TOP-1

  • 1=183, o3=240
  • 3=405
  • 1=363
  • 2’=158
  • 4’=137
  • 0’=124

42

Phase 1 : o1 = 91+92 = 183, o3 = 99+67+74 = 240

τ = (Kth highest score (partial) / n) => 240 / 5 => τ = 48

Phase 2 : Have we computed K exact scores ? Computed Exactly: [o3, o1] Incompletely Computed: [o4,o2,o0]

Drawback: The threshold is too coarse (uniform)

slide-22
SLIDE 22

22

Optimality? Fagin et al (2003): TA is an instance optimal algorithm. Cao et al (PODC’04): No Fixed Round Algorithm can be Instance Optimal (TPUT, TJA, etc)

43

Fixed Rounds => Constant Communication Overhead

Improving The TPUT Algorithm

[Yu, Li, Wu, Agrawal, El Abbadi, DEXA’05]

In TPUT the threshold is uniform and too-coarse One approach is to use statistics to set different th h ld tt ib t thresholds per attribute We need statistical information a priori

TPAT

1) Fetch K first entries from all n lists. Define threshold τ as τ = (Kth lowest partial score. 2) Partition τ (the uniform threshold) based on data

44

distribution, is then disseminate values to all nodes. 3) Each node sends any pair which has a score above τ. 4) If we found the complete score for less than K objects then we perform a random access for all incomplete

  • bjects
slide-23
SLIDE 23

23

Threshold Join Algorithm (TJA)

TJA is a 3-round algorithm: minimizes number of transmitted objects, performs in network aggregation

[Zeinalipour-Yiazti, Vagena, Gunopulos, Kalogeraki, Tsotras, Vlachos, Koudas, Srivastava, DMSN’05]

1. LB Phase: Ask each node to send the K (locally) highest ranked results. The union of these results defines a threshold τ . performs in-network aggregation,

  • ptimizes the utilization of the communication channel

45

2. HJ Phase: Ask each node to transmit everything above this threshold τ . 3. CL Phase: identify the complete score of all incompletely calculated scores.

Step 1 - LB (Lower Bound) Phase

  • Each node sends its top-k

results to its parent. Each intermediate node

v1

2,3,4,5:

TJA 1) LB Phase U U

1 1,2,3,4,5 Ltotal {1,3}

  • Each intermediate node

performs a union of all received lists (denoted as τ):

v3 v2 v4 v5

5: 3: 4,5: 4U 5 2,3

U

4,5 Occupied Oij Empty Oij

v1 v2 v3 v4 v5

LB Query: TOP-1

46

v1 v2 v3 v4 v5

  • 3, 99
  • 1, 66
  • 0, 63
  • 2, 48
  • 4, 44
  • 1, 91
  • 3, 90
  • 0, 61
  • 4, 07
  • 2, 01
  • 1, 92
  • 3, 75
  • 4, 70
  • 2, 16
  • 0, 01
  • 3, 74
  • 1, 56
  • 2, 56
  • 0, 28
  • 4, 19
  • 3, 67
  • 4, 67
  • 1, 58
  • 2, 54
  • 0, 35

LB

{o3, o1}

slide-24
SLIDE 24

24

Step 2 – HJ (Hierarchical Join) Phase

  • Disseminate τ to all nodes
  • Each node sends back

everything with score above all

TJA 2) HJ Phase

v1

2,3,4,5: 1,2,3,4,5 Rtotal {1,3,4}

U U

+

y g

  • bjectIDs in τ.
  • Before sending the objects,

each node tags as incomplete, scores that could not be computed exactly (upper bound)

v3 v2 v4 v5

5: 3: 4,5: 4 5 2,3 4,5 Occupied Oij Empty Oij Incomplete Oij

U

+

U

+

1 2 3 4 5

47

  • 3, 405
  • 1, 363
  • 4',354

v1 v2 v3 v4 v5

  • 3, 99
  • 1, 66
  • 0, 63
  • 2, 48
  • 4, 44
  • 1, 91
  • 3, 90
  • 0, 61
  • 4, 07
  • 2, 01
  • 1, 92
  • 3, 75
  • 4, 70
  • 2, 16
  • 0, 01
  • 3, 74
  • 1, 56
  • 2, 56
  • 0, 28
  • 4,19
  • 3, 67
  • 4, 67
  • 1, 58
  • 2, 54
  • 0, 35

HJ

}

Complete Incomplete

Step 3 – CL (Cleanup) Phase

Have we found K objects with a complete score? Yes: The answer has been found! No: Find the complete score for each incomplete object (all in a single batch phase)

  • CL ensures correctness
  • This phase is rarely required in practice.

1 2 3 4 5

48

  • 3, 405
  • 1, 363
  • 4, 207
  • 0, 188
  • 2, 175

v1 v2 v3 v4 v5

  • 3, 99
  • 1, 66
  • 0, 63
  • 2, 48
  • 4, 44
  • 1, 91
  • 3, 90
  • 0, 61
  • 4, 07
  • 2, 01
  • 1, 92
  • 3, 75
  • 4, 70
  • 2, 16
  • 0, 01
  • 3, 74
  • 1, 56
  • 2, 56
  • 0, 28
  • 4, 19
  • 3, 67
  • 4, 67
  • 1, 58
  • 2, 54
  • 0, 35

TOP-5

  • 3,405
slide-25
SLIDE 25

25

TJA vs. TPUT

1e+09 Bytes Required for Distributed Top-K Algorithms (Star Topology, K=5, m=25K) SJA 100000 1e+06 1e+07 1e+08 Bytes Required TPUT TJA

49

1000 10000 20 30 40 50 60 70 80 90 100 n (Number of Nodes)

Approximate Distributed Algorithms

  • TA performs many communication rounds
  • TPUT may retrieve a lot of data in Phase 2
  • TPUT, TJA perform random accesses

All these characteristics hurt performance!

50

slide-26
SLIDE 26

26

Approximate algorithms: KLEE

  • TPUT may retrieve a lot of data in Phase 2
  • TPUT TJA perform random accesses

[Michel, Triantafillou, Weikum, VLDB’05]

  • TPUT, TJA perform random accesses

KLEE is an improvement on both counts:

Focuses on Approximate answers Uses information about the data distribution to d d t t f

51

reduce data transfers Does not do random accesses at each peer

The KLEE Algorithm

KLEE is a 2 or 3-round algorithm:

  • 1. Exploration Step: finds an approximation of min-k

score threshold using histograms and bloom filters

  • 2. Optimization Step: decides if step 3 will be

executed (NO communication)

  • 3. Candidate Filtering: a docID is a good candidate if

high-scored in many peers.

52

g y p

  • 4. Candidate Retrieval: get all good docID

candidates.

slide-27
SLIDE 27

27

Histogram Bloom Structure

  • Each node pre-computes per attribute:
  • 1. an equi-width histogram,

2 a Bloom filter for each

score

#docs

1 1 1 1 1 1 1 1 1 1 1 1 1 1

1

  • 2. a Bloom filter for each

histogram cell,

  • 3. the average score per cell,
  • 4. upper/lower scores per cell

53

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

*From the VLDB’05 KLEE presentation

Top-K Algorithms that use Score Bounds

  • Suppose that each Node can only return Lower and

Upper Bounds rather than Exact scores

[Marian, Bruno, Gravano, TODS’04] [Zeinalipour-Yiazti, Lin, Gunopulos, CIKM’06]

Upper Bounds rather than Exact scores.

  • e.g. instead of 16 it tells us that the similarity is in the

range [11..19]

A2,3,6 A0,4,8 A4,5,10 A4,10,18 A2,13,19 A0,15,25

m

A4,4,5 A2,5,6 A0,5,7 A4,1,3 A0,6,10 A2,5,7

id,lb,ub v3 id,lb,ub v2 id,lb,ub v1 id,lb,ub

METADATA

trajectories A

2

A

1

y cell

Q

54

A7,7,9 A3,8,11 A9,8,9 .... A3,20,27 A9,22,26 A7,30,35 ....

m

A3,5,6 A9,8,10 A7,12,13 .... A9,6,7 A3,7,10 A7,11,13 ....

n

G

x A ccess P

  • int

m

  • ving object
slide-28
SLIDE 28

28

LB-K Algorithm

  • An iterative algorithm for finding the K highest ranked

[Zeinalipour-Yiazti, Lin, Gunopulos, CIKM’06]

  • An iterative algorithm for finding the K highest ranked

DATA objects using Lower Bounds (METADATA

  • bjects).
  • Strategy:

Utilize the METADATA objects in order to decide which

55

DATA objects have to be transferred.

LB-K: Example

id,lb

id,lb

DATA METADATA

Query: Find the K=2 highest ranked answers

A4,10 A2,13 A0,15 A3,20 A9,22 A4,17 A2,18 A0,24 A3,22 A9,25

≥ ? ≥ ?

K+1 TJA K 2K 2K+1 TJA

56

A7,30 .... A7,33 ....

LB EXACT

slide-29
SLIDE 29

29

UBLB-K Algorithm

  • Also an iterative algorithm with the same
  • bjectives as LBK

Diff

  • Differences:

– It uses both a lower (LB) and upper (UB) bound on the distributed DATA. – It transfers the candidate DATA objects in a final (bulk) phase rather than

57

in a final (bulk) phase rather than incrementally UBLB-K: Example

A4 10 18

id,lb

id,lb,ub

M ETAD ATA

Exact Score

A4 17

D ATA

A4,10,18 A2,13,19 A0,15,25 A3,20,27 A9,22,26 A7 30 35 A4,17 A2,18 A0,24 A3,22 A9,25 A7,33 ≥ ? ≥ ?

K+1 TJA 2K+1 TJA

58

A7,30,35 ....

LB,UB

A7,33 ....

EXACT

Note: Kth lowest UB is: 19 Therefore A3 (LB:20) and below are not necessary

slide-30
SLIDE 30

30

Presentation Outline

  • Introduction to Top-K Query Processing
  • Centralized techniques

– Fagin’s Algorithm Optimal Algorithms: TA (Threshold Algorithm) – Optimal Algorithms: TA (Threshold Algorithm) – Restricted Access Models: TA-Sorted – Probabilistic TA-Sorted – Using previous query instantiations: LPTA

  • Distributed techniques

– Exact algorithms – Exact algorithms with fixed rounds of communication: TPUT, TJA, TPAT

59

– Approximate Algorithms Using data distribution information: KLEE – Exact algorithms using upper/lower bounds: LBK

  • Online Algorithms for Monitoring Top-K results: BABOLS, TMA
  • Future Work

Online algorithms

  • Top-K monitoring for

tid

1 1

Q1 X1

R T

(1 ,1) (0,1)

[Mouratidis, Bakiras, Papadias, SIGMOD’06]

Top K monitoring for stream data: The TMA/SMA algorithms

  • Monitor multiple Top-K

queries simultaneously

– Efficiently identify the effect

1

tid

1 2

(10) ( , ) ( , )

60

Efficiently identify the effect

  • f changes:

Only some Top-K results change Q2

O P

(0,0) (1 ,0)

slide-31
SLIDE 31

31

Monitoring Top-K Results in a Distributed Setting

  • The setting: changes come over time
  • Use any efficient algorithm for finding top-K

[Babcock, Olston, SIGMOD’03]

Use any efficient algorithm for finding top K

  • Monitor changes: only if large changes happen

you have to recompute

– Approximate algorithm: top-K results are correct within ε – Need an algorithm that decides how big a change can we tolerate per node:

61

Query: 2 X1 + X2 Tuples: t1 = (3,9), t2 = (5,1) Slack per attribute is: ((Score(t1) – Score(t2)) + ε )/2 = 2+ε/2

Conclusions

  • Top-K Query Processing is an area with

– Many applications in practical problems Many challenges and opportunities! – Many challenges and opportunities!

  • Privacy issues
  • Approximate algorithms
  • Online algorithms

62

  • Modeling and exploiting correlations
slide-32
SLIDE 32

32

References

Amato G.,Rabitti F.,Savino P. and Zezula P., “Region proximity in metric spaces and its use for approximate similarity search”, In TOIS 2003 2003. – Babcock B. and Olston C., “Distributed Top-K Monitoring”, In Proceedings of the ACM SIGMOD international conference on Management of data, San Diego, CA, USA, Pages 28-39, 2003. – Balke W.-T., Nejdl W., Siberski W., Thaden U., “Progressive Distributed Top-K Retrieval in Peer-to-Peer Networks”, In Proceedings of the 21st International Conference on Data Engineering, April 5-8, Tokyo, Japan, 2005. Banerjee A Mitra A Najjar W Zeinalipour Yazti D Kalogeraki V and

63

– Banerjee A.,Mitra A.,Najjar W.,Zeinalipour-Yazti D., Kalogeraki V. and Gunopulos D., ”RISE Co-S : High Performance Sensor Storage and Co- Processing Architecture”, Second Annual IEEE Communications Society Conference on Sensor and Ad Hoc Communications and Networks,(SECON’2005), Santa Clara, California, USA, 2005.

References

– Bloom B. H., “Space/Time Trade-Offs in Hash Coding with Allowable Errors”, Communication of the ACM, 13(7):422-426, 1970. – Bruno N.,Gravano L. and Marian A., “EvaluatingTop-K Queries Over Web Accessible Databases”, In Proceedings of the 18th International Conference on Data Engineering, San Jose, CA, USA, Page 369, 2002. – Cao P. and Wang Z., “Efficient Top-K Query Calculation in Distributed Networks”, In Proceedings of the twenty-third annual ACM symposium

  • n Principles of distributed computing,St.John’s,Newfoundland,

Canada, Pages 206-215, 2004. – Chun B.N., Culler D.E, Roscoe T., Bavier A.C, Peterson L.L, Wawrzoniak M., Bowman M., “PlanetLab: an overlay testbed for broad-

64

coverage services”, Computer Communication Review Volume33, Issue 3, Pages 3-12, 2003. – Claffy K., Tracie E., McRobb D. ”Internet tomography”, 1999.

slide-33
SLIDE 33

33

References

– Considine J., Li F., Kollios G., and Byers J., “Approximate Aggregation Techniques for Sensor Databases”, In Proceedings of the 20th International Conference on Data Engineering, Boston, MA, USA, Page 449 2004 449, 2004. – Deligiannakis A., Kotidis Y. Roussopoulos N., “Hierarchical in-Network Data Aggregation with Quality Guarantees”, In 9th International Conference on Extending Database Technology, Heraklion, Greece, March 14-18, Pages 658-675, 2004. – Donjerkovic D. and Ramakrishnan R., “Probabilistic Optimization of Top-N Queries”, In Proceedings of the 25th International Conference on Very Large Data Bases, Edinburgh, Scotland, UK, Pages 411-422, 1999

65

1999. – Fagin R., “Combining Fuzzy Information from Multiple Systems”, In Proceedings of the fifteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems, Montreal, Quebec, Canada, Pages 216-226, 1996.

References

– Fagin R., “Fuzzy Queries In Multimedia Database Systems”, In Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems, Seattle, WA, USA, pp. 1-10 1998 1 10, 1998. – Fagin R., Lotem A. and Naor M., “Optimal Aggregation Algorithms For Middleware”, In Proceedings of the twentieth ACM SIGMOD-SIGACT- SIGART symposium on Principles of database systems, Santa Barbara, CA, USA, Pages 102-113, 2001. – Gravano L. and Chaudhuri S., “Evaluating Top-K Selection Queries”, In Proceedings of the 25th International Conference on Very Large DataBases, Edinburgh, Scotland, UK, Pages 397-410, 1999. Guntzer U Balke W Kiebling W “Optimizing Multi Feature Queries for

66

– Guntzer U.,Balke W.,Kiebling W., “Optimizing Multi Feature Queries for Image Databases”, In VLDB 2000. – Hansen, T.,Otero, J.,McGregor, A.,Braun, H-W., “Active measurement data analysis techniques”, In Proceedings of the International Conference on Communications in Computing, Las Vegas, Nevada, pp.105,2000.

slide-34
SLIDE 34

34

References

– Ilyas I.F., Aref W.G. and Elmagarmid A.K., “Supporting Top-k Join Queries in Relational Databases”, In The VLDBJournal –The International Journal on Very Large Data Bases, Vol. 13 , Iss. 3, pp. 207-221 2003 207 221, 2003. – Kalnis P., Ng W-S., Ooi B-C., Tan K-L., “Answering similarity queries in peer-to-peer networks”, In Proceedings of the 14th International World Wide Web Conference, Pages 482-483, New York City, NY, USA, 2004. – Kiessling W., “Foundations of Preferences in Database Systems”, InProceedings of the 28th International Conference on Very Large Data Bases, Hong Kong, China, Pages 311-322, 2002. – Lv Q., Cao P., Cohen E., Lai K., Shenker S., “Search and Replication in Unstructured Peer to Peer Networks” In Proceedings of the 16th

67

Unstructured Peer-to-Peer Networks”, In Proceedings of the 16th international conference on Supercomputing, New York,NY,USA,Pages 84-95, 2002. – Nepal S., Ramakrishna M. V., “Query Processing Issues in Image(Multimedia) Databases”, In ICDE 1999.

References

– Madden S.R., Franklin M.J., Hellerstein J.M., Hong W., “TAG: a Tiny AGgregation Service for Ad-Hoc Sensor Networks”, In Proceedings of the 5th symposium on Operating systems design and implementation, Boston MA pp 131-146 2002 Boston, MA, pp. 131 146, 2002. – Madden S.R., Franklin M.J., Hellerstein J.M., Hong W., ”The Design of an Acquisitional Query Processor for Sensor Networks”, In Proceedings

  • f the 2003 ACM SIGMOD international conference on Management of

data, San Diego, CA, USA, Pages 491-502, 2003. – Marian A., Gravano L., Bruno N., “ Evaluating Top-k Queries over Web- Accessible Databases”, In TODS 2004. – Michel S., Triantafillou P., Weikum G., “KLEE: A Framework for Distributed Top K Query Algorithms” In 31st conference in the series of

68

Distributed Top-K Query Algorithms”, In 31st conference in the series of the Very Large Data Bases, Trondheim, Norway, 2005. – Nejdl W., Siberski W., Thaden U. and Balke W., “Top-k Query Evaluation for Schema-Based Peer-to-Peer Networks”, In ISWC 2004.

slide-35
SLIDE 35

35

References

– Szewczyk R., Osterweil E., Polastre J., Hamilton M., Mainwaring A.M., Estrin D., “Habitat monitoring with sensor networks”, Commun.ACM47(6):34-40(2004). Theobald M Schenkel R Weikum G “Top k Query Evaluation with – Theobald M., Schenkel R., Weikum G., Top-k Query Evaluation with Probabilistic Guarantees”, In VLDB 2004. – Tsoumakos D. and Roussopoulos N., “Adaptive Probabilistic Search for Peer-to-Peer Networks”, In Proceedings of the Third International Conference on Peer-to-Peer Computing, Linkoping, Sweden, Pages 102-110, 2003. – Xiong L., Chitti S., Liu L., “Top-k Queries across Multiple Private Databases”, In ICDCS 2005.

69

– Yang B. and Garcia-Molina H., “Efficient Search in Peer-to-Peer Networks”. In Proceedings of the 22nd International Conference on Distributed Computing Systems Vienna, Austria, Pages 5-14, 2002.

References

– Zeinalipour-Yazti D., Vagena Z., Gunopulos D., Kalogeraki V., Tsotras V., Vlachos M., Koudas N., Srivastava D., “The Threshold Join Algorithm for Top-K Queries in Distributed Sensor Networks”, In Proceedings of the 2nd International Workshop on Data Management Proceedings of the 2nd International Workshop on Data Management for Sensor Networks, collocated with VLDB 2005, Trondheim, Norway, 2005. – Zeinalipour-Yazti D., Lin S., Kalogeraki V., Gunopulos D., Najjar W., ”MicroHash: An Efficient Index Structure for Flash-Based Sensor Devices”, In Proceedings of the 4th USENIX Conference on File and Storage Technologies(FAST’2005) SanFrancisco,CA,December14-16,

  • pp. 31-44, 2005.

70