Query Processing, Resource Management, And Approximation in a Data - - PowerPoint PPT Presentation

query processing resource management and approximation in
SMART_READER_LITE
LIVE PREVIEW

Query Processing, Resource Management, And Approximation in a Data - - PowerPoint PPT Presentation

Query Processing, Resource Management, And Approximation in a Data Stream Management System Kevin Hoeschele Archana Joshi References R. Motwani, J. Widom, A. Arasu, B. Babcock, S. Babu, M. Datar, G. Manku, C. Olston, J. Rosenstein and R.


slide-1
SLIDE 1

Query Processing, Resource Management, And Approximation in a Data Stream Management System

Kevin Hoeschele Archana Joshi

slide-2
SLIDE 2

References

  • R. Motwani, J. Widom, A. Arasu, B. Babcock, S. Babu, M.

Datar, G. Manku, C. Olston, J. Rosenstein and R. Varma. Query Processing, Resource Management, and Approximation in Data Stream Management System. In Proceeding of the 2003 CIDR conference

  • B. Babcock, S. Babu, M. Datar, R. Motwani and J. Widom.

Models and Issues in Data Stream Systems. Invited paper in Proc. of the 2002 ACM Symp. on Principles of Database Systems (PODS 2002), June 2002

  • A. Arasu, S. Babu and J. Widom. An Abstract Semantics

and Concrete Language for Continuous Queries Over Streams and Relations. Technical Report, November 2002

slide-3
SLIDE 3
  • A datastream management system (DSMS)
  • A declarative query language called CQL for

continuous queries

  • Queries handling both continuous data streams and

relations

  • Designed for changing and high data flow rates and

query work load

  • Good resource allocation
  • Approximation and resource management

Next - CQL

Important features of the STREAM system

slide-4
SLIDE 4
  • Extension of SQL with support for sliding

Windows and sampling for approximations

  • Supports both data streams and relations
  • Supports additional operators like Istream and

Dstream

CQL (Continuous Query Language)

slide-5
SLIDE 5

Data Streams Relations Have arrival order and unbounded Unordered and finite Append only Updates, deletions and insertions Stream resulting from continuous source and also from subqueries Relations are stored and also result from subqueries

slide-6
SLIDE 6
  • Based on existing well understood semantics
  • Additional transformations between relations and streams
  • Assumes a global, discrete, ordered time domain

(will discuss later)

  • Relations
  • Maps time T to set of tuples in R
  • Stream
  • Set of (tuple, timestamp) element
  • Stream at time T = all elements with timestamp <= T

Formal CQL Semantics

slide-7
SLIDE 7

Consider a stream of telephone call records ‘Calls’ having Attributes: cust_id, type, minutes, timestamp Compute the average call length considering only the last day’s long distance calls placed by each customer

SELECT AVG(S.minutes) FROM Calls S [PARTITION BY S.cust_id Range 1 Day Preceding WHERE S.type = ‘Long distance’]

Sample query 1

slide-8
SLIDE 8

Extract 10% sample of calls placed by ‘Gold’ customers and then stream the average where the cust_id is in range 1 to 1000

SELECT AVG(S.minutes) FROM (SELECT S.minutes FROM Calls S, Customer R WHERE S.cust_id= R.cust_id AND R.tier = ‘Gold’) V Sample (10) WHERE V.cust_id BETWEEN 1 AND 1000 Here we are joining a stream with relation

Sample query 2

slide-9
SLIDE 9

Transformation

  • Stream Mapped to Relation
  • A Stream with a window

specification (Rows,Range, partition by) upto a specific time T is a finite set and treated as a relation

  • Relations Mapped to Streams
  • Istream(R ) contains stream Elements(T,s) where tuple s is in

R at time T but not in R at time T-1

  • Dstream(R ) contains a stream Element(T,s) where tuple s is in

R at time T-1 but not in R at time T

slide-10
SLIDE 10

Timestamps

  • All relation updates are also timestamped

According to the global clock

  • Can Handle Application Designed time also
  • Stream Elements arrive in order and

timestamped according to a global clock

Next - query plans

slide-11
SLIDE 11

Query Plans

  • Query plan runs continuously and supports three components
  • 1. Query operators - Read a stream of tuples, process them and write

into output queue.

  • 2. Inter-operator Queues - Connect different operators and define path

along which tuples flow as in DBMS.

  • 3. Synopses - Maintain State associated with operators.
slide-12
SLIDE 12

Synopsis

  • Summarizes the tuples seen so far at some intermediate
  • perator
  • Maintains one Synopsis for each join operator
  • Needs some kind of summarization technique to limit size
  • Synopsis are tied to operators
  • Generic interfaces for both allowing to couple any synopsis

type with any operator type

  • Supports generic methods to create, changeMem, insert,

delete

slide-13
SLIDE 13

Resource sharing in Query Plans

  • Different queries with the same operations (input and
  • perator the same) can share it to reduce redundancy
  • Inter-operator Queues after shared

Operations have pointers for each Query

  • Data deleted after each pointer has

past it

  • Not useful when operations have

vastly different consumption rate

  • creates large queues
slide-14
SLIDE 14

Resource Management

  • Number of relevant resources like memory, computation, I/O
  • Will focus on memory management
  • Two Techniques
  • Algorithm for incorporating known constraints on input data

streams to reduce synopsis size

  • Algorithm for Query scheduling that minimizes queue size
slide-15
SLIDE 15

Constraints

  • Set by collecting statistics on data, Related to

punctuation

  • Adherance Parameter
  • sees how close stream fits a constraint
  • the closer stream is to constraint the smaller the

Synopsis size is

  • No precision loss.
slide-16
SLIDE 16

Stream constraint example

  • Consider a continuous query that joins streams

Orders (O) and Fulfillments (F) based on orderID

  • Ordered (k) : If we know k tuples for a given orderId

arrive on O before arriving on F then a join synopsis

  • n F doesn’t need to be kept for those k tuples

Stream O

Join

Stream F

K tuples O synopsis F synopsis Joined output

slide-17
SLIDE 17

Global Scheduler

  • Says which Queries are run, and When
  • Ways to Create weights for Scheduling Queries
  • Response time
  • Throughput
  • queue size *chosen by STREAM
  • Greedy Schedule
  • next operator chosen will consume most tuples/time unit
  • Scheduling chains like auroras Train scheduling
slide-18
SLIDE 18

O2 O1 Q2 Q1 Strategy 1 Strategy 2 1 2 3 4 5 6 7 Q2 Q2 Q1 Q1 1 1 1 1 0.2 0.2 2 0.4 1 2 0.2 1 0.6 3 1 1 1 0.2 3 4 0.8 1 1.2 Queue size in increments of N over Time

Query Scheduling Example

Operator 1 - 20% selectivity

  • takes in N tuples per time segment

Operator 2 - takes in N/5 tuples per time segment Strategy 1: each window of N tuples goes completely through Strategy 2 - Greedy : if Q1 has atleast N tuples, will always do that first

slide-19
SLIDE 19

Approximations

  • Static vs Dynamic Approximation
  • static, a certain query behavior is guaranteed, user can participate
  • In Dynamic, the level of approx changes, adapts to current

resource availability

  • Approximation techniques:
  • Window reduction size - reduces synopsis size and time to do

window joins

  • Sampling - dropping output data at a certain %
  • Load Shedding - similar to sampling but lets chunks of tuples get

dropped at a time, reduces queue size..

slide-20
SLIDE 20

Future Resource management

  • Able to monitor Queue and synopsis size, and react

when reduction is needed

  • Reallocation algorithm to deal with changes in the rate

and distribution of incoming data

  • Able to dynamically add, delete, activate and

deactivate queries

slide-21
SLIDE 21

Summary

  • STREAM system supports a declarative query

language for operations on Stream and Relations

  • It supports high data rates and varying work

loads.

  • A near-optimal scheduling algorithm for reducing

inter-operator queue sizes

  • A set of techniques for static and dynamic

approximation to cope with limited resources.

slide-22
SLIDE 22

Discussion (CACQ and STREAM Questions)

  • What are some of the differences between Aurora

and Stream?

  • CQL vs aurora
  • Query plans
  • windowing techniques
  • goals: total throughput(Aurora) vs minimizing

queue size(STREAM) which is better?

  • Why focus on memory for resource management?
  • What are the disadvantages of using eddies?
  • What assumptions are made that make the sharing of operators effective?
  • How does CACQs uses of eddies differ from their use in Telegraph, and

what are the pros and cons of this approach?

slide-23
SLIDE 23

PSoup Question 1

PSoup is currently implemented in main memory. This gives the system great speed, but limits the amount of data that can be stored for purposes of windowing. What alternatives could increase the amount of storage without significantly hitting the performance?

PSoup Question 2

The creators of PSoup posit that their system can be applied to Data Recharging scenarios. Is this really plausible? Consider: * PSoup's main memory limitations * PSoup's applicability to the 'Net * Data-recharging utility functions

Psoup Questions