Query Processing, Resource Management, And Approximation in a Data - PowerPoint PPT Presentation

Query Processing, Resource Management, And Approximation in a Data Stream Management System Kevin Hoeschele Archana Joshi

References • R. Motwani, J. Widom, A. Arasu, B. Babcock, S. Babu, M. Datar, G. Manku, C. Olston, J. Rosenstein and R. Varma. Query Processing, Resource Management, and Approximation in Data Stream Management System. In Proceeding of the 2003 CIDR conference • B. Babcock, S. Babu, M. Datar, R. Motwani and J. Widom. Models and Issues in Data Stream Systems. Invited paper in Proc. of the 2002 ACM Symp. on Principles of Database Systems (PODS 2002), June 2002 • A. Arasu, S. Babu and J. Widom. An Abstract Semantics and Concrete Language for Continuous Queries Over Streams and Relations. Technical Report, November 2002

Important features of the STREAM system • A datastream management system (DSMS) • A declarative query language called CQL for continuous queries • Queries handling both continuous data streams and relations • Designed for changing and high data flow rates and query work load - Good resource allocation - Approximation and resource management Next - CQL

CQL (Continuous Query Language) • Extension of SQL with support for sliding Windows and sampling for approximations • Supports both data streams and relations • Supports additional operators like Istream and Dstream

Relations Data Streams Have arrival order and Unordered and finite unbounded Append only Updates, deletions and insertions Stream resulting from Relations are stored and continuous source and also result from also from subqueries subqueries

Formal CQL Semantics • Based on existing well understood semantics • Additional transformations between relations and streams • Assumes a global, discrete, ordered time domain (will discuss later) • Relations - Maps time T to set of tuples in R • Stream - Set of (tuple, timestamp) element - Stream at time T = all elements with timestamp <= T

Sample query 1 Consider a stream of telephone call records ‘Calls’ having Attributes: cust_id, type, minutes, timestamp Compute the average call length considering only the last day’s long distance calls placed by each customer SELECT AVG(S.minutes) FROM Calls S [PARTITION BY S.cust_id Range 1 Day Preceding WHERE S.type = ‘Long distance’]

Sample query 2 Extract 10% sample of calls placed by ‘Gold’ customers and then stream the average where the cust_id is in range 1 to 1000 SELECT AVG(S.minutes) FROM (SELECT S.minutes FROM Calls S, Customer R WHERE S.cust_id= R.cust_id AND R.tier = ‘Gold’) V Sample (10) WHERE V.cust_id BETWEEN 1 AND 1000 Here we are joining a stream with relation

Transformation • Stream Mapped to Relation - A Stream with a window specification (Rows,Range, partition by) upto a specific time T is a finite set and treated as a relation • Relations Mapped to Streams - Istream(R ) contains stream Elements(T,s) where tuple s is in R at time T but not in R at time T-1 - Dstream(R ) contains a stream Element(T,s) where tuple s is in R at time T-1 but not in R at time T

Timestamps • Stream Elements arrive in order and timestamped according to a global clock • All relation updates are also timestamped According to the global clock • Can Handle Application Designed time also Next - query plans

Query Plans • Query plan runs continuously and supports three components 1. Query operators - Read a stream of tuples, process them and write into output queue. 2. Inter-operator Queues - Connect different operators and define path along which tuples flow as in DBMS. 3. Synopses - Maintain State associated with operators.

Synopsis • Summarizes the tuples seen so far at some intermediate operator • Maintains one Synopsis for each join operator • Needs some kind of summarization technique to limit size • Synopsis are tied to operators • Generic interfaces for both allowing to couple any synopsis type with any operator type • Supports generic methods to create, changeMem, insert, delete

Resource sharing in Query Plans • Different queries with the same operations (input and operator the same) can share it to reduce redundancy • Inter-operator Queues after shared Operations have pointers for each Query • Data deleted after each pointer has past it • Not useful when operations have vastly different consumption rate - creates large queues

Resource Management • Number of relevant resources like memory, computation, I/O • Will focus on memory management • Two Techniques - Algorithm for incorporating known constraints on input data streams to reduce synopsis size - Algorithm for Query scheduling that minimizes queue size

Constraints • Set by collecting statistics on data, Related to punctuation • Adherance Parameter - sees how close stream fits a constraint - the closer stream is to constraint the smaller the Synopsis size is • No precision loss.

Stream constraint example • Consider a continuous query that joins streams Orders (O) and Fulfillments (F) based on orderID - Ordered (k) : If we know k tuples for a given orderId arrive on O before arriving on F then a join synopsis on F doesn’t need to be kept for those k tuples K tuples O synopsis Stream O Joined output Join Stream F F synopsis

Global Scheduler •Says which Queries are run, and When •Ways to Create weights for Scheduling Queries - Response time - Throughput - queue size *chosen by STREAM • Greedy Schedule - next operator chosen will consume most tuples/time unit •Scheduling chains like auroras Train scheduling

Query Scheduling Example Q 1 Q 2 O 1 O 2 Operator 1 - 20% selectivity - takes in N tuples per time segment Operator 2 - takes in N/5 tuples per time segment Strategy 1: each window of N tuples goes completely through Strategy 2 - Greedy : if Q1 has atleast N tuples, will always do that first Queue size in increments of N over Time 1 2 3 4 5 6 7 1 4 Q 1 1 2 2 3 3 Strategy 1 Q 2 0 0.2 0 0.2 0 0.2 0 Q 1 1 1 1 1 1 1 1 Strategy 2 Q 2 0 0.2 0.4 0.6 0.8 1 1.2

Approximations • Static vs Dynamic Approximation - static, a certain query behavior is guaranteed, user can participate - In Dynamic, the level of approx changes, adapts to current resource availability • Approximation techniques: - Window reduction size - reduces synopsis size and time to do window joins - Sampling - dropping output data at a certain % - Load Shedding - similar to sampling but lets chunks of tuples get dropped at a time, reduces queue size..

Future Resource management • Able to monitor Queue and synopsis size, and react when reduction is needed • Reallocation algorithm to deal with changes in the rate and distribution of incoming data • Able to dynamically add, delete, activate and deactivate queries

Summary • STREAM system supports a declarative query language for operations on Stream and Relations • It supports high data rates and varying work loads. • A near-optimal scheduling algorithm for reducing inter-operator queue sizes • A set of techniques for static and dynamic approximation to cope with limited resources.

Discussion (CACQ and STREAM Questions) • What are some of the differences between Aurora and Stream? - CQL vs aurora - Query plans - windowing techniques - goals: total throughput(Aurora) vs minimizing queue size(STREAM) which is better? • Why focus on memory for resource management? • What are the disadvantages of using eddies? • What assumptions are made that make the sharing of operators effective? • How does CACQs uses of eddies differ from their use in Telegraph, and what are the pros and cons of this approach?

Psoup Questions PSoup Question 1 PSoup is currently implemented in main memory. This gives the system great speed, but limits the amount of data that can be stored for purposes of windowing. What alternatives could increase the amount of storage without significantly hitting the performance? PSoup Question 2 The creators of PSoup posit that their system can be applied to Data Recharging scenarios. Is this really plausible? Consider: * PSoup's main memory limitations * PSoup's applicability to the 'Net * Data-recharging utility functions

Query Processing, Resource Management, And Approximation in a Data - PowerPoint PPT Presentation

Query Processing, Resource Management, And Approximation in a Data Stream Management System Kevin Hoeschele Archana Joshi References R. Motwani, J. Widom, A. Arasu, B. Babcock, S. Babu, M. Datar, G. Manku, C. Olston, J. Rosenstein and R.

Improve Query Performance with the Query Log Analyzer Kees Vegter Field Engineer Query Log

CS4224/CS5424 Lecture 9 Distributed Query Processing Query Processing Translates query into a

Query Processing Relevance feedback; query expansion; Web Search 1 Overview Indexes Query

Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu Query

Chapter 3: Top-k Query Processing and Indexing 3.1 Top-k Algorithms 3.2 Approximate Top-k Query

6. Approximation and fitting norm approximation least-norm problems regularized

Resource Resource Management Management RESOURCE MANAGEMENT RESOURCE MANAGEMENT We have a

Query Understanding: A Manifesto Daniel Tunkelang queryunderstanding.com Overview What is

Perfect Query FORMULA 5 critical sections in every successful query letter (c) 2019

Query Op)miza)on 1 Query op)miza)on Given an SQL query,

z Towards Plan-aware Resource Allocation in Serverless Query Processing Malay Bag Alekh Jindal

Online Query Processing Exposure to online query processing algorithms and fundamentals A

V.3 Query Processing 1. Term-at-a-Time 2. Document-at-a-Time 3. WAND 4. Quit & Continue 5.

Efficient query processing Efficient scoring, distributed query processing Web Search 1 Ranking

Query Processing Query Processing Steps balance < 2500 ( balance ( account)) balance

Relational Query Languages (2) SQL and QBE Walid G. Aref Query Languages For The Relational

LINQ to SQL: Taking the Boredom out of Querying Introduction LINQ = Language INtegrated Query =

SQL Queries 1 / 28 The SELECT-FROM-WHERE Structure SELECT <attributes > FROM <tables

Top-K Queries Marcin Kwietniewski Agenda Introduction Early solution Translation

An Accurate Join for Zonotopes, Preserving Affine Input/Output Relations Eric Goubault, Tristan

Relational Operators indexing technology. Now we can move on to query processing.

CSE 232A Graduate Database Systems Arun Kumar Topic 4: Query Optimization Chapters 12 and

Outline Query Processing Overview Algorithms for basic operations Sorting Selection

Time Complexity [Turing] has for the first time succeeded in giving an absolute definition of an

Query Processing, Resource Management, And Approximation in a Data - PowerPoint PPT Presentation

Query Processing, Resource Management, And Approximation in a Data Stream Management System Kevin Hoeschele Archana Joshi References R. Motwani, J. Widom, A. Arasu, B. Babcock, S. Babu, M. Datar, G. Manku, C. Olston, J. Rosenstein and R.

Improve Query Performance with the Query Log Analyzer Kees Vegter Field Engineer Query Log

CS4224/CS5424 Lecture 9 Distributed Query Processing Query Processing Translates query into a

Query Processing Relevance feedback; query expansion; Web Search 1 Overview Indexes Query

Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu Query

Chapter 3: Top-k Query Processing and Indexing 3.1 Top-k Algorithms 3.2 Approximate Top-k Query

6. Approximation and fitting norm approximation least-norm problems regularized

Resource Resource Management Management RESOURCE MANAGEMENT RESOURCE MANAGEMENT We have a

Query Understanding: A Manifesto Daniel Tunkelang queryunderstanding.com Overview What is

Perfect Query FORMULA 5 critical sections in every successful query letter (c) 2019

Query Op)miza)on 1 Query op)miza)on Given an SQL query,

z Towards Plan-aware Resource Allocation in Serverless Query Processing Malay Bag Alekh Jindal

Online Query Processing Exposure to online query processing algorithms and fundamentals A

V.3 Query Processing 1. Term-at-a-Time 2. Document-at-a-Time 3. WAND 4. Quit &amp; Continue 5.

Efficient query processing Efficient scoring, distributed query processing Web Search 1 Ranking

Query Processing Query Processing Steps balance &lt; 2500 ( balance ( account)) balance

Relational Query Languages (2) SQL and QBE Walid G. Aref Query Languages For The Relational

LINQ to SQL: Taking the Boredom out of Querying Introduction LINQ = Language INtegrated Query =

SQL Queries 1 / 28 The SELECT-FROM-WHERE Structure SELECT &lt;attributes &gt; FROM &lt;tables

Top-K Queries Marcin Kwietniewski Agenda Introduction Early solution Translation

An Accurate Join for Zonotopes, Preserving Affine Input/Output Relations Eric Goubault, Tristan

Relational Operators indexing technology. Now we can move on to query processing.

CSE 232A Graduate Database Systems Arun Kumar Topic 4: Query Optimization Chapters 12 and

Outline Query Processing Overview Algorithms for basic operations Sorting Selection

Time Complexity [Turing] has for the first time succeeded in giving an absolute definition of an

V.3 Query Processing 1. Term-at-a-Time 2. Document-at-a-Time 3. WAND 4. Quit & Continue 5.

Query Processing Query Processing Steps balance < 2500 ( balance ( account)) balance

SQL Queries 1 / 28 The SELECT-FROM-WHERE Structure SELECT <attributes > FROM <tables