Optimization of Continuous Queries in Federated Database and Stream - - PowerPoint PPT Presentation

optimization of continuous queries in federated database
SMART_READER_LITE
LIVE PREVIEW

Optimization of Continuous Queries in Federated Database and Stream - - PowerPoint PPT Presentation

Optimization of Continuous Queries in Federated Database and Stream Processing Systems uanzhen Ji 1 , Zbigniew Jerzak 1 , Anisoara Nica 1 , Gregor Hackenbroich 1 , Y Christof Fetzer 2 1 SAP SE 2 TU Dresden 1 firstname.lastname@sap.com 2


slide-1
SLIDE 1

Optimization of Continuous Queries in Federated Database and Stream Processing Systems

Y uanzhen Ji1, Zbigniew Jerzak1, Anisoara Nica1, Gregor Hackenbroich1, Christof Fetzer2

1SAP SE 2TU Dresden

1firstname.lastname@sap.com 2christof.fetzer@tu-dresden.de

M arch 10, 2015 BTW 2015

slide-2
SLIDE 2

Agenda

  • Introduction
  • Federated Continuous Query Execution
  • Query Optimization Problem
  • Our Optimization Solution
  • Evaluation
  • Conclusions

2

slide-3
SLIDE 3
  • Problem: optimizing continuous queries (CQ) for federated execution over

a native stream processing engine (SPE) and column-oriented in-memory database (CIM DB).

– focus on SPJA queries (select, project, join, aggregate)

  • Goal: maximize query throughput (amount of data processed in unit time)
  • By “federate”, we mean “outsource”

Introduction

3

SPE CIM DB data streams query results data flow

slide-4
SLIDE 4
  • Problem: optimizing continuous queries (CQ) for federated execution over

a native stream processing engine (SPE) and column-oriented in-memory database (CIM DB).

– focus on SPJA queries (select, project, join, aggregate)

  • Goal: maximize query throughput (amount of data processed in unit time)
  • By “federate”, we mean “outsource”

Introduction

4

SPE CIM DB data streams query results data flow

slide-5
SLIDE 5

Introduction

  • M otivation:

– “No one size fits all” (Cyclops[LHB13], [JI13]) – obtain the best of both worlds (S

PE, CIMDB)

  • Application Scenario:

– analyzing energy consumption data collected from smart plugs

installed in households (DEBS 2014 Grand Challenge)

  • M ain contributions:

a static cost-based optimizer for federated systems

  • extends established optimization techniques
  • considers the feasibility property of CQ

reveal the potential of federated CQ execution; query throughput under federated execution:

  • up to 8.5x as high as query throughput under SPE-based execution
  • up to 1.8x as high as query throughput under CIMDB-based execution

5

slide-6
SLIDE 6

Federated Continuous Query Execution

6

SPE CIM DB data streams query results data flow

slide-7
SLIDE 7

Federated Continuous Query Execution

Introduce operator in SPE, is responsible for

  • sending relevant input data from S

PE to CIMDB

  • triggering re-evaluation of query pieces moved to CIM DB
  • taking results of query pieces executed in CIM DB back to SPE

7

SPE CIM DB data streams query results

SQL query M IG M IG

data flow

M IG

slide-8
SLIDE 8

Query Optimization Problem

  • Problem: determine the optimal execution

plan for a given CQ

– currently static optimization

  • Feasibility of continuous queries [AN04]:

– feasible execution plan: can keep up

with data arrival rate

– feasible query: has at least one feasible plan

8

SPE CIM DB

  • Feasibility-dependent optimization objective:

– feasible queries: find the feasible plan with min. resource consumption – infeasible queries: find the infeasible plan with max. throughput

  • State of the art: either did not consider feasibility of CQ, or considered

feasibility of CQ but not a federated system setup.

slide-9
SLIDE 9

Optimization Solution

Cost M odel – Operator Cost (1)

  • Operator cost C(op): CPU cost caused by tuples arrived from data sources

within unit-time For an operator op with k direct upstream operators:

– li: # tuples produced by the i-th upstream operator as a result of unit-

time source arrivals

– ci: time to process a single tuple from the i-th upstream operator

9 C() > 1 à bottleneck à infeasible plan

() = ∑ li

  • = l1 + l2
  • p

l1=300

=200 =0.001 = 0.002

l2

c1 c2

= 300* 0.001+ 200 * 0.002 = 0.7

slide-10
SLIDE 10

Optimization Solution

Cost M odel – Operator Cost (2)

  • A query piece outsourced to CIMDB and its corresponding M IG operator:

– treated as a composite operator and cost as a whole – cost includes data transfer (in & out) cost and query execution cost

10

SPE CIM DB data streams query results

SQL query M IG

data flow

slide-11
SLIDE 11
  • Execution plan cost C(P): C(P) = < , > (P has m operators)

– Two components: bottleneck cost: = max{(): ∈ [1,]}

total utilization cost: = ∑

()

  • – is infeasible if >1

Optimization Solution

Cost M odel – Execution Plan Cost

11

= () = 1.1 = 0.5 +0.3 +1.1 + 0.7

=2.6

()=0.5 ()=0.3 ()=1.1 ()=0.7

  • p3
  • p1
  • p2
  • p4
slide-12
SLIDE 12

Optimization Solution

Optimal Execution Plan

  • An execution plan P of a continuous query is an optimal plan, iff for any
  • ther plan P’ of the query, one of the following conditions is satisfied:

– Condition 1: P is feasible but P’ is infeasible

C

b(P) ≤ 1 < C b(P’)

– Condition 2: Both P and P’ are feasible, but P has lower C

u(P)

C

b(P) ≤ 1, C b(P’) ≤ 1, and C u(P) ≤ C u(P’)

– Condition 3: Both P and P’ are infeasible, but P has lower C

b(P)

1 < C

b(P) ≤ C b(P’)

12

slide-13
SLIDE 13

Optimization Solution

Two Phase-Optimization

  • Large search space (high number of possible execution plans)

à Two-Phase optimization: – Phase One:

  • find optimal purely S

PE-based plan (consider join ordering, etc.)

  • take the corresponding logical plan as the optimal logical plan

– Phase Two:

  • determine placement for each logical operator in the plan

produced in phase-one.

  • Bottom-up plan construction using dynamic programming (DP) approach
  • Proved applicability of DP for feasibility-dependent optimization objective

in paper.

13

slide-14
SLIDE 14
  • For each operator (op) in a logical plan, the optimal sub-plan until op,

where op is placed in the SPE, can be built from the optimal sub-plans until direct upstream operators of op.

  • For a large logical plan: divide into smaller pieces, optimize and compose

in post order.

Optimization Solution

Pruning in Phase Two

14

I1

  • I2

<

  • p
  • p2
  • p
  • p1
  • p
  • p3
slide-15
SLIDE 15

Evaluation

  • Setup: HP Z620 workstation with 24-cores (1.2GHz per core) and 96 GB

RAM , running SUSE Linux.

  • Data: real-world energy consumption data from smart plugs installed in

households (DEBS 2014 Grand Challenge).

  • T

ested queries:

15

slide-16
SLIDE 16

26,1 3,1 18,7

5 10 15 20 25 30 SELECT in SPE All in SPE All in DB

  • Max. throughput (thousand/ s)

5 10 15 20 25 30 5 10 15 20 25 30 35 40 Actual throughput (thousand/ s) Requested throughput (thousand/ s)

Evaluation

Optimizer effectiveness (1)

  • Examine 9 source stream data rates picked

from range [1,000, 40,000] (tuples/s)

  • measure throughput of devised optimal plan

16

M ax. throughput comparison Actual vs. requested throughput

PROJECT INNER JOIN AGGR (avg) SELECT SELECT WINDOW (5 min) WINDOW (5 min) AGGR (cnt)

SELECT IN SPE

slide-17
SLIDE 17

Evaluation

Optimizer effectiveness (2)

17

5 10 15 20 25 30 5 10 15 20 25 30 35 40 Actual throughput (thousand/ s) Requested throughput (thousand/ s)

18,1 28,6 6,0 18,0

5 10 15 20 25 30

SELECT in SPE SEL, JOIN, P in SPE All in SPE All in DB

  • Max. throughput (thousand/ s)

P1 P2 P1 P2

M ax. throughput comparison Actual vs. requested throughput

  • Examine 9 source stream data rates picked

from range [1,000, 40,000] (tuples/s)

  • measure throughput of devised optimal plan

P1 PROJECT INNER JOIN AGGR (avg, max) AGGR (avg, max) SELECT SELECT WINDOW (5 min) WINDOW (1 min) SELECT IN SPE (P1) SEL, JOIN, P IN SPE (P2)

slide-18
SLIDE 18

Evaluation

Influence of Feasibility Check

18

5 10 15 20 25 30 5 10 15 20 25 30 35 40 Actual throughput (thousand/ s) Requested throughput (thousand/ s)

PROJECT INNER JOIN AGGR (avg, max) AGGR (avg, max) SELECT SELECT WINDOW (5 min) WINDOW (1 min)

SELECT IN SPE (P1) (with feasibility check) SEL, JOIN, P IN SPE (P2) (with feasibility check) SEL IN SPE (P1) (without feasibility check)

slide-19
SLIDE 19

Evaluation

Optimization Time

  • T

ested with join queries (2-way, 5-way, 8-way).

19

11 312 8411 64 327168 2-way (6) 5-way (15) 8-way (24) # enumerated plans in Phase-Two (log scale) With pruning Without pruning 0,9 68,6 100,5 12,3 908,6 61335,3 2-way (6) 5-way (15) 8-way (24) Time in millisecond (log scale) Phase-One Phase-Two 16+ million

PROJECT INNER JOIN AGGR (avg, max) AGGR (avg, max) SELECT SELECT WINDOW (5 min) WINDOW (1 min)

time can be reduced to about 2 seconds if breaking the input logical plan into 2 pieces

slide-20
SLIDE 20

Conclusion

  • Exploit the potential of federated execution of continuous queries over

SPE and CIM DB.

  • Present a static optimizer which extends traditional optimization

techniques to consider feasibility of continuous queries.

  • Evaluation shows promising results.

For examined queries, throughput of devised federated plan is

– up to 8.5 times as high as throughput of purely SPE-based plan – up to 1.8 times as high as throughput of purely CIM DB-based plan

20

slide-21
SLIDE 21

References

[AN04] Ayad, A. M. & Naughton, J. F ., Static Optimization of Conjunctive Queries with Sliding Windows over Infinite Streams, SIGM OD, 2004 [FKC+09] Franklin, M. J.; Krishnamurthy, S.; Conway, N.; Li, A., Russakovsky, A. & Thombre, N., Continuous Analytics: Rethinking query processing in a network-effect world. CIDR, 2009 [KS09] Kraemer, J. & Seeger B., Semantics and implementation of continuous sliding window queries over data streams, ACM TODS, 2009 [BCD+10] Botan, I.; Cho, Y .; Derakhshan, R.; Dindar, N.; Gupta, A.; Haas, L. M.; Kim, K.; Lee, C.; Mundada, G.; Shan, M .-C.; T atbul, N.; Y an, Y .; Y un, B. & Zhang, J. A demonstration of the MaxStream federated stream processing system. ICDE, 2010 [LM B+10] Liu, M .; M ihaylov, S. R.; Bao, Z.; Jacob, M .; Ives, Z. G.; Loo, B. T . & Guha, S. SmartCIS: integrating digital and physical environments. SIGM OD Record, 2010 [LIM+12] Liarou, E.; Idreos, S.; Manegold, S. & Kersten, M. MonetDB/ DataCell: online analytics in a streaming column-store, PVLDB, 2012 [LHB13] Lim, H.; Han, Y . & Babu, S. How to Fit when No One Size Fits, CIDR, 2013 [Ji13] Ji, Y ., Database support for processing complex aggregate queries over data streams , EDBT Workshops, 2013 [CDK+14] Çetintemel, U.; Du, J.; Kraska, T .; M adden, S.; M aier, D.; M eehan, J.; Pavlo, A.; Stonebraker, M .; Sutherland, E.; T atbul, N.; Tufte, K.; Wang, H. & Zdonik, S. B., S-Store: A streaming NewSQL system for big velocity applications, PVLDB, 2014 [DLB+11] Daum, M.; Lauterwald, F .; Baumgärtel, P .; Pollner, N. & Meyer-Wegener, K., E fficient and Cost-aware Operator Placement in Heterogeneous Stream-processing Environments, DEBS, 2011

21

slide-22
SLIDE 22

Thank you!

Contact information: Y uanzhen Ji yuanzhen.ji@sap.com