optimization of continuous queries in federated database
play

Optimization of Continuous Queries in Federated Database and Stream - PowerPoint PPT Presentation

Optimization of Continuous Queries in Federated Database and Stream Processing Systems uanzhen Ji 1 , Zbigniew Jerzak 1 , Anisoara Nica 1 , Gregor Hackenbroich 1 , Y Christof Fetzer 2 1 SAP SE 2 TU Dresden 1 firstname.lastname@sap.com 2


  1. Optimization of Continuous Queries in Federated Database and Stream Processing Systems uanzhen Ji 1 , Zbigniew Jerzak 1 , Anisoara Nica 1 , Gregor Hackenbroich 1 , Y Christof Fetzer 2 1 SAP SE 2 TU Dresden 1 firstname.lastname@sap.com 2 christof.fetzer@tu-dresden.de M arch 10, 2015 BTW 2015

  2. Agenda • Introduction • Federated Continuous Query Execution • Query Optimization Problem • Our Optimization Solution • Evaluation • Conclusions 2

  3. Introduction • Problem: optimizing continuous queries (CQ) for federated execution over a native stream processing engine (SPE) and column-oriented in-memory database (CIM DB). – focus on SPJA queries (select, project, join, aggregate) • Goal: maximize query throughput (amount of data processed in unit time) • By “federate”, we mean “outsource” SPE data query streams results CIM DB 3 data flow

  4. Introduction • Problem: optimizing continuous queries (CQ) for federated execution over a native stream processing engine (SPE) and column-oriented in-memory database (CIM DB). – focus on SPJA queries (select, project, join, aggregate) • Goal: maximize query throughput (amount of data processed in unit time) • By “federate”, we mean “outsource” SPE data query streams results CIM DB 4 data flow

  5. Introduction • M otivation : – “No one size fits all” (Cyclops[LHB13], [JI13]) – obtain the best of both worlds (S PE, CIMDB) • Application Scenario : – analyzing energy consumption data collected from smart plugs installed in households (DEBS 2014 Grand Challenge) • M ain contributions : – a static cost-based optimizer for federated systems • extends established optimization techniques • considers the feasibility property of CQ – reveal the potential of federated CQ execution; query throughput under federated execution: • up to 8.5x as high as query throughput under SPE-based execution • up to 1.8x as high as query throughput under CIMDB-based execution 5

  6. Federated Continuous Query Execution SPE data query streams results CIM DB data flow 6

  7. Federated Continuous Query Execution SPE data query streams results M IG CIM DB SQL query data flow Introduce operator in SPE, is responsible for M IG M IG • sending relevant input data from S PE to CIMDB • triggering re-evaluation of query pieces moved to CIM DB • taking results of query pieces executed in CIM DB back to SPE 7

  8. Query Optimization Problem • Problem: determine the optimal execution plan for a given CQ – currently static optimization SPE CIM DB • Feasibility of continuous queries [AN04]: – feasible execution plan: can keep up with data arrival rate – feasible query: has at least one feasible plan • Feasibility-dependent optimization objective : – feasible queries: find the feasible plan with min. resource consumption – infeasible queries: find the infeasible plan with max. throughput • State of the art: either did not consider feasibility of CQ, or considered feasibility of CQ but not a federated system setup. 8

  9. Optimization Solution Cost M odel – Operator Cost (1) • Operator cost C(op) : CPU cost caused by tuples arrived from data sources within unit-time l i � � = l 1 � � + l 2 � � � l 1 =300 � ( �� ) = ∑ c 1 =0.001 ��� = 300* 0.001+ 200 * 0.002 = 0.7 op l 2 C ( �� ) > 1 à bottleneck à infeasible plan =200 c 2 = 0.002 For an operator op with k direct upstream operators: – l i : # tuples produced by the i -th upstream operator as a result of unit- time source arrivals – c i : time to process a single tuple from the i -th upstream operator 9

  10. Optimization Solution Cost M odel – Operator Cost (2) SPE data query streams results M IG CIM DB SQL query data flow • A query piece outsourced to CIMDB and its corresponding M IG operator: – treated as a composite operator and cost as a whole – cost includes data transfer (in & out) cost and query execution cost 10

  11. Optimization Solution Cost M odel – Execution Plan Cost Execution plan cost C(P) : C(P) = < � � � , � � � > (P has m operators) • – Two components: bottleneck cost: � � � = max{ � ( �� � ): � ∈ [1, � ]} � total utilization cost: � � � = ∑ � ( �� � ) ��� � ( �� � ) =0.5 � � � = � ( �� � ) = 1.1 � ( �� � ) =1.1 � ( �� � ) =0.7 op 1 � � � = 0.5 +0.3 +1.1 + 0.7 op 3 op 4 =2.6 op 2 � ( �� � ) =0.3 – � is infeasible if � � � >1 11

  12. Optimization Solution Optimal Execution Plan • An execution plan P of a continuous query is an optimal plan, iff for any other plan P’ of the query, one of the following conditions is satisfied: – Condition 1: P is feasible but P’ is infeasible b (P) ≤ 1 < C C b (P’) – Condition 2: Both P and P’ are feasible, but P has lower C u (P) b (P) ≤ 1, C b (P’) ≤ 1 , and C u (P) ≤ C C u (P’) – Condition 3: Both P and P’ are infeasible, but P has lower C b (P) b (P) ≤ C 1 < C b (P’) 12

  13. Optimization Solution Two Phase-Optimization • Large search space (high number of possible execution plans) à Two-Phase optimization: – Phase One: • find optimal purely S PE-based plan (consider join ordering, etc.) • take the corresponding logical plan as the optimal logical plan – Phase Two: • determine placement for each logical operator in the plan produced in phase-one. • Bottom-up plan construction using dynamic programming (DP) approach • Proved applicability of DP for feasibility-dependent optimization objective in paper. 13

  14. Optimization Solution Pruning in Phase Two • For each operator ( op ) in a logical plan, the optimal sub-plan until op , where op is placed in the SPE, can be built from the optimal sub-plans until direct upstream operators of op . � � � < � � � op op 3 �� ���� �� ���� op op 2 I 1 �� ��� I 2 �� ���� op op 1 • For a large logical plan: divide into smaller pieces, optimize and compose in post order. 14

  15. Evaluation • Setup: HP Z620 workstation with 24-cores (1.2GHz per core) and 96 GB RAM , running SUSE Linux. • Data: real-world energy consumption data from smart plugs installed in households (DEBS 2014 Grand Challenge). • T ested queries: 15

  16. PROJECT Evaluation AGGR (cnt) INNER JOIN Optimizer effectiveness (1) AGGR (avg) • Examine 9 source stream data rates picked SELECT SELECT from range [1,000, 40,000] (tuples/s) WINDOW WINDOW • measure throughput of devised optimal plan (5 min) (5 min) Actual vs. requested throughput M ax. throughput comparison 30 30 Max. throughput (thousand/ s) Actual throughput (thousand/ s) 26,1 25 25 20 18,7 20 15 15 10 10 5 5 3,1 SELECT IN SPE 0 0 0 5 10 15 20 25 30 35 40 SELECT in All in SPE All in DB Requested throughput (thousand/ s) SPE 16

  17. PROJECT Evaluation INNER JOIN Optimizer effectiveness (2) AGGR AGGR (avg, max) (avg, max) • Examine 9 source stream data rates picked SELECT SELECT from range [1,000, 40,000] (tuples/s) WINDOW WINDOW (5 min) (1 min) • measure throughput of devised optimal plan Actual vs. requested throughput M ax. throughput comparison 30 30 Actual throughput (thousand/ s) 28,6 Max. throughput (thousand/ s) 25 25 P2 20 20 18,1 18,0 P2 15 15 P1 P1 10 10 6,0 5 5 SELECT IN SPE (P1) SEL, JOIN, P IN SPE (P2) 0 0 0 5 10 15 20 25 30 35 40 SELECT in SEL, JOIN, All in SPE All in DB Requested throughput (thousand/ s) SPE P in SPE P1 17

  18. Evaluation Influence of Feasibility Check SELECT IN SPE (P1) (with feasibility check) SEL, JOIN, P IN SPE (P2) (with feasibility check) SEL IN SPE (P1) (without feasibility check) 30 Actual throughput (thousand/ s) PROJECT 25 INNER JOIN 20 AGGR AGGR (avg, max) (avg, max) 15 SELECT SELECT 10 WINDOW WINDOW (5 min) (1 min) 5 0 0 5 10 15 20 25 30 35 40 Requested throughput (thousand/ s) 18

  19. PROJECT Evaluation INNER JOIN Optimization Time AGGR AGGR (avg, max) (avg, max) SELECT SELECT • T ested with join queries (2-way, 5-way, 8-way). WINDOW WINDOW time can be reduced to about 2 seconds if (5 min) (1 min) breaking the input logical plan into 2 pieces 16+ million Phase-One With pruning # enumerated plans in Phase-Two Phase-Two Without pruning 61335,3 Time in millisecond 327168 (log scale) (log scale) 908,6 8411 312 100,5 64 68,6 12,3 11 0,9 2-way (6) 5-way (15) 8-way (24) 2-way (6) 5-way (15) 8-way (24) 19

  20. Conclusion • Exploit the potential of federated execution of continuous queries over SPE and CIM DB. • Present a static optimizer which extends traditional optimization techniques to consider feasibility of continuous queries. • Evaluation shows promising results. For examined queries, throughput of devised federated plan is – up to 8.5 times as high as throughput of purely SPE-based plan – up to 1.8 times as high as throughput of purely CIM DB-based plan 20

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend