 
              Optimization of Continuous Queries in Federated Database and Stream Processing Systems uanzhen Ji 1 , Zbigniew Jerzak 1 , Anisoara Nica 1 , Gregor Hackenbroich 1 , Y Christof Fetzer 2 1 SAP SE 2 TU Dresden 1 firstname.lastname@sap.com 2 christof.fetzer@tu-dresden.de M arch 10, 2015 BTW 2015
Agenda • Introduction • Federated Continuous Query Execution • Query Optimization Problem • Our Optimization Solution • Evaluation • Conclusions 2
Introduction • Problem: optimizing continuous queries (CQ) for federated execution over a native stream processing engine (SPE) and column-oriented in-memory database (CIM DB). – focus on SPJA queries (select, project, join, aggregate) • Goal: maximize query throughput (amount of data processed in unit time) • By “federate”, we mean “outsource” SPE data query streams results CIM DB 3 data flow
Introduction • Problem: optimizing continuous queries (CQ) for federated execution over a native stream processing engine (SPE) and column-oriented in-memory database (CIM DB). – focus on SPJA queries (select, project, join, aggregate) • Goal: maximize query throughput (amount of data processed in unit time) • By “federate”, we mean “outsource” SPE data query streams results CIM DB 4 data flow
Introduction • M otivation : – “No one size fits all” (Cyclops[LHB13], [JI13]) – obtain the best of both worlds (S PE, CIMDB) • Application Scenario : – analyzing energy consumption data collected from smart plugs installed in households (DEBS 2014 Grand Challenge) • M ain contributions : – a static cost-based optimizer for federated systems • extends established optimization techniques • considers the feasibility property of CQ – reveal the potential of federated CQ execution; query throughput under federated execution: • up to 8.5x as high as query throughput under SPE-based execution • up to 1.8x as high as query throughput under CIMDB-based execution 5
Federated Continuous Query Execution SPE data query streams results CIM DB data flow 6
Federated Continuous Query Execution SPE data query streams results M IG CIM DB SQL query data flow Introduce operator in SPE, is responsible for M IG M IG • sending relevant input data from S PE to CIMDB • triggering re-evaluation of query pieces moved to CIM DB • taking results of query pieces executed in CIM DB back to SPE 7
Query Optimization Problem • Problem: determine the optimal execution plan for a given CQ – currently static optimization SPE CIM DB • Feasibility of continuous queries [AN04]: – feasible execution plan: can keep up with data arrival rate – feasible query: has at least one feasible plan • Feasibility-dependent optimization objective : – feasible queries: find the feasible plan with min. resource consumption – infeasible queries: find the infeasible plan with max. throughput • State of the art: either did not consider feasibility of CQ, or considered feasibility of CQ but not a federated system setup. 8
Optimization Solution Cost M odel – Operator Cost (1) • Operator cost C(op) : CPU cost caused by tuples arrived from data sources within unit-time l i � � = l 1 � � + l 2 � � � l 1 =300 � ( �� ) = ∑ c 1 =0.001 ��� = 300* 0.001+ 200 * 0.002 = 0.7 op l 2 C ( �� ) > 1 à bottleneck à infeasible plan =200 c 2 = 0.002 For an operator op with k direct upstream operators: – l i : # tuples produced by the i -th upstream operator as a result of unit- time source arrivals – c i : time to process a single tuple from the i -th upstream operator 9
Optimization Solution Cost M odel – Operator Cost (2) SPE data query streams results M IG CIM DB SQL query data flow • A query piece outsourced to CIMDB and its corresponding M IG operator: – treated as a composite operator and cost as a whole – cost includes data transfer (in & out) cost and query execution cost 10
Optimization Solution Cost M odel – Execution Plan Cost Execution plan cost C(P) : C(P) = < � � � , � � � > (P has m operators) • – Two components: bottleneck cost: � � � = max{ � ( �� � ): � ∈ [1, � ]} � total utilization cost: � � � = ∑ � ( �� � ) ��� � ( �� � ) =0.5 � � � = � ( �� � ) = 1.1 � ( �� � ) =1.1 � ( �� � ) =0.7 op 1 � � � = 0.5 +0.3 +1.1 + 0.7 op 3 op 4 =2.6 op 2 � ( �� � ) =0.3 – � is infeasible if � � � >1 11
Optimization Solution Optimal Execution Plan • An execution plan P of a continuous query is an optimal plan, iff for any other plan P’ of the query, one of the following conditions is satisfied: – Condition 1: P is feasible but P’ is infeasible b (P) ≤ 1 < C C b (P’) – Condition 2: Both P and P’ are feasible, but P has lower C u (P) b (P) ≤ 1, C b (P’) ≤ 1 , and C u (P) ≤ C C u (P’) – Condition 3: Both P and P’ are infeasible, but P has lower C b (P) b (P) ≤ C 1 < C b (P’) 12
Optimization Solution Two Phase-Optimization • Large search space (high number of possible execution plans) à Two-Phase optimization: – Phase One: • find optimal purely S PE-based plan (consider join ordering, etc.) • take the corresponding logical plan as the optimal logical plan – Phase Two: • determine placement for each logical operator in the plan produced in phase-one. • Bottom-up plan construction using dynamic programming (DP) approach • Proved applicability of DP for feasibility-dependent optimization objective in paper. 13
Optimization Solution Pruning in Phase Two • For each operator ( op ) in a logical plan, the optimal sub-plan until op , where op is placed in the SPE, can be built from the optimal sub-plans until direct upstream operators of op . � � � < � � � op op 3 �� ���� �� ���� op op 2 I 1 �� ��� I 2 �� ���� op op 1 • For a large logical plan: divide into smaller pieces, optimize and compose in post order. 14
Evaluation • Setup: HP Z620 workstation with 24-cores (1.2GHz per core) and 96 GB RAM , running SUSE Linux. • Data: real-world energy consumption data from smart plugs installed in households (DEBS 2014 Grand Challenge). • T ested queries: 15
PROJECT Evaluation AGGR (cnt) INNER JOIN Optimizer effectiveness (1) AGGR (avg) • Examine 9 source stream data rates picked SELECT SELECT from range [1,000, 40,000] (tuples/s) WINDOW WINDOW • measure throughput of devised optimal plan (5 min) (5 min) Actual vs. requested throughput M ax. throughput comparison 30 30 Max. throughput (thousand/ s) Actual throughput (thousand/ s) 26,1 25 25 20 18,7 20 15 15 10 10 5 5 3,1 SELECT IN SPE 0 0 0 5 10 15 20 25 30 35 40 SELECT in All in SPE All in DB Requested throughput (thousand/ s) SPE 16
PROJECT Evaluation INNER JOIN Optimizer effectiveness (2) AGGR AGGR (avg, max) (avg, max) • Examine 9 source stream data rates picked SELECT SELECT from range [1,000, 40,000] (tuples/s) WINDOW WINDOW (5 min) (1 min) • measure throughput of devised optimal plan Actual vs. requested throughput M ax. throughput comparison 30 30 Actual throughput (thousand/ s) 28,6 Max. throughput (thousand/ s) 25 25 P2 20 20 18,1 18,0 P2 15 15 P1 P1 10 10 6,0 5 5 SELECT IN SPE (P1) SEL, JOIN, P IN SPE (P2) 0 0 0 5 10 15 20 25 30 35 40 SELECT in SEL, JOIN, All in SPE All in DB Requested throughput (thousand/ s) SPE P in SPE P1 17
Evaluation Influence of Feasibility Check SELECT IN SPE (P1) (with feasibility check) SEL, JOIN, P IN SPE (P2) (with feasibility check) SEL IN SPE (P1) (without feasibility check) 30 Actual throughput (thousand/ s) PROJECT 25 INNER JOIN 20 AGGR AGGR (avg, max) (avg, max) 15 SELECT SELECT 10 WINDOW WINDOW (5 min) (1 min) 5 0 0 5 10 15 20 25 30 35 40 Requested throughput (thousand/ s) 18
PROJECT Evaluation INNER JOIN Optimization Time AGGR AGGR (avg, max) (avg, max) SELECT SELECT • T ested with join queries (2-way, 5-way, 8-way). WINDOW WINDOW time can be reduced to about 2 seconds if (5 min) (1 min) breaking the input logical plan into 2 pieces 16+ million Phase-One With pruning # enumerated plans in Phase-Two Phase-Two Without pruning 61335,3 Time in millisecond 327168 (log scale) (log scale) 908,6 8411 312 100,5 64 68,6 12,3 11 0,9 2-way (6) 5-way (15) 8-way (24) 2-way (6) 5-way (15) 8-way (24) 19
Conclusion • Exploit the potential of federated execution of continuous queries over SPE and CIM DB. • Present a static optimizer which extends traditional optimization techniques to consider feasibility of continuous queries. • Evaluation shows promising results. For examined queries, throughput of devised federated plan is – up to 8.5 times as high as throughput of purely SPE-based plan – up to 1.8 times as high as throughput of purely CIM DB-based plan 20
Recommend
More recommend