Dynamic and Decentralized Global Analytics via Machine Learning
Hao Wang1, Di Niu2, Baochun Li1
1University of Toronto, 2University of Alberta
SoCC’18, Carlsbad, CA, USA
Dynamic and Decentralized Global Analytics via Machine Learning Hao - - PowerPoint PPT Presentation
SoCC18, Carlsbad, CA, USA Dynamic and Decentralized Global Analytics via Machine Learning Hao Wang 1 , Di Niu 2 , Baochun Li 1 1 University of Toronto, 2 University of Alberta Query Processing Query Plan startName , studioName 1.CREATE
Hao Wang1, Di Niu2, Baochun Li1
1University of Toronto, 2University of Alberta
SoCC’18, Carlsbad, CA, USA
!2
1.CREATE VIEW MoviesOf1996 AS
5. 6.SELECT starName, studioName 7.FROM MoviesOf1996 JOIN StarsIN;
Movies
StarsIn
Query Plan
!3
⋈ ⋈ ⋈
SQL Query QEP Candidates QEP parse select Hive / SparkSQL
Map-Reduce DAG DC1 DC3 DC2 … Hadoop / Spark
!4
iperf -t 10 -P 5
Google Cloud Taiwan Iowa
Bandwidth in total Bandwidth per connection Bandwidth (Mbps) 50 100 150 200 400 600 800 Time 3:00 3:15 3:30 3:45 4:00 4:15 4:30 4:45 5:00
300 Mbps 150 Mbps 400 Mbps 500 Mbps 150 Mbps DC 4 DC 2 DC 3 DC 1 100 Mbps customer 0.5 GB
3.3 GB partsupp 2.3 GB lineitem 15 GB
!5
L JOIN JOIN O PS JOIN C 2.3G 2.6G 15G 3.3G 2G 0.5G 2.3G 15G 3.3G 0.5G 1.7G 0.5G L JOIN JOIN JOIN C PS O L JOIN JOIN PS O JOIN C 1.4G 2.3G 15G 1.7G 3.3G 0.5G Plan A Plan B Plan C
Plan A Plan C
!6
Plan B
2.3G 15G 3.3G 0.5G 1.7G 0.5G L JOIN JOIN JOIN C PS O L JOIN JOIN PS O JOIN C 1.4G 2.3G 15G 1.7G 3.3G 0.5G Plan B Plan C
300 Mbps 150 Mbps 400 Mbps 500 Mbps 150 Mbps DC 4 DC 2 DC 3 DC 1 100 Mbps customer 0.5 GB
3.3 GB partsupp 2.3 GB lineitem 15 GB
Start End t (s) BW (Mbps) 100 200 300 400 50 100 150 200 250
Plan C
!7
Centralized plan
+Δt Δt: the data movement time Query Completion Time (s) 400 450 500 550 600
Plan selected by Clarinet (Plan B) Baseline (Plan A) Dynamic adjusted plan (Plan C) The data movement time
!8
queries.
2.3G 15G 3.3G 0.5G 1.7G 0.5G L JOIN JOIN JOIN C PS O L JOIN JOIN PS O JOIN C 1.4G 2.3G 15G 1.7G 3.3G 0.5G Plan B Plan C
!9
Planning
Data Model Evaluation
!10
Map-Reduce DAG DC1 DC3 DC2 … Hadoop / Spark SQL Query QEP Candidates QEP parse select Hive / SparkSQL Turbo
Model Training QEP Adjustment Cost Estimator
(duration, output size)
⋈ ⋈ ⋈
!11
filter(order o=>(o.price>100))
1.Operator —> Map stage
map(customer c=>(c.custkey, c.values)) map(order o=>(o.custkey, o.values)) reduce(custkey, values)
!12
JOIN tables reduce maps
!13
Raw Features Range total_exec_num 1 − 16 cpu_core_num 1 − 8 per executor mem_size 1 − 4 GB per executor avail_bw 5 − 1000 Mbps per link tbl1_size, tbl2_size 0.3 − 12 GB per table hdfs_block_num 1 − 90
LASSO path
!14
Handcrafted Features tbl_size_sum = sum(tbl1_size, tbl2_size) max_tbl_size = max(tbl1_size, tbl2_size) min_tbl_size = min(tbl1_size, tbl2_size) 1/avail_bw, 1/total_exec_num, 1/cpu_core_num
!15
9
5 1 3 4 6 7 8 2 Coefficients 50 100 150 L1 penalty (decreasing)
duration
!16
1 2
−100×103 100×103
5 4 3 Coefficients 1×106 2×106 3×106 4×106 5×106 L1 penalty (decreasing)
Linear Regression with L1 penalty Gradient Boosting Regression Tree 500 ternary regression trees of depth 3
!17
depth-1 depth-2 depth-3 depth-4 APE (%) 20 40 60 80 100 Number of regression trees 200 400 600
Error (%)
APEi = | yi − h(xi) | yi ×100%.
Absolute Percentage Error:
!18
APE (%) 10 20 30 40 LASSO GBRT-raw GBRT APE (%) 10 20 30 LASSO GBRT-raw GBRT
Duration Output Size
!19
APE (%) 10 20 30 Dataset size 3K 5K 7K 9K 11K 13K 15K APE (%) 10 20 30 Dataset size 3K 5K 7K 9K 11K 13K 15K
Duration Output Size
duration
data_reduction
data_reduction / duration
!20
Table Location Table Location lineitem Taiwan customer Frankfurt region Singapore
Sao Paulo supplier Sydney nation Northern Virginia part Belgium partsupp Oregon
8 regions
!21
Turbo-SCTF
Turbo-MDRF
Turbo-MDRRF
Baseline Clarinet Turbo-SCTF Turbo-MDRF Turbo-MDRRF Query completion time (s) 200 400 600 Q2 Q3 Q5 Q7 Q8 Query completion time (s) 500 1000 1500 2000 Q9 Q10 Q11 Q18 Q21
!22
The completion time distributions of pairwise joins.
!23
20 40 Q2 100 200 300 Q3 100 200 Q5 100 200 300 Q7 100 200 300 Q8 100 200 300 400 Q9 100 200 Q10 20 40 Q11 100 200 300 Q18 200 400 600 Q21 Baseline Clarinet Turbo-SCTF Turbo-MDRF Turbo-MDRRF Stage completion time (s) Stage completion time (s)
Completion time (s) Completion time (s) JOIN tables reduce maps
The Gantt chart of the query Q21
!24
Brazil--Taiwan Brazil--Virginia Virginia--Taiwan Brazil--Sydney Virginia--Sydney Taiwan--Sydney MDRRF MDRF SCTF Clarinet Baseline BW (Mbps) 200 400 600 800 Time 5:20 5:30 5:40
Work Data Placement Task Scheduling Plan Optimization Working Mode Geode [26] √ √ static WANanalytics [27] √ √ static Iridium [20] √ √ static SWAG [16] √ static JetSteam [21] √ static Clarinet [25] √ √ static Lube [15] √ dynamic Graphene [14] √ static Turbo √ dynamic
!25
based on the TPC-H benchmark
!26