Dynamic and Decentralized Global Analytics via Machine Learning Hao - PowerPoint PPT Presentation

SoCC’18, Carlsbad, CA, USA Dynamic and Decentralized Global Analytics via Machine Learning Hao Wang 1 , Di Niu 2 , Baochun Li 1 1 University of Toronto, 2 University of Alberta

Query Processing Query Plan π startName , studioName 1.CREATE VIEW MoviesOf1996 AS 2. SELECT * 3. FROM Movies 4. WHERE year = 1996; ⋈ 5. 6.SELECT starName, studioName σ year = 1996 7.FROM MoviesOf1996 JOIN StarsIN; StarsIn Movies ! 2

Decentralized Global Analytics SQL Query ⋈ DAG Hive / SparkSQL parse QEP Map-Reduce Candidates ⋈ select QEP Hadoop / Spark DC1 DC3 ⋈ DC2 … ! 3

Fluctuating WAN 800 600 400 Bandwidth (Mbps) Bandwidth in total 200 150 100 50 Bandwidth per connection 3:00 3:15 3:30 3:45 4:00 4:15 4:30 4:45 5:00 Time Google Cloud Taiwan Iowa iperf -t 10 -P 5 ! 4

A Toy Example 01. SELECT 02. C.name, O.orderstatus, 03. L.discount, PS.availqty customer order DC 1 DC 2 400 Mbps 04. FROM 0.5 GB 3.3 GB 05. customer as C, 300 Mbps 100 Mbps 06. order as O, 150 Mbps 150 Mbps 07. lineitem as L, 08. partsupp as PS lineitem partsupp 09. WHERE O.orderkey == L.orderkey, 500 Mbps DC 3 DC 4 15 GB 2.3 GB 10. AND PS.partkey == L.partkey, 11. AND PS.suppkey == L.suppkey, 12. AND C.custkey == O.custkey ! 5

Query Plan Candidates Plan A JOIN JOIN JOIN • The worst plan 1.4G 0.5G 2G 0.5G 1.7G 0.5G JOIN JOIN C C • The baseline JOIN JOIN 1.7G 3.3G 2.6G 2.3G Plan B JOIN JOIN PS O 2.3G 15G 3.3G 0.5G • The initial optimal plan 3.3G 15G 2.3G 15G PS L O C O L PS L • Selected by Clarinet Plan A Plan B Plan C Plan C • … ! 6

A Toy Example 400 BW (Mbps) 300 JOIN JOIN 200 Start End 1.4G 0.5G 1.7G 0.5G 100 t (s) JOIN 0 50 100 150 200 250 C JOIN JOIN 1.7G 3.3G JOIN O 2.3G 15G 3.3G 0.5G 2.3G 15G customer order PS L O C PS L DC 1 DC 2 400 Mbps 0.5 GB 3.3 GB Plan B Plan C 300 Mbps 100 Mbps 150 Mbps 150 Mbps Plan C • The adjusted plan lineitem partsupp 500 Mbps • Adapt to bandwidth fluctuation DC 3 DC 4 15 GB 2.3 GB ! 7

Query Completion Time Δ t: the data movement time Centralized plan Query Completion Time (s) 600 Baseline (Plan A) 550 + Δ t Plan selected by Clarinet (Plan B) 500 Dynamic adjusted plan (Plan C) 450 Δ t The data movement time 400 ! 8

Dynamic Query Planning Challenges: JOIN JOIN 1.4G 0.5G 1.7G 0.5G • Accurately estimating runtime cost JOIN C of query plans. JOIN JOIN 1.7G 3.3G JOIN O 2.3G 15G 3.3G 0.5G • Minimize overall completion time of 2.3G 15G PS L O C PS L queries. Plan B Plan C ! 9

Turbo Data Turbo SQL Query Hive / SparkSQL Model Training parse QEP Planning QEP Candidates Adjustment select QEP Cost Estimator Model Hadoop / Spark DAG DC1 DC3 DC2 … Map-Reduce Evaluation ! 10

Prediction Target ⋈ ⋈ (duration, output size) ⋈ ! 11

Data Generation JOIN reduce tables maps 1.Operator —> Map stage σ price > 100 filter(order o=>(o.price>100)) 2. Operator —> MapReduce stages map(customer c=>(c.custkey, c.values)) orders customer map(order o=>(o.custkey, o.values)) ⋈ reduce(custkey, values) ! 12

Data Generation 15K records Raw Features Range 1 − 16 total_exec_num 1 − 8 per executor cpu_core_num 1 − 4 GB per executor mem_size 5 − 1000 Mbps per link avail_bw tbl1_size , tbl2_size 0 . 3 − 12 GB per table 1 − 90 hdfs_block_num ! 13

Data Preprocessing [1, a , b , c , a 2 , ab , ac , b 2 , bc , c 2 ] [ a , b , c ] 1. Handcrafting features Handcrafted Features 2. Polynomial feature crossing tbl_size_sum = sum ( tbl1_size , tbl2_size ) 3. Feature selection by max_tbl_size = max ( tbl1_size , tbl2_size ) LASSO path min_tbl_size = min ( tbl1_size , tbl2_size ) 1 / avail_bw , 1 / total_exec_num , 1 / cpu_core_num ! 14

Feature Selection duration 1. max_tbl_size 4. cpu_core_num 7. total_exec_num 2. tbl_size_sum 5. max_tbl_size / bw 8. mem_size 6. 1/bw 2 3. min_tbl_size 9. Other features 150 5 Coefficients 100 1 50 2 7 6 4 3 8 0 9 L 1 penalty (decreasing) ! 15

Feature Selection output size 1. max_tbl_size 3. tbl_size_sum 5. Other features 4. min_tbl_size 2 2. tbl1_size 5×10 6 4×10 6 1 100×10 3 3×10 6 Coefficients 4 0 5 2×10 6 3 −100×10 3 1×10 6 2 0 L 1 penalty (decreasing) ! 16

Training 100 LASSO Regression depth-1 80 depth-2 depth-3 Error (%) depth-4 Linear Regression with L1 penalty 60 APE (%) 40 GBRT 20 0 0 200 400 600 Gradient Boosting Regression Tree Number of regression trees 500 ternary regression trees of depth 3 ! 17

Model Test APE i = | y i − h ( x i ) | Absolute × 100%. Percentage Error: y i 40 30 30 APE (%) APE (%) 20 20 10 10 0 0 LASSO GBRT-raw GBRT LASSO GBRT-raw GBRT Duration Output Size ! 18

Model Test 30 30 APE (%) APE (%) 20 20 10 10 0 0 3K 5K 7K 9K 11K 13K 15K 3K 5K 7K 9K 11K 13K 15K Dataset size Dataset size Duration Output Size ! 19

Dynamic Planning Strategies • Shortest Completion Time First (SCTF) duration • Maximum Data Reduction First (MDRF) data_reduction • Maximum Data Reduction Rate First (MDRRF) data_reduction / duration ! 20

Evaluation Setup • TPC-H benchmark Table Location Table Location Taiwan Frankfurt lineitem customer • Google Cloud Singapore Sao Paulo region orders Sydney Northern Virginia supplier nation - 33 instances across Belgium Oregon part partsupp 8 regions ! 21

Query Baseline Turbo-SCTF Turbo-MDRRF Clarinet Turbo-MDRF Query completion time (s) 600 400 Turbo-SCTF 200 • 25.1-38.5% 0 Turbo-MDRF Q2 Q3 Q5 Q7 Q8 Query completion time (s) 2000 • 12.6-37.1% 1500 Turbo-MDRRF 1000 • 25.2-41.4% 500 0 Q9 Q10 Q11 Q18 Q21 ! 22

Pairwise Join JOIN reduce tables maps Baseline Turbo-SCTF Turbo-MDRRF Clarinet Turbo-MDRF Stage completion time (s) Stage completion time (s) 300 Completion time (s) 300 40 300 200 200 200 200 20 100 100 100 The completion time 100 distributions of 0 0 0 0 0 Q2 Q3 Q5 Q7 Q8 pairwise joins. 400 600 Completion time (s) 200 40 300 300 400 200 200 100 20 200 100 100 0 0 0 0 0 Q9 Q10 Q11 Q18 Q21 ! 23

Case Study Brazil--Taiwan Virginia--Taiwan Virginia--Sydney Brazil--Virginia Brazil--Sydney Taiwan--Sydney 800 600 BW (Mbps) 400 The Gantt chart of 200 the query Q21 0 Baseline Clarinet SCTF MDRF MDRRF 5:20 5:30 5:40 Time ! 24

Related Work Data Task Plan Working Work Placement Scheduling Optimization Mode √ √ Geode [26] static √ √ WANanalytics [27] static √ √ Iridium [20] static √ SWAG [16] static √ JetSteam [21] static √ √ Clarinet [25] static √ Lube [15] dynamic √ Graphene [14] static √ Turbo dynamic ! 25

Conclusion • Turbo: dynamic query planning with awareness of WAN bandwidths • Data-driven cost estimation of pairwise join with accuracy over 95% • Greedy strategies that reduces the query completion times by up to 41% based on the TPC-H benchmark ! 26

The End Thank You

Dynamic and Decentralized Global Analytics via Machine Learning Hao - PowerPoint PPT Presentation

SoCC18, Carlsbad, CA, USA Dynamic and Decentralized Global Analytics via Machine Learning Hao Wang 1 , Di Niu 2 , Baochun Li 1 1 University of Toronto, 2 University of Alberta Query Processing Query Plan startName , studioName 1.CREATE

Analytics and Data Summit 2020 Analytics and Data Summit 2020 Analytics and Data Summit 2020

Decentralized Clinical Trials In 2020: A Global Survey Decentralized and Hybrid Trials 2020

Undergraduate Business Analytics Minor Spreadsheet Analytics BANA-2081 Business Analytics

Decentralized Consensus Proto cols 1 Goals of the lecture Decentralized Consensus

Google Analytics Overview Whats Google Analytics? The Google Analytics

Introduction to Talent Analytics and Interim View 01 Overview Erich OSaben Talent Analytics

Google Analytics A beginners guide What is Google Analytics? Google Analytics is not magic.

Document Name Solar Analytics - Rooftop PV energy analytics PREPARED BY: Your Name, Your Title

Architecture 3.0 Landscape Analytics Jrgen Dllner Hasso-Plattner-Institut Jrgen

A Decentralized and Distributed E-voting Scheme Based on Cryptographic Shuffles Decentralized

Fixing healthcare data exchange with decentralized FOSS Protect your API's with a decentralized

Decentralized Trust Management for Decentralized Trust Management for Ad-Hoc Peer-to-Peer

monitoring decentralized specifications Antoine El-Hokayem Ylis Falcone Univ. Grenoble Alpes,

PERPETUAL PERPETUAL DECENTRALIZED MANAGEMENT OF DIGITAL OBJECTS DECENTRALIZED MANAGEMENT OF

Alternatives to Blockchains Sarah Meiklejohn (University College London) fully decentralized

Decentralized Deduplication in SAN Cluster File Systems Austin T. Clements Irfan Ahmad

Task and Mo*on Planning Dylan Hadfield-Menell UC Berkeley

Indexes, Query Optimization PDBM 7.5, 12.3.5 Dr. Chris Mayfield Department of Computer Science

CAPER and the eCon Planning Suite July 26, 2018 Webinar Instructions Presenters Jon Kunz,

Dynamic Scheduling of Network Updates Xin Jin Hongqiang Harry

CMS TMS Integration How Using Available Standards Could Help Loc Dufresne de Virel

Strategic Plan Year 4 Review June 20, 2019 Board of Education Regular Meeting Our Mission 2

15 July disjoint work with 2020 Evan Cavallo / Carlo Angiuli Anders Mrtberg / Andrea

Elementary School Redistricting Presentation to the School Board Jan 8 2019 By ARCBridge