Semantics-Aware Prediction for Analytic Queries in MapReduce - PowerPoint PPT Presentation

Semantics-Aware Prediction for Analytic Queries in MapReduce Environment Weikuan Yu, Zhuo Liu, Xiaoning Ding Florida State University Auburn University New Jersey Institute of Technology

Background MapReduce is a popular data-centric programming n model. Hive and Pig are popular data warehouse systems. n More than 40% of Hadoop jobs at Yahoo! were Pig Ø programs back in 2009, more with Hive now. In Facebook, 95% of MR jobs are generated by Hive. Ø In Hive, each SQL query is compiled and translated n into a DAG (Directed Acyclic Graph) of MapReduce jobs with inner-dependencies. Source: ICDE’10 by Facebook AGG (J4) Join (J3) AGG Join (J1) (J2) P2S2: S2

Motivation n Semantic gap: between Hive and Hadoop Ø Hadoop is un-aware of such dependency and inter-job relationship, just treating all jobs as the same. Ø Without such awareness, it will be difficult for Hadoop to schedule jobs that belong to a query efficiently. n Problems: Ø Suboptimal query response time Ø Unfairness among queries P2S2: S3

Efficiency issue for queries with varied sizes AGG QB (J4) QC Join Sort Sort (J3) QA (J2) (J2) AGG Join AGG AGG (J1) (J2) (J1) (J1) QA QA QC QC QB QB QBJ1 QBJ2 J1 J2 J1 J2 J3 J4 Resource Allocation n In this test, QA, QB and QC are issued in sequence, where QB is a large query. n Under HCS, interleaved execution happens among queries’ jobs. P2S2: S4

Query Delays shown by GANTT Chart J2 Execution Map Stall QC J1 Reduce Stall QB J1 J2 J3 J4 J2 QA J1 0 100 200 300 400 500 600 700 Time Elapse (seconds) n QA arrives first with its J1 job, QB and QC afterwards with their jobs listed accordingly. n QA-J2 gets delayed by QB-J1. QC-J2 gets delayed by QB-J2. n Query response time can be improved if the scheduler is aware of the query semantics, therefore the relationship among the jobs. P2S2: S5

Semantics-Aware Query Prediction n Three main techniques Ø Semantics extraction (DAG, operator type, predicates, etc.) Ø Selectivity Estimation Ø Query prediction Hive Parser HiveQL Queries Semantics Execution Semantics Analyzer Engine Extraction Job & Results Semantics Hadoop JobTracker JobListener Query Selectivity prediction Estimation TaskTracker TaskTracker TaskTracker P2S2: S6

Selectivity estimation for a query’s jobs n Selectivity estimation Ø Predict each job node’s intermediate (Med) and output (Out) data sizes recursively along the DAG (from bottom to top). Ø For different job types, e.g., Groupby, Join, Select, we reply on certain formulas and offline-built histograms to estimate their selectivities. n Logic: Selectivity estimation=>Job/query resource estimation and time modeling =>Used for efficient query scheduling SORT Join Out Reduce AGG Join Med Map T1 T2 T3 In(T1) P2S2: S7

Selectivity Estimation n IS is used for estimating MOF size Ø IS = D Med /D In n Final Selectivity (FS) is defined as: Ø FS = |Out|*W Out /D In n Predicate selectivity (ratio of selected rows to input rows) Ø S pred = |Med| /|In| n Projection selectivity (ratio of selected cols to tuple width) Ø S proj = ∑ "#$%ℎ '() * +,),'-,. /01234"#$%ℎ P2S2: S8

Intermediate Selectivity - IS Ø For extract job such as select and order by, IS = S pred * S proj Ø For join job: IS = S pred 1 * S proj 1 *r 1 +S pred 2 * S proj 2 *(1-r 1 ) Ø Groupby can involve local combine: IS = S comb * S proj Ø For clustered keys, !.#$% !.#$% S comb = min(1, ! ∗'()*# )* S pred = min(S pred , ! ) Ø For randomly distributed keys, !.#$% S comb = min(S pred , ! / N -.(/ ) P2S2: S9

Final Selectivity – Output n For extract job, Ø For “top k” job, |Out|=min(|In|, k) Ø For “order by” job, |Out|=|In| n For groupby job, Ø |Out|= min(|T|*S pred , T.dxy) n For join job, Ø Equ-join with uniform keys: & |Out| = |T 1 ⋈ T 2 | = |T 1 | ∗ |T 2 | ∗ '()(+ & .-., + 0 .-.) Ø Chained joins: |Out| = |T 1 . pred 1 ⋈ T 2 . pred 2 ⋈ T 3 . pred 3 | =Spred 1 ∗ S pred 2 ∗ S pred 3 max(|T 1 | , |T 2 | , |T 3 |) P2S2: S10

An example for selectivity estimation n Predict jobs’ selectivity recursively in a query. Job 1 Job 2 Job 3 25 24 Pred 9600 9600 MED1 n 200000 768000 Join 768000 n ⋈ s MED1 Group by n ⋈ s Join ⋈ ps MED1 resl MED2 s 10000 10000 MED2 ps 800000 800000 |Out|= 0.96*25*10000 *1/max(25,25) |Out|= 0.96 * max(25, 10000, 800000) |Out| = min(768000,200000) P2S2: S11

Multivariate Time Prediction n List of Considered Input Features Ø Operators Ø Input Data Ø Output Data Ø Data Growth P2S2: S12

Job Time Prediction Model Ø Model job execution time based on selectivity estimation Ø Training on over 5647 MR jobs, about 1000 queries from TPC- DS and TPC-H of different scales. Ø ! is trained for extract, groupby and join jobs respectively. P2S2: S13

Task Time Prediction Model Ø Data size: Ø TD In_i and TD Out_i Ø The predicted time for the i-th task: ET i Ø ET i = k0 + k1 TD In_i + k2 TD Out_i + k3 * P (1-P) TD In_i P2S2: S14

Scheduling with Semantics Awareness n Semantics-Aware Resource Demand Ø Weight Resource Demand (WRD): aggregate the demand from all map tasks (MT i ) and Reduce tasks (RT i ) Ø WRD = ∑(#$% ∗ '#% ) + ∑(($ % ∗ '(%) n Experimented with a simple greedy scheduling policy Ø Prioritizing smallest Queries for fast turnaround Ø Smallest WRD First (SWRD) query scheduling P2S2: S15

Evaluation setup n Benchmarks. Ø Built with TPC-H, TPC-DS queries and Terasort/Grep/Wordcount MapReduce jobs. Ø Submitted in Poisson interval n Metrics Ø Accuracy of the prediction via semantics awareness Ø Efficiency: query execution time P2S2: S16

Estimation of Job Execution time n Accuracy Ø On average, 13.98% error rate for the test set of jobs. Job Time Estimation P2S2: S17

Estimation of Task Execution n Map Task Execution Time Ø Join operators lead to lower accuracy n Reduce Task Execution Time P2S2: S18

Validation of job and query time estimation n Predicted time accuracy for queries Ø Error rate is 8.3% on average for 22 100G TPC-H queries. 1400 Query Response Time (sec) Actual Estm 1200 1000 800 600 400 200 0 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22 Query Time Estimation P2S2: S19

Benefits of Semantics-Aware Scheduling n Execution time of queries Ø Compared to HFS, SWRD improves the execution of Bing and Facebook workloads by 44% and 40%, respectively. Ø Compared to HCS, SWRD improves by 27.4% and 72.8%, respectively. 2000" HFS" HCS" SWRD" Execu&on)Time)(seconds)) 1600" 1200" 800" 400" 0" Bing"" Facebook" P2S2: S20

Conclusion and Future Work n Introduced cross-layer semantics extraction and percolation to increase the semantics awareness of the Hadoop job scheduler n Formalized the estimation of selectivity for intermediate data and final output n Developed a multivariate prediction model for job and task execution time and validated the accuracy Ø Leveraged semantics awareness for efficient query scheduling in HIVE n Plan to pursue further integration of semantics awareness in complex query scheduling and other data analytics systems. P2S2: S21

Acknowledgement P2S2: S22

Semantics-Aware Prediction for Analytic Queries in MapReduce - PowerPoint PPT Presentation

Semantics-Aware Prediction for Analytic Queries in MapReduce Environment Weikuan Yu, Zhuo Liu, Xiaoning Ding Florida State University Auburn University New Jersey Institute of Technology Background MapReduce is a popular data-centric

Zeros of analytic functions Lecture 14 Zeros of analytic functions Zeros of analytic functions

Semantics 1 / 21 Outline What is semantics? Denotational semantics Semantics of naming What

Queries in PSM The following rules apply to the use of queries: CS 235: 1. Queries

A Decision A Decision A Decision-Analytic Approach for A Decision Analytic Approach for

Operational Semantics 1 / 14 Outline What is semantics? Operational Semantics What is

15-411: Dynamic Semantics Jan Ho ff mann Dynamic Semantics Static semantics: definition of

Range Minimum and Lowest Common Ancestor Queries Slides by Solon P. Pissis November 15, 2019

Top- -k k Queries Queries on SQL on SQL Databases Databases Top Top-k Queries on SQL

Middleware Queries Queries Middleware Middleware Queries Prof. Paolo Ciaccia Prof. Paolo

Correlation-Aware Semi-Analytic Visibility for Antialiased Rendering Cyril Crassin, Chris Wyman,

CLARINET: WAN-Aware Optimization for Analytics Queries Presented By Robert Claus Agenda 1.

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

Branch Prediction Branch Prediction vs vs Execution Time Execution Time Prediction

Polyteam Semantics Team Semantics Axiomatizations in team semantics Polyteams and Jonni

Semantics in Practice Semantics of Practice How do we write semantics? 1: pen-and-paper How do

Introductory Notes Jigsaw Semantics or: Dynamic Semantics Put Together Again Formal semantics

PostgreSQL:,N ode.js,Client 1 Read%from%PostgreSQL%with%Node.js //include the node postgres

Parametric Query Optimization for Linear and Piecewise Linear Cost Functions Arvind Hulgeri S.

From relation algebra to semi-join algebra: an approach for graph query optimization Jelle

How we run SQL queries in-memory when available memory is constrained with Kognitio analytical

Computational Geometry Lecture 14: Windowing queries Computational Geometry Lecture 14:

Query Evaluation Doing what we're told to do. Hoyt Koepke 10/02/06 1 Purpose of the Paper

Relational Algebra Chapter 4 Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

Continuous queries Daniele DellAglio dellaglio@ifi.uzh.ch http://dellaglio.org @dandellaglio