Trace-Based Evaluation of Job Runtime and Queue Wait Time - PowerPoint PPT Presentation

Trace-Based Evaluation of Job Runtime and Queue Wait Time Predictions in Grids Ozan Sonmez , Nezih Yigitbasi, Alexandru Iosup, Dick Epema Parallel and Distributed Systems Group (PDS) Department of Software Technology Faculty EEMCS, Delft, the Netherlands 15.06.2009 1

I ntroduction • Grids • Multi-site and heterogeneous resource structure • Dynamic and heterogeneous workloads � Highly variable job runtimes and queue wait times limit the efficient use of the resources by users 15.06.2009 2

I ntroduction (cont.) Remedy: Prediction-based methods • • Extensive body of research for space-shared Parallel Production Environments ( PPEs ) • Grids differ from traditional PPEs in both structure and typical use (e.g., heterogeneous resources, more bursty job arrivals) • Goal: A systematic evaluation of job runtime and queue wait • time predictions in grids using real traces 15.06.2009 3

What to predict? • Job Runtime • Queue Wait Time CPU Load • Resource Availability • Resource Failure Rates • 15.06.2009 4

What to predict? • Job runtime predictions for • Improving the performance of backfilling in batch queueing systems* • Predicting queue wait times • Queue wait time predictions for • Guiding the decisions of a user/grid scheduler 15.06.2009 5 * D. Tsafrir, Y. Etsion, and D. G. Feitelson. Backfilling Using System-Generated Predictions Rather than User Runtime Estimates . IEEE TPDS, 18(6):789–803, 2007

Prediction Methods Easy to implement • Time Series-based Fast delivery of predictions Analytical Benchmarking • Code Profiling • Genetic Algorithms • Instance-based Learning • 15.06.2009 6

Time Series Prediction Based on historical (classified) data • • Time ordered set of past observations • Example: Last2 15.06.2009 7

Grid Workload Traces* Traces Type # CPUs Duration # Tasks Parallel Jobs (Months) DAS2 Research 400 18 1.1 M 66% GRI D5000 Research 2500 27 1.0 M 45% DAS3 Research 544 18 2 M 15% SHARCNET Research 6828 12 1.2 M 10% AUVER Production 475 12 0.4 M 0% NORDU Production 2000 24 0.8 M 0% LCG Production 24515 4 0.2 M 0% NGS Production - 6 0.6 M 0% GRI D3 Production 3500 18 1.3 M 0% 15.06.2009 8 * The Grid Workloads Archive: http://gwa.ewi.tudelft.nl/pmwiki/

Grid Workload Traces: Bursty Job Arrivals (5 minute intervals) DAS3 SHARCNET Bursty arrivals reduce predictability! Grids have bursty job arrivals! NGS GRID3 15.06.2009 9

Research Questions 1. What is the performance of job runtime predictors in grids? 2. What is the performance of queue wait time predictors in grids? 3. Can prediction-based grid scheduling policies perform better than traditional policies? 15.06.2009 10

Job Runtime Predictions We have evaluated the accuracy of five time series • methods under four job classifications • Time series methods • Last • Last2 • Running Mean (RM) • Sliding Median (SM) • Exponential Smoothing (ES) 15.06.2009 11

Job Runtime Predictions • Job Classification Methods • Create classes according to job attributes • Site, User, User on Site, (User + Application Name + Job Size) on Site • Performance Metric P : Predicted runtime T r : Actual runtime 15.06.2009 12

Job Runtime Predictions Classification: (User + Application Name + Job Size) on Site w/ o Cl : best results from the other three classifications w Cl : results with this classification More specific classification improves the accuracy No dominant prediction method 15.06.2009 13

Job Runtime Predictions Research Grids Production Grids Lower curves have higher accuracy Job runtimes are predicted more accurately in research grids 15.06.2009 14

Job Runtime Predictions: Summary of the results More specific classification improves job runtime • prediction performance Job runtime prediction accuracy is low across all grids • (except SHARCNET) • Bursty Arrivals: Same prediction error is made for all the jobs submitted together • Lack of Stationarity (no constant long-term mean and variance) 15.06.2009 15

Queue Wait Time Predictions • Point-value predictions • Simulate the local scheduling policy with predicted job runtimes to predict job queue wait times • Upper-bound predictions • Predict upper bounds for queue wait times with a specified confidence level • Obviate the need to know the internal operation of local scheduling policies 15.06.2009 16

Point-Value Predictions • Simulation Model • FCFS as the local scheduling policy • Jobs assigned to their original execution sites • A point-value predictor runs on each site • Job runtimes are predicted with Last2 • Prediction Correction Mechanism • On departure, update the predicted runtimes of both the queued and the running jobs accordingly • Traces: DAS2, DAS3, GRID5000, and AUVER 15.06.2009 17

Point-Value Predictions DAS3 Accuracy of the point-value predictor is low Correction mechanism improves the prediction accuracy (1% to 10%) 15.06.2009 18

Upper-Bound Predictions Binomial Method Batch Predictor (BMBP) * • • Predicts the specified quantile of the wait time distribution with a specified confidence level A predictor based on Chebyshev’s I nequality • • No more than 1/ k 2 of the values are more than k standard deviations away from the mean We consider a quantile (for BMBP) and a confidence • level of 95% • Traces: DAS2, DAS3, GRID5000, and AUVER 15.06.2009 19 * J. Brevik, D. Nurmi, and R. Wolski. Predicting bounds on queuing delay for batch-scheduled parallel machines . In PPoPP, pages 110–118, 2006.

Upper-Bound Predictions BMBP Grid-Site Avg. Under- Perfect- Over- Accuracy predictions predictions predictions DAS2-FS1 0.50 8% 9% 83% DAS3-FS4 0.41 15% 4% 81% Auver-clr01 0.20 12% 1% 87% GRI D5K-G1 0.72 20% 0% 80% Chebyshev DAS2-FS1 0.21 8% 0% 92% DAS3-FS4 0.23 7% 1% 82% Auver-clr01 0.10 7% 0% 93% GRI D5K-G1 0.24 16% 0% 84% 15.06.2009 20 Trade-off between accuracy and tightness of the upper bounds

Upper-Bound Predictions Both BMBP and Chebyshev fail when jobs arrive in bursts • User runtime estimates , if available, can also be used in • predicting upper bounds A burst period of DAS3-FS4 15.06.2009 21

Performance of Prediction-Based Grid Scheduling • Global Scheduling Policies • Earliest Completion Time (ECT)-Perfect Prediction-based • ECT-Last2 • Load Balancer Traditional • Fastest Processor First (FPF) • Simulation Model • DAS3 and AUVER • Jobs arrive to a global scheduler • A point-value predictor runs on each cluster (Last2+ Correction) Trace Period Number of Jobs Avg. Util. DAS3 July-Oct. 2008 ~ 220,000 ~ 30% AUVER Aug.-Nov. 2006 ~ 90,000 ~ 70% 15.06.2009 22

Performance of Prediction-Based Grid Scheduling DAS3 Response Time DAS3 ECT-Perfect ECT-Last2 LB FPF Avg. Response Time [s] 1320 1400 4318 1911 Avg. Wait Time [s] 105 186 3061 681 Prediction-based policies perform better 15.06.2009 23

Performance of Prediction-Based Grid Scheduling AUVER AUVER ECT-Perfect ECT-Last2 LB FPF Avg. Response Time [s] 40951 41003 40959 41334 Avg. Wait Time [s] 6515 6574 6534 6898 All policies have similar performance 15.06.2009 24

Conclusion We presented a systematic evaluation of job runtime and • queue wait time predictions in grids using real traces • Simple time-series methods revealed low accuracy • Current predictors cannot handle bursty arrivals • More accurate predictions do not imply a better performance of grid scheduling • Future Work • Simple vs. Complex (AI-based) prediction methods 25 15.06.2009

Questions? More I nformation: • The Grid Workloads Archive: http://gwa.ewi.tudelft.nl/pmwiki/ • DGSim: www.pds.ewi.tudelft.nl/~ iosup/dgsim.php • see PDS publication database at: www.pds.twi.tudelft.nl/ email: o.o.sonmez@tudelft.nl This work was carried out in the context of the Virtual Laboratory for e-Science project (www.vl-e.nl). Part of this work is also carried out under the FP6 Network of Excellence CoreGRID funded by European Commision. 26 15.06.2009

Trace-Based Evaluation of Job Runtime and Queue Wait Time - PowerPoint PPT Presentation

Trace-Based Evaluation of Job Runtime and Queue Wait Time Predictions in Grids Ozan Sonmez , Nezih Yigitbasi, Alexandru Iosup, Dick Epema Parallel and Distributed Systems Group (PDS) Department of Software Technology Faculty EEMCS, Delft, the

ADT Queue 1 Queues 2 Queue of cars 3 Queue at logical level A queue is an ADT in which

KRISTA BOAN WAIT, WHAT JUST HAPPENED? WAIT, WHAT JUST HAPPENED? WAIT, WHAT JUST HAPPENED? WAIT,

Priority Queue Queue Enqueue an item Dequeue: Item returned has been in the queue

ECE 2574: Data Structures and Algorithms - Queue ADT C. L. Wyatt Today we will look at the Queue

To jump the queue or wait my turn? D. Manjunath IIT Bombay January 14, 2014 D. Manjunath (IIT

Priority Queues, Heaps, Graphs, and Sets Priority Queue Queue Enqueue an item

Back of queue detection Edward D. Cox, Indiana DOT 1 Back ck of queue, queue, m many option

Queue 7 January 2019 OSU CSE 1 Queue The Queue component family allows you to manipulate

Queue Mode Scheduling at Subaru Telescope Eric Jeschke Software Division eric@naoj.org Queue

Points of Pride: What we have accomplished so far! Created Job Framework 24 Job Groups/Job

Assessing the Performance of MPI Applications Through Time-Independent Trace Replay . Desprez 1

CV Border Wait- -Time Time CV Border Wait Measurement Project Measurement Project Border

Trace Caches and optimizations therein CSE 240C - Rushi Chakrabarti - Winter 2009 Trace Caches

Semantic Trace-based Malware Variants Detection Khalid Alzarooni CREST - DCS - UCL April 6,

CS 457 Lecture 5 Reliable Delivery Part 2 Fall 2011 Stop and Wait in Action Stop and Wait

Points to ponder while we wait for everyone to log on Points to ponder while we wait for

Basic Food Safety 2016 Food Safety The Centers for Disease Control and Prevention estimates

BPA Fish and Wildlife April 2018 B O N N E V I L L E P O W E R A D M I N I S

Analyst presentation recommended cash offer by SHV of 40 per share for all shares of

USING GAMES TO TEACH COCHRANE CONCEPTS SHAYESTEH JAHANFAR, PH.D. Central Michigan University

Negotiations - New Job A balance of interests Dr. Mark Werwath Negotiations basics

Zackuse Creek Fish Passage Project Road Closure Options City Council Meeting July 18, 2017

EPA Criminal Enforcement on the Rise Preparing for and Responding to Environmental Investigations

Member-owned Cooperative utility for over fifty (50) years Six (6) elected Board of