Tarcil: Reconciling Scheduling Speed and Quality in Large Shared - PowerPoint PPT Presentation

Tarcil: Reconciling Scheduling Speed and Quality in Large Shared Clusters Christina Delimitrou 1 , Daniel Sanchez 2 and Christos Kozyrakis 1 1 Stanford University, 2 MIT SOCC ¡– ¡August ¡27 th ¡2015 ¡

Executive Summary ¨ Goals of cluster scheduling High performance ¤ High decision quality High cluster utilization ¤ High scheduling speed ¨ Problem: Disparity in scheduling designs ¤ Centralized schedulers à High quality, low speed ¤ Sampling-based schedulers à High speed, low quality ¨ Tarcil: Key scheduling techniques to bridge the gap ¤ Account for resource preferences à High decision quality ¤ Analytical framework for sampling à Predictable performance ¤ Admission control à High quality & speed ¤ Distributed design à High scheduling speed 2

Motivation ¨ Optimize scheduling speed (sampling-based, distributed) Good: Short jobs Bad: Long jobs ¨ Optimize scheduling quality (centralized, greedy) Good: Long jobs Bad: Short jobs Short: 100msec, Medium: 1-10sec, Long: 10sec-10min 3

Motivation ¨ Optimize scheduling speed (sampling-based, distributed) Good: Short jobs Bad: Long jobs ¨ Optimize scheduling quality (centralized, greedy) Good: Long jobs Bad: Short jobs Short: 100msec, Medium: 1-10sec, Long: 10sec-10min 4

Key Scheduling Techniques at Scale 5

1. Determine Resource Preferences ¨ Scheduling quality depends on: interference, heterogeneity, scale up/out, … ¤ Exhaustive exploration à infeasible ¤ Practical data mining framework 1 ¤ Measure impact of a couple of allocations à estimate for large space 1 C. Delimitrou and C. Kozyrakis. Quasar: Resource-Efficient and QoS-Aware Cluster Management. In ASPLOS 2014. 6

Example: Quantifying Interference ¨ Interference: set of microbenchmarks of tunable intensity (iBench) QoS 68% QoS … … 7% Resource Data mining: Recover Measure tolerated & missing resources Quality Q generated interference 7 1 C. Delimitrou and C. Kozyrakis. Quasar: Resource-Efficient and QoS-Aware Cluster Management. In ASPLOS 2014.

2. Analytical Sampling Framework ¨ Sample w.r.t. required resource quality 8

2. Analytical Sampling Framework ¨ Fine-grain allocations: partition servers in Resource Units (RU) à minimum allocation unit RU Single-threaded apps Reclaim unused resources 9

2. Analytical Sampling Framework ¨ Match a new job with required quality Q to appropriate RUs Q R2 Q R3 Q R4 Q R5 Q R6 Q R7 Q R8 Q R9 Q R1 Q R10 Q R11 Q R20 Q R30 Q R21 Q R31 Q R42 Q R43 Q R54 Q R55 Q R60 Q R61 Q R66 Q R74 Q R67 10

2. Analytical Sampling Framework ¨ Rank resources by quality 11

2. Analytical Sampling Framework ¨ Break ties with a fair coin à uniform distribution CDF Resource Quality Q 12

2. Analytical Sampling Framework ¨ Break ties with a fair coin à uniform distribution Better resources CDF Worse resources Resource Quality Q 13

2. Analytical Sampling Framework ¨ Sample on uniform distribution à guarantees on resource allocation quality Pr[Q ≤ x] = x R Pr[Q<0.8]=10 -3 14

Validation ¨ 100 server EC2 cluster ¨ Short Spark tasks ¨ Deviation between analytical and empirical is minimal 15

Sampling at High Load ¨ Performance degrades (with small sample size) ¨ Or sample size needs to increase 16

3. Admission Control ¨ Queue jobs based on required resource quality ¨ Resource quality vs. waiting time à set max waiting time limit Q … 17

Tarcil Implementation ¨ 4,000 loc in C/C++ and Python ¨ Supports apps in various frameworks (Hadoop, Spark, key-value stores) ¨ Distributed design: Concurrent scheduling agents (sim. Omega 2 ) ¤ Each agent has local copy of state, one resilient master copy ¤ Lock-free optimistic concurrency for conflict resolution (rare) à Abort and retry ¤ 30:1 worker to scheduling agent ratio 2 M. Schwarzkopf, A. Konwinski, et al. Omega: flexible, scalable schedulers for large compute clusters. In EuroSys 2013. 18

Evaluation Methodology 1. TPC-H Workload ~40k queries of different types ¤ Compare with a centralized scheduler (Quasar) and a distributed ¤ scheduler based on random sampling (Sparrow) 110-server EC2 cluster (100 workers, 10 scheduling agents) ¤ Homogeneous cluster, no interference n Homogeneous cluster, with interference n Heterogeneous cluster, with interference n Metrics: ¨ Task performance ¤ Performance predictability ¤ Scheduling latency ¤ 19

Evaluation Centralized: high overheads Sparrow and Tarcil: similar 20

Evaluation Centralized: high overheads Sparrow and Tarcil: similar Centralized and Sparrow: comparable performance Tarcil: 24% lower completion time 21

Evaluation Centralized: high overheads Sparrow and Tarcil: similar Centralized and Sparrow: comparable performance Tarcil: 24% lower completion time Centralized outperforms Sparrow Tarcil: 41% lower completion time & less jitter 22

Scheduling Overheads Heterogeneous, with interference ¨ Centralized: Two orders of magnitude slower than the distributed, sampling-based schedulers ¨ Sparrow and Tarcil: Comparable scheduling overheads 23

Resident Load memcached ¨ Tarcil and Centralized account for cross-job interference à preserve memcached’s QoS ¨ Sparrow causes QoS violations for memcached 24

Motivation Revisited Distributed, sampling-based Tarcil Centralized, greedy Short: 100msec Medium: 1-10sec Long:10sec-10min 25

More details in the paper… ¨ Sensitivity on parameters such as: ¤ Cluster load ¤ Number of scheduling agents ¤ Sample size ¤ Task duration, etc. ¨ Job priorities ¨ Large allocations ¨ Generic application scenario (batch and latency-critical) on 200 EC2 servers 26

Conclusions ¨ Tarcil: Reconciles high quality and high speed scheduling ¤ Account for resource preferences ¤ Analytical sampling framework to improve predictability ¤ Admission control to maintain high scheduling quality at high load ¤ Distributed design to improve scheduling speed ¨ Results: ¤ 41% better performance than random sampling-based schedulers ¤ 100x better scheduling latency than centralized schedulers ¤ Predictable allocation quality & performance 27

Questions? ¨ Tarcil: Reconciles high quality and high speed schedulers ¤ Account for resource preferences ¤ Analytical sampling framework to improve predictability ¤ Admission control to maintain high scheduling quality at high load ¤ Distributed design to improve scheduling speed ¨ Results: ¤ 41% better performance than random sampling-based schedulers ¤ 100x better scheduling latency than centralized schedulers ¤ Predictable allocation quality & performance 28

Questions?? ¡ ¡ ¡ Thank you 29

Tarcil: Reconciling Scheduling Speed and Quality in Large Shared - PowerPoint PPT Presentation

Tarcil: Reconciling Scheduling Speed and Quality in Large Shared Clusters Christina Delimitrou 1 , Daniel Sanchez 2 and Christos Kozyrakis 1 1 Stanford University, 2 MIT SOCC August 27 th 2015 Executive Summary Goals of

Aperiodic Task Scheduling Radek Pel anek Preemptive Scheduling Non-preemptive Scheduling

Reconciling Human Development Reconciling Human Development and Climate Protection -

A Tale of Two Theories: A Tale of Two Theories: Reconciling Reconciling random matrix theory

Module 5: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Chapter 6: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Uniprocessor Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms 2

Module 5: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Module 6: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Uniprocessor Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms Three

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

Instruction Scheduling Last time Instruction scheduling using list scheduling Today

Scheduling and SAT Emmanuel Hebrard Toulouse Outline Introduction 1 Scheduling and SAT

Planning and Scheduling Operations part 2 Scheduling and Control Functions Facility

Outline Workforce Scheduling DMP204 SCHEDULING, TIMETABLING AND ROUTING 1. Transportation

CPU Scheduling Questions Why is scheduling needed? CSCI [4|6] 730 What is

Learning High Accuracy Rules for Object Identification Sheila Tejada Wednesday, December 12,

PRESENTATION ON EIRCODE & A NATIONAL POSTCODE TO OIREACHTAS JOINT COMMITTEE ON TRANSPORT

Statewide Prepared by Larissa Guedko, Radio Interoperable Engineer DHSES/OIEC Communications

Welcome ONR-NGO Forum meeting Mercure Hotel, Manchester 28 March 2019 ONR-NGO Forum meeting 28

Cinderella: Turning Shabby X.509 Certificates into Elegant Anonymous Credentials with the Magic

Minimax (Ch. 5-5.3) Announcements Homework 1 solutions posted Test in 2 weeks (27 th ) -Covers

Commutative closures of regular Antoine Delignat-Lavaud languages Jul, 28. 2009 Outline Trace

Tonight Human Information n Xerox Star Processing n History Xerox Parc n Design Desktop

Tarcil: Reconciling Scheduling Speed and Quality in Large Shared - PowerPoint PPT Presentation

Tarcil: Reconciling Scheduling Speed and Quality in Large Shared Clusters Christina Delimitrou 1 , Daniel Sanchez 2 and Christos Kozyrakis 1 1 Stanford University, 2 MIT SOCC August 27 th 2015 Executive Summary Goals of

Aperiodic Task Scheduling Radek Pel anek Preemptive Scheduling Non-preemptive Scheduling

Reconciling Human Development Reconciling Human Development and Climate Protection -

A Tale of Two Theories: A Tale of Two Theories: Reconciling Reconciling random matrix theory

Module 5: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Chapter 6: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Uniprocessor Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms 2

Module 5: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Module 6: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Uniprocessor Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms Three

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

Instruction Scheduling Last time Instruction scheduling using list scheduling Today

Scheduling and SAT Emmanuel Hebrard Toulouse Outline Introduction 1 Scheduling and SAT

Planning and Scheduling Operations part 2 Scheduling and Control Functions Facility

Outline Workforce Scheduling DMP204 SCHEDULING, TIMETABLING AND ROUTING 1. Transportation

CPU Scheduling Questions Why is scheduling needed? CSCI [4|6] 730 What is

Learning High Accuracy Rules for Object Identification Sheila Tejada Wednesday, December 12,

PRESENTATION ON EIRCODE &amp; A NATIONAL POSTCODE TO OIREACHTAS JOINT COMMITTEE ON TRANSPORT

Statewide Prepared by Larissa Guedko, Radio Interoperable Engineer DHSES/OIEC Communications

Welcome ONR-NGO Forum meeting Mercure Hotel, Manchester 28 March 2019 ONR-NGO Forum meeting 28

Cinderella: Turning Shabby X.509 Certificates into Elegant Anonymous Credentials with the Magic

Minimax (Ch. 5-5.3) Announcements Homework 1 solutions posted Test in 2 weeks (27 th ) -Covers

Commutative closures of regular Antoine Delignat-Lavaud languages Jul, 28. 2009 Outline Trace

Tonight Human Information n Xerox Star Processing n History Xerox Parc n Design Desktop

PRESENTATION ON EIRCODE & A NATIONAL POSTCODE TO OIREACHTAS JOINT COMMITTEE ON TRANSPORT