tarcil reconciling scheduling speed and
play

Tarcil: Reconciling Scheduling Speed and Quality in Large Shared - PowerPoint PPT Presentation

Tarcil: Reconciling Scheduling Speed and Quality in Large Shared Clusters Christina Delimitrou 1 , Daniel Sanchez 2 and Christos Kozyrakis 1 1 Stanford University, 2 MIT SOCC August 27 th 2015 Executive Summary Goals of


  1. Tarcil: Reconciling Scheduling Speed and Quality in Large Shared Clusters Christina Delimitrou 1 , Daniel Sanchez 2 and Christos Kozyrakis 1 1 Stanford University, 2 MIT SOCC ¡– ¡August ¡27 th ¡2015 ¡

  2. Executive Summary ¨ Goals of cluster scheduling High performance ¤ High decision quality High cluster utilization ¤ High scheduling speed ¨ Problem: Disparity in scheduling designs ¤ Centralized schedulers à High quality, low speed ¤ Sampling-based schedulers à High speed, low quality ¨ Tarcil: Key scheduling techniques to bridge the gap ¤ Account for resource preferences à High decision quality ¤ Analytical framework for sampling à Predictable performance ¤ Admission control à High quality & speed ¤ Distributed design à High scheduling speed 2

  3. Motivation ¨ Optimize scheduling speed (sampling-based, distributed) Good: Short jobs Bad: Long jobs ¨ Optimize scheduling quality (centralized, greedy) Good: Long jobs Bad: Short jobs Short: 100msec, Medium: 1-10sec, Long: 10sec-10min 3

  4. Motivation ¨ Optimize scheduling speed (sampling-based, distributed) Good: Short jobs Bad: Long jobs ¨ Optimize scheduling quality (centralized, greedy) Good: Long jobs Bad: Short jobs Short: 100msec, Medium: 1-10sec, Long: 10sec-10min 4

  5. Key Scheduling Techniques at Scale 5

  6. 1. Determine Resource Preferences ¨ Scheduling quality depends on: interference, heterogeneity, scale up/out, … ¤ Exhaustive exploration à infeasible ¤ Practical data mining framework 1 ¤ Measure impact of a couple of allocations à estimate for large space 1 C. Delimitrou and C. Kozyrakis. Quasar: Resource-Efficient and QoS-Aware Cluster Management. In ASPLOS 2014. 6

  7. Example: Quantifying Interference ¨ Interference: set of microbenchmarks of tunable intensity (iBench) QoS 68% QoS … … 7% Resource Data mining: Recover Measure tolerated & missing resources Quality Q generated interference 7 1 C. Delimitrou and C. Kozyrakis. Quasar: Resource-Efficient and QoS-Aware Cluster Management. In ASPLOS 2014.

  8. 2. Analytical Sampling Framework ¨ Sample w.r.t. required resource quality 8

  9. 2. Analytical Sampling Framework ¨ Fine-grain allocations: partition servers in Resource Units (RU) à minimum allocation unit RU Single-threaded apps Reclaim unused resources 9

  10. 2. Analytical Sampling Framework ¨ Match a new job with required quality Q to appropriate RUs Q R2 Q R3 Q R4 Q R5 Q R6 Q R7 Q R8 Q R9 Q R1 Q R10 Q R11 Q R20 Q R30 Q R21 Q R31 Q R42 Q R43 Q R54 Q R55 Q R60 Q R61 Q R66 Q R74 Q R67 10

  11. 2. Analytical Sampling Framework ¨ Rank resources by quality 11

  12. 2. Analytical Sampling Framework ¨ Break ties with a fair coin à uniform distribution CDF Resource Quality Q 12

  13. 2. Analytical Sampling Framework ¨ Break ties with a fair coin à uniform distribution Better resources CDF Worse resources Resource Quality Q 13

  14. 2. Analytical Sampling Framework ¨ Sample on uniform distribution à guarantees on resource allocation quality Pr[Q ≤ x] = x R Pr[Q<0.8]=10 -3 14

  15. Validation ¨ 100 server EC2 cluster ¨ Short Spark tasks ¨ Deviation between analytical and empirical is minimal 15

  16. Sampling at High Load ¨ Performance degrades (with small sample size) ¨ Or sample size needs to increase 16

  17. 3. Admission Control ¨ Queue jobs based on required resource quality ¨ Resource quality vs. waiting time à set max waiting time limit Q … 17

  18. Tarcil Implementation ¨ 4,000 loc in C/C++ and Python ¨ Supports apps in various frameworks (Hadoop, Spark, key-value stores) ¨ Distributed design: Concurrent scheduling agents (sim. Omega 2 ) ¤ Each agent has local copy of state, one resilient master copy ¤ Lock-free optimistic concurrency for conflict resolution (rare) à Abort and retry ¤ 30:1 worker to scheduling agent ratio 2 M. Schwarzkopf, A. Konwinski, et al. Omega: flexible, scalable schedulers for large compute clusters. In EuroSys 2013. 18

  19. Evaluation Methodology 1. TPC-H Workload ~40k queries of different types ¤ Compare with a centralized scheduler (Quasar) and a distributed ¤ scheduler based on random sampling (Sparrow) 110-server EC2 cluster (100 workers, 10 scheduling agents) ¤ Homogeneous cluster, no interference n Homogeneous cluster, with interference n Heterogeneous cluster, with interference n Metrics: ¨ Task performance ¤ Performance predictability ¤ Scheduling latency ¤ 19

  20. Evaluation Centralized: high overheads Sparrow and Tarcil: similar 20

  21. Evaluation Centralized: high overheads Sparrow and Tarcil: similar Centralized and Sparrow: comparable performance Tarcil: 24% lower completion time 21

  22. Evaluation Centralized: high overheads Sparrow and Tarcil: similar Centralized and Sparrow: comparable performance Tarcil: 24% lower completion time Centralized outperforms Sparrow Tarcil: 41% lower completion time & less jitter 22

  23. Scheduling Overheads Heterogeneous, with interference ¨ Centralized: Two orders of magnitude slower than the distributed, sampling-based schedulers ¨ Sparrow and Tarcil: Comparable scheduling overheads 23

  24. Resident Load memcached ¨ Tarcil and Centralized account for cross-job interference à preserve memcached’s QoS ¨ Sparrow causes QoS violations for memcached 24

  25. Motivation Revisited Distributed, sampling-based Tarcil Centralized, greedy Short: 100msec Medium: 1-10sec Long:10sec-10min 25

  26. More details in the paper… ¨ Sensitivity on parameters such as: ¤ Cluster load ¤ Number of scheduling agents ¤ Sample size ¤ Task duration, etc. ¨ Job priorities ¨ Large allocations ¨ Generic application scenario (batch and latency-critical) on 200 EC2 servers 26

  27. Conclusions ¨ Tarcil: Reconciles high quality and high speed scheduling ¤ Account for resource preferences ¤ Analytical sampling framework to improve predictability ¤ Admission control to maintain high scheduling quality at high load ¤ Distributed design to improve scheduling speed ¨ Results: ¤ 41% better performance than random sampling-based schedulers ¤ 100x better scheduling latency than centralized schedulers ¤ Predictable allocation quality & performance 27

  28. Questions? ¨ Tarcil: Reconciles high quality and high speed schedulers ¤ Account for resource preferences ¤ Analytical sampling framework to improve predictability ¤ Admission control to maintain high scheduling quality at high load ¤ Distributed design to improve scheduling speed ¨ Results: ¤ 41% better performance than random sampling-based schedulers ¤ 100x better scheduling latency than centralized schedulers ¤ Predictable allocation quality & performance 28

  29. Questions?? ¡ ¡ ¡ Thank you 29

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend