towards hybrid isolation for shared multicore systems
play

Towards Hybrid Isolation for Shared Multicore Systems 23rd Workshop - PowerPoint PPT Presentation

1 Towards Hybrid Isolation for Shared Multicore Systems 23rd Workshop on Job Scheduling Strategies for Parallel Processing, (JSSPP), 2020 Yoonsung Nam , Byeonghun Yoo , Yongjun Choi , Yongseok Son , and Hyeonsang Eom Seoul


  1. 1 Towards Hybrid Isolation for Shared Multicore Systems 23rd Workshop on Job Scheduling Strategies for Parallel Processing, (JSSPP), 2020 Yoonsung Nam † , Byeonghun Yoo † , Yongjun Choi † , Yongseok Son § , and Hyeonsang Eom † Seoul National University † Chung-Ang University §

  2. 2 Latency-critical Workloads Best-Effort Workloads data analytics graph processing simulation scientific in-memory DB recognition speech web server object detection Multicore Server Machine Co-locating Workloads in a Server Machine Colocating multiple • workloads in a server Latency-critical • Best-effort • Benefits • Higher resource • efficiency Running multiple • workloads in parallel

  3. 3 Private BE Workload LC Workload … Private Cache Private Cache Cache Cache Private Memory Core Shared Last Level Cache Core Core Core Shared Resource Contention in Multicore Systems Problem: • Resource Contention Lower performance • Lower resource • efficiency Higher service level • objective violations

  4. 4 Isolation Techniques Software isolation techniques • Isolation techniques performed by software • cgroup is one of the representative isolation knob in Linux • e.g., CPU core allocation , CPU cycle limit , and Thread migration • Hardware isolation techniques • Isolation techniques performed by hardware • Hardware vendors provide their own isolation interfaces (e.g., Intel CAT*, Intel • per-core DVFS, and Intel RAPL** ) e.g, Cache partitioning and Per-core Dynamic Voltage Frequency Scaling • (DVFS) *CAT: Cache Allocation Technology **RAPL: Running Average Power Limit

  5. 5 Related Work How these isolation techniques have been used to mitigate • resource contention? SW CPI2[EuroSys’13] Quasar[ASPLOS’14] Isolation Techniques Memguard[RTAS’13] (CPU cycle limit (CPU allocation (CPU cycle limit) & Migration) & Migration) Heracles[ISCA’15] PARTIES[ASPLOS’19] Both (CPU allocation, Intel CAT, (CPU allocation, Intel CAT, Network HTB*, CPUFreq.) Network HTB*, Disk qdisc) Dirigent[ASPLOS’16] (Intel CAT & per-core DVFS) HW Isolation Techniques *HTB: Hierarchical Token Bucket

  6. 6 Related Work How these isolation techniques have been used to mitigate • resource contention? SW CPI2[EuroSys’13] Quasar[ASPLOS’14] Isolation Techniques Memguard[RTAS’13] (CPU cycle limit (CPU allocation (CPU cycle limit) & Migration) & Migration) HIS[JSSPP’20] Heracles[ISCA’15] PARTIES[ASPLOS’19] (CPU allocation, Intel CAT, Both (CPU allocation, Intel CAT, (CPU allocation, Intel CAT, CPU Freq, Thread Migration) Network HTB*, CPUFreq.) Network HTB*, Disk qdisc) Dirigent[ASPLOS’16] (Intel CAT & per-core DVFS) HW Isolation Techniques *HTB: Hierarchical Token Bucket

  7. 7 SW Isolation HW Isolation and strict I’m fast flexible! I’m Trade-offs in Existing Isolation Techniques H/W isolation tech. S/W isolation tech. Cache Per-core CPU Cycle CPU Thread Partitioning DVFS Limit Allocation Migration Intel Processor cgroup:cpuset, Technology Intel CAT cgroup:cpu cgroup:cpuset (since Haswell) memory Type Partitioning Throttling Throttling Scheduling Scheduling Latency (ms)* 3 2 40~50 3 90 # of ways # of freq. quota/period # of cores # of sockets Configurations (20 per LLC) (10 per core) (100) (16) (2) Strictness High High Medium Medium Low Responsiveness High High Medium High Low Flexibility Low Low Medium High High *Latency is time taken for each isolation technique to work on a high memory-intensive workload ( SP of NPB)

  8. 8 Mem BW: 68GB/s P P T S Scheduling(S) : CPU allocation Throttling(T) : Per-core DVFS Partitioning(P) : Intel CAT BG FG LLC: 40MB 16-core Xeon Socket Ineffective Isolation FG : streamcluster (MemBW-int) & canneal (LLC-int), BG : SP (MemBW-int & LLC-int) • Both workloads show the highest performance with the combination of different • isolation techniques (indicated by yellow stars ) Some isolation technique combinations show worse performance than baseline that just • allocates 8-cores evenly (indicated by red stars )

  9. 9 Trade-off: Strictness FG: streamcluster BG: SP Changes in BG’s LLC Changes in FG’s LLC Changes in BG’s IPC Changes in FG’s IPC * LLC sizes are measured by Intel Resctrl

  10. 10 Remote HW(or SW) machine 32-core Xeon SP web server Apache 16-core Xeon Socket Trade-off: Responsiveness FG: Apache web server (client: ab ) BG: SP ab HW isolation technique (Per-core DVFS) shows lower tail latency for • the latency-critical workload than SW isolation technique (CPU cycle limit) Because, the hardware isolation techniques enables quick and faster • isolation by controlling hardware directly

  11. 16-core Xeon Socket canneal SP 16-core Xeon Socket swaptions nn x1.6 x1.3 x1.2 11 Trade-off: Flexibility Thread Migration

  12. 12 Profiler Control Flow Feedback/Monitor Flow (Linux) OS Kernel User-level techniques isolation Scheduler Server Machine Workload BG Workload FG HIS: Hybrid Isolation System Perf, /proc Profiler : monitors resource contention for the foreground workload in • an online manner Scheduler : chooses an isolation technique based on the most • contentious resource, available isolation techniques, and type of techniques (SW or HW)

  13. 13 Profiler Time SIGCONT FG Profiler Paused SIGSTOP Profiler BG FG Profiler BG FG BG BG FG HIS Online Profiler collecting If there is no any valid profile, collecting ‘solo-run’ samples of FG then performing ‘co-run’ samples of FG `profiling mode` during 2 seconds (10 samples) Profiler: • Measuring how much workloads suffer • Metric Meaning Contention Type from resource contentions by calculating contention using samples LLC hit ratio Data reuse LLC res cont BW = ( res co-run /res solo-run ) - 1 memory BW Mem BW consumption Running ‘profiling mode’ if there is no free/allocated • CPUs, active CPU demand any valid profile (e.g., phase change, no CPU Core threads profile data)

  14. Sleeps 200ms 7a. Iterate step 1 ~ 6 by Per-core DVFS (HW) Per-core DVFS (HW) FG 2.1GHz -> 2.0GHz If still, Mem BW contention is high then go step 2 the feedback Per-core DVFS 7b. If contentions are well managed within the threshold, then stop If the highest contentions are within 5% Strengthens Memory BW Isolation! 14 CPU allocation (SW) Lower CPU freq. (2.1GHz -> 2.0GHz) CPU allocation or 2. Which isolation BG Scheduler FG MemBW Scheduler 1. Which is the most contentious one? BG techniques are available ? 3. HW or SW Isolation? 5. Which parameter? ( strengthen or weaken ) 6. Get feedback from profiler 4. Isolation technique is selected HIS Scheduler Isolation Contention Type techniques Intel CAT LLC Mem BW CPU Allocation or Per-core DVFS CPU Core CPU Allocation or Thread migration

  15. 15 Experimental Setup Machine • Intel XeonE5-2683 v4 (16-core per socket, 2 sockets) • LLC: 40MB, RAM: 32GB, Memory BW: 68GB/s • Linux 4.19.0 • Workloads • Multi-threaded benchmarks (PARSEC, Rodinia, NPB, Apache web server with ab) • Co-locates two workloads (FG and BG) in a machine • SP is used for BG workload, because SP shows the highest memory bandwidth and LLC usages • Comparing with co-running workloads with static CPU isolation • Batch workloads • Latency-critical workloads •

  16. 16 Evaluation (Batch Workload) FG: benchmarks shown in x-axis & BG: SP of NPB • Performance is normalized to the baseline • (co-run with static CPU isolation) It achieves 1.05~1.7x performance than baseline and BG shows degraded performance up to • 1.3x

  17. 17 Evaluation (Latency-critical Workload) FG: Apache web benchmark (with ab) & BG: SP • client (ab) sends requests generated with Pareto distribution • Apache web benchmark achieves 2.14x performance than co-run and also shows slightly • lower latency than the solo-run (8-core) case The maximum number of allocated CPU cores for web server reached to 12-cores • responding to CPU demands

  18. 18 Future work and Conclusion Future Work • Testing with other isolation policies (e.g., selecting isolation techniques) • Testing with more diverse benchmarks (e.g., phase-changing workloads) • Comparing with recent works (e.g., Heracles-like system) • Conclusion • We have analyzed the tradeoffs in HW/SW isolation techniques • We have implemented a prototype for hybrid isolation system • We have evaluated our prototype using selected benchmarks and • achieved 1.7~2.14x performance improvement than static isolation

  19. 19 Thank you! Questions? Yoonsung Nam DCSLab., Seoul National University yoonsung.nam@snu.ac.kr

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend