Towards Hybrid Isolation for Shared Multicore Systems 23rd Workshop - PowerPoint PPT Presentation

1 Towards Hybrid Isolation for Shared Multicore Systems 23rd Workshop on Job Scheduling Strategies for Parallel Processing, (JSSPP), 2020 Yoonsung Nam † , Byeonghun Yoo † , Yongjun Choi † , Yongseok Son § , and Hyeonsang Eom † Seoul National University † Chung-Ang University §

2 Latency-critical Workloads Best-Effort Workloads data analytics graph processing simulation scientific in-memory DB recognition speech web server object detection Multicore Server Machine Co-locating Workloads in a Server Machine Colocating multiple • workloads in a server Latency-critical • Best-effort • Benefits • Higher resource • efficiency Running multiple • workloads in parallel

3 Private BE Workload LC Workload … Private Cache Private Cache Cache Cache Private Memory Core Shared Last Level Cache Core Core Core Shared Resource Contention in Multicore Systems Problem: • Resource Contention Lower performance • Lower resource • efficiency Higher service level • objective violations

4 Isolation Techniques Software isolation techniques • Isolation techniques performed by software • cgroup is one of the representative isolation knob in Linux • e.g., CPU core allocation , CPU cycle limit , and Thread migration • Hardware isolation techniques • Isolation techniques performed by hardware • Hardware vendors provide their own isolation interfaces (e.g., Intel CAT*, Intel • per-core DVFS, and Intel RAPL** ) e.g, Cache partitioning and Per-core Dynamic Voltage Frequency Scaling • (DVFS) *CAT: Cache Allocation Technology **RAPL: Running Average Power Limit

5 Related Work How these isolation techniques have been used to mitigate • resource contention? SW CPI2[EuroSys’13] Quasar[ASPLOS’14] Isolation Techniques Memguard[RTAS’13] (CPU cycle limit (CPU allocation (CPU cycle limit) & Migration) & Migration) Heracles[ISCA’15] PARTIES[ASPLOS’19] Both (CPU allocation, Intel CAT, (CPU allocation, Intel CAT, Network HTB*, CPUFreq.) Network HTB*, Disk qdisc) Dirigent[ASPLOS’16] (Intel CAT & per-core DVFS) HW Isolation Techniques *HTB: Hierarchical Token Bucket

6 Related Work How these isolation techniques have been used to mitigate • resource contention? SW CPI2[EuroSys’13] Quasar[ASPLOS’14] Isolation Techniques Memguard[RTAS’13] (CPU cycle limit (CPU allocation (CPU cycle limit) & Migration) & Migration) HIS[JSSPP’20] Heracles[ISCA’15] PARTIES[ASPLOS’19] (CPU allocation, Intel CAT, Both (CPU allocation, Intel CAT, (CPU allocation, Intel CAT, CPU Freq, Thread Migration) Network HTB*, CPUFreq.) Network HTB*, Disk qdisc) Dirigent[ASPLOS’16] (Intel CAT & per-core DVFS) HW Isolation Techniques *HTB: Hierarchical Token Bucket

7 SW Isolation HW Isolation and strict I’m fast flexible! I’m Trade-offs in Existing Isolation Techniques H/W isolation tech. S/W isolation tech. Cache Per-core CPU Cycle CPU Thread Partitioning DVFS Limit Allocation Migration Intel Processor cgroup:cpuset, Technology Intel CAT cgroup:cpu cgroup:cpuset (since Haswell) memory Type Partitioning Throttling Throttling Scheduling Scheduling Latency (ms)* 3 2 40~50 3 90 # of ways # of freq. quota/period # of cores # of sockets Configurations (20 per LLC) (10 per core) (100) (16) (2) Strictness High High Medium Medium Low Responsiveness High High Medium High Low Flexibility Low Low Medium High High *Latency is time taken for each isolation technique to work on a high memory-intensive workload ( SP of NPB)

8 Mem BW: 68GB/s P P T S Scheduling(S) : CPU allocation Throttling(T) : Per-core DVFS Partitioning(P) : Intel CAT BG FG LLC: 40MB 16-core Xeon Socket Ineffective Isolation FG : streamcluster (MemBW-int) & canneal (LLC-int), BG : SP (MemBW-int & LLC-int) • Both workloads show the highest performance with the combination of different • isolation techniques (indicated by yellow stars ) Some isolation technique combinations show worse performance than baseline that just • allocates 8-cores evenly (indicated by red stars )

9 Trade-off: Strictness FG: streamcluster BG: SP Changes in BG’s LLC Changes in FG’s LLC Changes in BG’s IPC Changes in FG’s IPC * LLC sizes are measured by Intel Resctrl

10 Remote HW(or SW) machine 32-core Xeon SP web server Apache 16-core Xeon Socket Trade-off: Responsiveness FG: Apache web server (client: ab ) BG: SP ab HW isolation technique (Per-core DVFS) shows lower tail latency for • the latency-critical workload than SW isolation technique (CPU cycle limit) Because, the hardware isolation techniques enables quick and faster • isolation by controlling hardware directly

16-core Xeon Socket canneal SP 16-core Xeon Socket swaptions nn x1.6 x1.3 x1.2 11 Trade-off: Flexibility Thread Migration

12 Profiler Control Flow Feedback/Monitor Flow (Linux) OS Kernel User-level techniques isolation Scheduler Server Machine Workload BG Workload FG HIS: Hybrid Isolation System Perf, /proc Profiler : monitors resource contention for the foreground workload in • an online manner Scheduler : chooses an isolation technique based on the most • contentious resource, available isolation techniques, and type of techniques (SW or HW)

13 Profiler Time SIGCONT FG Profiler Paused SIGSTOP Profiler BG FG Profiler BG FG BG BG FG HIS Online Profiler collecting If there is no any valid profile, collecting ‘solo-run’ samples of FG then performing ‘co-run’ samples of FG `profiling mode` during 2 seconds (10 samples) Profiler: • Measuring how much workloads suffer • Metric Meaning Contention Type from resource contentions by calculating contention using samples LLC hit ratio Data reuse LLC res cont BW = ( res co-run /res solo-run ) - 1 memory BW Mem BW consumption Running ‘profiling mode’ if there is no free/allocated • CPUs, active CPU demand any valid profile (e.g., phase change, no CPU Core threads profile data)

Sleeps 200ms 7a. Iterate step 1 ~ 6 by Per-core DVFS (HW) Per-core DVFS (HW) FG 2.1GHz -> 2.0GHz If still, Mem BW contention is high then go step 2 the feedback Per-core DVFS 7b. If contentions are well managed within the threshold, then stop If the highest contentions are within 5% Strengthens Memory BW Isolation! 14 CPU allocation (SW) Lower CPU freq. (2.1GHz -> 2.0GHz) CPU allocation or 2. Which isolation BG Scheduler FG MemBW Scheduler 1. Which is the most contentious one? BG techniques are available ? 3. HW or SW Isolation? 5. Which parameter? ( strengthen or weaken ) 6. Get feedback from profiler 4. Isolation technique is selected HIS Scheduler Isolation Contention Type techniques Intel CAT LLC Mem BW CPU Allocation or Per-core DVFS CPU Core CPU Allocation or Thread migration

15 Experimental Setup Machine • Intel XeonE5-2683 v4 (16-core per socket, 2 sockets) • LLC: 40MB, RAM: 32GB, Memory BW: 68GB/s • Linux 4.19.0 • Workloads • Multi-threaded benchmarks (PARSEC, Rodinia, NPB, Apache web server with ab) • Co-locates two workloads (FG and BG) in a machine • SP is used for BG workload, because SP shows the highest memory bandwidth and LLC usages • Comparing with co-running workloads with static CPU isolation • Batch workloads • Latency-critical workloads •

16 Evaluation (Batch Workload) FG: benchmarks shown in x-axis & BG: SP of NPB • Performance is normalized to the baseline • (co-run with static CPU isolation) It achieves 1.05~1.7x performance than baseline and BG shows degraded performance up to • 1.3x

17 Evaluation (Latency-critical Workload) FG: Apache web benchmark (with ab) & BG: SP • client (ab) sends requests generated with Pareto distribution • Apache web benchmark achieves 2.14x performance than co-run and also shows slightly • lower latency than the solo-run (8-core) case The maximum number of allocated CPU cores for web server reached to 12-cores • responding to CPU demands

18 Future work and Conclusion Future Work • Testing with other isolation policies (e.g., selecting isolation techniques) • Testing with more diverse benchmarks (e.g., phase-changing workloads) • Comparing with recent works (e.g., Heracles-like system) • Conclusion • We have analyzed the tradeoffs in HW/SW isolation techniques • We have implemented a prototype for hybrid isolation system • We have evaluated our prototype using selected benchmarks and • achieved 1.7~2.14x performance improvement than static isolation

19 Thank you! Questions? Yoonsung Nam DCSLab., Seoul National University yoonsung.nam@snu.ac.kr

Towards Hybrid Isolation for Shared Multicore Systems 23rd Workshop - PowerPoint PPT Presentation

1 Towards Hybrid Isolation for Shared Multicore Systems 23rd Workshop on Job Scheduling Strategies for Parallel Processing, (JSSPP), 2020 Yoonsung Nam , Byeonghun Yoo , Yongjun Choi , Yongseok Son , and Hyeonsang Eom Seoul

State of Multicore OCaml KC Sivaramakrishnan University of OCaml Labs Cambridge Outline

The Why, Where and How of Multicore Anant Agarwal MIT and Tilera Corp. What is Multicore?

Multicore Multicore curiculum 1 Motivation Moores Law: the number of transistors double

Hybrid Construction Hybrid Construction Hybrid Construction Hybrid Construction 1 VP

Model Predictive Control Model Predictive Control of Hybrid Systems of Hybrid Systems Model

GCC Highlighted Products GSure Gel Extraction kit GSure Soil DNA Isolation kit GSure Sputum DNA

Serializable Snapshot Isolation Making ISOLATION LEVEL SERIALIZABLE Provide Serializable

Multicore Synchronization a pragmatic introduction Multicore Synchronization This is a talk on

CS 240A: Shared Memory & Multicore Programming with Cilk++ Multicore and NUMA

Multicore OCaml GC KC Sivaramakrishnan, Stephen Dolan University of OCaml Labs Cambridge

RETHINKING OPERATING SYSTEM DESIGNS FOR A Ken Birman Based heavily MULTICORE WORLD on a slide

Web/CD Hybrid Model Web/CD Hybrid Model Web/CD Hybrid Model Web/CD Hybrid Model for t he Dist

Hybrid Automobiles Hybrid Automobiles It switches easily between fuel, batteries, or both It

Introduction to pixel track isolation The purpose of track isolation algorithm is an additional

ADAPTED SPAULDING PYRAMID Making Isolation: How does it work? Patient Isolation- Creating

Isolation trees Alastair Rushworth Data Scientist DataCamp Anomaly Detection in R Isolation

GraphP phP : Reducing Communication for PIM-based Graph Processing with Efficient Data Partition

PUBLIC KEY INFRASTRUCTURE Nina Bindel Cryptography for the IoT+Cloud Udyani Herath Bochum,

Using Processor Partitioning to Using Processor Partitioning to Evaluate the Performance of MPI,

Solution Concepts www.unisi.it and W ell-posedness of Hybrid Systems Maurice Heemels Embedded

formulation and single resolution experiments with real data for NCEP GFS Ting Lei, Xuguang Wang

Hybrid scheme Kerberos Protocol Public-key: nice solution for key distribution, but

Distributed hybrid control synthesis for multi-agent systems from high level specifcations

Hybrid Co-scheduling Optimizations for Concurrent Applications in Virtualized Environments

Towards Hybrid Isolation for Shared Multicore Systems 23rd Workshop - PowerPoint PPT Presentation

1 Towards Hybrid Isolation for Shared Multicore Systems 23rd Workshop on Job Scheduling Strategies for Parallel Processing, (JSSPP), 2020 Yoonsung Nam , Byeonghun Yoo , Yongjun Choi , Yongseok Son , and Hyeonsang Eom Seoul

State of Multicore OCaml KC Sivaramakrishnan University of OCaml Labs Cambridge Outline

The Why, Where and How of Multicore Anant Agarwal MIT and Tilera Corp. What is Multicore?

Multicore Multicore curiculum 1 Motivation Moores Law: the number of transistors double

Hybrid Construction Hybrid Construction Hybrid Construction Hybrid Construction 1 VP

Model Predictive Control Model Predictive Control of Hybrid Systems of Hybrid Systems Model

GCC Highlighted Products GSure Gel Extraction kit GSure Soil DNA Isolation kit GSure Sputum DNA

Serializable Snapshot Isolation Making ISOLATION LEVEL SERIALIZABLE Provide Serializable

Multicore Synchronization a pragmatic introduction Multicore Synchronization This is a talk on

CS 240A: Shared Memory &amp; Multicore Programming with Cilk++ Multicore and NUMA

Multicore OCaml GC KC Sivaramakrishnan, Stephen Dolan University of OCaml Labs Cambridge

RETHINKING OPERATING SYSTEM DESIGNS FOR A Ken Birman Based heavily MULTICORE WORLD on a slide

Web/CD Hybrid Model Web/CD Hybrid Model Web/CD Hybrid Model Web/CD Hybrid Model for t he Dist

Hybrid Automobiles Hybrid Automobiles It switches easily between fuel, batteries, or both It

Introduction to pixel track isolation The purpose of track isolation algorithm is an additional

ADAPTED SPAULDING PYRAMID Making Isolation: How does it work? Patient Isolation- Creating

Isolation trees Alastair Rushworth Data Scientist DataCamp Anomaly Detection in R Isolation

GraphP phP : Reducing Communication for PIM-based Graph Processing with Efficient Data Partition

PUBLIC KEY INFRASTRUCTURE Nina Bindel Cryptography for the IoT+Cloud Udyani Herath Bochum,

Using Processor Partitioning to Using Processor Partitioning to Evaluate the Performance of MPI,

Solution Concepts www.unisi.it and W ell-posedness of Hybrid Systems Maurice Heemels Embedded

formulation and single resolution experiments with real data for NCEP GFS Ting Lei, Xuguang Wang

Hybrid scheme Kerberos Protocol Public-key: nice solution for key distribution, but

Distributed hybrid control synthesis for multi-agent systems from high level specifcations

Hybrid Co-scheduling Optimizations for Concurrent Applications in Virtualized Environments

CS 240A: Shared Memory & Multicore Programming with Cilk++ Multicore and NUMA