Core-Aware Scheduling: Balancing Application Parallelism with Core - PowerPoint PPT Presentation

Core-Aware Scheduling: Balancing Application Parallelism with Core Availability Henry Qin Advisor: John Ousterhout Febuary 2, 2016 1 / 15

Introduction Motivation: Inefficient core and thread management Hard to get high throughput in low latency services Difficult to match application parallelism to available cores. Proposal: Core-Aware Scheduling Thread scheduling moves to user level Kernel allocates cores to applications 2 / 15

Outline Motivation Proposal for Core-Aware Scheduling Related Work Current Status Request for Feedback 3 / 15

A Throughput Problem RAMCloud write requests must make replication requests to backup servers, and wait for their return. RAMCloud uses polling to avoid expensive kernel thread switches and and kernel bypass to avoid system calls. When the master runs out of CPU cores it must cease processing requests. 4 / 15

Core Exhaustion Bottleneck New write request, no cores to write to local log. Master Replication Rpc Write to Log Backup Backup Backup 5 / 15

What happens under load? Backups are slower to respond, since they coexist with masters. Write requests wait even longer for backups, spinning cores for even longer. 6 / 15

7 / 15

Match application parallelism to available cores Application servers can have many threads running such as log cleaners, worker threads, and failure detection threads. We want to neither overcommit nor undercommit cores. Overcommit cores ==> undesirable kernel multiplexing because there are multiple kernel threads per core Under commit cores ==> idle cores. When the log cleaner needs to run, we would like to scale down the number of worker threads so that we do not exceed available cores. 8 / 15

Core-Aware Scheduling: Kernel Core Allocator Kernel scheduler class which allocates cores to applications on request. In general, kernel never preempts a thread running on the cores it has allocated to the process. Allow kernel to safely multiplex latency-sensitive applications with CPU-bound batch jobs. Latency-sensitive applications can request only as many cores as they need, and give up cores when they no longer need it. 9 / 15

Core-Aware Scheduling: Userland Scheduler Fast context switches enable practical core multiplexing in a low-latency system. Manage thread priorities and parallelism level based on application-specified policies. User-level scheduler requests dedicated cores from the OS, and always knows exactly how many cores it has. 10 / 15

Preempted Questions How will you handle system calls for blocking IO? Why is thread pinning insufficient? 11 / 15

Related Work Scheduler Activations inspired this work but it not sufficiently core-aware because the kernel makes too many scheduling decisions. 12 / 15

Related Work Scheduler Activations inspired this work but it not sufficiently core-aware because the kernel makes too many scheduling decisions. Linux cgroups do not allow support the dedicated allocation of specific cores. 12 / 15

Related Work Scheduler Activations inspired this work but it not sufficiently core-aware because the kernel makes too many scheduling decisions. Linux cgroups do not allow support the dedicated allocation of specific cores. Cappricio does not support multicore. 12 / 15

Related Work Scheduler Activations inspired this work but it not sufficiently core-aware because the kernel makes too many scheduling decisions. Linux cgroups do not allow support the dedicated allocation of specific cores. Cappricio does not support multicore. Go does not address the core allocation problem; no mechanism to communicate with kernel for dedicated cores. 12 / 15

Related Work Scheduler Activations inspired this work but it not sufficiently core-aware because the kernel makes too many scheduling decisions. Linux cgroups do not allow support the dedicated allocation of specific cores. Cappricio does not support multicore. Go does not address the core allocation problem; no mechanism to communicate with kernel for dedicated cores. Cilk requires user threads to be non-blocking. 12 / 15

Related Work Scheduler Activations inspired this work but it not sufficiently core-aware because the kernel makes too many scheduling decisions. Linux cgroups do not allow support the dedicated allocation of specific cores. Cappricio does not support multicore. Go does not address the core allocation problem; no mechanism to communicate with kernel for dedicated cores. Cilk requires user threads to be non-blocking. OpenMP supports neither core allocation nor explicit management of thread scheduling. 12 / 15

Current Status Implemented a simple user-level dispatcher. Measured a single direction context switch with no cache pollution at 9 ns on an Intel(R) Xeon(R) CPU X3470 @ 2.93GHz 13 / 15

Request for Feedback 14 / 15

Request for Feedback Do you know of a threading system that solves these problems of core allocation and fast context switching practically and cleanly? 14 / 15

Request for Feedback Do you know of a threading system that solves these problems of core allocation and fast context switching practically and cleanly? Have you ever measured the core utilization over short time intervals (ms and s) on your large-scale systems? 14 / 15

Request for Feedback Do you know of a threading system that solves these problems of core allocation and fast context switching practically and cleanly? Have you ever measured the core utilization over short time intervals (ms and s) on your large-scale systems? Do you have dedicated hardware or shared machines? 14 / 15

Request for Feedback Do you know of a threading system that solves these problems of core allocation and fast context switching practically and cleanly? Have you ever measured the core utilization over short time intervals (ms and s) on your large-scale systems? Do you have dedicated hardware or shared machines? How do you decide on the number of OS threads for an application? 14 / 15

Request for Feedback Do you know of a threading system that solves these problems of core allocation and fast context switching practically and cleanly? Have you ever measured the core utilization over short time intervals (ms and s) on your large-scale systems? Do you have dedicated hardware or shared machines? How do you decide on the number of OS threads for an application? What is the relationship between this number and the number of cores on the machine? 14 / 15

Thank You! If we did not talk at the poster session, please find me at the reception! Send mail to hq6@cs.stanford.edu Questions? 15 / 15

Core-Aware Scheduling: Balancing Application Parallelism with Core - PowerPoint PPT Presentation

Core-Aware Scheduling: Balancing Application Parallelism with Core Availability Henry Qin Advisor: John Ousterhout Febuary 2, 2016 1 / 15 Introduction Motivation: Inefficient core and thread management Hard to get high throughput in low

Hardware Parallelism vs. Software Parallelism USENIX Workshop on Hot Topics in Parallelism March

Caching, Parallelism, Fault Tolerance Marco Serafini COMPSCI 532 Lectures 2-3 Memory Hierarchy

Welcome Welcome Core: Core A Regional Destination Core: Core UL Core: Core Downtown

Balancing Gas system information provision 12 June 2018 GRTgaz balancing in a nutshell -> 2

Chapter 17: Parallel Databases Introduction I/O Parallelism Interquery Parallelism

Load Balancing Load Balancing Load balancing: distributing data and/or computations across

Multi-core Programming: Implicit Parallelism Tuukka Haapasalo April 16, 2009 Tuukka Haapasalo

Aperiodic Task Scheduling Radek Pel anek Preemptive Scheduling Non-preemptive Scheduling

Pervasive Parallelism Laboratory Stanford University ppl.stanford.edu Make parallelism

Data-Level Parallelism Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures

Advanced OpenMP Lecture 6: Nested parallelism Nested parallelism Nested parallelism is

CSCI341 Lecture 37, Introduction to Parallelism PIPELINING Exploits potential parallelism

Plan Parallelism Complexity Measures 1 Multithreaded Parallelism and Performance Measures cilk

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

Module 5: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Chapter 6: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Introduction to Program Analysis: A Pointer Centric View Uday Khedker

Trust in the Digital World Vienna 7-8 April 2014 Panel: Trusted Personal data Management

Market Analysis Findings 1 Project Overview and Status We are here Completing market assessment

May Club Forum Melbourne University Sport 29 st May 2018 1 Welcome from the MU Sport Board

S7105 ADAS/AD CHALLENGES: GPU SCHEDULING & SYNCHRONIZATION Venugopala Madumbu, NVIDIA GTC

SPECTRES, VIRTUAL GHOSTS, AND HARDWARE SUPPORT Xiaowan Dong University of

Entropy-based Concept Shift Detection Peter Vorburger, Abraham Bernstein University of Zurich

Porting FreeBSD to AArch64 Andrew Turner andrew@fubar.geek.nz 12 June 2015 About me Source

Core-Aware Scheduling: Balancing Application Parallelism with Core - PowerPoint PPT Presentation

Core-Aware Scheduling: Balancing Application Parallelism with Core Availability Henry Qin Advisor: John Ousterhout Febuary 2, 2016 1 / 15 Introduction Motivation: Inefficient core and thread management Hard to get high throughput in low

Hardware Parallelism vs. Software Parallelism USENIX Workshop on Hot Topics in Parallelism March

Caching, Parallelism, Fault Tolerance Marco Serafini COMPSCI 532 Lectures 2-3 Memory Hierarchy

Welcome Welcome Core: Core A Regional Destination Core: Core UL Core: Core Downtown

Balancing Gas system information provision 12 June 2018 GRTgaz balancing in a nutshell -&gt; 2

Chapter 17: Parallel Databases Introduction I/O Parallelism Interquery Parallelism

Load Balancing Load Balancing Load balancing: distributing data and/or computations across

Multi-core Programming: Implicit Parallelism Tuukka Haapasalo April 16, 2009 Tuukka Haapasalo

Aperiodic Task Scheduling Radek Pel anek Preemptive Scheduling Non-preemptive Scheduling

Pervasive Parallelism Laboratory Stanford University ppl.stanford.edu Make parallelism

Data-Level Parallelism Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures

Advanced OpenMP Lecture 6: Nested parallelism Nested parallelism Nested parallelism is

CSCI341 Lecture 37, Introduction to Parallelism PIPELINING Exploits potential parallelism

Plan Parallelism Complexity Measures 1 Multithreaded Parallelism and Performance Measures cilk

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

Module 5: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Chapter 6: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Introduction to Program Analysis: A Pointer Centric View Uday Khedker

Trust in the Digital World Vienna 7-8 April 2014 Panel: Trusted Personal data Management

Market Analysis Findings 1 Project Overview and Status We are here Completing market assessment

May Club Forum Melbourne University Sport 29 st May 2018 1 Welcome from the MU Sport Board

S7105 ADAS/AD CHALLENGES: GPU SCHEDULING &amp; SYNCHRONIZATION Venugopala Madumbu, NVIDIA GTC

SPECTRES, VIRTUAL GHOSTS, AND HARDWARE SUPPORT Xiaowan Dong University of

Entropy-based Concept Shift Detection Peter Vorburger, Abraham Bernstein University of Zurich

Porting FreeBSD to AArch64 Andrew Turner andrew@fubar.geek.nz 12 June 2015 About me Source

Balancing Gas system information provision 12 June 2018 GRTgaz balancing in a nutshell -> 2

S7105 ADAS/AD CHALLENGES: GPU SCHEDULING & SYNCHRONIZATION Venugopala Madumbu, NVIDIA GTC