Core-Aware Scheduling: Balancing Application Parallelism with Core - - PowerPoint PPT Presentation

core aware scheduling balancing application parallelism
SMART_READER_LITE
LIVE PREVIEW

Core-Aware Scheduling: Balancing Application Parallelism with Core - - PowerPoint PPT Presentation

Core-Aware Scheduling: Balancing Application Parallelism with Core Availability Henry Qin Advisor: John Ousterhout Febuary 2, 2016 1 / 15 Introduction Motivation: Inefficient core and thread management Hard to get high throughput in low


slide-1
SLIDE 1

Core-Aware Scheduling: Balancing Application Parallelism with Core Availability

Henry Qin Advisor: John Ousterhout Febuary 2, 2016

1 / 15

slide-2
SLIDE 2

Introduction

Motivation: Inefficient core and thread management Hard to get high throughput in low latency services Difficult to match application parallelism to available cores. Proposal: Core-Aware Scheduling Thread scheduling moves to user level Kernel allocates cores to applications

2 / 15

slide-3
SLIDE 3

Outline Motivation Proposal for Core-Aware Scheduling Related Work Current Status Request for Feedback

3 / 15

slide-4
SLIDE 4

A Throughput Problem

RAMCloud write requests must make replication requests to backup servers, and wait for their return. RAMCloud uses polling to avoid expensive kernel thread switches and and kernel bypass to avoid system calls. When the master runs out of CPU cores it must cease processing requests.

4 / 15

slide-5
SLIDE 5

Master Backup Backup Backup

Write to Log Replication Rpc

Core Exhaustion Bottleneck

New write request, no cores to write to local log. 5 / 15

slide-6
SLIDE 6

What happens under load?

Backups are slower to respond, since they coexist with masters. Write requests wait even longer for backups, spinning cores for even longer.

6 / 15

slide-7
SLIDE 7

7 / 15

slide-8
SLIDE 8

Match application parallelism to available cores

Application servers can have many threads running such as log cleaners, worker threads, and failure detection threads. We want to neither overcommit nor undercommit cores. Overcommit cores ==> undesirable kernel multiplexing because there are multiple kernel threads per core Under commit cores ==> idle cores. When the log cleaner needs to run, we would like to scale down the number

  • f worker threads so that we do not exceed available cores.

8 / 15

slide-9
SLIDE 9

Core-Aware Scheduling: Kernel Core Allocator

Kernel scheduler class which allocates cores to applications on request. In general, kernel never preempts a thread running on the cores it has allocated to the process. Allow kernel to safely multiplex latency-sensitive applications with CPU-bound batch jobs. Latency-sensitive applications can request only as many cores as they need, and give up cores when they no longer need it.

9 / 15

slide-10
SLIDE 10

Core-Aware Scheduling: Userland Scheduler

Fast context switches enable practical core multiplexing in a low-latency system. Manage thread priorities and parallelism level based on application-specified policies. User-level scheduler requests dedicated cores from the OS, and always knows exactly how many cores it has.

10 / 15

slide-11
SLIDE 11

Preempted Questions How will you handle system calls for blocking IO? Why is thread pinning insufficient?

11 / 15

slide-12
SLIDE 12

Related Work

Scheduler Activations inspired this work but it not sufficiently core-aware because the kernel makes too many scheduling decisions.

12 / 15

slide-13
SLIDE 13

Related Work

Scheduler Activations inspired this work but it not sufficiently core-aware because the kernel makes too many scheduling decisions. Linux cgroups do not allow support the dedicated allocation

  • f specific cores.

12 / 15

slide-14
SLIDE 14

Related Work

Scheduler Activations inspired this work but it not sufficiently core-aware because the kernel makes too many scheduling decisions. Linux cgroups do not allow support the dedicated allocation

  • f specific cores.

Cappricio does not support multicore.

12 / 15

slide-15
SLIDE 15

Related Work

Scheduler Activations inspired this work but it not sufficiently core-aware because the kernel makes too many scheduling decisions. Linux cgroups do not allow support the dedicated allocation

  • f specific cores.

Cappricio does not support multicore. Go does not address the core allocation problem; no mechanism to communicate with kernel for dedicated cores.

12 / 15

slide-16
SLIDE 16

Related Work

Scheduler Activations inspired this work but it not sufficiently core-aware because the kernel makes too many scheduling decisions. Linux cgroups do not allow support the dedicated allocation

  • f specific cores.

Cappricio does not support multicore. Go does not address the core allocation problem; no mechanism to communicate with kernel for dedicated cores. Cilk requires user threads to be non-blocking.

12 / 15

slide-17
SLIDE 17

Related Work

Scheduler Activations inspired this work but it not sufficiently core-aware because the kernel makes too many scheduling decisions. Linux cgroups do not allow support the dedicated allocation

  • f specific cores.

Cappricio does not support multicore. Go does not address the core allocation problem; no mechanism to communicate with kernel for dedicated cores. Cilk requires user threads to be non-blocking. OpenMP supports neither core allocation nor explicit management of thread scheduling.

12 / 15

slide-18
SLIDE 18

Current Status

Implemented a simple user-level dispatcher. Measured a single direction context switch with no cache pollution at 9 ns on an Intel(R) Xeon(R) CPU X3470 @ 2.93GHz

13 / 15

slide-19
SLIDE 19

Request for Feedback

14 / 15

slide-20
SLIDE 20

Request for Feedback

Do you know of a threading system that solves these problems

  • f core allocation and fast context switching practically and

cleanly?

14 / 15

slide-21
SLIDE 21

Request for Feedback

Do you know of a threading system that solves these problems

  • f core allocation and fast context switching practically and

cleanly? Have you ever measured the core utilization over short time intervals (ms and s) on your large-scale systems?

14 / 15

slide-22
SLIDE 22

Request for Feedback

Do you know of a threading system that solves these problems

  • f core allocation and fast context switching practically and

cleanly? Have you ever measured the core utilization over short time intervals (ms and s) on your large-scale systems? Do you have dedicated hardware or shared machines?

14 / 15

slide-23
SLIDE 23

Request for Feedback

Do you know of a threading system that solves these problems

  • f core allocation and fast context switching practically and

cleanly? Have you ever measured the core utilization over short time intervals (ms and s) on your large-scale systems? Do you have dedicated hardware or shared machines? How do you decide on the number of OS threads for an application?

14 / 15

slide-24
SLIDE 24

Request for Feedback

Do you know of a threading system that solves these problems

  • f core allocation and fast context switching practically and

cleanly? Have you ever measured the core utilization over short time intervals (ms and s) on your large-scale systems? Do you have dedicated hardware or shared machines? How do you decide on the number of OS threads for an application? What is the relationship between this number and the number

  • f cores on the machine?

14 / 15

slide-25
SLIDE 25

Thank You! If we did not talk at the poster session, please find me at the reception! Send mail to hq6@cs.stanford.edu Questions?

15 / 15