FlexSC FlexSC Livio Soares Livio Soares 2 The legacy from the - - PowerPoint PPT Presentation

flexsc flexsc
SMART_READER_LITE
LIVE PREVIEW

FlexSC FlexSC Livio Soares Livio Soares 2 The legacy from the - - PowerPoint PPT Presentation

Flexible System Call Scheduling with Exception-Less System Calls and Michael Stumm University of Toronto FlexSC FlexSC Livio Soares Livio Soares 2 The legacy from the single core era system calls for the multicore era Expensive! Costs


slide-1
SLIDE 1

FlexSC FlexSC

Flexible System Call Scheduling with Exception-Less System Calls Livio Livio Soares Soares and Michael Stumm

University of Toronto

slide-2
SLIDE 2

2

Motivation

The synchronous synchronous system call interface is a legacy from the single core era FlexSC implements efficient and flexible efficient and flexible system calls for the multicore era Expensive! Costs are:

➔ direct: mode-switch ➔ indirect: processor

structure pollution

slide-3
SLIDE 3

3

FlexSC overview

Two contributions: FlexSC and FlexSC-Threads

Results in: 1) MySQL throughput increase of up to 40% and latency reduction of 30% 2) Apache throughput increase of up to 115% and latency reduction of 50%

slide-4
SLIDE 4

4

Performance impact of synchronous syscalls

➔ Xalan from SPEC CPU 2006 ➔ Virtually no time in the OS ➔ Linux on Intel Core i7 (Nehalem) ➔ Injected exceptions with varying frequencies ➔ Direct

Direct: emulate null system call

➔ Indirect

Indirect: emulate “write()” system call

➔ Measured only user-mode time ➔ Kernel time ignored

Ideally, user-mode performance is unaltered

slide-5
SLIDE 5

5

1K 2K 5K 10K 20K 50K 100K 0% 10% 20% 30% 40% 50% 60% 70%

Xalan (SPEC CPU 2006)

Indirect Direct

user-mode instructions between exceptions (log scale) Degradation (lower is faster)

Degradation due to sync. syscalls

Apache MySQL

System calls can half half processor efficiency; indirect indirect cause is major contributor

slide-6
SLIDE 6

6

Processor state pollution

➔Key source of performance impact ➔On a Linux write() call:

➔ up to 2/3

rd of the L1 data cache and data

TLB are evicted evicted

➔Kernel performance equally affected

➔ Processor efficiency for OS code is also cut

in half half

slide-7
SLIDE 7

7

Synchronous system calls are expensive User Kernel Traditional system calls are synchronous and use exceptions to cross domains

slide-8
SLIDE 8

8

Alternative: side-step the boundary User Kernel Exception-less syscalls Exception-less syscalls remove synchronicity by decoupling invocation from execution

slide-9
SLIDE 9

9

Benefits of exception-less system calls User Kernel

➔Significantly reduce direct costs

➔ Fewer mode switches

➔Allow for batching

➔ Reduce indirect costs

➔Allow for dynamic multicore specialization

➔ Further reduce direct and indirect costs

slide-10
SLIDE 10

10

Exception-less interface: syscall page

write(fd, buf, 4096); entry = free_syscall_entry(); /* write syscall */ /* write syscall */ entry->syscall = 1; entry->num_args = 3; entry->args[0] = fd; entry->args[1] = buf; entry->args[2] = 4096; entry->status = SUBMIT SUBMIT; while while (entry->status != DONE DONE) do_something_else(); return return entry->return_code;

slide-11
SLIDE 11

11

Exception-less interface: syscall page

write(fd, buf, 4096); entry = free_syscall_entry(); /* write syscall */ /* write syscall */ entry->syscall = 1; entry->num_args = 3; entry->args[0] = fd; entry->args[1] = buf; entry->args[2] = 4096; entry->status = SUBMIT SUBMIT; while while (entry->status != DONE DONE) do_something_else(); return return entry->return_code;

SUBMIT SUBMIT

slide-12
SLIDE 12

12

Exception-less interface: syscall page

write(fd, buf, 4096); entry = free_syscall_entry(); /* write syscall */ /* write syscall */ entry->syscall = 1; entry->num_args = 3; entry->args[0] = fd; entry->args[1] = buf; entry->args[2] = 4096; entry->status = SUBMIT SUBMIT; while while (entry->status != DONE DONE) do_something_else(); return return entry->return_code;

DONE DONE

slide-13
SLIDE 13

13

Syscall threads

➔Kernel-only threads

➔ Part of application process

➔Execute requests from syscall page ➔Schedulable on a per-core basis

slide-14
SLIDE 14

14

System call batching Request as many system calls as possible Switch to kernel-mode Start executing all posted system calls Avoids direct and indirect costs, even on a single core

slide-15
SLIDE 15

15

Dynamic multicore specialization FlexSC makes specializing cores simple Dynamically adapts to workload needs

slide-16
SLIDE 16

16

What programs can benefit from FlexSC?

Event-driven servers

(e.g., memcached, nginx webserver)

➔ Use asynchoronous calls, similar to FlexSC ➔ Can use FlexSC directly ➔ Mix sync and exception-less system calls

Multi-threaded servers: FlexSC-Threads FlexSC-Threads

➔ Thread library, compatible with Pthreads ➔ No changes to app. code or recompilation required ➔ Transparently converts legacy syscalls into

exception-less ones

slide-17
SLIDE 17

17

FlexSC-Threads library

➔ Hybrid (M-on-N) threading model

➔ One kernel visible thread per core ➔ Many user threads per kernel-visible thread

➔ Redirects system calls (libc wrappers)

➔ Posts exception-less syscall to syscall page ➔ Switches to other user-level thread ➔ Resumes thread upon syscall completion

Benefits of exception-less syscalls while maintaining sequential syscall interface

slide-18
SLIDE 18

18

FlexSC-Threads in action User

slide-19
SLIDE 19

19

FlexSC-Threads in action

On a syscall: Post request to system call page Block user-level thread

slide-20
SLIDE 20

20

FlexSC-Threads in action Kernel

On a syscall: Post request to system call page Block user-level thread Switch to next ready thread

slide-21
SLIDE 21

21

FlexSC-Threads in action User Kernel

If all user-level threads become blocked: 1) enter kernel 2) wait for completion of at least 1 syscall

slide-22
SLIDE 22

22

Evaluation

➔Linux 2.6.33 ➔Nehalem (Core i7) server, 2.3GHz

➔ 4 cores on a chip

➔Clients connected on 1 Gbps network ➔Workloads

➔ Sysbench on MySQL (80% user, 20% kernel) ➔ ApacheBench on Apache (50% user, 50% kernel)

➔Default Linux NTPL (“sync

sync”) vs. FlexSC-Threads (“flexsc flexsc”)

slide-23
SLIDE 23

23

Sysbench: “OLTP” on MySQL (1 core)

50 100 150 200 250 300 100 200 300 400 500

flexsc sync

Request Concurrency

Throughput (requests/sec.)

15% improvement

slide-24
SLIDE 24

24

Sysbench: “OLTP” on MySQL (4 cores)

50 100 150 200 250 300 200 400 600 800 1,000 flexsc sync Request Concurrency

Throughput (requests/sec.)

40% improvement

slide-25
SLIDE 25

25

MySQL latency per client request

sync flexsc sync flexsc sync flexsc

100 200 300 400 500 600 700 800 900 1,000

256 connections

95th percentile average

Latency (ms) 1 core 2 cores 4 cores

1900

Up to 30% reduction of average request latencies

slide-26
SLIDE 26

26

MySQL processor metrics

IPC L3 L2 d-cache i-cache TLB Branch IPC L3 L2 d-cache i-cache TLB Branch 0.2 0.4 0.6 0.8 1 1.2 1.4

SysBench (4 cores) Relative Performance (flexsc/sync)

User Kernel Performance improvements consequence of more efficient processor execution

slide-27
SLIDE 27

27

ApacheBench throughput (1 core)

200 400 600 800 1000 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 45,000 flexsc sync Request Concurrency

Throughput (requests/sec.)

80-90% improvement

slide-28
SLIDE 28

28

ApacheBench throughput (4 cores)

200 400 600 800 1000 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 45,000 flexsc sync Request Concurrency

Throughput (requests/sec.)

115% improvement

slide-29
SLIDE 29

29

Apache latency per client request

sync flexsc sync flexsc sync flexsc

5 10 15 20 25 30

256 concurrent requests

99th percentile average

Latency (ms) 1 core 2 cores 4 cores

238

Up to 50% reduction of average request latencies

slide-30
SLIDE 30

30

Apache processor metrics

IPC L3 L2 d-cache i-cache TLB Branch IPC L3 L2 d-cache i-cache TLB Branch 0.5 1 1.5 2

Apache (1 core) Relative Performance (flexsc/sync)

User Kernel Processor efficiency doubles for kernel and user-mode execution

slide-31
SLIDE 31

31

Discussion

➔New OS architecture not necessary

➔ Exception-less syscalls can coexist with legacy ones

➔Foundation for non-blocking system calls

➔ select() / poll() in user-space ➔ Interesting case of non-blocking free()

➔Multicore ultra-specialization

➔ TCP Servers (Rutgers; Iftode et.al), FS Servers

➔Single-ISA asymmetric cores

➔ OS-friendly cores (HP Labs; Mogul et. al)

slide-32
SLIDE 32

32

Concluding Remarks

➔ System calls degrade server performance

➔ Processor

pollution is inherent to synchronous system calls

➔ Exception-less syscalls

Exception-less syscalls

➔ Flexible and efficient system call execution

➔ FlexSC-Threads

FlexSC-Threads

➔ Leverages exception-less syscalls ➔ No modifications to multi-threaded applications

➔ Throughput & latency gains

➔ 2x throughput improvement for Apache and BIND ➔ 1.4x throughput improvement for MySQL

slide-33
SLIDE 33

FlexSC FlexSC

Flexible System Call Scheduling with Exception-Less System Calls Livio Livio Soares Soares and Michael Stumm

University of Toronto