FlexSC FlexSC Livio Soares Livio Soares 2 The legacy from the - - PowerPoint PPT Presentation
FlexSC FlexSC Livio Soares Livio Soares 2 The legacy from the - - PowerPoint PPT Presentation
Flexible System Call Scheduling with Exception-Less System Calls and Michael Stumm University of Toronto FlexSC FlexSC Livio Soares Livio Soares 2 The legacy from the single core era system calls for the multicore era Expensive! Costs
2
Motivation
The synchronous synchronous system call interface is a legacy from the single core era FlexSC implements efficient and flexible efficient and flexible system calls for the multicore era Expensive! Costs are:
➔ direct: mode-switch ➔ indirect: processor
structure pollution
3
FlexSC overview
Two contributions: FlexSC and FlexSC-Threads
Results in: 1) MySQL throughput increase of up to 40% and latency reduction of 30% 2) Apache throughput increase of up to 115% and latency reduction of 50%
4
Performance impact of synchronous syscalls
➔ Xalan from SPEC CPU 2006 ➔ Virtually no time in the OS ➔ Linux on Intel Core i7 (Nehalem) ➔ Injected exceptions with varying frequencies ➔ Direct
Direct: emulate null system call
➔ Indirect
Indirect: emulate “write()” system call
➔ Measured only user-mode time ➔ Kernel time ignored
Ideally, user-mode performance is unaltered
5
1K 2K 5K 10K 20K 50K 100K 0% 10% 20% 30% 40% 50% 60% 70%
Xalan (SPEC CPU 2006)
Indirect Direct
user-mode instructions between exceptions (log scale) Degradation (lower is faster)
Degradation due to sync. syscalls
Apache MySQL
System calls can half half processor efficiency; indirect indirect cause is major contributor
6
Processor state pollution
➔Key source of performance impact ➔On a Linux write() call:
➔ up to 2/3
rd of the L1 data cache and data
TLB are evicted evicted
➔Kernel performance equally affected
➔ Processor efficiency for OS code is also cut
in half half
7
Synchronous system calls are expensive User Kernel Traditional system calls are synchronous and use exceptions to cross domains
8
Alternative: side-step the boundary User Kernel Exception-less syscalls Exception-less syscalls remove synchronicity by decoupling invocation from execution
9
Benefits of exception-less system calls User Kernel
➔Significantly reduce direct costs
➔ Fewer mode switches
➔Allow for batching
➔ Reduce indirect costs
➔Allow for dynamic multicore specialization
➔ Further reduce direct and indirect costs
10
Exception-less interface: syscall page
write(fd, buf, 4096); entry = free_syscall_entry(); /* write syscall */ /* write syscall */ entry->syscall = 1; entry->num_args = 3; entry->args[0] = fd; entry->args[1] = buf; entry->args[2] = 4096; entry->status = SUBMIT SUBMIT; while while (entry->status != DONE DONE) do_something_else(); return return entry->return_code;
11
Exception-less interface: syscall page
write(fd, buf, 4096); entry = free_syscall_entry(); /* write syscall */ /* write syscall */ entry->syscall = 1; entry->num_args = 3; entry->args[0] = fd; entry->args[1] = buf; entry->args[2] = 4096; entry->status = SUBMIT SUBMIT; while while (entry->status != DONE DONE) do_something_else(); return return entry->return_code;
SUBMIT SUBMIT
12
Exception-less interface: syscall page
write(fd, buf, 4096); entry = free_syscall_entry(); /* write syscall */ /* write syscall */ entry->syscall = 1; entry->num_args = 3; entry->args[0] = fd; entry->args[1] = buf; entry->args[2] = 4096; entry->status = SUBMIT SUBMIT; while while (entry->status != DONE DONE) do_something_else(); return return entry->return_code;
DONE DONE
13
Syscall threads
➔Kernel-only threads
➔ Part of application process
➔Execute requests from syscall page ➔Schedulable on a per-core basis
14
System call batching Request as many system calls as possible Switch to kernel-mode Start executing all posted system calls Avoids direct and indirect costs, even on a single core
15
Dynamic multicore specialization FlexSC makes specializing cores simple Dynamically adapts to workload needs
16
What programs can benefit from FlexSC?
Event-driven servers
(e.g., memcached, nginx webserver)
➔ Use asynchoronous calls, similar to FlexSC ➔ Can use FlexSC directly ➔ Mix sync and exception-less system calls
Multi-threaded servers: FlexSC-Threads FlexSC-Threads
➔ Thread library, compatible with Pthreads ➔ No changes to app. code or recompilation required ➔ Transparently converts legacy syscalls into
exception-less ones
17
FlexSC-Threads library
➔ Hybrid (M-on-N) threading model
➔ One kernel visible thread per core ➔ Many user threads per kernel-visible thread
➔ Redirects system calls (libc wrappers)
➔ Posts exception-less syscall to syscall page ➔ Switches to other user-level thread ➔ Resumes thread upon syscall completion
Benefits of exception-less syscalls while maintaining sequential syscall interface
18
FlexSC-Threads in action User
19
FlexSC-Threads in action
On a syscall: Post request to system call page Block user-level thread
20
FlexSC-Threads in action Kernel
On a syscall: Post request to system call page Block user-level thread Switch to next ready thread
21
FlexSC-Threads in action User Kernel
If all user-level threads become blocked: 1) enter kernel 2) wait for completion of at least 1 syscall
22
Evaluation
➔Linux 2.6.33 ➔Nehalem (Core i7) server, 2.3GHz
➔ 4 cores on a chip
➔Clients connected on 1 Gbps network ➔Workloads
➔ Sysbench on MySQL (80% user, 20% kernel) ➔ ApacheBench on Apache (50% user, 50% kernel)
➔Default Linux NTPL (“sync
sync”) vs. FlexSC-Threads (“flexsc flexsc”)
23
Sysbench: “OLTP” on MySQL (1 core)
50 100 150 200 250 300 100 200 300 400 500
flexsc sync
Request Concurrency
Throughput (requests/sec.)
15% improvement
24
Sysbench: “OLTP” on MySQL (4 cores)
50 100 150 200 250 300 200 400 600 800 1,000 flexsc sync Request Concurrency
Throughput (requests/sec.)
40% improvement
25
MySQL latency per client request
sync flexsc sync flexsc sync flexsc
100 200 300 400 500 600 700 800 900 1,000
256 connections
95th percentile average
Latency (ms) 1 core 2 cores 4 cores
1900
Up to 30% reduction of average request latencies
26
MySQL processor metrics
IPC L3 L2 d-cache i-cache TLB Branch IPC L3 L2 d-cache i-cache TLB Branch 0.2 0.4 0.6 0.8 1 1.2 1.4
SysBench (4 cores) Relative Performance (flexsc/sync)
User Kernel Performance improvements consequence of more efficient processor execution
27
ApacheBench throughput (1 core)
200 400 600 800 1000 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 45,000 flexsc sync Request Concurrency
Throughput (requests/sec.)
80-90% improvement
28
ApacheBench throughput (4 cores)
200 400 600 800 1000 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 45,000 flexsc sync Request Concurrency
Throughput (requests/sec.)
115% improvement
29
Apache latency per client request
sync flexsc sync flexsc sync flexsc
5 10 15 20 25 30
256 concurrent requests
99th percentile average
Latency (ms) 1 core 2 cores 4 cores
238
Up to 50% reduction of average request latencies
30
Apache processor metrics
IPC L3 L2 d-cache i-cache TLB Branch IPC L3 L2 d-cache i-cache TLB Branch 0.5 1 1.5 2
Apache (1 core) Relative Performance (flexsc/sync)
User Kernel Processor efficiency doubles for kernel and user-mode execution
31
Discussion
➔New OS architecture not necessary
➔ Exception-less syscalls can coexist with legacy ones
➔Foundation for non-blocking system calls
➔ select() / poll() in user-space ➔ Interesting case of non-blocking free()
➔Multicore ultra-specialization
➔ TCP Servers (Rutgers; Iftode et.al), FS Servers
➔Single-ISA asymmetric cores
➔ OS-friendly cores (HP Labs; Mogul et. al)
32
Concluding Remarks
➔ System calls degrade server performance
➔ Processor
pollution is inherent to synchronous system calls
➔ Exception-less syscalls
Exception-less syscalls
➔ Flexible and efficient system call execution
➔ FlexSC-Threads
FlexSC-Threads
➔ Leverages exception-less syscalls ➔ No modifications to multi-threaded applications
➔ Throughput & latency gains
➔ 2x throughput improvement for Apache and BIND ➔ 1.4x throughput improvement for MySQL