flexsc flexsc
play

FlexSC FlexSC Livio Soares Livio Soares 2 The legacy from the - PowerPoint PPT Presentation

Flexible System Call Scheduling with Exception-Less System Calls and Michael Stumm University of Toronto FlexSC FlexSC Livio Soares Livio Soares 2 The legacy from the single core era system calls for the multicore era Expensive! Costs


  1. Flexible System Call Scheduling with Exception-Less System Calls and Michael Stumm University of Toronto FlexSC FlexSC Livio Soares Livio Soares

  2. 2 The legacy from the single core era system calls for the multicore era Expensive! Costs are: structure pollution Motivation synchronous system call interface is a synchronous ➔ direct : mode-switch ➔ indirect : processor efficient and flexible FlexSC implements efficient and flexible

  3. 3 Two contributions: FlexSC and FlexSC-Threads Results in: 1) MySQL throughput increase of up to 40% and latency reduction of 30% 2) Apache throughput increase of up to 115% and latency reduction of 50% FlexSC overview

  4. Ideally, user-mode performance is unaltered 4 Performance impact of synchronous syscalls ➔ Xalan from SPEC CPU 2006 ➔ Virtually no time in the OS ➔ Linux on Intel Core i7 (Nehalem) ➔ Injected exceptions with varying frequencies ➔ Direct Direct : emulate null system call ➔ Indirect Indirect : emulate “write()” system call ➔ Measured only user-mode time ➔ Kernel time ignored

  5. 5 MySQL Apache Degradation due to sync. syscalls Degradation (lower is faster) Xalan (SPEC CPU 2006) 70% Indirect 60% Direct 50% 40% 30% 20% 10% 0% 1K 2K 5K 10K 20K 50K 100K user-mode instructions between exceptions (log scale) half processor efficiency; System calls can half indirect cause is major contributor indirect

  6. 6 rd of the L1 data cache and data Processor state pollution ➔ Key source of performance impact ➔ On a Linux write() call: ➔ up to 2/3 evicted TLB are evicted ➔ Kernel performance equally affected ➔ Processor efficiency for OS code is also cut half in half

  7. 7 Traditional system calls are synchronous and use exceptions to cross domains Synchronous system calls are expensive User Kernel

  8. by decoupling invocation from execution 8 Alternative: side-step the boundary User Kernel Exception-less syscalls remove synchronicity Exception-less syscalls

  9. 9 Benefits of exception-less system calls ➔ Significantly reduce direct costs ➔ Fewer mode switches User ➔ Allow for batching Kernel ➔ Reduce indirect costs ➔ Allow for dynamic multicore specialization ➔ Further reduce direct and indirect costs

  10. 10 Exception-less interface: syscall page write(fd, buf, 4096); entry = free_syscall_entry(); /* write syscall */ /* write syscall */ entry->syscall = 1; entry->num_args = 3; entry->args[0] = fd; entry->args[1] = buf; entry->args[2] = 4096; entry->status = SUBMIT SUBMIT; while (entry->status != DONE DONE) while do_something_else(); return entry->return_code; return

  11. 11 Exception-less interface: syscall page write(fd, buf, 4096); entry = free_syscall_entry(); /* write syscall */ /* write syscall */ entry->syscall = 1; entry->num_args = 3; entry->args[0] = fd; entry->args[1] = buf; entry->args[2] = 4096; SUBMIT SUBMIT entry->status = SUBMIT SUBMIT; while (entry->status != DONE DONE) while do_something_else(); return entry->return_code; return

  12. 12 Exception-less interface: syscall page write(fd, buf, 4096); entry = free_syscall_entry(); /* write syscall */ /* write syscall */ entry->syscall = 1; entry->num_args = 3; entry->args[0] = fd; entry->args[1] = buf; DONE entry->args[2] = 4096; DONE entry->status = SUBMIT SUBMIT; while (entry->status != DONE DONE) while do_something_else(); return entry->return_code; return

  13. 13 Syscall threads ➔ Kernel-only threads ➔ Part of application process ➔ Execute requests from syscall page ➔ Schedulable on a per-core basis

  14. 14 Request as many system calls as possible Switch to kernel-mode Start executing all posted system calls Avoids direct and indirect costs, even on a single core System call batching

  15. 15 FlexSC makes specializing cores simple Dynamically adapts to workload needs Dynamic multicore specialization

  16. 16 Event-driven servers (e.g., memcached, nginx webserver) exception-less ones What programs can benefit from FlexSC? ➔ Use asynchoronous calls, similar to FlexSC ➔ Can use FlexSC directly ➔ Mix sync and exception-less system calls FlexSC-Threads Multi-threaded servers: FlexSC-Threads ➔ Thread library, compatible with Pthreads ➔ No changes to app. code or recompilation required ➔ Transparently converts legacy syscalls into

  17. 17 FlexSC-Threads library ➔ Hybrid (M-on-N) threading model ➔ One kernel visible thread per core ➔ Many user threads per kernel-visible thread ➔ Redirects system calls ( libc wrappers) ➔ Posts exception-less syscall to syscall page ➔ Switches to other user-level thread ➔ Resumes thread upon syscall completion Benefits of exception-less syscalls while maintaining sequential syscall interface

  18. 18 FlexSC-Threads in action User

  19. 19 On a syscall: Post request to system call page Block user-level thread FlexSC-Threads in action

  20. 20 On a syscall: Post request to system call page Block user-level thread Switch to next ready thread FlexSC-Threads in action Kernel

  21. 21 If all user-level threads become blocked: 1) enter kernel 2) wait for completion of at least 1 syscall FlexSC-Threads in action User Kernel

  22. 22 Evaluation ➔ Linux 2.6.33 ➔ Nehalem (Core i7) server, 2.3GHz ➔ 4 cores on a chip ➔ Clients connected on 1 Gbps network ➔ Workloads ➔ Sysbench on MySQL (80% user, 20% kernel) ➔ ApacheBench on Apache (50% user, 50% kernel) sync ”) vs. ➔ Default Linux NTPL (“ sync flexsc ”) FlexSC-Threads (“ flexsc

  23. 23 Sysbench: “OLTP” on MySQL (1 core) 500 400 (requests/sec.) Throughput 300 15% improvement 200 flexsc 100 sync 0 0 50 100 150 200 250 300 Request Concurrency

  24. 24 Sysbench: “OLTP” on MySQL (4 cores) 1,000 800 (requests/sec.) Throughput 600 40% improvement 400 flexsc 200 sync 0 0 50 100 150 200 250 300 Request Concurrency

  25. 25 Up to 30% reduction of average request latencies MySQL latency per client request 256 connections 1900 1,000 95th 900 percentile 800 Latency (ms) average 700 600 500 400 300 200 100 0 sync flexsc sync flexsc sync flexsc 4 cores 1 core 2 cores

  26. 26 Performance improvements consequence of more efficient processor execution MySQL processor metrics SysBench (4 cores) 1.4 1.2 Relative Performance User Kernel 1 (flexsc/sync) 0.8 0.6 0.4 0.2 0 L3 d-cache TLB IPC L2 i-cache Branch IPC L2 i-cache Branch L3 d-cache TLB

  27. 27 ApacheBench throughput (1 core) 45,000 flexsc 40,000 sync 35,000 (requests/sec.) Throughput 30,000 25,000 20,000 80-90% improvement 15,000 10,000 5,000 0 0 200 400 600 800 1000 Request Concurrency

  28. 28 ApacheBench throughput (4 cores) 45,000 40,000 35,000 (requests/sec.) Throughput 30,000 115% improvement 25,000 20,000 15,000 10,000 flexsc 5,000 sync 0 0 200 400 600 800 1000 Request Concurrency

  29. 29 Up to 50% reduction of average request latencies Apache latency per client request 256 concurrent requests 238 30 99th 25 percentile Latency (ms) average 20 15 10 5 0 sync flexsc sync flexsc sync flexsc 1 core 2 cores 4 cores

  30. 30 Processor efficiency doubles for kernel and user-mode execution Apache processor metrics Apache (1 core) 2 Relative Performance 1.5 (flexsc/sync) User Kernel 1 0.5 0 L3 d-cache TLB IPC L2 i-cache Branch IPC L2 i-cache Branch L3 d-cache TLB

  31. 31 Discussion ➔ New OS architecture not necessary ➔ Exception-less syscalls can coexist with legacy ones ➔ Foundation for non-blocking system calls ➔ select() / poll() in user-space ➔ Interesting case of non-blocking free() ➔ Multicore ultra -specialization ➔ TCP Servers (Rutgers; Iftode et.al), FS Servers ➔ Single-ISA asymmetric cores ➔ OS-friendly cores (HP Labs; Mogul et. al)

  32. 32 system calls Concluding Remarks ➔ System calls degrade server performance pollution is inherent to synchronous ➔ Processor ➔ Exception-less syscalls Exception-less syscalls ➔ Flexible and efficient system call execution ➔ FlexSC-Threads FlexSC-Threads ➔ Leverages exception-less syscalls ➔ No modifications to multi-threaded applications ➔ Throughput & latency gains ➔ 2x throughput improvement for Apache and BIND ➔ 1.4x throughput improvement for MySQL

  33. Flexible System Call Scheduling with Exception-Less System Calls and Michael Stumm University of Toronto FlexSC FlexSC Livio Soares Livio Soares

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend