FlexSC FlexSC Livio Soares Livio Soares 2 The legacy from the - PowerPoint PPT Presentation

Flexible System Call Scheduling with Exception-Less System Calls and Michael Stumm University of Toronto FlexSC FlexSC Livio Soares Livio Soares

2 The legacy from the single core era system calls for the multicore era Expensive! Costs are: structure pollution Motivation synchronous system call interface is a synchronous ➔ direct : mode-switch ➔ indirect : processor efficient and flexible FlexSC implements efficient and flexible

3 Two contributions: FlexSC and FlexSC-Threads Results in: 1) MySQL throughput increase of up to 40% and latency reduction of 30% 2) Apache throughput increase of up to 115% and latency reduction of 50% FlexSC overview

Ideally, user-mode performance is unaltered 4 Performance impact of synchronous syscalls ➔ Xalan from SPEC CPU 2006 ➔ Virtually no time in the OS ➔ Linux on Intel Core i7 (Nehalem) ➔ Injected exceptions with varying frequencies ➔ Direct Direct : emulate null system call ➔ Indirect Indirect : emulate “write()” system call ➔ Measured only user-mode time ➔ Kernel time ignored

5 MySQL Apache Degradation due to sync. syscalls Degradation (lower is faster) Xalan (SPEC CPU 2006) 70% Indirect 60% Direct 50% 40% 30% 20% 10% 0% 1K 2K 5K 10K 20K 50K 100K user-mode instructions between exceptions (log scale) half processor efficiency; System calls can half indirect cause is major contributor indirect

6 rd of the L1 data cache and data Processor state pollution ➔ Key source of performance impact ➔ On a Linux write() call: ➔ up to 2/3 evicted TLB are evicted ➔ Kernel performance equally affected ➔ Processor efficiency for OS code is also cut half in half

7 Traditional system calls are synchronous and use exceptions to cross domains Synchronous system calls are expensive User Kernel

by decoupling invocation from execution 8 Alternative: side-step the boundary User Kernel Exception-less syscalls remove synchronicity Exception-less syscalls

9 Benefits of exception-less system calls ➔ Significantly reduce direct costs ➔ Fewer mode switches User ➔ Allow for batching Kernel ➔ Reduce indirect costs ➔ Allow for dynamic multicore specialization ➔ Further reduce direct and indirect costs

10 Exception-less interface: syscall page write(fd, buf, 4096); entry = free_syscall_entry(); /* write syscall */ /* write syscall */ entry->syscall = 1; entry->num_args = 3; entry->args[0] = fd; entry->args[1] = buf; entry->args[2] = 4096; entry->status = SUBMIT SUBMIT; while (entry->status != DONE DONE) while do_something_else(); return entry->return_code; return

11 Exception-less interface: syscall page write(fd, buf, 4096); entry = free_syscall_entry(); /* write syscall */ /* write syscall */ entry->syscall = 1; entry->num_args = 3; entry->args[0] = fd; entry->args[1] = buf; entry->args[2] = 4096; SUBMIT SUBMIT entry->status = SUBMIT SUBMIT; while (entry->status != DONE DONE) while do_something_else(); return entry->return_code; return

12 Exception-less interface: syscall page write(fd, buf, 4096); entry = free_syscall_entry(); /* write syscall */ /* write syscall */ entry->syscall = 1; entry->num_args = 3; entry->args[0] = fd; entry->args[1] = buf; DONE entry->args[2] = 4096; DONE entry->status = SUBMIT SUBMIT; while (entry->status != DONE DONE) while do_something_else(); return entry->return_code; return

13 Syscall threads ➔ Kernel-only threads ➔ Part of application process ➔ Execute requests from syscall page ➔ Schedulable on a per-core basis

14 Request as many system calls as possible Switch to kernel-mode Start executing all posted system calls Avoids direct and indirect costs, even on a single core System call batching

15 FlexSC makes specializing cores simple Dynamically adapts to workload needs Dynamic multicore specialization

16 Event-driven servers (e.g., memcached, nginx webserver) exception-less ones What programs can benefit from FlexSC? ➔ Use asynchoronous calls, similar to FlexSC ➔ Can use FlexSC directly ➔ Mix sync and exception-less system calls FlexSC-Threads Multi-threaded servers: FlexSC-Threads ➔ Thread library, compatible with Pthreads ➔ No changes to app. code or recompilation required ➔ Transparently converts legacy syscalls into

17 FlexSC-Threads library ➔ Hybrid (M-on-N) threading model ➔ One kernel visible thread per core ➔ Many user threads per kernel-visible thread ➔ Redirects system calls ( libc wrappers) ➔ Posts exception-less syscall to syscall page ➔ Switches to other user-level thread ➔ Resumes thread upon syscall completion Benefits of exception-less syscalls while maintaining sequential syscall interface

18 FlexSC-Threads in action User

19 On a syscall: Post request to system call page Block user-level thread FlexSC-Threads in action

20 On a syscall: Post request to system call page Block user-level thread Switch to next ready thread FlexSC-Threads in action Kernel

21 If all user-level threads become blocked: 1) enter kernel 2) wait for completion of at least 1 syscall FlexSC-Threads in action User Kernel

22 Evaluation ➔ Linux 2.6.33 ➔ Nehalem (Core i7) server, 2.3GHz ➔ 4 cores on a chip ➔ Clients connected on 1 Gbps network ➔ Workloads ➔ Sysbench on MySQL (80% user, 20% kernel) ➔ ApacheBench on Apache (50% user, 50% kernel) sync ”) vs. ➔ Default Linux NTPL (“ sync flexsc ”) FlexSC-Threads (“ flexsc

23 Sysbench: “OLTP” on MySQL (1 core) 500 400 (requests/sec.) Throughput 300 15% improvement 200 flexsc 100 sync 0 0 50 100 150 200 250 300 Request Concurrency

24 Sysbench: “OLTP” on MySQL (4 cores) 1,000 800 (requests/sec.) Throughput 600 40% improvement 400 flexsc 200 sync 0 0 50 100 150 200 250 300 Request Concurrency

25 Up to 30% reduction of average request latencies MySQL latency per client request 256 connections 1900 1,000 95th 900 percentile 800 Latency (ms) average 700 600 500 400 300 200 100 0 sync flexsc sync flexsc sync flexsc 4 cores 1 core 2 cores

26 Performance improvements consequence of more efficient processor execution MySQL processor metrics SysBench (4 cores) 1.4 1.2 Relative Performance User Kernel 1 (flexsc/sync) 0.8 0.6 0.4 0.2 0 L3 d-cache TLB IPC L2 i-cache Branch IPC L2 i-cache Branch L3 d-cache TLB

27 ApacheBench throughput (1 core) 45,000 flexsc 40,000 sync 35,000 (requests/sec.) Throughput 30,000 25,000 20,000 80-90% improvement 15,000 10,000 5,000 0 0 200 400 600 800 1000 Request Concurrency

28 ApacheBench throughput (4 cores) 45,000 40,000 35,000 (requests/sec.) Throughput 30,000 115% improvement 25,000 20,000 15,000 10,000 flexsc 5,000 sync 0 0 200 400 600 800 1000 Request Concurrency

29 Up to 50% reduction of average request latencies Apache latency per client request 256 concurrent requests 238 30 99th 25 percentile Latency (ms) average 20 15 10 5 0 sync flexsc sync flexsc sync flexsc 1 core 2 cores 4 cores

30 Processor efficiency doubles for kernel and user-mode execution Apache processor metrics Apache (1 core) 2 Relative Performance 1.5 (flexsc/sync) User Kernel 1 0.5 0 L3 d-cache TLB IPC L2 i-cache Branch IPC L2 i-cache Branch L3 d-cache TLB

31 Discussion ➔ New OS architecture not necessary ➔ Exception-less syscalls can coexist with legacy ones ➔ Foundation for non-blocking system calls ➔ select() / poll() in user-space ➔ Interesting case of non-blocking free() ➔ Multicore ultra -specialization ➔ TCP Servers (Rutgers; Iftode et.al), FS Servers ➔ Single-ISA asymmetric cores ➔ OS-friendly cores (HP Labs; Mogul et. al)

32 system calls Concluding Remarks ➔ System calls degrade server performance pollution is inherent to synchronous ➔ Processor ➔ Exception-less syscalls Exception-less syscalls ➔ Flexible and efficient system call execution ➔ FlexSC-Threads FlexSC-Threads ➔ Leverages exception-less syscalls ➔ No modifications to multi-threaded applications ➔ Throughput & latency gains ➔ 2x throughput improvement for Apache and BIND ➔ 1.4x throughput improvement for MySQL

Flexible System Call Scheduling with Exception-Less System Calls and Michael Stumm University of Toronto FlexSC FlexSC Livio Soares Livio Soares

FlexSC FlexSC Livio Soares Livio Soares 2 The legacy from the - PowerPoint PPT Presentation

Flexible System Call Scheduling with Exception-Less System Calls and Michael Stumm University of Toronto FlexSC FlexSC Livio Soares Livio Soares 2 The legacy from the single core era system calls for the multicore era Expensive! Costs

Semantic T extual Similarity & more on Alignment CMSC 723 / LING 723 / INST 725 M ARINE C

LONDON WORKSHOP 5 FEBRUARY 2020 Information Classification: Restricted AGENDA 15:00 WELCOME AND

T he ro le o f T ra nsc uta ne o us ve rsus Surg ic a l I nte rve ntio ns fo r Struc tura l

Retrofitting Contextualized Word Embeddings with Paraphrases Weijia Shi 1* , Muhao Chen 1 * , Pei

On strongly regular graphs attaining the claw bound M. Ma caj Comenius University,

Components with Symbolic Transition Fabrcio Fernandes, Systems: a Java Implementation

Super-Time-Stepping P. D. Mullen UIUC Email: pmullen2@illinois.edu GitHub: pdmullen 1

The facial weak order and its lattice of quotients Aram Dermenjian Joint work with: Christophe

Exploring the SAML 2.0 ECP-Profile Development of a client and a service provider prototype

Deploying Information Deploying Information Agents on the Web Agents on the Web Craig A.

Secure Communication by Ratcheting F Bet ul Durak and Serge Vaudenay COLE POLYTECHNIQUE

Are C e Caste C e Categ egories M es Misl slea eading? The e Relationsh ship B p Bet

STS Read-out Ele lectronics for The CBM Experiment Merve Dogan, Adrian Rodrigues Rodriguez,

Securing the Cloud Identity Management and Network Security in the Cloud Mark Ryland Chief

Arrays, ArrayLists, Wrapper Classes, Auto-boxing Check out ArraysAndLists from SVN Test next

Compilers and VMs for Programming Environments Used by Scien;sts

Semantic Accountable Matchmaking for E-Science Resource Sharing Zeqian Meng Rizos Sakellariou

Scalable e Tool ools for for Deb ebugging Non on-D -Det eter erministic MPI PI

Computer Science Meets Foreign Policy Stephanie Forrest Biodesign

Sources of error R.W. Oldford Population attributes: Interest lies in assessing and/or

Statistics 380 Probability and Statistics for the Physical Sciences Instructor: Peter Bloomfield

HOWARD COUNTY Prepared by: Ukeles Associates, Inc. June 1, 2011 CONTENTS 2 About the Jewish

Outline Substance Use within a Trauma- Introduction Informed Primary Background Care

Estimating long-run effects in models with cross-sectional dependence using xtdcce2 Three ways to

FlexSC FlexSC Livio Soares Livio Soares 2 The legacy from the - PowerPoint PPT Presentation

Flexible System Call Scheduling with Exception-Less System Calls and Michael Stumm University of Toronto FlexSC FlexSC Livio Soares Livio Soares 2 The legacy from the single core era system calls for the multicore era Expensive! Costs

Semantic T extual Similarity &amp; more on Alignment CMSC 723 / LING 723 / INST 725 M ARINE C

LONDON WORKSHOP 5 FEBRUARY 2020 Information Classification: Restricted AGENDA 15:00 WELCOME AND

T he ro le o f T ra nsc uta ne o us ve rsus Surg ic a l I nte rve ntio ns fo r Struc tura l

Retrofitting Contextualized Word Embeddings with Paraphrases Weijia Shi 1* , Muhao Chen 1 * , Pei

On strongly regular graphs attaining the claw bound M. Ma caj Comenius University,

Components with Symbolic Transition Fabrcio Fernandes, Systems: a Java Implementation

Super-Time-Stepping P. D. Mullen UIUC Email: pmullen2@illinois.edu GitHub: pdmullen 1

The facial weak order and its lattice of quotients Aram Dermenjian Joint work with: Christophe

Exploring the SAML 2.0 ECP-Profile Development of a client and a service provider prototype

Deploying Information Deploying Information Agents on the Web Agents on the Web Craig A.

Secure Communication by Ratcheting F Bet ul Durak and Serge Vaudenay COLE POLYTECHNIQUE

Are C e Caste C e Categ egories M es Misl slea eading? The e Relationsh ship B p Bet

STS Read-out Ele lectronics for The CBM Experiment Merve Dogan, Adrian Rodrigues Rodriguez,

Securing the Cloud Identity Management and Network Security in the Cloud Mark Ryland Chief

Arrays, ArrayLists, Wrapper Classes, Auto-boxing Check out ArraysAndLists from SVN Test next

Compilers and VMs for Programming Environments Used by Scien;sts

Semantic Accountable Matchmaking for E-Science Resource Sharing Zeqian Meng Rizos Sakellariou

Scalable e Tool ools for for Deb ebugging Non on-D -Det eter erministic MPI PI

Computer Science Meets Foreign Policy Stephanie Forrest Biodesign

Sources of error R.W. Oldford Population attributes: Interest lies in assessing and/or

Statistics 380 Probability and Statistics for the Physical Sciences Instructor: Peter Bloomfield

HOWARD COUNTY Prepared by: Ukeles Associates, Inc. June 1, 2011 CONTENTS 2 About the Jewish

Outline Substance Use within a Trauma- Introduction Informed Primary Background Care

Estimating long-run effects in models with cross-sectional dependence using xtdcce2 Three ways to

Semantic T extual Similarity & more on Alignment CMSC 723 / LING 723 / INST 725 M ARINE C