The Impact of Thread- Per-Core Architecture on Application Tail - PowerPoint PPT Presentation

The Impact of Thread- Per-Core Architecture on Application Tail Latency Pekka Enberg, Ashwin Rao, and Sasu Tarkoma University of Helsinki ANCS 2019 � 1 /54

Introduction • Thread-per-core architecture has emerged to eliminate overheads in traditional multi-threaded architectures in server applications. • Partitioning of hardware resources can improve parallelism, but there are various trade-o ff s applications need to consider . • Takeaway: Request steering and OS interfaces are holding back the thread-per-core architecture. � 2 /54

Outline • Overview of thread-per-core • A key-value store • Impact on tail latency • Problems in the approach • Future directions � 3 /54

What is thread-per-core? • Thread-per-core = no multiplexing of a CPU core at OS level • Eliminates thread context switching overhead [Qin 2019; Seastar] • Enables elimination of thread synchronization by partitioning [Seastar] • Eliminates thread scheduling delays [Ousterhout, 2019] Ousterhout et al. 2019. Shenango: Achieving High CPU E ffi ciency for Latency-sensitive Datacenter Workloads. NSDI ’19. Qin et al. 2018. Arachne: Core-Aware Thread Management. OSDI ’18. � 5 /54 Seastar framework for high-performance server applications on modern hardware. http://seastar.io/

Interrupt isolation for thread-per-core • The in-kernel network stack runs in kernel threads, which interfere with application threads. • Network stack processing must be isolated to CPU cores not running application thread. • Interrupt isolation can be done with IRQ a ffi nity and IRQ balancing configuration changes. • NIC receive side-steering (RSS) configuration needs to align with IRQ a ffi nity configuration. Li et al . 2014. Tales of the Tail: Hardware, OS, and Application-level Sources of Tail Latency. SOCC ‘14 � 6 /54

Partitioning in thread-per-core • Partitioning of hardware resources (such as NIC and DRAM) can improve parallelism, by eliminating thread synchronization. • Di ff erent ways of partitioning resources: • Shared-everything, shared-nothing, and shared- something. � 7 /54

Shared-everything CPU0 CPU1 CPU2 CPU3 DRAM Data � 8 /54

Shared-everything CPU0 CPU1 CPU2 CPU3 DRAM Data Hardware resources are shared between all CPU cores. � 9 /54

Shared-everything CPU0 CPU1 CPU2 CPU3 DRAM Data Every request can be processed on any CPU core. � 10 /54

Shared-everything CPU0 CPU1 CPU2 CPU3 DRAM Data Data access must be synchronized. � 11 /54

Shared-everything • Advantages: • Every request can be processed on any CPU core. • No request steering needed. • Disadvantages: • Shared-memory scales badly on multicore [Holland, 2011] • Examples: • Memcached (when thread pool size equals CPU core count) Holland et al . 2011. Multicore OSes: Looking Forward from 1991, Er, 2011. HotOS ‘11 � 12 /54

Shared-nothing CPU0 CPU1 CPU2 CPU3 DRAM Data Data Data Data � 13 /54

Shared-nothing CPU0 CPU1 CPU2 CPU3 DRAM Data Data Data Data Hardware resources are partitioned between CPU cores. � 14 /54

Shared-nothing CPU0 CPU1 CPU2 CPU3 DRAM Data Data Data Data Request can be processed on one specific CPU core. � 15 /54

Shared-nothing CPU0 CPU1 CPU2 CPU3 DRAM Data Data Data Data Data access does not require synchronization. � 16 /54

Shared-nothing CPU0 CPU1 CPU2 CPU3 DRAM Data Data Data Data Requests need to be steered. � 17 /54

Shared-nothing • Advantages: • Data access does not require synchronization. • Disadvantages: • Request steering is needed [Lim, 2014; Didona, 2019] • CPU utilisation imbalance if data is not distributed well (“hot partition”) • Sensitive to skewed workloads • Examples: • Seastar framework and MICA key-value store Didona et al . 2019. Sharding for Improving Tail Latencies in In-memory Key-value Stores. NSDI '19 Lim et al . 2014. MICA: A Holistic Approach to Fast In-memory Key-value. NSDI ’14 � 18 /54

Shared-something CPU0 CPU1 CPU2 CPU3 DRAM Data Data � 19 /54

Shared-something CPU0 CPU1 CPU2 CPU3 DRAM Data Data Hardware resources are partitioned between CPU core clusters . � 20 /54

Shared-something CPU0 CPU1 CPU2 CPU3 DRAM Data Data No synchronization needed for data access on di ff erent CPU clusters. � 21 /54

Shared-something CPU0 CPU1 CPU2 CPU3 DRAM Data Data Data access needs to be synchronised within the same CPU core cluster. � 22 /54

Shared-something • Advantages: • Request can be processed on many cores • Shared-memory scales on small core counts [Holland, 2011]. • Improved hardware-level parallelism? • For example, partitioning around sub-NUMA clustering could improve memory controller utilization. • Disadvantages: • Request steering becomes more complex. Holland et al . 2011. Multicore OSes: Looking Forward from 1991, Er, 2011. HotOS ‘11 � 23 /54

Takeaways • Partitioning improves parallelism, but there are trade-o ff s applications need to consider. • Isolation of the in-kernel network stack is needed to avoid interference with application threads. � 24 /54

A shared-nothing, key-value store • To measure the impact of thread-per-core on tail latency, we designed a shared-nothing key-value store. • Memcached wire-protocol compatible for easier evaluation. • Software-based request steering with message passing between threads. • Lockless, single-producer, single-consumer (SPSC) queue per thread. � 26 /54

Shared-nothing CPU0 CPU1 CPU2 CPU3 DRAM Data Data Data Data Taking the shared-nothing model… � 27 /54

KV store design CPU0 CPU1 CPU2 CPU3 Userspace Application Application Message Passing Thread Thread Socket Socket Socket Socket Kernel SoftIRQ SoftIRQ Thread Thread IRQ Poll IRQ Poll Hardware NIC RX NIC RX DRAM DRAM Queue Queue Network …and implementing it on Linux. � 28 /54

KV store design CPU0 CPU1 CPU2 CPU3 Userspace Application Application Message Passing Thread Thread Socket Socket Socket Socket Kernel SoftIRQ SoftIRQ Thread Thread IRQ Poll IRQ Poll Hardware NIC RX NIC RX DRAM DRAM Queue Queue Network In-kernel network stack isolated on its own CPU cores. � 29 /54

KV store design CPU0 CPU1 CPU2 CPU3 Userspace Application Application Message Passing Thread Thread Socket Socket Socket Socket Kernel SoftIRQ SoftIRQ Thread Thread IRQ Poll IRQ Poll Hardware NIC RX NIC RX DRAM DRAM Queue Queue Network Application threads are running on their own CPU cores. � 30 /54

KV store design CPU0 CPU1 CPU2 CPU3 Userspace Application Application Message Passing Thread Thread Socket Socket Socket Socket Kernel SoftIRQ SoftIRQ Thread Thread IRQ Poll IRQ Poll Hardware NIC RX NIC RX DRAM DRAM Queue Queue Network Message passing between the application threads. � 31 /54

Impact on tail latency • Comparison of Memcached ( shared-everything ) and Sphinx ( shared-nothing ) • Measured read and update latency with the Mutilate tool • Testbed servers (Intel Xeon): • 24 CPU cores, Intel 82599ES NIC ( modern ) • 8 CPU cores, Broadcom NetXtreme II ( legacy ) • Varied IRQ isolation configurations. � 33 /54

Impact on tail latency � 34 /54

Impact on tail latency � 35 /54

99th percentile latency over concurrency for updates 99th Percentile Update Latency (ms) 2 . 5 Memcached (legacy) Sphinxd (legacy) 2 . 0 Memcached (modern) Sphinxd (modern) 1 . 5 1 . 0 0 . 5 0 . 0 24 48 72 96 120 144 168 192 216 240 264 288 312 336 360 384 Number of Concurrent Connections � 36 /54

99th percentile latency over concurrency for updates 99th Percentile Update Latency (ms) 2 . 5 Memcached (legacy) Sphinxd (legacy) 2 . 0 Memcached (modern) Sphinxd (modern) 1 . 5 Memcached 1 . 0 0 . 5 Sphinx 0 . 0 24 48 72 96 120 144 168 192 216 240 264 288 312 336 360 384 Number of Concurrent Connections � 37 /54

99th percentile latency over concurrency for updates 99th Percentile Update Latency (ms) 2 . 5 Memcached (legacy) Sphinxd (legacy) 2 . 0 Memcached (modern) Sphinxd (modern) 1 . 5 Memcached 1 . 0 0 . 5 Sphinx 0 . 0 24 48 72 96 120 144 168 192 216 240 264 288 312 336 360 384 Number of Concurrent Connections No locking, better CPU cache utilization. � 38 /54

Latency percentiles for updates Sphinx Memcached 99 95 90 80 Percentile (%) 50 Memcached (legacy) Sphinxd (legacy) 20 Memcached (modern) 10 Sphinxd (modern) 5 1 0 . 0 0 . 5 1 . 0 1 . 5 2 . 0 Update Latency (ms) � 39 /54

The Impact of Thread- Per-Core Architecture on Application Tail - PowerPoint PPT Presentation

The Impact of Thread- Per-Core Architecture on Application Tail Latency Pekka Enberg, Ashwin Rao, and Sasu Tarkoma University of Helsinki ANCS 2019 1 /54 Introduction Thread-per-core architecture has emerged to eliminate overheads in

Welcome Welcome Core: Core A Regional Destination Core: Core UL Core: Core Downtown

13 IN THIS CHAPTER Benefits of Thread Pooling 308 Considerations and Costs of Thread

To thread or not to thread? Why PETSc favors MPI-only Plenary Discussion PETSc User Meeting 2016

Caching, Parallelism, Fault Tolerance Marco Serafini COMPSCI 532 Lectures 2-3 Memory Hierarchy

Econom ical Aspects Econom ical Aspects Pay per Risk Pay per Use Pay per Use Pay per

Design of Thread-Safe Classes 1 Topic Outline Thread-Safe Classes Principles Confinement

Synthesizing Commutativity Conditions Kshitij Bansal Eric Koskinen Omer Tripp New York

Roadmap for Section 4.3. Windows Process and Thread Internals Thread Block, Process Block Flow

Directive-Based Programming with OpenMP Shared Memory Programming Explicit thread creation

CPL 2016, week 3 Thread management: execution and shutdown Oleg Batrashev Institute of Computer

CPL 2016, week 5 Inter-thread collaboration Oleg Batrashev Institute of Computer Science, Tartu,

CS 6958 LECTURE 9 TRAX MEMORY MODEL February 5, 2014 Recap: TRaX Thread DRAM L2 L1 Thread

CPL 2016, week 4 Inter-thread communication Oleg Batrashev Institute of Computer Science, Tartu,

Is This Class Thread-Safe? Inferring Documentation using Graph-Based Learning Andrew Habib,

What is a Thread? A thread lives within a process; A process can have several threads.

MULTITREADING What is a thread? A thread is a concurrent unit of execution Threads share

NetFPGA Summer Course Presented by: Noa Zilberman Yury Audzevich Technion August 2 August

An Effective Approach to Processing in DRAM Jinho Lee, Kiyoung Choi , and Jung Ho Ahn Seoul

More demanding workload Design Goals ____ More demanding workload Bottleneck: Network stack in

Keppel Land Limited Keppel Land Limited Nine Months to September 2005 Results Nine Months to

LR(0) Drawbacks Simple LR (SLR) Consider the unambiguous augmented grammar: New algorithm for

Drawbacks of single cycle implementation All instructions take the same time although

Parallel Programming Overview and Concepts Dr Mark Bull, EPCC markb@epcc.ed.ac.uk Outline

Bagging, Boosting and RANSAC MACHINE LEARNING - 2013 Bootstrap Aggregation Bagging The Main