/54
The Impact of Thread- Per-Core Architecture on Application Tail Latency
Pekka Enberg, Ashwin Rao, and Sasu Tarkoma University of Helsinki ANCS 2019
1
The Impact of Thread- Per-Core Architecture on Application Tail - - PowerPoint PPT Presentation
The Impact of Thread- Per-Core Architecture on Application Tail Latency Pekka Enberg, Ashwin Rao, and Sasu Tarkoma University of Helsinki ANCS 2019 1 /54 Introduction Thread-per-core architecture has emerged to eliminate overheads in
/54
Pekka Enberg, Ashwin Rao, and Sasu Tarkoma University of Helsinki ANCS 2019
1
/54
server applications.
parallelism, but there are various trade-offs applications need to consider.
holding back the thread-per-core architecture.
2
/54
3
/54
4
/54
level
Seastar]
partitioning [Seastar]
5
Ousterhout et al. 2019. Shenango: Achieving High CPU Efficiency for Latency-sensitive Datacenter Workloads. NSDI ’19. Qin et al. 2018. Arachne: Core-Aware Thread Management. OSDI ’18. Seastar framework for high-performance server applications on modern hardware. http://seastar.io/
/54
interfere with application threads.
not running application thread.
balancing configuration changes.
align with IRQ affinity configuration.
6
Li et al. 2014. Tales of the Tail: Hardware, OS, and Application-level Sources of Tail Latency. SOCC ‘14
/54
DRAM) can improve parallelism, by eliminating thread synchronization.
something.
7
/54
8
CPU0 CPU1 CPU2 CPU3 DRAM Data
/54
9
CPU0 CPU1 CPU2 CPU3 DRAM Data
Hardware resources are shared between all CPU cores.
/54
10
CPU0 CPU1 CPU2 CPU3 DRAM Data
Every request can be processed on any CPU core.
/54
11
CPU0 CPU1 CPU2 CPU3 DRAM Data
Data access must be synchronized.
/54
12
Holland et al. 2011. Multicore OSes: Looking Forward from 1991, Er, 2011. HotOS ‘11
/54
13
CPU0 CPU1 CPU2 CPU3 DRAM Data Data Data Data
/54
14
CPU0 CPU1 CPU2 CPU3 DRAM Data Data Data Data
Hardware resources are partitioned between CPU cores.
/54
15
CPU0 CPU1 CPU2 CPU3 DRAM Data Data Data Data
Request can be processed on one specific CPU core.
/54
16
CPU0 CPU1 CPU2 CPU3 DRAM Data Data Data Data
Data access does not require synchronization.
/54
17
CPU0 CPU1 CPU2 CPU3 DRAM Data Data Data Data
Requests need to be steered.
/54
18
Didona et al. 2019. Sharding for Improving Tail Latencies in In-memory Key-value Stores. NSDI '19 Lim et al. 2014. MICA: A Holistic Approach to Fast In-memory Key-value. NSDI ’14
/54
19
CPU0 CPU1 CPU2 CPU3 DRAM Data Data
/54
20
CPU0 CPU1 CPU2 CPU3 DRAM Data Data
Hardware resources are partitioned between CPU core clusters.
/54
21
CPU0 CPU1 CPU2 CPU3 DRAM Data Data
No synchronization needed for data access on different CPU clusters.
/54
22
CPU0 CPU1 CPU2 CPU3 DRAM Data Data
Data access needs to be synchronised within the same CPU core cluster.
/54
improve memory controller utilization.
23
Holland et al. 2011. Multicore OSes: Looking Forward from 1991, Er, 2011. HotOS ‘11
/54
applications need to consider.
interference with application threads.
24
/54
25
/54
we designed a shared-nothing key-value store.
evaluation.
between threads.
queue per thread.
26
/54
27
CPU0 CPU1 CPU2 CPU3 DRAM Data Data Data Data
Taking the shared-nothing model…
/54
28
SoftIRQ Thread NIC RX Queue Poll IRQ CPU0 Application Thread DRAM CPU1 Socket Socket Userspace Kernel Hardware SoftIRQ Thread NIC RX Queue Poll IRQ CPU2 Application Thread DRAM CPU3 Network Message Passing Socket Socket
…and implementing it on Linux.
/54
29
SoftIRQ Thread NIC RX Queue Poll IRQ CPU0 Application Thread DRAM CPU1 Socket Socket Userspace Kernel Hardware SoftIRQ Thread NIC RX Queue Poll IRQ CPU2 Application Thread DRAM CPU3 Network Message Passing Socket Socket
In-kernel network stack isolated on its own CPU cores.
/54
30
SoftIRQ Thread NIC RX Queue Poll IRQ CPU0 Application Thread DRAM CPU1 Socket Socket Userspace Kernel Hardware SoftIRQ Thread NIC RX Queue Poll IRQ CPU2 Application Thread DRAM CPU3 Network Message Passing Socket Socket
Application threads are running on their own CPU cores.
/54
31
SoftIRQ Thread NIC RX Queue Poll IRQ CPU0 Application Thread DRAM CPU1 Socket Socket Userspace Kernel Hardware SoftIRQ Thread NIC RX Queue Poll IRQ CPU2 Application Thread DRAM CPU3 Network Message Passing Socket Socket
Message passing between the application threads.
/54
32
/54
Sphinx (shared-nothing)
33
/54
34
/54
35
/54
36
24 48 72 96 120 144 168 192 216 240 264 288 312 336 360 384
Number of Concurrent Connections
0.0 0.5 1.0 1.5 2.0 2.5
99th Percentile Update Latency (ms)
Memcached (legacy) Sphinxd (legacy) Memcached (modern) Sphinxd (modern)
/54
37
24 48 72 96 120 144 168 192 216 240 264 288 312 336 360 384
Number of Concurrent Connections
0.0 0.5 1.0 1.5 2.0 2.5
99th Percentile Update Latency (ms)
Memcached (legacy) Sphinxd (legacy) Memcached (modern) Sphinxd (modern)
Memcached Sphinx
/54
38
24 48 72 96 120 144 168 192 216 240 264 288 312 336 360 384
Number of Concurrent Connections
0.0 0.5 1.0 1.5 2.0 2.5
99th Percentile Update Latency (ms)
Memcached (legacy) Sphinxd (legacy) Memcached (modern) Sphinxd (modern)
Memcached Sphinx No locking, better CPU cache utilization.
/54
39
0.0 0.5 1.0 1.5 2.0
Update Latency (ms)
1 5 10 20 50 80 90 95 99
Percentile (%)
Memcached (legacy) Sphinxd (legacy) Memcached (modern) Sphinxd (modern)
Memcached Sphinx
/54
requests, because partitioning eliminates locking.
everything and shared-nothing.
nothing and shared-something (no locking in either case).
40
/54
41
/54
42
SoftIRQ Thread NIC RX Queue Poll IRQ CPU0 Application Thread DRAM CPU1 Socket Socket Userspace Kernel Hardware SoftIRQ Thread NIC RX Queue Poll IRQ CPU2 Application Thread DRAM CPU3 Network Message Passing Socket Socket
/54
43
SoftIRQ Thread NIC RX Queue Poll IRQ CPU0 Application Thread DRAM CPU1 Socket Socket Userspace Kernel Hardware SoftIRQ Thread NIC RX Queue Poll IRQ CPU2 Application Thread DRAM CPU3 Network Message Passing Socket Socket
A packet arrives on NIC RX queue and is processed by in-kernel network stack on CPU0.
/54
44
SoftIRQ Thread NIC RX Queue Poll IRQ CPU0 Application Thread DRAM CPU1 Socket Socket Userspace Kernel Hardware SoftIRQ Thread NIC RX Queue Poll IRQ CPU2 Application Thread DRAM CPU3 Network Message Passing Socket Socket
Application thread receives the request on CPU1.
/54
45
SoftIRQ Thread NIC RX Queue Poll IRQ CPU0 Application Thread DRAM CPU1 Socket Socket Userspace Kernel Hardware SoftIRQ Thread NIC RX Queue Poll IRQ CPU2 Application Thread DRAM CPU3 Network Message Passing Socket Socket
Request is steered to an application thread on CPU3.
/54
software steering:
copies makes the implementation more complex.
it increases latency.
in some scenarios.
46
/54
application specific. Not always easy to partition.
nothing model.
47
/54
48
/54
steers request to correct application thread [Floem, 2018].
packet movement cost.
interface could be used for this.
49
Mangpo, et al. Floem: A Programming System for NIC-Accelerated Network Applications. OSDI ’18.
/54
performed using eventfd interface or signals, but both have overheads.
OS would help.
50
/54
Linux, will help.
avoided.
51
/54
interfere with application threads.
52
/54
disadvantages, applications need to consider different trade-offs.
better OS interfaces are needed to unlock full potential of thread-per-core.
53
/54
Email: penberg@iki.fi Home page: penberg.org
54
/54
55
/54
56
24 48 72 96 120 144 168 192 216 240 264 288 312 336 360 384
Number of Concurrent Connections
0.0 0.5 1.0 1.5 2.0 2.5
99th Percentile Read Latency (ms)
Memcached (legacy) Sphinxd (legacy) Memcached (modern) Sphinxd (modern)
/54
57
0.0 0.5 1.0 1.5 2.0
Read Latency (ms)
1 5 10 20 50 80 90 95 99
Percentile (%)
Memcached (legacy) Sphinxd (legacy) Memcached (modern) Sphinxd (modern)