The Impact of Thread- Per-Core Architecture on Application Tail - - PowerPoint PPT Presentation

the impact of thread per core architecture on application
SMART_READER_LITE
LIVE PREVIEW

The Impact of Thread- Per-Core Architecture on Application Tail - - PowerPoint PPT Presentation

The Impact of Thread- Per-Core Architecture on Application Tail Latency Pekka Enberg, Ashwin Rao, and Sasu Tarkoma University of Helsinki ANCS 2019 1 /54 Introduction Thread-per-core architecture has emerged to eliminate overheads in


slide-1
SLIDE 1

/54

The Impact of Thread- Per-Core Architecture on Application Tail Latency

Pekka Enberg, Ashwin Rao, and Sasu Tarkoma University of Helsinki ANCS 2019

1

slide-2
SLIDE 2

/54

Introduction

  • Thread-per-core architecture has emerged to eliminate
  • verheads in traditional multi-threaded architectures in

server applications.

  • Partitioning of hardware resources can improve

parallelism, but there are various trade-offs applications need to consider.

  • Takeaway: Request steering and OS interfaces are

holding back the thread-per-core architecture.

2

slide-3
SLIDE 3

/54

Outline

  • Overview of thread-per-core
  • A key-value store
  • Impact on tail latency
  • Problems in the approach
  • Future directions

3

slide-4
SLIDE 4

/54

Outline

  • Overview of thread-per-core
  • A key-value store
  • Impact on tail latency
  • Problems in the approach
  • Future directions

4

slide-5
SLIDE 5

/54

What is thread-per-core?

  • Thread-per-core = no multiplexing of a CPU core at OS

level

  • Eliminates thread context switching overhead [Qin 2019;

Seastar]

  • Enables elimination of thread synchronization by

partitioning [Seastar]

  • Eliminates thread scheduling delays [Ousterhout, 2019]

5

Ousterhout et al. 2019. Shenango: Achieving High CPU Efficiency for Latency-sensitive Datacenter Workloads. NSDI ’19. Qin et al. 2018. Arachne: Core-Aware Thread Management. OSDI ’18. Seastar framework for high-performance server applications on modern hardware. http://seastar.io/

slide-6
SLIDE 6

/54

Interrupt isolation for thread-per-core

  • The in-kernel network stack runs in kernel threads, which

interfere with application threads.

  • Network stack processing must be isolated to CPU cores

not running application thread.

  • Interrupt isolation can be done with IRQ affinity and IRQ

balancing configuration changes.

  • NIC receive side-steering (RSS) configuration needs to

align with IRQ affinity configuration.

6

Li et al. 2014. Tales of the Tail: Hardware, OS, and Application-level Sources of Tail Latency. SOCC ‘14

slide-7
SLIDE 7

/54

Partitioning in thread-per-core

  • Partitioning of hardware resources (such as NIC and

DRAM) can improve parallelism, by eliminating thread synchronization.

  • Different ways of partitioning resources:
  • Shared-everything, shared-nothing, and shared-

something.

7

slide-8
SLIDE 8

/54

Shared-everything

8

CPU0 CPU1 CPU2 CPU3 DRAM Data

slide-9
SLIDE 9

/54

Shared-everything

9

CPU0 CPU1 CPU2 CPU3 DRAM Data

Hardware resources are shared between all CPU cores.

slide-10
SLIDE 10

/54

Shared-everything

10

CPU0 CPU1 CPU2 CPU3 DRAM Data

Every request can be processed on any CPU core.

slide-11
SLIDE 11

/54

Shared-everything

11

CPU0 CPU1 CPU2 CPU3 DRAM Data

Data access must be synchronized.

slide-12
SLIDE 12

/54

Shared-everything

  • Advantages:
  • Every request can be processed on any CPU core.
  • No request steering needed.
  • Disadvantages:
  • Shared-memory scales badly on multicore [Holland, 2011]
  • Examples:
  • Memcached (when thread pool size equals CPU core count)

12

Holland et al. 2011. Multicore OSes: Looking Forward from 1991, Er, 2011. HotOS ‘11

slide-13
SLIDE 13

/54

Shared-nothing

13

CPU0 CPU1 CPU2 CPU3 DRAM Data Data Data Data

slide-14
SLIDE 14

/54

Shared-nothing

14

CPU0 CPU1 CPU2 CPU3 DRAM Data Data Data Data

Hardware resources are partitioned between CPU cores.

slide-15
SLIDE 15

/54

Shared-nothing

15

CPU0 CPU1 CPU2 CPU3 DRAM Data Data Data Data

Request can be processed on one specific CPU core.

slide-16
SLIDE 16

/54

Shared-nothing

16

CPU0 CPU1 CPU2 CPU3 DRAM Data Data Data Data

Data access does not require synchronization.

slide-17
SLIDE 17

/54

Shared-nothing

17

CPU0 CPU1 CPU2 CPU3 DRAM Data Data Data Data

Requests need to be steered.

slide-18
SLIDE 18

/54

Shared-nothing

  • Advantages:
  • Data access does not require synchronization.
  • Disadvantages:
  • Request steering is needed [Lim, 2014; Didona, 2019]
  • CPU utilisation imbalance if data is not distributed well (“hot partition”)
  • Sensitive to skewed workloads
  • Examples:
  • Seastar framework and MICA key-value store

18

Didona et al. 2019. Sharding for Improving Tail Latencies in In-memory Key-value Stores. NSDI '19 Lim et al. 2014. MICA: A Holistic Approach to Fast In-memory Key-value. NSDI ’14

slide-19
SLIDE 19

/54

Shared-something

19

CPU0 CPU1 CPU2 CPU3 DRAM Data Data

slide-20
SLIDE 20

/54

Shared-something

20

CPU0 CPU1 CPU2 CPU3 DRAM Data Data

Hardware resources are partitioned between CPU core clusters.

slide-21
SLIDE 21

/54

Shared-something

21

CPU0 CPU1 CPU2 CPU3 DRAM Data Data

No synchronization needed for data access on different CPU clusters.

slide-22
SLIDE 22

/54

Shared-something

22

CPU0 CPU1 CPU2 CPU3 DRAM Data Data

Data access needs to be synchronised within the same CPU core cluster.

slide-23
SLIDE 23

/54

Shared-something

  • Advantages:
  • Request can be processed on many cores
  • Shared-memory scales on small core counts [Holland, 2011].
  • Improved hardware-level parallelism?
  • For example, partitioning around sub-NUMA clustering could

improve memory controller utilization.

  • Disadvantages:
  • Request steering becomes more complex.

23

Holland et al. 2011. Multicore OSes: Looking Forward from 1991, Er, 2011. HotOS ‘11

slide-24
SLIDE 24

/54

Takeaways

  • Partitioning improves parallelism, but there are trade-offs

applications need to consider.

  • Isolation of the in-kernel network stack is needed to avoid

interference with application threads.

24

slide-25
SLIDE 25

/54

Outline

  • Overview of thread-per-core
  • A key-value store
  • Impact on tail latency
  • Problems in the approach
  • Future directions

25

slide-26
SLIDE 26

/54

A shared-nothing, key-value store

  • To measure the impact of thread-per-core on tail latency,

we designed a shared-nothing key-value store.

  • Memcached wire-protocol compatible for easier

evaluation.

  • Software-based request steering with message passing

between threads.

  • Lockless, single-producer, single-consumer (SPSC)

queue per thread.

26

slide-27
SLIDE 27

/54

Shared-nothing

27

CPU0 CPU1 CPU2 CPU3 DRAM Data Data Data Data

Taking the shared-nothing model…

slide-28
SLIDE 28

/54

KV store design

28

SoftIRQ Thread NIC RX Queue Poll IRQ CPU0 Application Thread DRAM CPU1 Socket Socket Userspace Kernel Hardware SoftIRQ Thread NIC RX Queue Poll IRQ CPU2 Application Thread DRAM CPU3 Network Message Passing Socket Socket

…and implementing it on Linux.

slide-29
SLIDE 29

/54

KV store design

29

SoftIRQ Thread NIC RX Queue Poll IRQ CPU0 Application Thread DRAM CPU1 Socket Socket Userspace Kernel Hardware SoftIRQ Thread NIC RX Queue Poll IRQ CPU2 Application Thread DRAM CPU3 Network Message Passing Socket Socket

In-kernel network stack isolated on its own CPU cores.

slide-30
SLIDE 30

/54

KV store design

30

SoftIRQ Thread NIC RX Queue Poll IRQ CPU0 Application Thread DRAM CPU1 Socket Socket Userspace Kernel Hardware SoftIRQ Thread NIC RX Queue Poll IRQ CPU2 Application Thread DRAM CPU3 Network Message Passing Socket Socket

Application threads are running on their own CPU cores.

slide-31
SLIDE 31

/54

KV store design

31

SoftIRQ Thread NIC RX Queue Poll IRQ CPU0 Application Thread DRAM CPU1 Socket Socket Userspace Kernel Hardware SoftIRQ Thread NIC RX Queue Poll IRQ CPU2 Application Thread DRAM CPU3 Network Message Passing Socket Socket

Message passing between the application threads.

slide-32
SLIDE 32

/54

Outline

  • Overview of thread-per-core
  • A key-value store
  • Impact on tail latency
  • Problems in the approach
  • Future directions

32

slide-33
SLIDE 33

/54

Impact on tail latency

  • Comparison of Memcached (shared-everything) and

Sphinx (shared-nothing)

  • Measured read and update latency with the Mutilate tool
  • Testbed servers (Intel Xeon):
  • 24 CPU cores, Intel 82599ES NIC (modern)
  • 8 CPU cores, Broadcom NetXtreme II (legacy)
  • Varied IRQ isolation configurations.

33

slide-34
SLIDE 34

/54

Impact on tail latency

34

slide-35
SLIDE 35

/54

Impact on tail latency

35

slide-36
SLIDE 36

/54

99th percentile latency over concurrency for updates

36

24 48 72 96 120 144 168 192 216 240 264 288 312 336 360 384

Number of Concurrent Connections

0.0 0.5 1.0 1.5 2.0 2.5

99th Percentile Update Latency (ms)

Memcached (legacy) Sphinxd (legacy) Memcached (modern) Sphinxd (modern)

slide-37
SLIDE 37

/54

99th percentile latency over concurrency for updates

37

24 48 72 96 120 144 168 192 216 240 264 288 312 336 360 384

Number of Concurrent Connections

0.0 0.5 1.0 1.5 2.0 2.5

99th Percentile Update Latency (ms)

Memcached (legacy) Sphinxd (legacy) Memcached (modern) Sphinxd (modern)

Memcached Sphinx

slide-38
SLIDE 38

/54

99th percentile latency over concurrency for updates

38

24 48 72 96 120 144 168 192 216 240 264 288 312 336 360 384

Number of Concurrent Connections

0.0 0.5 1.0 1.5 2.0 2.5

99th Percentile Update Latency (ms)

Memcached (legacy) Sphinxd (legacy) Memcached (modern) Sphinxd (modern)

Memcached Sphinx No locking, better CPU cache utilization.

slide-39
SLIDE 39

/54

Latency percentiles for updates

39

0.0 0.5 1.0 1.5 2.0

Update Latency (ms)

1 5 10 20 50 80 90 95 99

Percentile (%)

Memcached (legacy) Sphinxd (legacy) Memcached (modern) Sphinxd (modern)

Memcached Sphinx

slide-40
SLIDE 40

/54

Takeaways

  • Shared-nothing model reduces tail latency for update

requests, because partitioning eliminates locking.

  • More results in the paper:
  • Interrupt isolation reduces latency for both shared-

everything and shared-nothing.

  • No difference for read requests between shared-

nothing and shared-something (no locking in either case).

40

slide-41
SLIDE 41

/54

Outline

  • Overview of thread-per-core
  • A key-value store
  • Impact on tail latency
  • Problems in the approach
  • Future directions

41

slide-42
SLIDE 42

/54

Packet movement between CPU cores

42

SoftIRQ Thread NIC RX Queue Poll IRQ CPU0 Application Thread DRAM CPU1 Socket Socket Userspace Kernel Hardware SoftIRQ Thread NIC RX Queue Poll IRQ CPU2 Application Thread DRAM CPU3 Network Message Passing Socket Socket

slide-43
SLIDE 43

/54

Packet movement between CPU cores

43

SoftIRQ Thread NIC RX Queue Poll IRQ CPU0 Application Thread DRAM CPU1 Socket Socket Userspace Kernel Hardware SoftIRQ Thread NIC RX Queue Poll IRQ CPU2 Application Thread DRAM CPU3 Network Message Passing Socket Socket

A packet arrives on NIC RX queue and is processed by in-kernel network stack on CPU0.

slide-44
SLIDE 44

/54

Packet movement between CPU cores

44

SoftIRQ Thread NIC RX Queue Poll IRQ CPU0 Application Thread DRAM CPU1 Socket Socket Userspace Kernel Hardware SoftIRQ Thread NIC RX Queue Poll IRQ CPU2 Application Thread DRAM CPU3 Network Message Passing Socket Socket

Application thread receives the request on CPU1.

slide-45
SLIDE 45

/54

Packet movement between CPU cores

45

SoftIRQ Thread NIC RX Queue Poll IRQ CPU0 Application Thread DRAM CPU1 Socket Socket Userspace Kernel Hardware SoftIRQ Thread NIC RX Queue Poll IRQ CPU2 Application Thread DRAM CPU3 Network Message Passing Socket Socket

Request is steered to an application thread on CPU3.

slide-46
SLIDE 46

/54

Request steering inefficiency

  • Inter-thread communication efficiency matters for

software steering:

  • Message passing by copying is a bottleneck. Avoiding

copies makes the implementation more complex.

  • Thread wakeup are expensive, batching is needed, but

it increases latency.

  • Busy-polling is a solution, but it wastes CPU resources

in some scenarios.

46

slide-47
SLIDE 47

/54

Partitioning scheme and skewed workloads

  • Partitioning scheme is critical, but the design decision is

application specific. Not always easy to partition.

  • Skewed workloads are difficult to address with shared-

nothing model.

47

slide-48
SLIDE 48

/54

Outline

  • Overview of thread-per-core
  • A key-value store
  • Impact on tail latency
  • Problems in the approach
  • Future directions

48

slide-49
SLIDE 49

/54

Request steering with a programmable NIC?

  • Program running on the NIC parses request headers, and

steers request to correct application thread [Floem, 2018].

  • Eliminates request software steering overheads and

packet movement cost.

  • On Linux, the Express Data Path (XDP) and eBPF

interface could be used for this.

49

Mangpo, et al. Floem: A Programming System for NIC-Accelerated Network Applications. OSDI ’18.

slide-50
SLIDE 50

/54

OS support for inter-core communication?

  • On Linux, wakeup needed for inter-thread messaging are

performed using eventfd interface or signals, but both have overheads.

  • Adding better support for inter-core communication in the

OS would help.

50

slide-51
SLIDE 51

/54

Non-blocking OS interfaces

  • Thread-per-core requires non-blocking OS interfaces.
  • New asynchronous I/O interfaces, such as io_uring on

Linux, will help.

  • Paging and memory-mapped I/O are effectively blocking
  • perations (when you take a page fault), and must be

avoided.

51

slide-52
SLIDE 52

/54

Network stack scheduling control

  • In-kernel network stack runs in kernel threads, which

interfere with application threads.

  • Configuring IRQ isolation is possible, but hard and error-
  • prone. Better interfaces are needed.
  • Moving the network stack to user space helps.

52

slide-53
SLIDE 53

/54

Summary

  • Thread-per-core architecture addresses kernel thread
  • verheads.
  • Partitioning of hardware resources has advantages and

disadvantages, applications need to consider different trade-offs.

  • Request steering is critical: CPU and NIC co-design and

better OS interfaces are needed to unlock full potential of thread-per-core.

53

slide-54
SLIDE 54

/54

Thank you!

Email: penberg@iki.fi Home page: penberg.org

54

slide-55
SLIDE 55

/54

Backup slides

55

slide-56
SLIDE 56

/54

Read latency (99th)

56

24 48 72 96 120 144 168 192 216 240 264 288 312 336 360 384

Number of Concurrent Connections

0.0 0.5 1.0 1.5 2.0 2.5

99th Percentile Read Latency (ms)

Memcached (legacy) Sphinxd (legacy) Memcached (modern) Sphinxd (modern)

slide-57
SLIDE 57

/54

Read latency

57

0.0 0.5 1.0 1.5 2.0

Read Latency (ms)

1 5 10 20 50 80 90 95 99

Percentile (%)

Memcached (legacy) Sphinxd (legacy) Memcached (modern) Sphinxd (modern)