Tales of the Tail Hardware, OS, and Application-level Sources of - - PowerPoint PPT Presentation

tales of the tail hardware os and application level
SMART_READER_LITE
LIVE PREVIEW

Tales of the Tail Hardware, OS, and Application-level Sources of - - PowerPoint PPT Presentation

Tales of the Tail Hardware, OS, and Application-level Sources of Tail Latency Jialin Li, Naveen Kr. Sharma , Dan R. K. Ports and Steven D. Gribble February 2, 2015 1 Introduction What is Tail Latency? What is Tail Latency? 2 Introduction


slide-1
SLIDE 1

1

Tales of the Tail Hardware, OS, and Application-level Sources of Tail Latency

Jialin Li, Naveen Kr. Sharma, Dan R. K. Ports and Steven D. Gribble February 2, 2015

slide-2
SLIDE 2

2

Introduction What is Tail Latency?

What is Tail Latency?

slide-3
SLIDE 3

2

Introduction What is Tail Latency?

What is Tail Latency?

Request Processing Time Fraction of Requests

slide-4
SLIDE 4

2

Introduction What is Tail Latency?

What is Tail Latency?

Request Processing Time Fraction of Requests

slide-5
SLIDE 5

2

Introduction What is Tail Latency?

What is Tail Latency?

Request Processing Time Fraction of Requests

In Facebook’s Memcached deployment,

Median latency is 100µs, but 95th percentile latency ≥ 1ms.

slide-6
SLIDE 6

2

Introduction What is Tail Latency?

What is Tail Latency?

Request Processing Time Fraction of Requests

In Facebook’s Memcached deployment,

Median latency is 100µs, but 95th percentile latency ≥ 1ms.

In this talk, we will explore Why some requests take longer than expected? What causes them to get delayed?

slide-7
SLIDE 7

3

Introduction What is Tail Latency?

Why is the Tail important?

Low latency is crucial for interactive services.

500ms delay can cause 20% drop in user traffic. [Google Study] Latency is directly tied to traffic, hence revenue.

slide-8
SLIDE 8

3

Introduction What is Tail Latency?

Why is the Tail important?

Low latency is crucial for interactive services.

500ms delay can cause 20% drop in user traffic. [Google Study] Latency is directly tied to traffic, hence revenue.

What makes it challenging is today’s datacenter workloads. Interactive services are highly parallel. Single client request spawns thousands of sub-tasks.

Overall latency depends on slowest sub-task latency. Bad Tail ⇒ Probability of any one sub-task getting delayed is high.

slide-9
SLIDE 9

4

Introduction What is Tail Latency?

A real-life example

Nishtala et. al. Scaling memcache at Facebook, NSDI 2013.

slide-10
SLIDE 10

4

Introduction What is Tail Latency?

A real-life example

Nishtala et. al. Scaling memcache at Facebook, NSDI 2013.

All requests have to finish within the SLA latency.

slide-11
SLIDE 11

5

Introduction What is Tail Latency?

What can we do?

People in industry have worked hard on solutions. Hedged Requests [Jeff Dean et. al.]

Effective sometimes, but adds application specific complexity.

Intelligently avoid slow machines

Keep track of server status; route requests around slow nodes.

slide-12
SLIDE 12

5

Introduction What is Tail Latency?

What can we do?

People in industry have worked hard on solutions. Hedged Requests [Jeff Dean et. al.]

Effective sometimes, but adds application specific complexity.

Intelligently avoid slow machines

Keep track of server status; route requests around slow nodes.

Attempts to build predictable response out of less predictable parts. We still don’t know what is causing requests to get delayed.

slide-13
SLIDE 13

6

Introduction What is Tail Latency?

Our Approach

1 Pick some real life applications: RPC Server, Memcached, Nginx. 2 Generate the ideal latency distribution. 3 Measure the actual distribution on a standard Linux server. 4 Identify a factor causing deviation from ideal distribution. 5 Explain and mitigate it. 6 Iterate over this till we reach the ideal distribution.

slide-14
SLIDE 14

7

Introduction What is Tail Latency?

Rest of the Talk

1

Introduction

2

Predicted Latency from Queuing Models

3

Measurements: Sources of Tail Latencies

4

Summary

slide-15
SLIDE 15

8

Predicted Latency from Queuing Models Ideal latency distribution

What is the ideal latency for a network server?

slide-16
SLIDE 16

8

Predicted Latency from Queuing Models Ideal latency distribution

What is the ideal latency for a network server?

Ideal baseline for comparing measured performance.

slide-17
SLIDE 17

8

Predicted Latency from Queuing Models Ideal latency distribution

What is the ideal latency for a network server?

Ideal baseline for comparing measured performance. Assume a simple model, and apply queuing theory.

slide-18
SLIDE 18

8

Predicted Latency from Queuing Models Ideal latency distribution

What is the ideal latency for a network server?

Ideal baseline for comparing measured performance. Assume a simple model, and apply queuing theory. Server

slide-19
SLIDE 19

8

Predicted Latency from Queuing Models Ideal latency distribution

What is the ideal latency for a network server?

Ideal baseline for comparing measured performance. Assume a simple model, and apply queuing theory. Server Clients

slide-20
SLIDE 20

8

Predicted Latency from Queuing Models Ideal latency distribution

What is the ideal latency for a network server?

Ideal baseline for comparing measured performance. Assume a simple model, and apply queuing theory. Server Clients

slide-21
SLIDE 21

8

Predicted Latency from Queuing Models Ideal latency distribution

What is the ideal latency for a network server?

Ideal baseline for comparing measured performance. Assume a simple model, and apply queuing theory. Server Clients

slide-22
SLIDE 22

8

Predicted Latency from Queuing Models Ideal latency distribution

What is the ideal latency for a network server?

Ideal baseline for comparing measured performance. Assume a simple model, and apply queuing theory. Server Clients Given the arrival distribution and request processing time, We can predict the time spent by a request in the server.

slide-23
SLIDE 23

9

Predicted Latency from Queuing Models Tail latency characteristics

10-4 10-3 10-2 10-1 100 101 102 103 104 CCDF P[X >= x] Latency in micro-seconds

Dummy

slide-24
SLIDE 24

9

Predicted Latency from Queuing Models Tail latency characteristics

10-4 10-3 10-2 10-1 100 101 102 103 104 CCDF P[X >= x] Latency in micro-seconds Distribution 1

Dummy

slide-25
SLIDE 25

9

Predicted Latency from Queuing Models Tail latency characteristics

10-4 10-3 10-2 10-1 100 101 102 103 104 CCDF P[X >= x] Latency in micro-seconds Distribution 1

99th percentile ⇒ 60 µs

slide-26
SLIDE 26

9

Predicted Latency from Queuing Models Tail latency characteristics

10-4 10-3 10-2 10-1 100 101 102 103 104 CCDF P[X >= x] Latency in micro-seconds Distribution 1

99.9th percentile ⇒ 200 µs

slide-27
SLIDE 27

9

Predicted Latency from Queuing Models Tail latency characteristics

10-4 10-3 10-2 10-1 100 101 102 103 104 CCDF P[X >= x] Latency in micro-seconds Distribution 1 Distribution 2

Dummy

slide-28
SLIDE 28

9

Predicted Latency from Queuing Models Tail latency characteristics

What is the ideal latency distribution?

Assume a server with single worker with 50 µs fixed processing time.

10-4 10-3 10-2 10-1 100 101 102 103 104 CCDF P[X >= x] Latency in micro-seconds

Dummy

slide-29
SLIDE 29

9

Predicted Latency from Queuing Models Tail latency characteristics

What is the ideal latency distribution?

Assume a server with single worker with 50 µs fixed processing time.

10-4 10-3 10-2 10-1 100 101 102 103 104 CCDF P[X >= x] Latency in micro-seconds Uniform Request Arrival

Dummy

slide-30
SLIDE 30

9

Predicted Latency from Queuing Models Tail latency characteristics

What is the ideal latency distribution?

Assume a server with single worker with 50 µs fixed processing time.

10-4 10-3 10-2 10-1 100 101 102 103 104 CCDF P[X >= x] Latency in micro-seconds Uniform Request Arrival Poisson at 70% Utilization

Inherent tail latency due to request burstiness.

slide-31
SLIDE 31

9

Predicted Latency from Queuing Models Tail latency characteristics

What is the ideal latency distribution?

Assume a server with single worker with 50 µs fixed processing time.

10-4 10-3 10-2 10-1 100 101 102 103 104 CCDF P[X >= x] Latency in micro-seconds Uniform Request Arrival Poisson at 70% Utilization Poisson at 90% Utilization

Tail latency depends on the average server utilization.

slide-32
SLIDE 32

9

Predicted Latency from Queuing Models Tail latency characteristics

What is the ideal latency distribution?

Assume a server with single worker with 50 µs fixed processing time.

10-4 10-3 10-2 10-1 100 101 102 103 104 CCDF P[X >= x] Latency in micro-seconds Uniform Request Arrival Poisson at 70% Utilization Poisson at 90% Utilization Poisson at 70% - 4 workers

Additional workers can reduce tail latency, even at constant utilization.

slide-33
SLIDE 33

10

Measurements: Sources of Tail Latencies

1

Introduction

2

Predicted Latency from Queuing Models

3

Measurements: Sources of Tail Latencies

4

Summary

slide-34
SLIDE 34

11

Measurements: Sources of Tail Latencies

Testbed

Cluster of standard datacenter machines.

2 x Intel L5640 6 core CPU 24 GB of DRAM Mellanox 10Gbps NIC Ubuntu 12.04, Linux Kernel 3.2.0

All servers connected to a single 10 Gbps ToR switch. One server runs Memcached, others run workload generating clients. Other application results are in the paper.

slide-35
SLIDE 35

12

Measurements: Sources of Tail Latencies

Timestamping Methodology

Append a blank buffer ≈ 32 bytes to each request. Overwrite buffer with timestamps as it goes through the server.

Incoming Server NIC Memcached read() return Outgoing Server NIC Memcached write() After TCP/UDP processing Memcached thread scheduled on CPU

Very low overhead and no server side logging.

slide-36
SLIDE 36

13

Measurements: Sources of Tail Latencies

How far are we from the ideal?

slide-37
SLIDE 37

14

Measurements: Sources of Tail Latencies

How far are we from the ideal?

10-4 10-3 10-2 10-1 100 101 102 103 104 CCDF P[X >= x] Latency in micro-seconds Ideal Model

Single CPU, single core, Memcached running at 80% utilization.

slide-38
SLIDE 38

14

Measurements: Sources of Tail Latencies

How far are we from the ideal?

10-4 10-3 10-2 10-1 100 101 102 103 104 CCDF P[X >= x] Latency in micro-seconds Ideal Model Standard Linux

Single CPU, single core, Memcached running at 80% utilization.

slide-39
SLIDE 39

14

Measurements: Sources of Tail Latencies

How far are we from the ideal?

10-4 10-3 10-2 10-1 100 101 102 103 104 CCDF P[X >= x] Latency in micro-seconds 30x Ideal Model Standard Linux

Single CPU, single core, Memcached running at 80% utilization.

slide-40
SLIDE 40

15

Measurements: Sources of Tail Latencies

Rest of the talk

Source of Tail Latency Potential way to fix Background Processes Multicore Concurrency Interrupt Processing

slide-41
SLIDE 41

15

Measurements: Sources of Tail Latencies

Rest of the talk

Source of Tail Latency Potential way to fix Background Processes Multicore Concurrency Interrupt Processing

slide-42
SLIDE 42

16

Measurements: Sources of Tail Latencies

How can background processes affect tail latency?

Memcached threads time-share a CPU core with other processes. We need to wait for other processes to relinquish CPU. Scheduling time-slices are usually couple of milliseconds.

slide-43
SLIDE 43

16

Measurements: Sources of Tail Latencies

How can background processes affect tail latency?

Memcached threads time-share a CPU core with other processes. We need to wait for other processes to relinquish CPU. Scheduling time-slices are usually couple of milliseconds. How can we mitigate it? Raise priority (decrease niceness) ⇒ More CPU time. Upgrade scheduling class to real-time ⇒ Pre-emptive power. Run on a dedicated core ⇒ No interference what-so-ever.

slide-44
SLIDE 44

17

Measurements: Sources of Tail Latencies

Impact of Background Processes

10-4 10-3 10-2 10-1 100 101 102 103 104 CCDF P[X >= x] Latency in micro-seconds 30x Ideal Model Standard Linux

Single CPU, single core, Memcached running at 80% utilization.

slide-45
SLIDE 45

17

Measurements: Sources of Tail Latencies

Impact of Background Processes

10-4 10-3 10-2 10-1 100 101 102 103 104 CCDF P[X >= x] Latency in micro-seconds 10x Ideal Model Standard Linux Maximum Priority

Single CPU, single core, Memcached running at 80% utilization.

slide-46
SLIDE 46

17

Measurements: Sources of Tail Latencies

Impact of Background Processes

10-4 10-3 10-2 10-1 100 101 102 103 104 CCDF P[X >= x] Latency in micro-seconds 4x Ideal Model Standard Linux Maximum Priority Realtime Scheduling

Single CPU, single core, Memcached running at 80% utilization.

slide-47
SLIDE 47

17

Measurements: Sources of Tail Latencies

Impact of Background Processes

10-4 10-3 10-2 10-1 100 101 102 103 104 CCDF P[X >= x] Latency in micro-seconds 3x Ideal Model Standard Linux Maximum Priority Realtime Scheduling Dedicated Core

Single CPU, single core, Memcached running at 80% utilization.

slide-48
SLIDE 48

17

Measurements: Sources of Tail Latencies

Impact of Background Processes

10-4 10-3 10-2 10-1 100 101 102 103 104 CCDF P[X >= x] Latency in micro-seconds 3x Ideal Model Standard Linux Maximum Priority Realtime Scheduling Dedicated Core

Interference from background processes has a large effect on the tail.

Single CPU, single core, Memcached running at 80% utilization.

slide-49
SLIDE 49

18

Measurements: Sources of Tail Latencies

Source of Tail Latency Potential way to fix Background Processes Isolate by running on a dedicated core. Multicore Concurrency Interrupt Processing

slide-50
SLIDE 50

18

Measurements: Sources of Tail Latencies

Source of Tail Latency Potential way to fix Background Processes Isolate by running on a dedicated core. Multicore Concurrency Interrupt Processing

slide-51
SLIDE 51

19

Measurements: Sources of Tail Latencies

Does adding more CPU cores improve tail latency?

10-4 10-3 10-2 10-1 100 101 102 103 104 CCDF P[X >= x] Latency in micro-seconds 1 core Ideal Model

Single CPU, 4 cores, Memcached running at 80% utilization.

slide-52
SLIDE 52

19

Measurements: Sources of Tail Latencies

Does adding more CPU cores improve tail latency?

10-4 10-3 10-2 10-1 100 101 102 103 104 CCDF P[X >= x] Latency in micro-seconds 1 core Ideal Model 4 core Ideal Model

Single CPU, 4 cores, Memcached running at 80% utilization.

slide-53
SLIDE 53

19

Measurements: Sources of Tail Latencies

Does adding more CPU cores improve tail latency?

10-4 10-3 10-2 10-1 100 101 102 103 104 CCDF P[X >= x] Latency in micro-seconds 3x 1 core Ideal Model 4 core Ideal Model 1 core Linux

Single CPU, 4 cores, Memcached running at 80% utilization.

slide-54
SLIDE 54

19

Measurements: Sources of Tail Latencies

Does adding more CPU cores improve tail latency?

10-4 10-3 10-2 10-1 100 101 102 103 104 CCDF P[X >= x] Latency in micro-seconds 1 core Ideal Model 4 core Ideal Model 1 core Linux 4 core Linux

Single CPU, 4 cores, Memcached running at 80% utilization.

slide-55
SLIDE 55

19

Measurements: Sources of Tail Latencies

Does adding more CPU cores improve tail latency?

10-4 10-3 10-2 10-1 100 101 102 103 104 CCDF P[X >= x] Latency in micro-seconds 15x 1 core Ideal Model 4 core Ideal Model 1 core Linux 4 core Linux

Single CPU, 4 cores, Memcached running at 80% utilization.

slide-56
SLIDE 56

20

Measurements: Sources of Tail Latencies

Does adding more CPU cores improve tail latency?

Yes it does! Provided we maintain a single queue abstraction.

slide-57
SLIDE 57

20

Measurements: Sources of Tail Latencies

Does adding more CPU cores improve tail latency?

Yes it does! Provided we maintain a single queue abstraction.

Server Clients Ideal Model

slide-58
SLIDE 58

20

Measurements: Sources of Tail Latencies

Does adding more CPU cores improve tail latency?

Yes it does! Provided we maintain a single queue abstraction. Memcached partitions requests statically among threads.

Server Clients Ideal Model Server Clients Memcached Architecture

slide-59
SLIDE 59

20

Measurements: Sources of Tail Latencies

Does adding more CPU cores improve tail latency?

Yes it does! Provided we maintain a single queue abstraction. Memcached partitions requests statically among threads.

Server Clients Ideal Model Server Clients Memcached Architecture

How can we mitigate it? Modify Memcached concurrency model to use a single queue.

slide-60
SLIDE 60

21

Measurements: Sources of Tail Latencies

Impact of Multicore Concurrency Model

10-4 10-3 10-2 10-1 100 101 102 103 104 CCDF P[X >= x] Latency in micro-seconds 15x 4 core Ideal Model 1 core Linux 4 core Linux

Single CPU, 4 cores, Memcached running at 80% utilization.

slide-61
SLIDE 61

21

Measurements: Sources of Tail Latencies

Impact of Multicore Concurrency Model

10-4 10-3 10-2 10-1 100 101 102 103 104 CCDF P[X >= x] Latency in micro-seconds 4 core Ideal Model 1 core Linux 4 core Linux 4 core Linux - Single Queue

Single CPU, 4 cores, Memcached running at 80% utilization.

slide-62
SLIDE 62

21

Measurements: Sources of Tail Latencies

Impact of Multicore Concurrency Model

10-4 10-3 10-2 10-1 100 101 102 103 104 CCDF P[X >= x] Latency in micro-seconds 4x 4 core Ideal Model 1 core Linux 4 core Linux 4 core Linux - Single Queue

Single CPU, 4 cores, Memcached running at 80% utilization.

slide-63
SLIDE 63

21

Measurements: Sources of Tail Latencies

Impact of Multicore Concurrency Model

10-4 10-3 10-2 10-1 100 101 102 103 104 CCDF P[X >= x] Latency in micro-seconds 4x 4 core Ideal Model 1 core Linux 4 core Linux 4 core Linux - Single Queue

For multi-threaded applications, a single queue abstraction can reduce tail latency.

Single CPU, 4 cores, Memcached running at 80% utilization.

slide-64
SLIDE 64

22

Measurements: Sources of Tail Latencies

Source of Tail Latency Potential way to fix Background Processes Isolate by running on a dedicated core. Concurrency Model Ensure a single queue abstraction. Interrupt Processing

slide-65
SLIDE 65

22

Measurements: Sources of Tail Latencies

Source of Tail Latency Potential way to fix Background Processes Isolate by running on a dedicated core. Concurrency Model Ensure a single queue abstraction. Interrupt Processing

slide-66
SLIDE 66

23

Measurements: Sources of Tail Latencies

How can interrupts affect tail latency?

By default, Linux irqbalance spreads interrupts across all cores. OS pre-empts Memcached threads frequently. Introduces extra context switching overheads and cache pollution.

slide-67
SLIDE 67

23

Measurements: Sources of Tail Latencies

How can interrupts affect tail latency?

By default, Linux irqbalance spreads interrupts across all cores. OS pre-empts Memcached threads frequently. Introduces extra context switching overheads and cache pollution. How can we mitigate it? Separate cores for interrupt processing and application threads. 3 cores run Memcached threads, and 1 core processes interrupts.

slide-68
SLIDE 68

24

Measurements: Sources of Tail Latencies

Impact of Interrupt Processing

10-4 10-3 10-2 10-1 100 101 102 103 104 CCDF P[X >= x] Latency in micro-seconds 4x 4 core Ideal Model 4 core Linux - Interrupt spread

Single CPU, 4 cores, Memcached running at 80% utilization.

slide-69
SLIDE 69

24

Measurements: Sources of Tail Latencies

Impact of Interrupt Processing

10-4 10-3 10-2 10-1 100 101 102 103 104 CCDF P[X >= x] Latency in micro-seconds 4 core Ideal Model 4 core Linux - Interrupt spread 4 core Linux - Separate Interrupt core

Single CPU, 4 cores, Memcached running at 80% utilization.

slide-70
SLIDE 70

24

Measurements: Sources of Tail Latencies

Impact of Interrupt Processing

10-4 10-3 10-2 10-1 100 101 102 103 104 CCDF P[X >= x] Latency in micro-seconds 4 core Ideal Model 4 core Linux - Interrupt spread 4 core Linux - Separate Interrupt core

Separate cores for interrupt and application processing improves tail latency.

Single CPU, 4 cores, Memcached running at 80% utilization.

slide-71
SLIDE 71

25

Measurements: Sources of Tail Latencies

Other sources of tail latency

Source of Tail Latency Underlying Cause Thread Scheduling Policy Non-FIFO ordering of requests. NUMA Effects Increased latency across NUMA nodes. Hyper-threading Contending hyper-threads can increase latency. Power Saving Features Extra time required to wake CPU from idle state.

slide-72
SLIDE 72

26

Summary

Summary and Future Works

We explored hardware, OS and application-level sources of tail latency. Pin-point sources using finegrained timestaming, and an ideal model. We obtain substantial improvements, close to ideal distributions. 99.9th percentile latency of Memcached from 5 ms to 32 µs.

slide-73
SLIDE 73

26

Summary

Summary and Future Works

We explored hardware, OS and application-level sources of tail latency. Pin-point sources using finegrained timestaming, and an ideal model. We obtain substantial improvements, close to ideal distributions. 99.9th percentile latency of Memcached from 5 ms to 32 µs. Sources of tail latency in multi-process environment. How does virtualization effect tail latency? Overhead of virtualization, interference from other VMs. New effects when moving to a distributed setting, network effects.