Predictable Latency Adventures in Java Concurrency Martin Thompson - - PowerPoint PPT Presentation

predictable latency
SMART_READER_LITE
LIVE PREVIEW

Predictable Latency Adventures in Java Concurrency Martin Thompson - - PowerPoint PPT Presentation

A Quest for Predictable Latency Adventures in Java Concurrency Martin Thompson - @mjpt777 If a system does not respond in a timely manner then it is effectively unavailable 1. Its all about the Blocking 2. What do we mean by Latency 3.


slide-1
SLIDE 1

A Quest for Predictable Latency

Adventures in Java Concurrency

Martin Thompson - @mjpt777

slide-2
SLIDE 2
slide-3
SLIDE 3
slide-4
SLIDE 4
slide-5
SLIDE 5

If a system does not respond in a timely manner then it is effectively unavailable

slide-6
SLIDE 6
  • 1. It’s all about the Blocking
  • 2. What do we mean by Latency
  • 3. Adventures with Locks and queues
  • 4. Some Alternative FIFOs
  • 5. Where can we go Next
slide-7
SLIDE 7
  • 1. It’s all about the

Blocking

slide-8
SLIDE 8

What is preventing progress?

slide-9
SLIDE 9

A thread is blocked when it cannot make progress

slide-10
SLIDE 10

There are two major causes of blocking

slide-11
SLIDE 11

Blocking

  • Systemic Pauses
  • JVM Safepoints (GC, etc.)
  • Transparent Huge Pages (Linux)
  • Hardware (C-States, SMIs)
slide-12
SLIDE 12

Blocking

  • Concurrent Algorithms
  • Notifying Completion
  • Mutual Exclusion (Contention)
  • Synchronisation / Rendezvous
  • Systemic Pauses
  • JVM Safepoints (GC, etc.)
  • Transparent Huge Pages (Linux)
  • Hardware (C-States, SMIs)
slide-13
SLIDE 13
  • Call into kernel on contention
  • Always blocking
  • Difficult to get right
  • Execute in user space
  • Can be non-blocking
  • Very difficult to get right

Atomic/CAS Instructions Locks

Concurrent Algorithms

slide-14
SLIDE 14
  • 2. What do we mean by

Latency

slide-15
SLIDE 15

Queuing Theory

slide-16
SLIDE 16

Queuing Theory

Service Time

slide-17
SLIDE 17

Queuing Theory

Latent/Wait Time Service Time

slide-18
SLIDE 18

Queuing Theory

Service Time Response Time Latent/Wait Time

slide-19
SLIDE 19

Queuing Theory

Service Time Response Time

Dequeue Enqueue

Latent/Wait Time

slide-20
SLIDE 20

Queuing Theory

0.0 2.0 4.0 6.0 8.0 10.0 12.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Response Time Utilisation

slide-21
SLIDE 21

Amdahl’s Law

2 4 6 8 10 12 14 16 18 20 1 2 4 8 16 32 64 128 256 512 1024

Speedup Processors

Amdahl

slide-22
SLIDE 22

Universal Scalability Law

2 4 6 8 10 12 14 16 18 20 1 2 4 8 16 32 64 128 256 512 1024

Speedup Processors

Amdahl USL

slide-23
SLIDE 23
  • 3. Adventures with

Locks and Queues?

slide-24
SLIDE 24

Some queue implementations spend more time queueing to be enqueued than in queue time!!!

slide-25
SLIDE 25

The evils of Blocking

slide-26
SLIDE 26

Queue.put() && Queue.take()

slide-27
SLIDE 27

Condition Variables

slide-28
SLIDE 28

Condition Variable RTT

Echo Service Histogram Record Signal Ping Signal Pong

slide-29
SLIDE 29

µs

slide-30
SLIDE 30

Max = 5525.503µs µs

slide-31
SLIDE 31

Bad News

That’s the best case scenario!

slide-32
SLIDE 32

Are non-blocking APIs any better?

slide-33
SLIDE 33

Queue.offer() && Queue.poll()

slide-34
SLIDE 34

https://github.com/real-logic/benchmarks Java 8u60 Ubuntu 15.04 – “performance” mode Intel i7-3632QM – 2.2 GHz (Ivy Bridge) Context for the Benchmarks

slide-35
SLIDE 35

Prod # Mean 99% Baseline (Wait-free) 1 167 189 ArrayBlockingQueue 1 645 1,210 2 1,984 11,648 3 5,257 19,680 LinkedBlockingQueue 1 461 740 2 1,197 7,320 3 2,010 15,152 ConcurrentLinkedQueue 1 281 361 2 381 559 3 444 705

Burst Length = 1: RTT (ns)

slide-36
SLIDE 36

Prod # Mean 99% Baseline (Wait-free) 1 721 982 ArrayBlockingQueue 1 30,346 41,728 2 35,631 65,008 3 50,271 90,240 LinkedBlockingQueue 1 31,422 41,600 2 45,096 94,208 3 89,820 180,224 ConcurrentLinkedQueue 1 12,916 15,792 2 25,132 35,136 3 39,462 56,768

Burst Length = 100: RTT (ns)

slide-37
SLIDE 37

Prod # Mean 99% Baseline (Wait-free) 1 721 982 ArrayBlockingQueue 1 30,346 41,728 2 35,631 65,008 3 50,271 90,240 LinkedBlockingQueue 1 31,422 41,600 2 45,096 94,208 3 89,820 180,224 ConcurrentLinkedQueue 1 12,916 15,792 2 25,132 35,136 3 39,462 56,768

Burst Length = 100: RTT (ns)

slide-38
SLIDE 38

Backpressure? Size methods? Flow Rates? Garbage? Fan out?

slide-39
SLIDE 39
  • 5. Some alternative

FIFOs

slide-40
SLIDE 40

Inter-Thread FIFOs

slide-41
SLIDE 41

Disruptor 1.0 – 2.0

claimSequence <<CAS>> cursor gating Reference Ring Buffer

Influences: Lamport + Network Cards

slide-42
SLIDE 42

Disruptor 1.0 – 2.0

claimSequence <<CAS>> cursor gating Reference Ring Buffer

Influences: Lamport + Network Cards

1

slide-43
SLIDE 43

Disruptor 1.0 – 2.0

claimSequence <<CAS>> cursor gating Reference Ring Buffer

Influences: Lamport + Network Cards

1

slide-44
SLIDE 44

Disruptor 1.0 – 2.0

claimSequence <<CAS>> cursor gating Reference Ring Buffer

Influences: Lamport + Network Cards

1 1

slide-45
SLIDE 45

Disruptor 1.0 – 2.0

claimSequence <<CAS>> cursor gating Reference Ring Buffer

Influences: Lamport + Network Cards

1 1

slide-46
SLIDE 46

Disruptor 1.0 – 2.0

claimSequence <<CAS>> cursor gating Reference Ring Buffer

Influences: Lamport + Network Cards

1 1 1

slide-47
SLIDE 47

long expectedSequence = claimedSequence - 1; while (cursor != expectedSequence) { // busy spin } cursor = claimedSequence;

slide-48
SLIDE 48

long expectedSequence = claimedSequence - 1; while (cursor != expectedSequence) { // busy spin } cursor = claimedSequence;

slide-49
SLIDE 49

Disruptor 3.0

gatingCache cursor <<CAS>> gating Reference Ring Buffer available

Influences: Lamport + Network Cards + Fast Flow

slide-50
SLIDE 50

Disruptor 3.0

gatingCache cursor <<CAS>> gating Reference Ring Buffer available

Influences: Lamport + Network Cards + Fast Flow

1

slide-51
SLIDE 51

Disruptor 3.0

gatingCache cursor <<CAS>> gating Reference Ring Buffer available

Influences: Lamport + Network Cards + Fast Flow

1

slide-52
SLIDE 52

Disruptor 3.0

gatingCache cursor <<CAS>> gating Reference Ring Buffer available

Influences: Lamport + Network Cards + Fast Flow

1 1

slide-53
SLIDE 53

Disruptor 3.0

gatingCache cursor <<CAS>> gating Reference Ring Buffer available

Influences: Lamport + Network Cards + Fast Flow

1 1

slide-54
SLIDE 54

Disruptor 3.0

gatingCache cursor <<CAS>> gating Reference Ring Buffer available

Influences: Lamport + Network Cards + Fast Flow

1 1 1

slide-55
SLIDE 55

Disruptor 2.0

slide-56
SLIDE 56

Disruptor 3.0

slide-57
SLIDE 57

What if I need something with a Queue interface?

slide-58
SLIDE 58

ManyToOneConcurrentArrayQueue

Influences: Lamport + Fast Flow

headCache tail <<CAS>> head Reference Ring Buffer

slide-59
SLIDE 59

ManyToOneConcurrentArrayQueue

Influences: Lamport + Fast Flow

headCache tail <<CAS>> head Reference Ring Buffer 1

slide-60
SLIDE 60

ManyToOneConcurrentArrayQueue

Influences: Lamport + Fast Flow

headCache tail <<CAS>> head Reference Ring Buffer 1

slide-61
SLIDE 61

ManyToOneConcurrentArrayQueue

Influences: Lamport + Fast Flow

headCache tail <<CAS>> head Reference Ring Buffer 1

slide-62
SLIDE 62

ManyToOneConcurrentArrayQueue

Influences: Lamport + Fast Flow

headCache tail <<CAS>> head Reference Ring Buffer 1 1

slide-63
SLIDE 63

Prod # Mean 99% Baseline (Wait-free) 1 167 189 ConcurrentLinkedQueue 1 281 361 2 381 559 3 444 705 Disruptor 3.3.2 1 201 263 2 256 373 3 282 459 ManyToOneConcurrentArrayQueue 1 173 198 2 238 319 3 259 392

Burst Length = 1: RTT (ns)

slide-64
SLIDE 64

Prod # Mean 99% Baseline (Wait-free) 1 167 189 ConcurrentLinkedQueue 1 281 361 2 381 559 3 444 705 Disruptor 3.3.2 1 201 263 2 256 373 3 282 459 ManyToOneConcurrentArrayQueue 1 173 198 2 238 319 3 259 392

Burst Length = 1: RTT (ns)

slide-65
SLIDE 65

Prod # Mean 99% Baseline (Wait-free) 1 721 982 ConcurrentLinkedQueue 1 12,916 15,792 2 25,132 35,136 3 39,462 56,768 Disruptor 3.3.2 1 7,030 8,192 2 15,159 21,184 3 31,597 54,976 ManyToOneConcurrentArrayQueue 1 3,570 3,932 2 12,926 16,480 3 26,685 51,072

Burst Length = 100: RTT (ns)

slide-66
SLIDE 66

Prod # Mean 99% Baseline (Wait-free) 1 721 982 ConcurrentLinkedQueue 1 12,916 15,792 2 25,132 35,136 3 39,462 56,768 Disruptor 3.3.2 1 7,030 8,192 2 15,159 21,184 3 31,597 54,976 ManyToOneConcurrentArrayQueue 1 3,570 3,932 2 12,926 16,480 3 26,685 51,072

Burst Length = 100: RTT (ns)

slide-67
SLIDE 67

BUT!!! Disruptor has single producer and batch methods

slide-68
SLIDE 68

Inter-Process FIFOs

slide-69
SLIDE 69

ManyToOneRingBuffer

slide-70
SLIDE 70

Tail <<CAS>> File Message 2 Header Message 3 Header Head

slide-71
SLIDE 71

Tail <<CAS>> File Message 2 Header Message 3 Header Head

slide-72
SLIDE 72

Tail <<CAS>> File Message 2 Header Message 3 Header Head Message 4

slide-73
SLIDE 73

Tail <<CAS>> File Message 2 Header Message 3 Header Head Message 4 Header

slide-74
SLIDE 74

Tail <<CAS>> File Message 3 Header Head Message 4 Header

slide-75
SLIDE 75

Tail <<CAS>> File Message 3 Header Head Message 4 Header

slide-76
SLIDE 76

MPSC Ring Buffer Message Header

0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-------------------------------------------------------------+ |R| Frame Length | +-+-------------------------------------------------------------+ |R| Message Type | +-+-------------------------------------------------------------+ | Encoded Message ... ... | +---------------------------------------------------------------+

slide-77
SLIDE 77

Takes a throughput hit on zero’ing memory but latency is good

slide-78
SLIDE 78

Aeron IPC

slide-79
SLIDE 79

Tail File

slide-80
SLIDE 80

Tail File Message 1 Header

slide-81
SLIDE 81

Tail File Message 1 Header Message 2 Header

slide-82
SLIDE 82

Tail File Message 1 Header Message 2 Header

slide-83
SLIDE 83

Tail File Message 1 Header Message 2 Header Message 3

slide-84
SLIDE 84

Tail File Message 1 Header Message 2 Header Message 3 Header

slide-85
SLIDE 85

One big file that goes on forever?

slide-86
SLIDE 86

No!!!

Page faults, page cache churn, VM pressure, ...

slide-87
SLIDE 87

Active Dirty Clean Tail Message Header Message Header Message Header Message Header Message Header Message Header Message Header Message Header

slide-88
SLIDE 88

Do publishers need a CAS?

slide-89
SLIDE 89

Tail <<XADD>> File Message 1 Header Message 2 Header Message 3 Header Message X Message Y

slide-90
SLIDE 90

Tail <<XADD>> File Message 1 Header Message 2 Header Message 3 Header Message X Message Y

slide-91
SLIDE 91

Tail File Message 1 Header Message 2 Header Message 3 Header Message X Message Y

slide-92
SLIDE 92

Tail File Message 1 Header Message 2 Header Message 3 Header Message X Message Y Header

slide-93
SLIDE 93

Tail File Message 1 Header Message 2 Header Message 3 Header Message X Message Y Header Padding

slide-94
SLIDE 94

Tail File Message 1 Header Message 2 Header Message 3 Header Message Y Header Padding File Message X

slide-95
SLIDE 95

Tail File Message 1 Header Message 2 Header Message 3 Header Message X Message Y Header Padding File Header

slide-96
SLIDE 96

Have a background thread do zero’ing and flow control

slide-97
SLIDE 97

Prod # Mean 99% Baseline (Wait-free) 1 167 189 ConcurrentLinkedQueue 1 281 361 2 381 559 3 444 705 ManyToOneRingBuffer 1 170 194 2 256 279 3 283 340 Aeron IPC 1 192 210 2 251 286 3 290 342

Burst Length = 1: RTT (ns)

slide-98
SLIDE 98

Prod # Mean 99% Baseline (Wait-free) 1 167 189 ConcurrentLinkedQueue 1 281 361 2 381 559 3 444 705 ManyToOneRingBuffer 1 170 194 2 256 279 3 283 340 Aeron IPC 1 192 210 2 251 286 3 290 342

Burst Length = 1: RTT (ns)

slide-99
SLIDE 99

Aeron Data Message Header

0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-------------------------------------------------------------+ |R| Frame Length | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-------------------------------+ | Version |B|E| Flags | Type | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-------------------------------+ |R| Term Offset | +-+-------------------------------------------------------------+ | Session ID | +---------------------------------------------------------------+ | Stream ID | +---------------------------------------------------------------+ | Term ID | +---------------------------------------------------------------+ | Encoded Message ... ... | +---------------------------------------------------------------+

slide-100
SLIDE 100

Prod # Mean 99% Baseline (Wait-free) 1 721 982 ConcurrentLinkedQueue 1 12,916 15,792 2 25,132 35,136 3 39,462 56,768 ManyToOneRingBuffer 1 3,773 3,992 2 11,286 13,200 3 27,255 34,432 Aeron IPC 1 4,436 4,632 2 7,623 8,224 3 10,825 13,872

Burst Length = 100: RTT (ns)

slide-101
SLIDE 101

Prod # Mean 99% Baseline (Wait-free) 1 721 982 ConcurrentLinkedQueue 1 12,916 15,792 2 25,132 35,136 3 39,462 56,768 ManyToOneRingBuffer 1 3,773 3,992 2 11,286 13,200 3 27,255 34,432 Aeron IPC 1 4,436 4,632 2 7,623 8,224 3 10,825 13,872

Burst Length = 100: RTT (ns)

slide-102
SLIDE 102

Less “False Sharing”

  • Inlined data vs Reference Array

Less “Card Marking”

slide-103
SLIDE 103

Is avoiding the spinning “CAS” loop is a major step forward?

slide-104
SLIDE 104

Prod # Mean 99% Baseline (Wait-free) 1 721 982 ConcurrentLinkedQueue 1 12,916 15,792 2 25,132 35,136 3 39,462 56,768 ManyToOneRingBuffer 1 3,773 3,992 2 11,286 13,200 3 27,255 34,432 Aeron IPC 1 4,436 4,632 2 7,623 8,224 3 10,825 13,872

Burst Length = 100: RTT (ns)

slide-105
SLIDE 105

20,000 40,000 60,000 80,000 100,000 120,000 140,000 160,000 180,000 200,000

Burst Length = 100: 99% RTT (ns)

slide-106
SLIDE 106

Event Logging is a Messaging Problem

slide-107
SLIDE 107
  • 5. Where can we go

Next?

slide-108
SLIDE 108

C vs Java

slide-109
SLIDE 109

C vs Java

  • Spin Wait Loops
  • Thread.onSpinWait()
slide-110
SLIDE 110

C vs Java

  • Spin Wait Loops
  • Thread.onSpinWait()
  • Data Dependent Loads
  • Heap Aggregates - ObjectLayout
  • Stack Allocation – Value Types
slide-111
SLIDE 111

C vs Java

  • Spin Wait Loops
  • Thread.onSpinWait()
  • Data Dependent Loads
  • Heap Aggregates - ObjectLayout
  • Stack Allocation – Value Types
  • Memory Copying
  • Baseline against memcpy() for

differing alignment

slide-112
SLIDE 112

C vs Java

  • Spin Wait Loops
  • Thread.onSpinWait()
  • Data Dependent Loads
  • Heap Aggregates - ObjectLayout
  • Stack Allocation – Value Types
  • Memory Copying
  • Baseline against memcpy() for

differing alignment

  • Queue Interface
  • Break conflated concerns to

reduce blocking actions

slide-113
SLIDE 113

In closing…

slide-114
SLIDE 114

2TB RAM and 100+ vcores!!!

slide-115
SLIDE 115

https://github.com/real-logic/benchmarks https://github.com/real-logic/Agrona https://github.com/real-logic/Aeron Where can I find the code?

slide-116
SLIDE 116

Blog: http://mechanical-sympathy.blogspot.com/ Twitter: @mjpt777 “Any intelligent fool can make things bigger, more complex, and more violent. It takes a touch of genius, and a lot of courage, to move in the opposite direction.”

  • Albert Einstein

Questions?