How To I/O? Todd L. Montgomery @toddlmontgomery I/O? Really? What - - PowerPoint PPT Presentation

how to i o
SMART_READER_LITE
LIVE PREVIEW

How To I/O? Todd L. Montgomery @toddlmontgomery I/O? Really? What - - PowerPoint PPT Presentation

StoneTor How To I/O? Todd L. Montgomery @toddlmontgomery I/O? Really? What used to be true is still true Except when it isnt Case Study: Aeron Takeaways I/O? Really? M.2 DDRSSD PCIe - 3/4 100 GbE OmniPath CPUs Cache /


slide-1
SLIDE 1

How To I/O?

Todd L. Montgomery @toddlmontgomery

StoneTor

slide-2
SLIDE 2

I/O? Really? What used to be true … is still true Except when it isn’t Case Study: Aeron Takeaways

slide-3
SLIDE 3

I/O? Really?

slide-4
SLIDE 4
slide-5
SLIDE 5
slide-6
SLIDE 6
slide-7
SLIDE 7
slide-8
SLIDE 8
slide-9
SLIDE 9

M.2 DDRSSD PCIe - 3/4 100 GbE … OmniPath

slide-10
SLIDE 10

CPUs Cache / Memory Fast networks - I/O-“ish"

slide-11
SLIDE 11

Storage 700+ MBps

slide-12
SLIDE 12

Network 10Gbps <15us latency

slide-13
SLIDE 13

Accumulated Improvement Time Network Bandwidth Response Time Storage Capacity CPU Cores Memory Capacity

slide-14
SLIDE 14

It’s all good… nothing to worry about… right?

slide-15
SLIDE 15

What used to be true

slide-16
SLIDE 16

Synchronous Read/Write

slide-17
SLIDE 17
slide-18
SLIDE 18
slide-19
SLIDE 19
slide-20
SLIDE 20

Streaming Read/Write

slide-21
SLIDE 21

Striding not just for memory VM Storage RDMA

slide-22
SLIDE 22

00004f0: 27cf 5c08 726b 8da2 486d f305 8e18 8727 '.\.rk..Hm.....' 0000500: 07ba 9b14 18e9 90da ce20 8569 6d49 1b2c ......... .imI., 0000510: 0b02 a02b 5095 cb25 5f11 76b8 1ae2 13d4 ...+P..%_.v..... 0000520: 2148 8924 2220 1e30 e325 5f71 44e5 98c4 !H.$" .0.%_qD... 0000530: 621b 0a55 e068 4ad3 01d0 0259 4845 8028 b..U.hJ....YHE.( 0000540: 0999 5cbe e2ac cca4 6a31 bbc2 b2b6 e520 ..\.....j1..... 0000550: ce7e 86fb d4e3 cdf8 f7c2 b76a 14ad 62ff .~.........j..b. 0000560: aec2 776a f4cf f46f 99ee cfc4 6a8b 7682 ..wj...o....j.v. 0000570: 6270 af16 1576 8bbe 39b1 56c9 81f1 218d bp...v..9.V...!. 0000580: 3277 1b3b 62de 1ca2 37b4 d218 a706 51f2 2w.;b...7.....Q. 0000590: a680 bd8d 7f05 2b35 1882 dea4 7607 d0d1 ......+5....v... 00005a0: c885 770e 91d3 4d92 ae90 bb18 9e8d 15bd ..w...M......... 00005b0: 3154 b266 1c94 bc80 de89 1f50 a5a8 83b6 1T.f.......P.... 00005c0: 9c0e 3dc6 21b5 d391 f2d9 0929 a4b0 82d4 ..=.!......)....

slide-23
SLIDE 23

00004f0: 27cf 5c08 726b 8da2 486d f305 8e18 8727 '.\.rk..Hm.....' 0000500: 07ba 9b14 18e9 90da ce20 8569 6d49 1b2c ......... .imI., 0000510: 0b02 a02b 5095 cb25 5f11 76b8 1ae2 13d4 ...+P..%_.v..... 0000520: 2148 8924 2220 1e30 e325 5f71 44e5 98c4 !H.$" .0.%_qD... 0000530: 621b 0a55 e068 4ad3 01d0 0259 4845 8028 b..U.hJ....YHE.( 0000540: 0999 5cbe e2ac cca4 6a31 bbc2 b2b6 e520 ..\.....j1..... 0000550: ce7e 86fb d4e3 cdf8 f7c2 b76a 14ad 62ff .~.........j..b. 0000560: aec2 776a f4cf f46f 99ee cfc4 6a8b 7682 ..wj...o....j.v. 0000570: 6270 af16 1576 8bbe 39b1 56c9 81f1 218d bp...v..9.V...!. 0000580: 3277 1b3b 62de 1ca2 37b4 d218 a706 51f2 2w.;b...7.....Q. 0000590: a680 bd8d 7f05 2b35 1882 dea4 7607 d0d1 ......+5....v... 00005a0: c885 770e 91d3 4d92 ae90 bb18 9e8d 15bd ..w...M......... 00005b0: 3154 b266 1c94 bc80 de89 1f50 a5a8 83b6 1T.f.......P.... 00005c0: 9c0e 3dc6 21b5 d391 f2d9 0929 a4b0 82d4 ..=.!......)....

slide-24
SLIDE 24

00004f0: 27cf 5c08 726b 8da2 486d f305 8e18 8727 '.\.rk..Hm.....' 0000500: 07ba 9b14 18e9 90da ce20 8569 6d49 1b2c ......... .imI., 0000510: 0b02 a02b 5095 cb25 5f11 76b8 1ae2 13d4 ...+P..%_.v..... 0000520: 2148 8924 2220 1e30 e325 5f71 44e5 98c4 !H.$" .0.%_qD... 0000530: 621b 0a55 e068 4ad3 01d0 0259 4845 8028 b..U.hJ....YHE.( 0000540: 0999 5cbe e2ac cca4 6a31 bbc2 b2b6 e520 ..\.....j1..... 0000550: ce7e 86fb d4e3 cdf8 f7c2 b76a 14ad 62ff .~.........j..b. 0000560: aec2 776a f4cf f46f 99ee cfc4 6a8b 7682 ..wj...o....j.v. 0000570: 6270 af16 1576 8bbe 39b1 56c9 81f1 218d bp...v..9.V...!. 0000580: 3277 1b3b 62de 1ca2 37b4 d218 a706 51f2 2w.;b...7.....Q. 0000590: a680 bd8d 7f05 2b35 1882 dea4 7607 d0d1 ......+5....v... 00005a0: c885 770e 91d3 4d92 ae90 bb18 9e8d 15bd ..w...M......... 00005b0: 3154 b266 1c94 bc80 de89 1f50 a5a8 83b6 1T.f.......P.... 00005c0: 9c0e 3dc6 21b5 d391 f2d9 0929 a4b0 82d4 ..=.!......)....

slide-25
SLIDE 25

00004f0: 27cf 5c08 726b 8da2 486d f305 8e18 8727 '.\.rk..Hm.....' 0000500: 07ba 9b14 18e9 90da ce20 8569 6d49 1b2c ......... .imI., 0000510: 0b02 a02b 5095 cb25 5f11 76b8 1ae2 13d4 ...+P..%_.v..... 0000520: 2148 8924 2220 1e30 e325 5f71 44e5 98c4 !H.$" .0.%_qD... 0000530: 621b 0a55 e068 4ad3 01d0 0259 4845 8028 b..U.hJ....YHE.( 0000540: 0999 5cbe e2ac cca4 6a31 bbc2 b2b6 e520 ..\.....j1..... 0000550: ce7e 86fb d4e3 cdf8 f7c2 b76a 14ad 62ff .~.........j..b. 0000560: aec2 776a f4cf f46f 99ee cfc4 6a8b 7682 ..wj...o....j.v. 0000570: 6270 af16 1576 8bbe 39b1 56c9 81f1 218d bp...v..9.V...!. 0000580: 3277 1b3b 62de 1ca2 37b4 d218 a706 51f2 2w.;b...7.....Q. 0000590: a680 bd8d 7f05 2b35 1882 dea4 7607 d0d1 ......+5....v... 00005a0: c885 770e 91d3 4d92 ae90 bb18 9e8d 15bd ..w...M......... 00005b0: 3154 b266 1c94 bc80 de89 1f50 a5a8 83b6 1T.f.......P.... 00005c0: 9c0e 3dc6 21b5 d391 f2d9 0929 a4b0 82d4 ..=.!......)....

slide-26
SLIDE 26

00004f0: 27cf 5c08 726b 8da2 486d f305 8e18 8727 '.\.rk..Hm.....' 0000500: 07ba 9b14 18e9 90da ce20 8569 6d49 1b2c ......... .imI., 0000510: 0b02 a02b 5095 cb25 5f11 76b8 1ae2 13d4 ...+P..%_.v..... 0000520: 2148 8924 2220 1e30 e325 5f71 44e5 98c4 !H.$" .0.%_qD... 0000530: 621b 0a55 e068 4ad3 01d0 0259 4845 8028 b..U.hJ....YHE.( 0000540: 0999 5cbe e2ac cca4 6a31 bbc2 b2b6 e520 ..\.....j1..... 0000550: ce7e 86fb d4e3 cdf8 f7c2 b76a 14ad 62ff .~.........j..b. 0000560: aec2 776a f4cf f46f 99ee cfc4 6a8b 7682 ..wj...o....j.v. 0000570: 6270 af16 1576 8bbe 39b1 56c9 81f1 218d bp...v..9.V...!. 0000580: 3277 1b3b 62de 1ca2 37b4 d218 a706 51f2 2w.;b...7.....Q. 0000590: a680 bd8d 7f05 2b35 1882 dea4 7607 d0d1 ......+5....v... 00005a0: c885 770e 91d3 4d92 ae90 bb18 9e8d 15bd ..w...M......... 00005b0: 3154 b266 1c94 bc80 de89 1f50 a5a8 83b6 1T.f.......P.... 00005c0: 9c0e 3dc6 21b5 d391 f2d9 0929 a4b0 82d4 ..=.!......)....

slide-27
SLIDE 27

00004f0: 27cf 5c08 726b 8da2 486d f305 8e18 8727 '.\.rk..Hm.....' 0000500: 07ba 9b14 18e9 90da ce20 8569 6d49 1b2c ......... .imI., 0000510: 0b02 a02b 5095 cb25 5f11 76b8 1ae2 13d4 ...+P..%_.v..... 0000520: 2148 8924 2220 1e30 e325 5f71 44e5 98c4 !H.$" .0.%_qD... 0000530: 621b 0a55 e068 4ad3 01d0 0259 4845 8028 b..U.hJ....YHE.( 0000540: 0999 5cbe e2ac cca4 6a31 bbc2 b2b6 e520 ..\.....j1..... 0000550: ce7e 86fb d4e3 cdf8 f7c2 b76a 14ad 62ff .~.........j..b. 0000560: aec2 776a f4cf f46f 99ee cfc4 6a8b 7682 ..wj...o....j.v. 0000570: 6270 af16 1576 8bbe 39b1 56c9 81f1 218d bp...v..9.V...!. 0000580: 3277 1b3b 62de 1ca2 37b4 d218 a706 51f2 2w.;b...7.....Q. 0000590: a680 bd8d 7f05 2b35 1882 dea4 7607 d0d1 ......+5....v... 00005a0: c885 770e 91d3 4d92 ae90 bb18 9e8d 15bd ..w...M......... 00005b0: 3154 b266 1c94 bc80 de89 1f50 a5a8 83b6 1T.f.......P.... 00005c0: 9c0e 3dc6 21b5 d391 f2d9 0929 a4b0 82d4 ..=.!......)....

slide-28
SLIDE 28

SSDs RDMA Random Access is OK!?…

slide-29
SLIDE 29

… is still true

slide-30
SLIDE 30

Striding still works well

slide-31
SLIDE 31

Striding still works well + more patterns

slide-32
SLIDE 32

Random Access incurs a penalty

slide-33
SLIDE 33

Random Access incurs a PENALTY

slide-34
SLIDE 34

Random Access

  • 10%*, -10x, -100x
slide-35
SLIDE 35

Streaming Read/Write still true

slide-36
SLIDE 36

Except when it isn’t

slide-37
SLIDE 37

Synchronous Read/Write never really was true

slide-38
SLIDE 38

[Incorrect] Assumption Oh.. You’re doing I/O, you don’t care about being fast

slide-39
SLIDE 39

Scheduling Jitter Locks

slide-40
SLIDE 40

LOCKS!!!

slide-41
SLIDE 41

It’s more likely you are blocked on locks than on the I/O device itself

slide-42
SLIDE 42

Most I/O is so fast, that the price of locking can

  • vershadow it
slide-43
SLIDE 43

But it’s not just locking…

slide-44
SLIDE 44

Data Formats (binary?) Algorithms Protocols …

slide-45
SLIDE 45

It is highly doubtful that you are being held back by the network or storage

slide-46
SLIDE 46

The reason(s)

slide-47
SLIDE 47

Accumulated Improvement Time Network Bandwidth Response Time Storage Capacity CPU Cores Memory Capacity

slide-48
SLIDE 48

The OS has locks The runtime has locks* Algorithms have coherence**

slide-49
SLIDE 49

Algorithms Matter

slide-50
SLIDE 50

Configuration that Outperforms a Single Thread

SSD + 1 thread of goodness > 128 cores of so-so

http://www.frankmcsherry.org/graph/scalability/cost/2015/01/15/COST.html http://blog.acolyer.org/2015/06/05/scalability-but-at-what-cost/

slide-51
SLIDE 51

You can’t escape the Math

slide-52
SLIDE 52 "AmdahlsLaw" by Daniels220 at English Wikipedia - Own work based on: File:AmdahlsLaw.png. Licensed under CC BY-SA 3.0 via Wikimedia Commons
slide-53
SLIDE 53

Contention isn’t the biggest enemy

slide-54
SLIDE 54

Coherence is!

slide-55
SLIDE 55

Universal Scalability Law

2 4 6 8 10 12 14 16 18 20 1 2 4 8 16 32 64 128 256 512 1024

Speedup Processors

Amdahl USL

slide-56
SLIDE 56

Also Coherence traffic eats up bandwidth

slide-57
SLIDE 57

Defeating Contention Smart Batching (Natural Batching)

http://mechanical-sympathy.blogspot.com/2011/10/smart-batching.html

slide-58
SLIDE 58

Accumulated Improvement Time Network Bandwidth Response Time Storage Capacity CPU Cores Memory Capacity

slide-59
SLIDE 59

Accumulated Improvement Time Network Bandwidth Response Time Storage Capacity CPU Cores Memory Capacity

Batching…

slide-60
SLIDE 60

Resource

slide-61
SLIDE 61

Resource Ring Buffer

slide-62
SLIDE 62

Batching Thread Resource Pull off as much waiting data as possible

slide-63
SLIDE 63

Single Writer Principle Avoid Resource Contention Batching only when needed Rate Decoupling Back Pressure

slide-64
SLIDE 64

Reading

slide-65
SLIDE 65

sendfile / slice / transferTo

slide-66
SLIDE 66

Read in (multiple) page size chunks Reduce kernel calls

slide-67
SLIDE 67

Async I/O

slide-68
SLIDE 68

The cost of locks

slide-69
SLIDE 69

DatagramChannelImpl

http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8-b132/sun/nio/ch/DatagramChannelImpl.java#DatagramChannelImpl

public int write(ByteBuffer buf) { synchronized (writeLock) { synchronized (stateLock) { … } … } } public int read(ByteBuffer buf) { synchronized (readLock) { synchronized (stateLock) { … } … } } send & receive are similar

slide-70
SLIDE 70

Bias Locking Same thread constructing, reading, & writing = 1+ microsecond

slide-71
SLIDE 71

Freedom! Lock-Free, Wait-Free

http://en.wikipedia.org/wiki/Non-blocking_algorithm

slide-72
SLIDE 72
slide-73
SLIDE 73

Words Matter

slide-74
SLIDE 74

Obstruction-Freedom Partially completed operations aborted & changes made rolled back

slide-75
SLIDE 75

Lock-Freedom Individual thread may starve, but guaranteed system-wide throughput Lock-Free is Obstruction-Free

slide-76
SLIDE 76

Wait-Freedom Starvation free and guaranteed system-wide throughput Wait-Free is Lock-Free

slide-77
SLIDE 77

These properties are awesome! Who wouldn’t want them?

slide-78
SLIDE 78

System-wide properties start at the lowest level

slide-79
SLIDE 79

Essence Just because we could take an action right now, doesn’t mean we should

slide-80
SLIDE 80
slide-81
SLIDE 81

Case Study: Aeron

https://github.com/real-logic/Aeron

slide-82
SLIDE 82

Append-only Data Structures

slide-83
SLIDE 83

Header Message

Log

slide-84
SLIDE 84

Header Message Header Message Header Message

Log

slide-85
SLIDE 85

Efficiently Replicating an Append-only Log

slide-86
SLIDE 86

What If…? The Data Structure could be directly sent to the “network”? and saved to “storage”?

slide-87
SLIDE 87

Header Message

slide-88
SLIDE 88

Header Message

Position in Log Length

slide-89
SLIDE 89

Header Message

Position in Log Length Version/Flags Type etc.

+

slide-90
SLIDE 90

Header Message

Fragment 0

slide-91
SLIDE 91

Header Message Header Message

Fragment 0

slide-92
SLIDE 92

Header Message Header Message Header Message Header Message

Fragment 0

slide-93
SLIDE 93

Header Message Header Message Header Message Header Message

Fragment 0 Fragment 1

slide-94
SLIDE 94

Header Message Header Message Header Message Header Message Header Message Header Message

Fragment 0 Fragment 1

slide-95
SLIDE 95

Takeaways

slide-96
SLIDE 96

We are loosing 30% memory bandwidth*…

Oh &^%(&^!

Observed by Martin Thompson

slide-97
SLIDE 97

Stream over Data Predictable Access Batching Algorithms Avoid contention Avoid coherence*

slide-98
SLIDE 98

@toddlmontgomery

Questions?

  • Aeron https://github.com/real-logic/Aeron
  • Twitter @toddlmontgomery

Thank You!

StoneTor