Low Latency Trading Architecture Sam Adams QCon London, March, 2017 - - PowerPoint PPT Presentation

low latency trading architecture
SMART_READER_LITE
LIVE PREVIEW

Low Latency Trading Architecture Sam Adams QCon London, March, 2017 - - PowerPoint PPT Presentation

Low Latency Trading Architecture Sam Adams QCon London, March, 2017 don't panic sell buy GBP/USD don't panic Typical day : 1,000's active clients 100,000's trades occur 100,000,000's orders placed very bursty: spikes of 100s / ms


slide-1
SLIDE 1

Low Latency Trading Architecture

Sam Adams QCon London, March, 2017

slide-2
SLIDE 2

don't panic don't panic buy sell

GBP/USD

slide-3
SLIDE 3

Typical day:

1,000's active clients 100,000's trades occur 100,000,000's orders placed – very bursty: spikes of 100s / ms 1,000,000,000's market data updates sent

slide-4
SLIDE 4

End-to-end latency:

50%: 80 µs 99%: 150 µs 99.99%: 500 µs Max: 4ms(*)

slide-5
SLIDE 5

System Architecture Building low latency applications

slide-6
SLIDE 6

Instruction Execution reports Market data

* latency sensitive * * throughput matters *

slide-7
SLIDE 7

The Disruptor

slide-8
SLIDE 8

Consumer Producer

High performance inter-thread messaging

slide-9
SLIDE 9

public class ArrayBlockingQueue<E> { final Object[] items; int takeIndex; int putIndex; int count; /** Main lock guarding all access */ final ReentrantLock lock; }

ArrayBlockingQueue vs Disruptor

locking & contention

slide-10
SLIDE 10

public class ArrayBlockingQueue<E> { final Object[] items; int takeIndex; int putIndex; int count; /** Main lock guarding all access */ final ReentrantLock lock; } public class RingBuffer<E> implements DataProvider<E> { // ... final long indexMask; final Object[] entries; final Sequence cursor; // ... } public class BatchEventProcessor<E> { final DataProvider<E> dataProvider; final Sequence sequence; }

ArrayBlockingQueue vs Disruptor

locking & contention vs single writers

slide-11
SLIDE 11

Claimed: -1 Published: -1 Consumer Consumed: -1 Waiting for: 0 Producer

slide-12
SLIDE 12

Claimed: 0 Published: -1 Consumer Consumed: -1 Waiting for: 0 Producer Claim slot: 0

slide-13
SLIDE 13

Claimed: 0 Published: 0 Consumer Consumed: -1 Waiting for: 0 Producer Publish slot: 0

slide-14
SLIDE 14

Claimed: 0 Published: 0 Consumer Consumed: -1 Available: 0 Processing: 0 Producer

slide-15
SLIDE 15

Claimed: 0 Published: 0 Consumer Consumed: 0 Waiting for: 1 Producer

slide-16
SLIDE 16

Claimed: 3 Published: 3 Consumer Consumed: 0 Waiting for: 1 Producer Published: 1-3

slide-17
SLIDE 17

Claimed: 3 Published: 3 Consumer Consumed: 0 Available: 3 Processing: 1,2,3 Producer

slide-18
SLIDE 18

Claimed: 3 Published: 3 Consumer Consumed: 3 Waiting for: 4 Producer

slide-19
SLIDE 19

Supports dependency graphs between consumers

slide-20
SLIDE 20

Messaging

slide-21
SLIDE 21

Asynchronous Pub/Sub messaging:

  • UDP Multicast: low latency, scalable, unreliable
  • Services publish / subscribe to topics
  • topic = unique multicast group
  • Informatica UMS (aka 29 West LBM) provides * some reliability *
slide-22
SLIDE 22

Asynchronous Pub/Sub messaging:

  • Push based
  • If you miss a message, it is gone
  • Late-join: no history
slide-23
SLIDE 23

javassist generated proxies to interfaces

public interface TradingInstructions { void placeOrder(PlaceOrderInstruction instruction); void cancelOrder(CancelOrderInstruction instruction); } See GeneratedRingBufferProxyGenerator in disruptor-proxy for inter-thread version https://github.com/LMAX-Exchange/disruptor-proxy

Event: long sequence byte operationIndex byte[] data int length

slide-24
SLIDE 24

See GeneratedRingBufferProxyGenerator in disruptor-proxy for inter-thread version https://github.com/LMAX-Exchange/disruptor-proxy public void placeOrder(PlaceOrderInstruction arg0) { // ... event.initialise(sequence, 1); // operation index marshaller.encode(arg0, event.outputStream()); // ... }

Publisher proxy:

Event: long sequence byte operationIndex byte[] data int length

slide-25
SLIDE 25

See GeneratedRingBufferProxyGenerator in disruptor-proxy for inter-thread version https://github.com/LMAX-Exchange/disruptor-proxy Invoker invokers[]; TradingInstructions implementation; public void onEvent(Event event) { Invoker invoker = invokers[event.getOperationIndex()]; invoker.invoke(event.getInputStream(), implementation); } public void invoke(InputStream input, TradingInstructions implementation) { PlaceOrderInstruction arg0 = marshaller.decode(input); implementation.placeOrder(arg0); }

Subscriber proxy:

Event: long sequence byte operationIndex byte[] data int length

slide-26
SLIDE 26

Matching Engine

slide-27
SLIDE 27

For speed: All working state held in memory Remove contention: single threaded

slide-28
SLIDE 28

Don’t block business logic: buffer for outbound I/O

slide-29
SLIDE 29

Don’t block network thread: buffer incoming events

slide-30
SLIDE 30

All state in volatile memory: Save on shutdown / Load on startup

slide-31
SLIDE 31

Recover from unclean shutdown Journal incoming events to disk, replay on startup

slide-32
SLIDE 32

Replicate events to hot-standby for resiliency Manual fail-over (also to offsite DR)

slide-33
SLIDE 33

Holding all your state in memory

No database No roll-back Up-front validation is critical Never throw exceptions

  • result is inconsistent state
slide-34
SLIDE 34

System must be deterministic

All operations event sourced time sourced from events collections must be ordered no local configuration

slide-35
SLIDE 35

Determinism bugs are really nasty

Only an issue if we have to fail-over or replay Primary is the source of truth

slide-36
SLIDE 36

Gateways

slide-37
SLIDE 37

Same principles:

  • non-blocking / message passing
  • minimise shared state
slide-38
SLIDE 38

Stream Processing

slide-39
SLIDE 39

Matching Engine Order Book

slide-40
SLIDE 40

All Orders[ ] Order Added Order Cancelled Order Added Trade Trade Order Added ... Matching Engine Order Book

slide-41
SLIDE 41

All Orders[ ] Order Added Order Cancelled Order Added Trade Trade Order Added ... Matching Engine Order Book Market Analysis Order Book Image Event Store AML Alerts Order Book Image

slide-42
SLIDE 42

Where latency doesn’t matter...

  • How big are the bursts?
  • Buffers are your friend

Does data loss matter?

Market Analysis Order Book Image Event Store AML Alerts Order Book Image

slide-43
SLIDE 43

More Reliable Messaging

slide-44
SLIDE 44
slide-45
SLIDE 45
slide-46
SLIDE 46
slide-47
SLIDE 47
slide-48
SLIDE 48
slide-49
SLIDE 49
slide-50
SLIDE 50
slide-51
SLIDE 51
slide-52
SLIDE 52
slide-53
SLIDE 53

Handling buffer wraps

‘better never than late’

  • reset & late join

persistent data loss

  • recover from event store
  • journal replay and gap-fill
slide-54
SLIDE 54

Low latency applications: mechanical sympathy

slide-55
SLIDE 55

[sam@box ~]$ lstopo

slide-56
SLIDE 56

Machine (126GB) NUMANode P#0 (63GB) Socket P#0 L3 (30MB) L2 (256KB) L1d (32KB) L1i (32KB) Core P#0 PU P#0 PU P#24 L2 (256KB) L1d (32KB) L1i (32KB) Core P#1 PU P#2 PU P#26 L2 (256KB) L1d (32KB) L1i (32KB) Core P#2 PU P#4 PU P#28 L2 (256KB) L1d (32KB) L1i (32KB) Core P#3 PU P#6 PU P#30

Main Memory L1/L2 Caches CPU Core / Hyper Threads

slide-57
SLIDE 57

Machine (126GB) NUMANode P#0 (63GB) Socket P#0 L3 (30MB) L2 (256KB) L1d (32KB) L1i (32KB) Core P#0 PU P#0 PU P#24 L2 (256KB) L1d (32KB) L1i (32KB) Core P#1 PU P#2 PU P#26 L2 (256KB) L1d (32KB) L1i (32KB) Core P#2 PU P#4 PU P#28 L2 (256KB) L1d (32KB) L1i (32KB) Core P#3 PU P#6 PU P#30

CPUs are faster than memory Intel Performance Analysis Guide: L1 CACHE hit, 4 cycles L2 CACHE hit, 10 cycles local L3 CACHE hit, ~40-75 cycles remote L3 CACHE hit, ~100-300 cycles Local Dram ~60 ns Remote Dram ~100 ns

slide-58
SLIDE 58

Memory system optimised for:

Temporal locality Spatial locality Equidistant locality

slide-59
SLIDE 59

Reference vs Primitives

Long[] vs long[]

slide-60
SLIDE 60

public class Cash { long value; }

Calculations with money

  • double: inexact
  • BigDecimal: expensive

Fixed-point arithmetic with long But I want type-safety...

slide-61
SLIDE 61

long price1 = 1250000L; long quantity1 = 1520L; // BUG long price2 = quantity1;

Prices, precision: 6dp 1250000L → 1.250000 Quantities, precision: 2dp 1520L → 15.20

slide-62
SLIDE 62

https://checkerframework.org/

With Type Annotations & Units Checker:

@Price long price1 = 1250000L; @Qty long quantity1 = 1520L; // Compilation error @Price long price2 = quantity1;

Prices, precision: 6dp 1250000L → 1.250000 Quantities, precision: 2dp 1520L → 15.20

slide-63
SLIDE 63

public class HashMap<K,V> { Node<K,V>[] table; static class Node<K,V> { K key; V value; Node<K,V> next; } public class Long2ObjectOpenHashMap<V> { long[] keys; V[] values; }

java.util vs fastutil

Map<Long,X> vs LongMap<X>

slide-64
SLIDE 64

public class HashMap<K,V> { Node<K,V>[] table; static class Node<K,V> { K key; V value; Node<K,V> next; } public class Long2ObjectOpenHashMap<V> { long[] keys; V[] values; }

java.util vs fastutil

Map<Long,X> vs LongMap<X>

slide-65
SLIDE 65

False sharing: revisit the Disruptor

public class ArrayBlockingQueue<E> { final Object[] items; int takeIndex; int putIndex; int count; /** Main lock guarding all access */ final ReentrantLock lock; }

slide-66
SLIDE 66

public class RingBuffer { // ... final Object[] entries; final Sequence cursor; // ... } public class Sequence { long p1, p2, p3, p4, p5, p6, p7; long value; long p9, p10, p11, p12, p13, p14, p15; }

False sharing: revisit the Disruptor

slide-67
SLIDE 67

public class RingBuffer { // ... final Object[] entries; final Sequence cursor; // ... } public class Sequence { @Contended long value; }

False sharing: revisit the Disruptor

Java 8:

slide-68
SLIDE 68

Removing Jitter: GC & Scheduling

slide-69
SLIDE 69

GC Options:

Zero garbage Massive heap, GC when convenient Commercial JVM – Azul Zing

slide-70
SLIDE 70

GC Options:

Zero garbage Massive heap, GC when convenient Commercial JVM – Azul Zing

slide-71
SLIDE 71

GC Options:

Zero garbage Massive heap, GC when convenient Commercial JVM – Azul Zing

slide-72
SLIDE 72

JVM OS

Avoiding scheduling jitter

slide-73
SLIDE 73

JVM OS

Socket P#0 L3 (30MB) L2 (256KB) L1d (32KB) L1i (32KB) Core P#0 PU P#0 PU P#24 L2 (256KB) L1d (32KB) L1i (32KB) Core P#1 PU P#2 PU P#26 L2 (256KB) L1d (32KB) L1i (32KB) Core P#2 PU P#4 PU P#28 L2 (256KB) L1d (32KB) L1i (32KB) Core P#3 PU P#6 PU P#30 L2 (256KB) L1d (32KB) L1i (32KB) Core P#4 PU P#8 PU P#32 L2 (256KB) L1d (32KB) L1i (32KB) Core P#5 PU P#10 PU P#34 L2 (256KB) L1d (32KB) L1i (32KB) Core P#8 PU P#12 PU P#36 L2 (256KB) L1d (32KB) L1i (32KB) Core P#9 PU P#14 PU P#38 L2 (256KB) L1d (32KB) L1i (32KB) Core P#10 PU P#16 PU P#40 L2 (256KB) L1d (32KB) L1i (32KB) Core P#11 PU P#18 PU P#42 L2 (256KB) L1d (32KB) L1i (32KB) Core P#12 PU P#20 PU P#44 L2 (256KB) L1d (32KB) L1i (32KB) Core P#13 PU P#22 PU P#46

slide-74
SLIDE 74

JVM OS

Socket P#0 L3 (30MB) L2 (256KB) L1d (32KB) L1i (32KB) Core P#0 PU P#0 PU P#24 L2 (256KB) L1d (32KB) L1i (32KB) Core P#1 PU P#2 PU P#26 L2 (256KB) L1d (32KB) L1i (32KB) Core P#2 PU P#4 PU P#28 L2 (256KB) L1d (32KB) L1i (32KB) Core P#3 PU P#6 PU P#30 L2 (256KB) L1d (32KB) L1i (32KB) Core P#4 PU P#8 PU P#32 L2 (256KB) L1d (32KB) L1i (32KB) Core P#5 PU P#10 PU P#34 L2 (256KB) L1d (32KB) L1i (32KB) Core P#8 PU P#12 PU P#36 L2 (256KB) L1d (32KB) L1i (32KB) Core P#9 PU P#14 PU P#38 L2 (256KB) L1d (32KB) L1i (32KB) Core P#10 PU P#16 PU P#40 L2 (256KB) L1d (32KB) L1i (32KB) Core P#11 PU P#18 PU P#42 L2 (256KB) L1d (32KB) L1i (32KB) Core P#12 PU P#20 PU P#44 L2 (256KB) L1d (32KB) L1i (32KB) Core P#13 PU P#22 PU P#46

Remove reserved CPUs from the kernel scheduler

isolcpus=0,2,4,6,8,24,26,28,30,32

slide-75
SLIDE 75

JVM OS

Socket P#0 L3 (30MB) L2 (256KB) L1d (32KB) L1i (32KB) Core P#0 PU P#0 PU P#24 L2 (256KB) L1d (32KB) L1i (32KB) Core P#1 PU P#2 PU P#26 L2 (256KB) L1d (32KB) L1i (32KB) Core P#2 PU P#4 PU P#28 L2 (256KB) L1d (32KB) L1i (32KB) Core P#3 PU P#6 PU P#30 L2 (256KB) L1d (32KB) L1i (32KB) Core P#4 PU P#8 PU P#32 L2 (256KB) L1d (32KB) L1i (32KB) Core P#5 PU P#10 PU P#34 L2 (256KB) L1d (32KB) L1i (32KB) Core P#8 PU P#12 PU P#36 L2 (256KB) L1d (32KB) L1i (32KB) Core P#9 PU P#14 PU P#38 L2 (256KB) L1d (32KB) L1i (32KB) Core P#10 PU P#16 PU P#40 L2 (256KB) L1d (32KB) L1i (32KB) Core P#11 PU P#18 PU P#42 L2 (256KB) L1d (32KB) L1i (32KB) Core P#12 PU P#20 PU P#44 L2 (256KB) L1d (32KB) L1i (32KB) Core P#13 PU P#22 PU P#46

Create CPU sets for system, application

# cset set --set=/system --cpu=18,20,...,46 # cset set --set=/app --cpu=0,2,...,40

/ /system /app

slide-76
SLIDE 76

OS

Socket P#0 L3 (30MB) L2 (256KB) L1d (32KB) L1i (32KB) Core P#0 PU P#0 PU P#24 L2 (256KB) L1d (32KB) L1i (32KB) Core P#1 PU P#2 PU P#26 L2 (256KB) L1d (32KB) L1i (32KB) Core P#2 PU P#4 PU P#28 L2 (256KB) L1d (32KB) L1i (32KB) Core P#3 PU P#6 PU P#30 L2 (256KB) L1d (32KB) L1i (32KB) Core P#4 PU P#8 PU P#32 L2 (256KB) L1d (32KB) L1i (32KB) Core P#5 PU P#10 PU P#34 L2 (256KB) L1d (32KB) L1i (32KB) Core P#8 PU P#12 PU P#36 L2 (256KB) L1d (32KB) L1i (32KB) Core P#9 PU P#14 PU P#38 L2 (256KB) L1d (32KB) L1i (32KB) Core P#10 PU P#16 PU P#40 L2 (256KB) L1d (32KB) L1i (32KB) Core P#11 PU P#18 PU P#42 L2 (256KB) L1d (32KB) L1i (32KB) Core P#12 PU P#20 PU P#44 L2 (256KB) L1d (32KB) L1i (32KB) Core P#13 PU P#22 PU P#46

Processes default to the / CPU set

/ /system /app

slide-77
SLIDE 77

OS

Socket P#0 L3 (30MB) L2 (256KB) L1d (32KB) L1i (32KB) Core P#0 PU P#0 PU P#24 L2 (256KB) L1d (32KB) L1i (32KB) Core P#1 PU P#2 PU P#26 L2 (256KB) L1d (32KB) L1i (32KB) Core P#2 PU P#4 PU P#28 L2 (256KB) L1d (32KB) L1i (32KB) Core P#3 PU P#6 PU P#30 L2 (256KB) L1d (32KB) L1i (32KB) Core P#4 PU P#8 PU P#32 L2 (256KB) L1d (32KB) L1i (32KB) Core P#5 PU P#10 PU P#34 L2 (256KB) L1d (32KB) L1i (32KB) Core P#8 PU P#12 PU P#36 L2 (256KB) L1d (32KB) L1i (32KB) Core P#9 PU P#14 PU P#38 L2 (256KB) L1d (32KB) L1i (32KB) Core P#10 PU P#16 PU P#40 L2 (256KB) L1d (32KB) L1i (32KB) Core P#11 PU P#18 PU P#42 L2 (256KB) L1d (32KB) L1i (32KB) Core P#12 PU P#20 PU P#44 L2 (256KB) L1d (32KB) L1i (32KB) Core P#13 PU P#22 PU P#46

Move all threads into /system CPU set

# cset proc --move -k --threads --force \

  • -from-set=/ --to-set=/system

/ /system /app

slide-78
SLIDE 78

JVM OS

Socket P#0 L3 (30MB) L2 (256KB) L1d (32KB) L1i (32KB) Core P#0 PU P#0 PU P#24 L2 (256KB) L1d (32KB) L1i (32KB) Core P#1 PU P#2 PU P#26 L2 (256KB) L1d (32KB) L1i (32KB) Core P#2 PU P#4 PU P#28 L2 (256KB) L1d (32KB) L1i (32KB) Core P#3 PU P#6 PU P#30 L2 (256KB) L1d (32KB) L1i (32KB) Core P#4 PU P#8 PU P#32 L2 (256KB) L1d (32KB) L1i (32KB) Core P#5 PU P#10 PU P#34 L2 (256KB) L1d (32KB) L1i (32KB) Core P#8 PU P#12 PU P#36 L2 (256KB) L1d (32KB) L1i (32KB) Core P#9 PU P#14 PU P#38 L2 (256KB) L1d (32KB) L1i (32KB) Core P#10 PU P#16 PU P#40 L2 (256KB) L1d (32KB) L1i (32KB) Core P#11 PU P#18 PU P#42 L2 (256KB) L1d (32KB) L1i (32KB) Core P#12 PU P#20 PU P#44 L2 (256KB) L1d (32KB) L1i (32KB) Core P#13 PU P#22 PU P#46

Launch application in /app CPU set, taskset to run in pool

$ cset proc --exec /app \ taskset -cp 10,12...38,40 \ java <args>

/ /system /app

slide-79
SLIDE 79

JVM OS

Socket P#0 L3 (30MB) L2 (256KB) L1d (32KB) L1i (32KB) Core P#0 PU P#0 PU P#24 L2 (256KB) L1d (32KB) L1i (32KB) Core P#1 PU P#2 PU P#26 L2 (256KB) L1d (32KB) L1i (32KB) Core P#2 PU P#4 PU P#28 L2 (256KB) L1d (32KB) L1i (32KB) Core P#3 PU P#6 PU P#30 L2 (256KB) L1d (32KB) L1i (32KB) Core P#4 PU P#8 PU P#32 L2 (256KB) L1d (32KB) L1i (32KB) Core P#5 PU P#10 PU P#34 L2 (256KB) L1d (32KB) L1i (32KB) Core P#8 PU P#12 PU P#36 L2 (256KB) L1d (32KB) L1i (32KB) Core P#9 PU P#14 PU P#38 L2 (256KB) L1d (32KB) L1i (32KB) Core P#10 PU P#16 PU P#40 L2 (256KB) L1d (32KB) L1i (32KB) Core P#11 PU P#18 PU P#42 L2 (256KB) L1d (32KB) L1i (32KB) Core P#12 PU P#20 PU P#44 L2 (256KB) L1d (32KB) L1i (32KB) Core P#13 PU P#22 PU P#46

Move critical threads onto their own cores using JNA / JNI

sched_set_affinity(0); sched_set_affinity(2); ...

slide-80
SLIDE 80

JVM OS

Socket P#0 L3 (30MB) L2 (256KB) L1d (32KB) L1i (32KB) Core P#0 PU P#0 PU P#24 L2 (256KB) L1d (32KB) L1i (32KB) Core P#1 PU P#2 PU P#26 L2 (256KB) L1d (32KB) L1i (32KB) Core P#2 PU P#4 PU P#28 L2 (256KB) L1d (32KB) L1i (32KB) Core P#3 PU P#6 PU P#30 L2 (256KB) L1d (32KB) L1i (32KB) Core P#4 PU P#8 PU P#32 L2 (256KB) L1d (32KB) L1i (32KB) Core P#5 PU P#10 PU P#34 L2 (256KB) L1d (32KB) L1i (32KB) Core P#8 PU P#12 PU P#36 L2 (256KB) L1d (32KB) L1i (32KB) Core P#9 PU P#14 PU P#38 L2 (256KB) L1d (32KB) L1i (32KB) Core P#10 PU P#16 PU P#40 L2 (256KB) L1d (32KB) L1i (32KB) Core P#11 PU P#18 PU P#42 L2 (256KB) L1d (32KB) L1i (32KB) Core P#12 PU P#20 PU P#44 L2 (256KB) L1d (32KB) L1i (32KB) Core P#13 PU P#22 PU P#46

slide-81
SLIDE 81

Summary

slide-82
SLIDE 82
slide-83
SLIDE 83

round-trip a correlation ID

slide-84
SLIDE 84

round-trip a correlation ID 25µs

slide-85
SLIDE 85

Thank You!

sam.adams@lmax.com https://www.lmax.com/blog/staff-blogs/

p.s. we’re hiring!

slide-86
SLIDE 86

The End.