than 1ms latency Martin Thompson & Michael Barker QCon SF 2010 - PowerPoint PPT Presentation

How to do 100K+ TPS at less than 1ms latency Martin Thompson & Michael Barker QCon SF 2010

Agenda • Context Setting • Tips for high performance computing (HPC) • What is possible on a single thread??? • New pattern for contended HPC • Q & A

Who/What is LMAX? • The London Multi-Asset Exchange • Spin-off from Betfair into retail finance • Access the wholesale financial markets on equal terms for retail traders • We aim to build the highest performance financial exchange in the world

What is Extreme Transaction Processing (XTP)? The Internet The Betfair Experience v

What is Extreme Transaction Processing (XTP)? The Internet The LMAX Model Risky! Latency GBP / USD Spread

X How not to solve this problem RDBMS X X J2EE SEDA X X X Actor X X Rails

Phasers or Disruptors?

Tips for high performance computing 1. Show good “Mechanical Sympathy” 2. Keep the working set In-Memory 3. Write cache friendly code 4. Write clean compact code 5. Invest in modelling your domain 6. Take the right approach to concurrency

1. Mechanical Sympathy – 1 of 2 Memory CPUs • Latency not significantly changed • The GHz race is over • Multi core • Massive bandwidth increase • 144GB in a commodity machine • Bigger smarter caches

1. Mechanical Sympathy – 2 of 2 Networks Storage • Sub 10 microseconds for local hop • Disk is the new tape! Fast for sequential access • Wide area bandwidth is cheap • SSDs for random threaded access • 10GigE is now a commodity • PCI-e connected storage • Multi-cast is getting traction

2. Keep the working set In-Memory Does it feel awkward working with data remote from your address space? • Keep data and behaviour co-located • Affords rich interaction at low latency • Enabled by 64-bit addressing

3. Write cache friendly code DRAM DRAM DRAM DRAM ~65ns DRAM DRAM QPI ~20ns MC MC L3 L3 ~42 cycles ~15ns L2 L2 L2 L2 L2 L2 L2 L2 ~10 cycles ~3ns L1 L1 L1 L1 L1 L1 L1 L1 ~4 cycles ~1ns C1 C2 C3 C4 C1 C2 C3 C4 Registers <1ns

4. Write clean compact code "Any intelligent fool can make things bigger, more complex, and more violent. It takes a touch of genius -- and a lot of courage -- to move in the opposite direction." • Hotspot likes small compact methods • CPU pipelines stall if they cannot predict branches • If your code is complex you do not properly understand the problem domain • Nothing in the world is truly complex other than Tax Law

5. Invest in modelling your domain Model of an elephant based on blind men touching one part each Wall like a attached attached Rope Elephant Snake Supported By TreeTrunk • Single responsibility – One class one thing, one method one thing, etc. • Know your data structures and cardinality of relationships • Let the relationships do the work

6. Take the right approach to concurrency Concurrent programming is about 2 things: Mutual Exclusion : Protect access to contended resources Visibility of Changes : Make the result public in the correct order Locks Atomic/CAS Instructions • Context switch to the kernel • Atomic read-modify-write primitives • Can always make progress • Happen in user space • Difficult to get right • Very difficult to get right!

What is possible when you get this stuff right? On a single thread you have ~3 billion instructions per second to play with: 10K+ TPS • If you don’t do anything too stupid 100K+ TPS • With well organised clean code and standard libraries 1m+ TPS • With custom cache friendly collections • Good performance tests • Controlled garbage creation • Very well modelled domain • BTW writing good performance tests is often harder than the target code!!!

How to address the other non-functional concerns? • With a very fast business logic thread we need to feed it reliably > Did we trick you into thinking we can avoid concurrent programming? Network Network / Archive DB Receiver HA / DR Node Publisher Replicator Marshaller Journaller File System Business Logic Un-Marshaller Each stage can have multiple Pipelined threads Process

Concurrent access to Queues – The Issues Link List backed size Tail Node Node Node Node Head • Hard to limit size • O(n) access times if not head or tail • Generates garbage which can be significant Array backed Tail Head size Cache line • Cannot resize easily • Difficult to get *P *C correct • O(1) access times for any slot and cache friendly

Di Disrup sruptor tor Network / Archive DB Network Message Message :sequence :sequence Receiver Publisher :buffer :object :invoker :buffer 103 97 Invoke Stage n n 1 1 7 2 7 2 Business Logic 6 3 6 3 5 4 5 4 long waitFor(n) long waitFor(n) Un-Marshaller Replicator Marshaller Journaller :MIN :MIN 101 101 102 97

Quick Recap • Most developers have an incorrect view of hardware and what can be achieved on a single thread • On modern processors a cache miss is your biggest cost • Push concurrency into the infrastructure, and make it REALLY fast • Once you have this, you have the world that OO programmers dream of: > Single threaded > All in-memory > Elegant model > Testable code > No infrastructure or integration worries

Wrap up Q & A jobs@lmax.com

than 1ms latency Martin Thompson & Michael Barker QCon SF 2010 - PowerPoint PPT Presentation

How to do 100K+ TPS at less than 1ms latency Martin Thompson & Michael Barker QCon SF 2010 Agenda Context Setting Tips for high performance computing (HPC) What is possible on a single thread??? New pattern for

LMAX Disruptor: 100K TPS at less than 1ms latency Dave Farley Martin Thompson GOTO rhus 2011

Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra-Low Latency SSDs Jinkyu Jeong

A 1MS/s to 1GS/s Ringamp-Based Pipelined ADC with Fully Dynamic Reference Regulation and

Scenario #1 Ready Queue C B A 2ms 1ms 100ms FCFS Avg: A B C 101.3 time 100 101 103

Lets talk locks! @kavya719 kavya locks. locks are slow locks are slow latency

FAILURE AT NETFLIX VELOCITY Cannot Connect to the Netflix Service 0 0 Ms % IMPACT LATENCY

Low Latency Live Video Streaming over HTTP 2.0 Sheng Wei, Vishy Swaminathan | Adobe Research

The Benefits and Burdens of Nuclear Latency by Mehta and Whitlark Andrew Malandra Possible

Run-time interrupts latency detection in real-time systems Julien Desfossez Michel Dagenais

Taming Latency In Data Center Applications Ph.D. Defense of Dissertation Mohan Kumar Advisor:

Tales of the Tail Hardware, OS, and Application-level Sources of Tail Latency Jialin Li, Naveen

Reducing input latency on the web bit.ly/reduce-input-latency W3C Games Workshop - June 2019

CROSS-LAYER CROSS-LAYER LATENCY-AWARE AND -PREDICTABLE LATENCY-AWARE AND -PREDICTABLE DATA

Green Latency-aware Data Deployment in Data Centers: Balancing Latency, Energy in Networks and

Sensitivity Of Quake3 Players Sensitivity Of Quake3 Players Sensitivity Of Quake3 Players

STORM AND LOW-LATENCY PROCESSING www.inf.ed.ac.uk Low latency processing Similar to data

Blind Signatures in Scriptless Scripts Jonas Nick jonasd.nick@gmail.com @n1ckler September 4,

Set 2: State-spaces and Uninformed Search ICS 271 Fall 2016 Kalev Kask 271-fall 2016 You need

Blind Source Separation from Single Measurements using Singular Spectrum Analysis CHES 2015

Anonymous Tokens Michele Orr ia.cr/2020/072 1 Anonymous Tokens Michele Orr joint work

DNS: the Kaminsky Blind Spoofing Attack CS 161: Computer Security Prof. David Wagner April 1,

IP Covert Timing Channels: Design and Detection By Serdar Cabuk, Carla E. Brodley, Clay

Publishing while female Are women held to higher standards? Evidence from peer review. Erin

Single-Trace Side-Channel Attacks on Masked Lattice-Based Encryption Robert Primas, Peter Pessl,