[PPT] - Managed Languages Martin Thompson - @mjpt777 Really, what is your PowerPoint Presentation

SLIDE 1

High Performance Managed Languages

Martin Thompson - @mjpt777

SLIDE 2

Really, what is your preferred platform for building HFT applications?

SLIDE 3

SLIDE 4

Why do you build low-latency applications on a GC’ed platform?

SLIDE 5

SLIDE 6

1. Let’s set some Context
2. Runtime Optimisation
3. Garbage Collection
4. Algorithms & Design

Agenda

SLIDE 7

Some Context

SLIDE 8

Let’s be clear

A Managed Runtime is not always the best choice…

SLIDE 9

Latency Arbitrage?

SLIDE 10

SLIDE 11

Two questions…

SLIDE 12

Why build on a Managed Runtime?

SLIDE 13

Can managed languages provide good performance?

SLIDE 14

We need to follow the evidence…

SLIDE 15

SLIDE 16

Are native languages faster?

SLIDE 17

Time? Skills & Resources?

SLIDE 18

What can, or should, be

utsourced?

SLIDE 19

CPU vs Memory Performance

SLIDE 20

How much time to perform an addition operation on 2 integers?

SLIDE 21

1 CPU Cycle < 1ns

SLIDE 22

Sequential Access

Average time in ns/op to sum all

longs in a 1GB array?

SLIDE 23

Access Pattern Benchmark

Benchmark Mode Score Error Units testSequential avgt 0.832 ± 0.006 ns/op

~1 ns/op

SLIDE 24

Really??? Less than 1ns per operation?

SLIDE 25

SLIDE 26

Random walk per OS Page

Average time in ns/op to sum all

longs in a 1GB array?

SLIDE 27

Access Pattern Benchmark

Benchmark Mode Score Error Units testSequential avgt 0.832 ± 0.006 ns/op testRandomPage avgt 2.703 ± 0.025 ns/op

~3 ns/op

SLIDE 28

Data dependant walk per OS Page

Average time in ns/op to sum all

longs in a 1GB array?

SLIDE 29

Access Pattern Benchmark

Benchmark Mode Score Error Units testSequential avgt 0.832 ± 0.006 ns/op testRandomPage avgt 2.703 ± 0.025 ns/op testDependentRandomPage avgt 7.102 ± 0.326 ns/op

~7 ns/op

SLIDE 30

Random heap walk

Average time in ns/op to sum all

longs in a 1GB array?

SLIDE 31

Access Pattern Benchmark

Benchmark Mode Score Error Units testSequential avgt 0.832 ± 0.006 ns/op testRandomPage avgt 2.703 ± 0.025 ns/op testDependentRandomPage avgt 7.102 ± 0.326 ns/op testRandomHeap avgt 19.896 ± 3.110 ns/op

~20 ns/op

SLIDE 32

Data dependant heap walk

Average time in ns/op to sum all

longs in a 1GB array?

SLIDE 33

Access Pattern Benchmark

Benchmark Mode Score Error Units testSequential avgt 0.832 ± 0.006 ns/op testRandomPage avgt 2.703 ± 0.025 ns/op testDependentRandomPage avgt 7.102 ± 0.326 ns/op testRandomHeap avgt 19.896 ± 3.110 ns/op testDependentRandomHeap avgt 89.516 ± 4.573 ns/op

~90 ns/op

SLIDE 34

Then ADD 40+ ns/op for NUMA access on a server!!!!

SLIDE 35

Data Dependent Loads aka “Pointer Chasing”!!!

SLIDE 36

Performance 101

SLIDE 37

1. Memory is transported in Cachelines

Performance 101

SLIDE 38

1. Memory is transported in Cachelines 2. Memory is managed in OS Pages

Performance 101

SLIDE 39

1. Memory is transported in Cachelines 2. Memory is managed in OS Pages 3. Memory is pre-fetched on predictable access patterns

Performance 101

SLIDE 40

Runtime Optimisation

SLIDE 41

1. Profile guided optimisations

Runtime JIT

SLIDE 42

1. Profile guided optimisations 2. Bets can be taken and later revoked

Runtime JIT

SLIDE 43

Branches

void foo() { // code if (condition) { // code } // code }

SLIDE 44

Block A

Branches

void foo() { // code if (condition) { // code } // code }

SLIDE 45

Block A Block B

Branches

void foo() { // code if (condition) { // code } // code }

SLIDE 46

Block A Block C Block B

Branches

void foo() { // code if (condition) { // code } // code }

SLIDE 47

Block A Block C Block B

Branches

void foo() { // code if (condition) { // code } // code } Block A Block C

SLIDE 48

Block A Block C Block B

Branches

void foo() { // code if (condition) { // code } // code } Block A Block C Block B

SLIDE 49

Subtle Branches int result = (i > 7) ? a : b;

SLIDE 50

Subtle Branches int result = (i > 7) ? a : b;

CMOV vs Branch Prediction?

SLIDE 51

Method/Function Inlining

void foo() { // code bar(); // code }

SLIDE 52

Method/Function Inlining

void foo() { // code bar(); // code } Block A

SLIDE 53

Method/Function Inlining

void foo() { // code bar(); // code } Block A bar()

SLIDE 54

Method/Function Inlining

void foo() { // code bar(); // code } Block A bar()

SLIDE 55

Method/Function Inlining

void foo() { // code bar(); // code } Block A Block B bar()

SLIDE 56

Method/Function Inlining

void foo() { // code bar(); // code } Block A Block B bar() Block A

SLIDE 57

Method/Function Inlining

void foo() { // code bar(); // code } Block A Block B bar() Block A bar()

SLIDE 58

Method/Function Inlining

void foo() { // code bar(); // code } Block A Block B bar() Block A Block B bar()

SLIDE 59

Method/Function Inlining

void foo() { // code bar(); // code }

i-cache & code bloat?

SLIDE 60

“Inlining is THE optimisation.”

Cliff Click

Method/Function Inlining

SLIDE 61

Bounds Checking

void foo(int[] array, int length) { // code for (int i = 0; i < length; i++) { bar(array[i]); } // code }

SLIDE 62

Bounds Checking

void foo(int[] array) { // code for (int i = 0; i < array.length; i++) { bar(array[i]); } // code }

SLIDE 63

Subtype Polymorphism

void draw(Shape[] shapes) { for (int i = 0; i < shapes.length; i++) { shapes[i].draw(); } } void bar(Shape shape) { bar(shape.isVisible()); }

SLIDE 64

Subtype Polymorphism

void draw(Shape[] shapes) { for (int i = 0; i < shapes.length; i++) { shapes[i].draw(); } } void bar(Shape shape) { bar(shape.isVisible()); }

Class Hierarchy Analysis & Inline Caching

SLIDE 65

1. Profile guided optimisations 2. Bets can be taken and later revoked

Runtime JIT

SLIDE 66

Garbage Collection

SLIDE 67

Generational Garbage Collection

“Only the good die young.”

Billy Joel

SLIDE 68

Eden Survivor 0 Survivor 1 Young/New Generation TLAB TLAB Tenured Virtual Virtual Old Generation

Generational Garbage Collection

SLIDE 69

Modern Hardware (Intel Sandy Bridge EP)

C 1 C n C 1 C n

Registers/Buffers <1ns

L1 L1 L1 L1

~4 cycles ~1ns

L2 L2 L2 L2

~12 cycles ~3ns

L3 L3

~40 cycles ~15ns ~60 cycles ~20ns (dirty hit) ~65ns

DRAM

QPI ~40ns

MC MC DRAM DRAM DRAM DRAM DRAM DRAM DRAM

... ... ... ... ... ...

QPI QPI PCI-e 3 PCI-e 3

40X IO 40X IO

* Assumption: 3GHz Processor

SLIDE 70

Broadwell EX – 24 cores & 60MB L3 Cache

SLIDE 71

Eden Young/New Generation TLAB TLAB

Thread Local Allocation Buffers

SLIDE 72

Eden Young/New Generation TLAB TLAB

Thread Local Allocation Buffers

Affords locality of reference
Avoid false sharing
Can have NUMA aware allocation

SLIDE 73

Eden Survivor 0 Survivor 1 Young/New Generation TLAB TLAB Virtual

Object Survival

SLIDE 74

Eden Survivor 0 Survivor 1 Young/New Generation TLAB TLAB Virtual

Object Survival

Aging Policies
Compacting Copy
NUMA Interleave
Fast Parallel Scavenging
Only the survivors require work

SLIDE 75

Eden Survivor 0 Survivor 1 Young/New Generation TLAB TLAB Tenured Virtual Virtual Old Generation

Object Promotion

SLIDE 76

Eden Survivor 0 Survivor 1 Young/New Generation TLAB TLAB Tenured Virtual Virtual Old Generation

Object Promotion

Concurrent Collection
String Deduplication

SLIDE 77

Compacting Collections

SLIDE 78

Compacting Collections – Depth first copy

SLIDE 79

Compacting Collections

SLIDE 80

Compacting Collections

OS Pages and cache lines?

SLIDE 81

Eden E O S O S O E S O O E O E O O O Survivor Old Unused O S E H Humongous H H O O

G1 – Concurrent Compaction

SLIDE 82

Azul Zing C4 True Concurrent Compacting Collector

SLIDE 83

Where next for GC?

SLIDE 84

Object Inlining/Aggregation

SLIDE 85

GC vs Manual Memory Management

Not easy to pick clear winner…

SLIDE 86

GC vs Manual Memory Management

Managed GC

GC Implementation
Card Marking
Read/Write Barriers
Object Headers
Background Overhead

in CPU and Memory

Not easy to pick clear winner…

SLIDE 87

GC vs Manual Memory Management

Managed GC

GC Implementation
Card Marking
Read/Write Barriers
Object Headers
Background Overhead

in CPU and Memory

Not easy to pick clear winner…

Native

Malloc Implementation
Arena/pool contention
Bin Wastage
Fragmentation
Debugging Effort
Inter-thread costs

SLIDE 88

Algorithms & Design

SLIDE 89

What is most important to performance?

SLIDE 90

Avoiding cache misses
Strength Reduction
Avoiding duplicate work
Amortising expensive operations
Mechanical Sympathy
Choice of Data Structures
Choice of Algorithms
API Design
Overall Design

SLIDE 91

In a large codebase it is really difficult to do everything well

SLIDE 92

It also takes some “uncommon” disciplines such as: profiling, telemetry, modelling…

SLIDE 93

“If I had more time, I would have written a shorter letter.”

Blaise Pascal

SLIDE 94

The story of Aeron

SLIDE 95

Aeron is an interesting lesson in “time to performance”

SLIDE 96

Lots of others exists such at the C# Roslyn compiler

SLIDE 97

Time spent on Mechanical Sympathy vs Debugging Pointers ???

SLIDE 98

Immutable Data & Concurrency

SLIDE 99

Functional Programming

SLIDE 100

In Closing …

SLIDE 101

What does the future hold?

SLIDE 102

Remember Assembly vs Compiled Languages

SLIDE 103

What about the issues of footprint, startup time, GC pauses, etc. ???

SLIDE 104

SLIDE 105

SLIDE 106

SLIDE 107

Blog: http://mechanical-sympathy.blogspot.com/ Twitter: @mjpt777 “Any intelligent fool can make things bigger, more complex, and more violent. It takes a touch of genius, and a lot of courage, to move in the opposite direction.”

Albert Einstein