Managed Languages Martin Thompson - @mjpt777 Really, what is your - - PowerPoint PPT Presentation

managed languages
SMART_READER_LITE
LIVE PREVIEW

Managed Languages Martin Thompson - @mjpt777 Really, what is your - - PowerPoint PPT Presentation

High Performance Managed Languages Martin Thompson - @mjpt777 Really, what is your preferred platform for building HFT applications? Why do you build low-latency applications on a GCed platform? Agenda 1. Lets set some Context 2.


slide-1
SLIDE 1

High Performance Managed Languages

Martin Thompson - @mjpt777

slide-2
SLIDE 2

Really, what is your preferred platform for building HFT applications?

slide-3
SLIDE 3
slide-4
SLIDE 4

Why do you build low-latency applications on a GC’ed platform?

slide-5
SLIDE 5
slide-6
SLIDE 6
  • 1. Let’s set some Context
  • 2. Runtime Optimisation
  • 3. Garbage Collection
  • 4. Algorithms & Design

Agenda

slide-7
SLIDE 7

Some Context

slide-8
SLIDE 8

Let’s be clear

A Managed Runtime is not always the best choice…

slide-9
SLIDE 9

Latency Arbitrage?

slide-10
SLIDE 10
slide-11
SLIDE 11

Two questions…

slide-12
SLIDE 12

Why build on a Managed Runtime?

slide-13
SLIDE 13

Can managed languages provide good performance?

slide-14
SLIDE 14

We need to follow the evidence…

slide-15
SLIDE 15
slide-16
SLIDE 16

Are native languages faster?

slide-17
SLIDE 17

Time? Skills & Resources?

slide-18
SLIDE 18

What can, or should, be

  • utsourced?
slide-19
SLIDE 19

CPU vs Memory Performance

slide-20
SLIDE 20

How much time to perform an addition operation on 2 integers?

slide-21
SLIDE 21

1 CPU Cycle < 1ns

slide-22
SLIDE 22

Sequential Access

  • Average time in ns/op to sum all

longs in a 1GB array?

slide-23
SLIDE 23

Access Pattern Benchmark

Benchmark Mode Score Error Units testSequential avgt 0.832 ± 0.006 ns/op

~1 ns/op

slide-24
SLIDE 24

Really??? Less than 1ns per operation?

slide-25
SLIDE 25
slide-26
SLIDE 26

Random walk per OS Page

  • Average time in ns/op to sum all

longs in a 1GB array?

slide-27
SLIDE 27

Access Pattern Benchmark

Benchmark Mode Score Error Units testSequential avgt 0.832 ± 0.006 ns/op testRandomPage avgt 2.703 ± 0.025 ns/op

~3 ns/op

slide-28
SLIDE 28

Data dependant walk per OS Page

  • Average time in ns/op to sum all

longs in a 1GB array?

slide-29
SLIDE 29

Access Pattern Benchmark

Benchmark Mode Score Error Units testSequential avgt 0.832 ± 0.006 ns/op testRandomPage avgt 2.703 ± 0.025 ns/op testDependentRandomPage avgt 7.102 ± 0.326 ns/op

~7 ns/op

slide-30
SLIDE 30

Random heap walk

  • Average time in ns/op to sum all

longs in a 1GB array?

slide-31
SLIDE 31

Access Pattern Benchmark

Benchmark Mode Score Error Units testSequential avgt 0.832 ± 0.006 ns/op testRandomPage avgt 2.703 ± 0.025 ns/op testDependentRandomPage avgt 7.102 ± 0.326 ns/op testRandomHeap avgt 19.896 ± 3.110 ns/op

~20 ns/op

slide-32
SLIDE 32

Data dependant heap walk

  • Average time in ns/op to sum all

longs in a 1GB array?

slide-33
SLIDE 33

Access Pattern Benchmark

Benchmark Mode Score Error Units testSequential avgt 0.832 ± 0.006 ns/op testRandomPage avgt 2.703 ± 0.025 ns/op testDependentRandomPage avgt 7.102 ± 0.326 ns/op testRandomHeap avgt 19.896 ± 3.110 ns/op testDependentRandomHeap avgt 89.516 ± 4.573 ns/op

~90 ns/op

slide-34
SLIDE 34

Then ADD 40+ ns/op for NUMA access on a server!!!!

slide-35
SLIDE 35

Data Dependent Loads aka “Pointer Chasing”!!!

slide-36
SLIDE 36

Performance 101

slide-37
SLIDE 37

1. Memory is transported in Cachelines

Performance 101

slide-38
SLIDE 38

1. Memory is transported in Cachelines 2. Memory is managed in OS Pages

Performance 101

slide-39
SLIDE 39

1. Memory is transported in Cachelines 2. Memory is managed in OS Pages 3. Memory is pre-fetched on predictable access patterns

Performance 101

slide-40
SLIDE 40

Runtime Optimisation

slide-41
SLIDE 41

1. Profile guided optimisations

Runtime JIT

slide-42
SLIDE 42

1. Profile guided optimisations 2. Bets can be taken and later revoked

Runtime JIT

slide-43
SLIDE 43

Branches

void foo() { // code if (condition) { // code } // code }

slide-44
SLIDE 44

Block A

Branches

void foo() { // code if (condition) { // code } // code }

slide-45
SLIDE 45

Block A Block B

Branches

void foo() { // code if (condition) { // code } // code }

slide-46
SLIDE 46

Block A Block C Block B

Branches

void foo() { // code if (condition) { // code } // code }

slide-47
SLIDE 47

Block A Block C Block B

Branches

void foo() { // code if (condition) { // code } // code } Block A Block C

slide-48
SLIDE 48

Block A Block C Block B

Branches

void foo() { // code if (condition) { // code } // code } Block A Block C Block B

slide-49
SLIDE 49

Subtle Branches int result = (i > 7) ? a : b;

slide-50
SLIDE 50

Subtle Branches int result = (i > 7) ? a : b;

CMOV vs Branch Prediction?

slide-51
SLIDE 51

Method/Function Inlining

void foo() { // code bar(); // code }

slide-52
SLIDE 52

Method/Function Inlining

void foo() { // code bar(); // code } Block A

slide-53
SLIDE 53

Method/Function Inlining

void foo() { // code bar(); // code } Block A bar()

slide-54
SLIDE 54

Method/Function Inlining

void foo() { // code bar(); // code } Block A bar()

slide-55
SLIDE 55

Method/Function Inlining

void foo() { // code bar(); // code } Block A Block B bar()

slide-56
SLIDE 56

Method/Function Inlining

void foo() { // code bar(); // code } Block A Block B bar() Block A

slide-57
SLIDE 57

Method/Function Inlining

void foo() { // code bar(); // code } Block A Block B bar() Block A bar()

slide-58
SLIDE 58

Method/Function Inlining

void foo() { // code bar(); // code } Block A Block B bar() Block A Block B bar()

slide-59
SLIDE 59

Method/Function Inlining

void foo() { // code bar(); // code }

i-cache & code bloat?

slide-60
SLIDE 60

“Inlining is THE optimisation.”

  • Cliff Click

Method/Function Inlining

slide-61
SLIDE 61

Bounds Checking

void foo(int[] array, int length) { // code for (int i = 0; i < length; i++) { bar(array[i]); } // code }

slide-62
SLIDE 62

Bounds Checking

void foo(int[] array) { // code for (int i = 0; i < array.length; i++) { bar(array[i]); } // code }

slide-63
SLIDE 63

Subtype Polymorphism

void draw(Shape[] shapes) { for (int i = 0; i < shapes.length; i++) { shapes[i].draw(); } } void bar(Shape shape) { bar(shape.isVisible()); }

slide-64
SLIDE 64

Subtype Polymorphism

void draw(Shape[] shapes) { for (int i = 0; i < shapes.length; i++) { shapes[i].draw(); } } void bar(Shape shape) { bar(shape.isVisible()); }

Class Hierarchy Analysis & Inline Caching

slide-65
SLIDE 65

1. Profile guided optimisations 2. Bets can be taken and later revoked

Runtime JIT

slide-66
SLIDE 66

Garbage Collection

slide-67
SLIDE 67

Generational Garbage Collection

“Only the good die young.”

  • Billy Joel
slide-68
SLIDE 68

Eden Survivor 0 Survivor 1 Young/New Generation TLAB TLAB Tenured Virtual Virtual Old Generation

Generational Garbage Collection

slide-69
SLIDE 69

Modern Hardware (Intel Sandy Bridge EP)

C 1 C n C 1 C n

Registers/Buffers <1ns

L1 L1 L1 L1

~4 cycles ~1ns

L2 L2 L2 L2

~12 cycles ~3ns

L3 L3

~40 cycles ~15ns ~60 cycles ~20ns (dirty hit) ~65ns

DRAM

QPI ~40ns

MC MC DRAM DRAM DRAM DRAM DRAM DRAM DRAM

... ... ... ... ... ...

QPI QPI PCI-e 3 PCI-e 3

40X IO 40X IO

* Assumption: 3GHz Processor

slide-70
SLIDE 70

Broadwell EX – 24 cores & 60MB L3 Cache

slide-71
SLIDE 71

Eden Young/New Generation TLAB TLAB

Thread Local Allocation Buffers

slide-72
SLIDE 72

Eden Young/New Generation TLAB TLAB

Thread Local Allocation Buffers

  • Affords locality of reference
  • Avoid false sharing
  • Can have NUMA aware allocation
slide-73
SLIDE 73

Eden Survivor 0 Survivor 1 Young/New Generation TLAB TLAB Virtual

Object Survival

slide-74
SLIDE 74

Eden Survivor 0 Survivor 1 Young/New Generation TLAB TLAB Virtual

Object Survival

  • Aging Policies
  • Compacting Copy
  • NUMA Interleave
  • Fast Parallel Scavenging
  • Only the survivors require work
slide-75
SLIDE 75

Eden Survivor 0 Survivor 1 Young/New Generation TLAB TLAB Tenured Virtual Virtual Old Generation

Object Promotion

slide-76
SLIDE 76

Eden Survivor 0 Survivor 1 Young/New Generation TLAB TLAB Tenured Virtual Virtual Old Generation

Object Promotion

  • Concurrent Collection
  • String Deduplication
slide-77
SLIDE 77

Compacting Collections

slide-78
SLIDE 78

Compacting Collections – Depth first copy

slide-79
SLIDE 79

Compacting Collections

slide-80
SLIDE 80

Compacting Collections

OS Pages and cache lines?

slide-81
SLIDE 81

Eden E O S O S O E S O O E O E O O O Survivor Old Unused O S E H Humongous H H O O

G1 – Concurrent Compaction

slide-82
SLIDE 82

Azul Zing C4 True Concurrent Compacting Collector

slide-83
SLIDE 83

Where next for GC?

slide-84
SLIDE 84

Object Inlining/Aggregation

slide-85
SLIDE 85

GC vs Manual Memory Management

Not easy to pick clear winner…

slide-86
SLIDE 86

GC vs Manual Memory Management

Managed GC

  • GC Implementation
  • Card Marking
  • Read/Write Barriers
  • Object Headers
  • Background Overhead

in CPU and Memory

Not easy to pick clear winner…

slide-87
SLIDE 87

GC vs Manual Memory Management

Managed GC

  • GC Implementation
  • Card Marking
  • Read/Write Barriers
  • Object Headers
  • Background Overhead

in CPU and Memory

Not easy to pick clear winner…

Native

  • Malloc Implementation
  • Arena/pool contention
  • Bin Wastage
  • Fragmentation
  • Debugging Effort
  • Inter-thread costs
slide-88
SLIDE 88

Algorithms & Design

slide-89
SLIDE 89

What is most important to performance?

slide-90
SLIDE 90
  • Avoiding cache misses
  • Strength Reduction
  • Avoiding duplicate work
  • Amortising expensive operations
  • Mechanical Sympathy
  • Choice of Data Structures
  • Choice of Algorithms
  • API Design
  • Overall Design
slide-91
SLIDE 91

In a large codebase it is really difficult to do everything well

slide-92
SLIDE 92

It also takes some “uncommon” disciplines such as: profiling, telemetry, modelling…

slide-93
SLIDE 93

“If I had more time, I would have written a shorter letter.”

  • Blaise Pascal
slide-94
SLIDE 94

The story of Aeron

slide-95
SLIDE 95

Aeron is an interesting lesson in “time to performance”

slide-96
SLIDE 96

Lots of others exists such at the C# Roslyn compiler

slide-97
SLIDE 97

Time spent on Mechanical Sympathy vs Debugging Pointers ???

slide-98
SLIDE 98

Immutable Data & Concurrency

slide-99
SLIDE 99

Functional Programming

slide-100
SLIDE 100

In Closing …

slide-101
SLIDE 101

What does the future hold?

slide-102
SLIDE 102

Remember Assembly vs Compiled Languages

slide-103
SLIDE 103

What about the issues of footprint, startup time, GC pauses, etc. ???

slide-104
SLIDE 104
slide-105
SLIDE 105
slide-106
SLIDE 106
slide-107
SLIDE 107

Blog: http://mechanical-sympathy.blogspot.com/ Twitter: @mjpt777 “Any intelligent fool can make things bigger, more complex, and more violent. It takes a touch of genius, and a lot of courage, to move in the opposite direction.”

  • Albert Einstein

Questions?