Reactive design patterns for microservices on multicore Reactive - - PowerPoint PPT Presentation

reactive design patterns for microservices on multicore
SMART_READER_LITE
LIVE PREVIEW

Reactive design patterns for microservices on multicore Reactive - - PowerPoint PPT Presentation

Reactive Software with elegance Reactive design patterns for microservices on multicore Reactive summit - 22/10/18 charly.bechara@tredzone.com Outline Microservices on multicore Reactive Multicore Patterns Modern Software Roadmap 2 1


slide-1
SLIDE 1

Reactive summit - 22/10/18

Reactive Software with elegance

Reactive design patterns for microservices

  • n multicore

charly.bechara@tredzone.com

slide-2
SLIDE 2

Outline

Microservices on multicore Reactive Multicore Patterns Modern Software Roadmap

2

slide-3
SLIDE 3

MICROSERVICES ON MULTICORE

1

3

slide-4
SLIDE 4

Microservices on Multicore

Microservice architecture with actor model Actor µService Message passing µService µService

4

slide-5
SLIDE 5

Microservices on Multicore

Fast data means more inter-communication

Stream Real time event processing

Communications Computations

Batch Highly Interconnected Workflows

Fast Data

5

slide-6
SLIDE 6

Microservices on Multicore

Microservice architecture µService µService

6

Actor µService Message passing

slide-7
SLIDE 7

Microservices on Multicore

Microservice architecture + Fast Data µService µService

7

Actor µService Message passing New interactions

slide-8
SLIDE 8

Microservices on Multicore

Microservice architecture + Fast Data µService µService

8

Actor µService Message passing New interactions More interactions

slide-9
SLIDE 9

Microservices on Multicore

More microservices should run on the same muticore machine machine

9

slide-10
SLIDE 10

Microservices on Multicore

Microservice architecture + Fast Data + Multicore + machine Core

10

µService µService

slide-11
SLIDE 11

Microservices on Multicore

Microservice architecture + Fast Data + Multicore + machine

σ=0, κ=0

Perfect scalability (N)

σ>>0, κ=0

Contention impact (σ)

σ>>0, κ>0

Coherency impact (κ)

Universal Law of Scalability (Gunther law) Performance model of a system based on queueing theory

11

slide-12
SLIDE 12

Microservices on Multicore

From inter-thread communications... Core

12

slide-13
SLIDE 13

Microservices on Multicore

From inter-thread communications... Core

13

slide-14
SLIDE 14

Microservices on Multicore

…to inter-core communications Core

14

slide-15
SLIDE 15

Microservices on Multicore

Inter-core communication => cache coherency machine

> > 30 cycles

(10 ns)

12 cycles

(4 ns)

4 cycles

(1.3 ns)

1 cycle

(0.3 ns)

Assuming Freq = 3 GHz MESI > 600 600 cycles

(200 ns)

Shared L3$ or LLC

L1 I$

L2$

L1D$ Registers

core 1

L1 I$

L2$

L1D$ Registers

core N

15

slide-16
SLIDE 16

Microservices on Multicore

Exchange software are pushing performance to hardware limits machine

Stability Volume Velocity

50%ile 99.99%ile

16

msec µsec k.msg/s

  • M. msg/s
slide-17
SLIDE 17

Simplx: one thread per core

No context switching

17

slide-18
SLIDE 18

Simplx: actors multitasking per thread

High core utilization

18

slide-19
SLIDE 19

Simplx: one event loop per core for communications

Lock free

19

Si Simplx runs on all cores Event loop = ~30 300 ns Event loop = ~30 300 ns ns

slide-20
SLIDE 20

20

Multicore WITHOUT multithreaded programming ?

slide-21
SLIDE 21

Microservices on Muticore

Very good resources, but no multicore-related patterns machine

21

slide-22
SLIDE 22

REACTIVE MULTICORE PATTERNS

2

22

slide-23
SLIDE 23

23

slide-24
SLIDE 24

Reactive multicore Patterns

7 patterns to unleash multicore reactivity machine

24

Core-to-core messaging (2 patterns) Core monitoring (2 patterns) Core-to-core flow control (1 pattern) Core-to-cache management (2 patterns)

slide-25
SLIDE 25

Core-to to-core messaging patterns

25

slide-26
SLIDE 26

Pattern #1: the core-aware messaging pattern

Inter-core communication: push message

26

destination core socket server sender

Push message

Pipe pipe = new Pipe(greenActorId); pipe.push<HelloEvent>();

~1 µs – 10 µs ~ 500 ns ~ 300 ns

slide-27
SLIDE 27

Pattern #1: the core-aware messaging pattern

Intra-core communication: push message

27

Push a message asynchronous

Pipe pipe = new Pipe(greenActorId); pipe.push<HelloEvent>();

~300 ns

slide-28
SLIDE 28

Intra-core communication: x150 speedup with direct call over push

28

Push a message asynchronous Direct call synchronous

ActorReference<GreenActor> target = getLocalReference(greenActorId); [...] target->hello(); Pipe pipe = new Pipe(greenActorId); pipe.push<HelloEvent>();

Pattern #1: the core-aware messaging pattern

Optimize calls according to the deployment

~300 ns ~2 ns

slide-29
SLIDE 29

Pattern #2: the message mutualization pattern

Network optimizations, core optimizations. Same fight

29

core actor push data In this use case, the 3 red consumers process the same data

slide-30
SLIDE 30

Pattern #2: the message mutualization pattern

Communication has a cost

30

Many events means high cache coherence usage (L3) core actor push data

3 events

slide-31
SLIDE 31

Pattern #2: the message mutualization pattern

Let’s mutualize inter-core communications

31

1 event

Local router

3 direct calls 3 events

slide-32
SLIDE 32

Pattern #2: the message mutualization pattern

WITH pattern vs WITHOUT pattern: Linear improvement

32

slide-33
SLIDE 33

Core monitoring patterns

33

@ real-time

slide-34
SLIDE 34

Pattern #3: the core stats pattern

Use case: monitoring the data distribution throughput

34

core actor

core actor data

We want to know in real-time the number of messages received per second, globally, and per core.

StartSequence startSequence; startSequence.addActor<RedActor>(0); // core 0 startSequence.addActor<RedActor>(0); // core 0 startSequence.addActor<RedActor>(1); // core 1 startSequence.addActor<RedActor>(1); // core 1 Simplx simplx(startSequence);

slide-35
SLIDE 35

Pattern #3: the core stats pattern

Use case: monitoring the data distribution throughput

35

1 1

struct LocalMonitorActor:Actor {[…] void newMessage() { ++count; } } struct RedActor:Actor {[…] ReferenceActor monitor; RedActor () { monitor = newSingletonActor<LocalMonitorActor>(); } void onEvent() { monitor-> newMessage(); }

}

Local monitoring Singleton

1

Increase message counter

slide-36
SLIDE 36

Pattern #3: the core stats pattern

Use case: monitoring the data distribution throughput

36

1 1

1 sec ec

Service monitoring Inform monitoring of the last second statistics Timer

struct LocalMonitorActor:Actor,TimerProxy{ […] LocalMonitorActor:TimerProxy(*this) { setRepeat(1000); } virtual void onTimeout() { serviceMonitoringPipe.push<StatsEvent>(count); count=0; } }

slide-37
SLIDE 37

Pattern #4: the core usage pattern

Core utilization

37

Detect overloading cores before it is too late

Relying on the CPU usage provided by the OS is not enough 100% does not mean the runtime is overloaded 10% does not tell how much data you can really process

slide-38
SLIDE 38

Pattern #4: the core usage pattern

No push, no event, no work

38

1 sec ec

Idle loop Reality is more about 3 millions loops per second

20 loops in a second 0% core usage

slide-39
SLIDE 39

Pattern #4: the core usage pattern

Efficient core usage

39

20 loops in a second 0% core usage 11 loops 3 working loops 60% core usage

=

1 sec ec

Working loop Idle loop

slide-40
SLIDE 40

Pattern #4: the core usage pattern

Runtime performance counters help measurement

40

11 loops 8 idle loops 3 working loops 60% core usage

1 sec ec

Working loop Idle loop

1 1 1 Duration(IdleLoop) = 0.05 s

Core usage actor idleLoop= 0|1

CoreUsage = 1 – ∑(idleLoop)*0.05 100

Reality is more about Duration Idle loop ~300 ns

slide-41
SLIDE 41

Demo: Real-time core monitoring

A typical trading workflow

41

Data stream Data processing

slide-42
SLIDE 42

Core-to to-core flow control patterns

42

slide-43
SLIDE 43

Pattern #5: the queuing prevention pattern

What if producers overflow a consumer ?

43

Your software cannot be more optimized ? Still, the incoming throughput could be too high, implying strong queuing. Continue ? Stop the flow ? Merge data ? Throttling ? Whatever the decision, we need to detect the issue

slide-44
SLIDE 44

Pattern #5: the queuing prevention pattern

What’s happening behind a push ?

44

slide-45
SLIDE 45

Pattern #5: the queuing prevention pattern

Local Simplx loops handle the inter-core communication

45

Batc atch ID = 145 145

slide-46
SLIDE 46

Pattern #5: the queuing prevention pattern

Once the destination reads the data, the BatchID is incremented

46

Batc atch ID = 145 145 Batc atch ID = 146 146

slide-47
SLIDE 47

Pattern #5: the queuing prevention pattern

BatchID does not increment if destination core is busy

47

Batc atch ID = 145 145 Batc atch ID = 145 145

slide-48
SLIDE 48

Pattern #5: the queuing prevention pattern

Core to core communication at max pace

48

BatchID batchID(pipe); pipe.push<Event>(); (…) if(batchID.hasChanged()) { // push again } else { //destination is busy //merging data, start throttling, reject orders… }

Batc atch ID = 145 145 Batc atch ID = 145 145

slide-49
SLIDE 49

Pattern #5: the queuing prevention pattern

Demo: code java

49

Same id => queuing Last id => no queuing

slide-50
SLIDE 50

Core-to to-cache management patterns

50

slide-51
SLIDE 51

Pattern #6: the cache-aware split pattern

FIX + execution engine

51

new w or

  • rde

der

slide-52
SLIDE 52

Pattern #6: the cache-aware split pattern

FIX + execution engine

52

Almost all tags sent in the new order request need to be sent back in the acknowledgment new w or

  • rde

der ack cknowledgment A FIX order can easily size ~200 Bytes

slide-53
SLIDE 53

Pattern #6: the cache-aware split pattern

Stability depends on the ability to be cache friendly

53

20 200 Bytes per per order: 1  10000

  • pen orders per book

Local storage

  • rder

book Local storage

To stay « in-cache » and get stable performance, one core can store ~1300 open orders.

slide-54
SLIDE 54

Pattern #6: the cache-aware split pattern

… but FIX orders are huge

54

FIX order

  • rder entry

~32 Bytes:

id price quantity type validity …

1  10000

  • pen orders per book

Local storage

  • rder

book Local storage

slide-55
SLIDE 55

Pattern #6: the cache-aware split pattern

Let’s cut it and send only the strict minimum

55

~32 Bytes:

id price quantity type validity …

1  10000

  • pen orders per book

Local storage

  • rder

book Local storage FIX order

  • rder entry
slide-56
SLIDE 56

Pattern #6: the cache-aware split pattern

and reconciliate both parts later

56

~32 Bytes:

id price quantity type validity …

1  10000

  • pen orders per book

Local storage

  • rder

book Local storage FIX order

  • rder entry
  • rder
slide-57
SLIDE 57

Pattern #6: the cache-aware split pattern

We have divided by 6 the number of cores to be stable

57

~32 Bytes:

id price quantity type validity …

1  10000

  • pen orders per book

Local storage

  • rder

book Local storage

To stay « in-cache » and get stable performance, one core can store 8000 open orders.

slide-58
SLIDE 58

Pattern #7: the $-friendly actor directory pattern

Routing a message needs an Actor directory

58

Message initiator

A

Send message to A Where is A ? What’s its address ? Send message to A router destination

slide-59
SLIDE 59

Pattern #7: the $-friendly actor directory pattern

Regular routing design is simple

59

router destination core

inc ncomin ing ke key @de destin inatio ion

slide-60
SLIDE 60

Pattern #7: the $-friendly actor directory pattern

Multi-scaling communications impacts the actor address size

60

sender destination core process server ActorID ActorID + CoreID ActorID + CoreID + EngineID ActorID + CoreID + EngineID + MachineID

slide-61
SLIDE 61

Pattern #7: the $-friendly actor directory pattern

The local directory can be huge

61

inco incoming key key @de destination K D 12 bytes* 10 bytes N records

size = N(K + D) size = 50 000(12 + 10) ~ 1.1 MB

*ISIN code = 12 characters

slide-62
SLIDE 62

Pattern #7: the $-friendly actor directory pattern

Let’s take advantage of core awareness

62

router destination core core 1 cores n …

inc ncomin ing ke key @de destin inatio ion

slide-63
SLIDE 63

Pattern #7: the $-friendly actor directory pattern

Let’s take advantage of core awareness

63

core 1 cores n …

sender destination core localrouter inc ncomin ing ke key co coreID ID co coreID ID @localr lrouter de dest stin inatio ion ind ndex

index

slide-64
SLIDE 64

Pattern #7: the $-friendly actor directory pattern

We save about 40 40% cache memory

64

inco incoming key key cor coreID ID 12 byte 1 byte N records cor coreID ID @lo localro lrouter de destin tination n inde index 1 byte 10 byte 1 byte 256 records

size = 50 000(12 + 1) + 256(1+10 +1) ~ 0,6 ,65MB

+

slide-65
SLIDE 65

MODERN SOFTWARE ROADMAP

3

65

slide-66
SLIDE 66

"The most amazing achievement of the computer software industry is its continuing cancellation of the steady and staggering gains made by the computer hardware industry."

Henry Peteroski

slide-67
SLIDE 67

Multicore software roadmap to success

67

Design Concurrent Develop Monothreaded

Run Parallel

Execute Reactive Current Multi-core Software

(with Multithreading)

Next Generation Multi-core Software

(with actor model)

slide-68
SLIDE 68

Simplx is is now now open

  • pen source Apache 2.0

https://github.com/Tredzone/simplx

68

charly.bechara@tredzone.com

slide-69
SLIDE 69

69