Fault Tolerance at Speed Todd L. Montgomery @toddlmontgomery About - - PowerPoint PPT Presentation

fault tolerance at speed
SMART_READER_LITE
LIVE PREVIEW

Fault Tolerance at Speed Todd L. Montgomery @toddlmontgomery About - - PowerPoint PPT Presentation

StoneTor Fault Tolerance at Speed Todd L. Montgomery @toddlmontgomery About me What type of Fault Tolerance? What is Clustering? Why Aeron? Design for Speeding Up? What type of Fault Tolerance? What is Clustering? Why Aeron? Design


slide-1
SLIDE 1

Fault Tolerance at Speed

Todd L. Montgomery @toddlmontgomery

StoneTor

slide-2
SLIDE 2

About me…

slide-3
SLIDE 3

What type of Fault Tolerance? What is Clustering? Why Aeron? Design for Speeding Up?

slide-4
SLIDE 4

What type of Fault Tolerance? What is Clustering? Why Aeron? Design for Speeding Up? Efficiency

slide-5
SLIDE 5

https://www.forbes.com/sites/forbestechcouncil/2017/12/15/why-energy-is-a-big-and-rapidly-growing-problem-for-data-centers/#344456665a30 https://www.datacenterdynamics.com/opinions/power-consumption-data-centers-global-problem/ https://www.nature.com/articles/d41586-018-06610-y

slide-6
SLIDE 6

We seem to assume efficiency/security/quality/etc. is a “special” characteristic added … later… if at all

slide-7
SLIDE 7

Fault Tolerance

slide-8
SLIDE 8

Service Client

slide-9
SLIDE 9

Service Client

slide-10
SLIDE 10

Service Client Service Service

slide-11
SLIDE 11

Service Client Service Service Client Client

slide-12
SLIDE 12

Service Client Service Service Client Client

S t a t e

slide-13
SLIDE 13
slide-14
SLIDE 14

Service Service Service

State “Storage”

slide-15
SLIDE 15

Service Client Service Service Client Client

S t a t e

slide-16
SLIDE 16

Fault Tolerance of State

slide-17
SLIDE 17

Service Service Service State

Partition Replication

slide-18
SLIDE 18

Contiguous Log with Snapshot & Replay

slide-19
SLIDE 19

1 2 3 4 5 6 X

slide-20
SLIDE 20

1 State 2 3 4 5 6 X

slide-21
SLIDE 21

1 State 2 3 4 5 6 X

… Snapshot

slide-22
SLIDE 22

1 State 2 3 4 5 6 X

… Snapshot

5 6 X

Snapshot State

slide-23
SLIDE 23

Clustered Services

slide-24
SLIDE 24

Service Service Service

slide-25
SLIDE 25

Service Service Service Log Archive Log Archive Log Archive

slide-26
SLIDE 26

Replicated State Machines

https://en.wikipedia.org/wiki/State_machine_replication

slide-27
SLIDE 27

Each Replicated Service Same event log Same input ordering Log replicated locally

Replicated State Machines

slide-28
SLIDE 28

Checkpoints / Snapshots Event in the log “Rolling” up previous log events

Replicated State Machines

slide-29
SLIDE 29

When should a service “consume” (or process) a log event?

slide-30
SLIDE 30

Service Service Service Archive Archive Archive

1 2 3 4 5 6 1 2 3 4 5 6 7 1 2

slide-31
SLIDE 31

Once processed, Event can not be altered Only process once event is stable

slide-32
SLIDE 32

Raft Consensus Event must be recorded at majority

  • f Replicas before being consumed

by any Replica

Replicated State Machines

https://raft.github.io/

slide-33
SLIDE 33

Service Service Service Archive Archive Archive

1 2 3 4 5 6 1 2 3 4 5 6 7 1 2

slide-34
SLIDE 34

Service Service Service Archive Archive Archive

1 2 3 4 5 6 1 2 3 4 5 6 7 1 2

slide-35
SLIDE 35

Strong Leader Elected member of the Cluster Orders Input Disseminates Consensus

Raft

slide-36
SLIDE 36

Service Service Service Archive Archive Archive Consensus Consensus Consensus

slide-37
SLIDE 37

Raft is An algorithm with formal verification

Replicated State Machines

slide-38
SLIDE 38

Raft is not A specification Nor A complete system

Replicated State Machines

slide-39
SLIDE 39

More than Raft Leader timestamps events Async, not RPC-based Timers

The Real World

slide-40
SLIDE 40

Service Service Service Archive Archive Archive Consensus Consensus Consensus

*Leader

Client

slide-41
SLIDE 41

Benefits

slide-42
SLIDE 42

Determinism Log is immutable Log can be played, stopped, & replayed Each event is timestamped Services restarted from snapshot & log

Benefits

slide-43
SLIDE 43

What Can You Do?

slide-44
SLIDE 44

Distributed Key/Value Store Distributed Timers Distributed Locks

slide-45
SLIDE 45

Matching Engines Order Management Market Surveillance P&L, Risk, …

Finance

slide-46
SLIDE 46

Venue Ticketing / Reservations Auctions

Beyond

Hint - a contended database is a good indicator

slide-47
SLIDE 47

Why Aeron?

slide-48
SLIDE 48

Efficient reliable UDP unicast, UDP multicast, and IPC message transport Java, C/C++, C#, Go

Aeron

https://github.com/real-logic/Aeron

slide-49
SLIDE 49

And a little bit more… Very fast Archival & Replay

Aeron

https://github.com/real-logic/Aeron

slide-50
SLIDE 50

The “Efficient” bit…

slide-51
SLIDE 51

All communications Aeron publications & subscriptions Aeron archival & replay Aeron shared counters

slide-52
SLIDE 52

Consensus based on Aeron stream position

slide-53
SLIDE 53

Batching Critical to efficient operation Optimizing pipelined throughput

slide-54
SLIDE 54

Flow Control Critical to correct operation

slide-55
SLIDE 55

Design for Efficiency?

slide-56
SLIDE 56

Cache Hit/Miss Ratios Branch Prediction Allocation Rates Garbage Collection Inlining Optimizations

slide-57
SLIDE 57

Not… Yet…

slide-58
SLIDE 58

Ownership, Dependency, & Coupling Complexity Layers of Abstraction (ain’t free) Resource Management

slide-59
SLIDE 59

Closer… But…

  • Still. Not. Yet.
slide-60
SLIDE 60

"AmdahlsLaw" by Daniels220 at English Wikipedia - Own work based on: File:AmdahlsLaw.png. Licensed under CC BY-SA 3.0 via Wikimedia Commons

slide-61
SLIDE 61

Universal Scalability Law

2 4 6 8 10 12 14 16 18 20 1 2 4 8 16 32 64 128 256 512 1024

Speedup Processors

Amdahl USL

slide-62
SLIDE 62

Breakdown Interactions Fundamental Sequential Operations

slide-63
SLIDE 63

Ingress Message, Sequence, Disseminate

Client Follower X Leader

Ingress

Follower Y

Log (multicast or serial unicast) Member Status Log Event Log Event

slide-64
SLIDE 64

Followers Append

Client Follower X Leader

Ingress

Follower Y

Log (multicast or serial unicast) Member Status Append Position Append Position

slide-65
SLIDE 65

Commit Message

Client Follower X Leader

Ingress

Follower Y

Log (multicast or serial unicast) Member Status Commit Position Commit Position

slide-66
SLIDE 66

Breakdown Interactions Pipeline-able Operation & Batching

slide-67
SLIDE 67

Follower Leader

Log (multicast or serial unicast) Member Status Commit Position @4096 Append Position @6912 Log Event @8192

Stream Positions

Archive Position @8096 Archive Position @7168

Store locally asynchronous to Position processing by Consensus, & Log processing by Service Batching: Log, Appends, Commits

slide-68
SLIDE 68

Doesn’t this Complicate Recovery?

slide-69
SLIDE 69

Follower

Recovery Positions

Archive Position @8096 Archive Position @7168

A synchronous system doesn’t make this complexity go away! Election still needs to assert state of the cluster & locally catch-up

Follower Follower

Archive Position @7584 Commit Position @4096 Commit Position @4064 Commit Position @4032 Service Position @4096 Service Position @4064 Service Position @3776

slide-70
SLIDE 70

Limitations of Efficiency Throughput & Latency

slide-71
SLIDE 71

Client Followers Leader

Ingress Log (multicast or serial unicast) Member Status Commit Position Append Position Log Event

Client to Service A: 0.5 RTT Client to Service Ox: 1 RTT Client to Service A (on Commit): 1.5 RTT Client to Service Ox (on Commit): 2 RTT

Constant Delay Network

Service A Service Ox

Round-Trip Time (RTT)

slide-72
SLIDE 72

Client to Service A: 50ns Client to Service Ox: 100ns Client to Service A (on Commit): 150ns Client to Service Ox (on Commit): 200ns

Limits from Constant Delay

Shared Memory RTT <100ns

Client to Service A: 50us Client to Service Ox: 100us Client to Service A (on Commit): 150us Client to Service Ox (on Commit): 200us

DC RTT <100us

Client to Service A: 5us Client to Service Ox: 10us Client to Service A (on Commit): 15us Client to Service Ox (on Commit): 20us

Rack (Kernel Bypass) RTT <10us

slide-73
SLIDE 73

Measured Latency at Throughput

RTT (us) 75 150 225 300 Percentile Min 0.50 0.90 0.99 0.9999 0.999999 Max

100K msgs/sec 200K msgs/sec

Intel Xeon Gold 5118 (2.30GHz, 12 cores) 32GB DDR4 2400 MHz ECC RAM Intel Optane SSD 900P Series 480GB SolarFlare X2522-PLUS 10GbE NIC All servers are connected to an Arista 7150S CentOS Linux 7.7, kernel 4.4.195-1.el7.elrepo.x86_64 tuned for low-latency workload. Courtesy Mark Price Single client session, bursts of 20x 200B messages, 3-node cluster, Service(s) echo(es) the payload back.

slide-74
SLIDE 74

Takeways Efficiency is part of design Power of a timestamped, replicated log Replicated State Machines

slide-75
SLIDE 75

Current Status Aeron Archiving - fully supported Aeron Clustering - pre-release

Sponsored by

https://weareadaptive.com/

slide-76
SLIDE 76
slide-77
SLIDE 77

Aeron: https://github.com/real-logic/Aeron Twitter: @toddlmontgomery

Thank You!

Questions?

StoneTor