Keeping RAFT Afloat Cloud Scale Distributed Consensus Philip Haynes - - PowerPoint PPT Presentation

keeping raft afloat
SMART_READER_LITE
LIVE PREVIEW

Keeping RAFT Afloat Cloud Scale Distributed Consensus Philip Haynes - - PowerPoint PPT Presentation

Keeping RAFT Afloat Cloud Scale Distributed Consensus Philip Haynes YOW! Data September 2016 ThreatMetrix Confidential Information Do Not Copy or Distribute Without Express Written Permission CONSENSUS? To the general public, consensus


slide-1
SLIDE 1

ThreatMetrix Confidential Information – Do Not Copy or Distribute Without Express Written Permission

Keeping RAFT Afloat Cloud Scale Distributed Consensus

Philip Haynes

YOW! Data September 2016

slide-2
SLIDE 2

ThreatMetrix Confidential Information – Do Not Copy or Distribute Without Express Written Permission 2

CONSENSUS?

  • To the general public, consensus is usually a good thing…..
slide-3
SLIDE 3

ThreatMetrix Confidential Information – Do Not Copy or Distribute Without Express Written Permission 3

The downside……

  • On the other hand, what if the robotic consensus decides that

humans should be exterminated….

  • In IT reality of course, the role of consensus is much more subtle
slide-4
SLIDE 4

ThreatMetrix Confidential Information – Do Not Copy or Distribute Without Express Written Permission 4

What is a consensus algorithm then?

  • The consensus problem is fundamental in the control of a multi

agent system e.g. multiple servers

  • A consensus problem requires agreement among a number of

processes (or agents) for a single data value

  • Some of the processes (agents) may fail or be unreliable in other

ways, so consensus protocols must be fault tolerant

  • One option is for all processes (agents) to agree on a majority

value e.g. > half the votes

slide-5
SLIDE 5

ThreatMetrix Confidential Information – Do Not Copy or Distribute Without Express Written Permission 5

Hard Distributed Consensus is:

  • Fundamental where a consistent view of a system in the

presence of failure is required

  • Think financial records / TP systems
  • Unfortunately perceived as being too difficult and costly for

cloud scale – hence “eventually” consistent models

  • Critical for simplifying big data processing and its analysis where:
  • Near enough is not good enough; and
  • Systems must always be up
slide-6
SLIDE 6

Cloud scale at ThreatMetrix Global Device Identity Recognition Rates

ThreatMetrix Confidential Information – Do Not Copy or Distribute Without Express Written Permission 6

slide-7
SLIDE 7

Cloud scale consensus requirements

  • > 100M digital identity requests per day. Internal SLA < 100ms.

Multi-data center. > 400 node Cassandra cluster.

  • New capability for operational fraud processing:
  • Capable of processing > 1B events per day
  • < 5 ms initial detection
  • Far fewer nodes than 400 (3-5); where
  • Results are evidentiary (i.e. consistency matters)
  • We also care about availability and scalability

ThreatMetrix Confidential Information – Do Not Copy or Distribute Without Express Written Permission 7

slide-8
SLIDE 8

Building blocks for cloud scale distributed consensus

  • Hardware aware programming methods
  • Aeron messaging. Low latency reliable transport from

Martin Thompson et. al.

  • RAFT. New distributed consensus algorithm designed to be

understandable (compared to Paxos etc.)

  • Asynchronous replicated log system model
  • Concern existed that distributed consensus is hard to implement*
  • RocksDB. LSM database originally from Google, now Facebook
  • Designed for write heavy loads on SSD’s

ThreatMetrix Confidential Information – Do Not Copy or Distribute Without Express Written Permission 8 *Our local experiment continues to support this view.

slide-9
SLIDE 9

RAFT – A replicated log

  • Replicated log => replicated state machine
  • All servers execute same command in same order
  • Consensus module ensures proper log replication
  • Systems makes progress as long as the majority of servers are up
  • Failure mode: fail-stop (not Byzantine), delayed loss messages

ThreatMetrix Confidential Information – Do Not Copy or Distribute Without Express Written Permission 9

slide-10
SLIDE 10

Odd number of servers to support voting 1. Request Vote RPC to elect leader 2. AppendEntry RPC to replicate log entries 3. When majority of followers append entries the log entry is committed and the state machine may be applied

Replicate across a cluster with a leader adding concept of time (called a term)

ThreatMetrix Confidential Information – Do Not Copy or Distribute Without Express Written Permission 10

slide-11
SLIDE 11

Initial RAFT implementation

ThreatMetrix Confidential Information – Do Not Copy or Distribute Without Express Written Permission 11

slide-12
SLIDE 12

Key initial implementation failures

  • Conceptual: Misunderstood that communication between

leader and followers must be viewed as a queue of request and responses

  • Flow control
  • Started seeing > 7000 messages in flight
  • When this happened the system collapsed
  • Requirement for flow control not understood on raft-dev
  • State of Practice
  • TMX RAFT: > 3K msgs/s @ ~300us latency
  • Public: 20 per second, batched to 50ms over TCP/IP

ThreatMetrix Confidential Information – Do Not Copy or Distribute Without Express Written Permission 12

slide-13
SLIDE 13

Flow control for RAFT: The hypothesis

ThreatMetrix Confidential Information – Do Not Copy or Distribute Without Express Written Permission 13

slide-14
SLIDE 14

RAFT flow control: The implementation

  • Keep moving averages of round trip time and service time
  • Keep a record of messages in flight

(i.e. queue size and active nodes)

  • Throttle when:
  • timeSinceLastReceived < heatBeatTimeOut; and
  • Queue size >= maxQueueSize; where
  • maxQueueSize = Math.max(1 + (int)(minRoundTripLatency / serviceTime),

10);

ThreatMetrix Confidential Information – Do Not Copy or Distribute Without Express Written Permission 14

slide-15
SLIDE 15

Performance results: Round trip time and service time

ThreatMetrix Confidential Information – Do Not Copy or Distribute Without Express Written Permission 15

200 300 400 500 600 700 0.000 0.005 0.010 0.015

Round Trip Time (ms) For 2 Followers (<750ms)

Time (ms) Density Histogram of Service Time (<50ms)

Time (ms) Frequency 15 20 25 30 35 40 45 50 5000 10000 15000

slide-16
SLIDE 16

Round trip and service time analysis

ThreatMetrix Confidential Information – Do Not Copy or Distribute Without Express Written Permission 16

200 300 400 500 600 700 0.000 0.005 0.010 0.015

Round Trip Time (ms) Density Function (<750ms)

Time (ms) Density Follower 0 Follower 2

slide-17
SLIDE 17

Round trip final analysis

ThreatMetrix Confidential Information – Do Not Copy or Distribute Without Express Written Permission 17

300 350 400 450 500 550 600 650 0.000 0.005 0.010 0.015

Round Trip Time (ms) For 2 Followers (<750ms)

Time (ms) Density

slide-18
SLIDE 18

ThreatMetrix Confidential Information – Do Not Copy or Distribute Without Express Written Permission 18

Getting it to really work

  • Implementation attempted to utilize multicast to provide

discovery across the cluster

  • Flow control interference between clusters
  • Modified for independent RAFT cluster flow control
  • RAFT messages prioritization over command messages
  • Introduced flow control between the different clusters
  • DTrace to identify and remove outliers (Units in nano’s)
  • Now processing:
  • 1,600 events per second; to process
  • Creating and closing 1,600 cases per second
slide-19
SLIDE 19

ThreatMetrix Confidential Information – Do Not Copy or Distribute Without Express Written Permission 19

Conclusion

  • Aeron and other hardware aware programming techniques are

fundamental to reducing the cost for cloud scale services

  • RAFT and DFSM’s are fundamental for implementing transaction

engines but are insufficient

  • Cloud scale is fundamentally different scale to research systems
  • Performance model and measure system processing
slide-20
SLIDE 20

ThreatMetrix Confidential Information – Do Not Copy or Distribute Without Express Written Permission

Questions?

slide-21
SLIDE 21

ThreatMetrix Confidential Information – Do Not Copy or Distribute Without Express Written Permission 21

Study limitations

  • Yet to fully optimize the system
  • Repeat on 10G hardware
  • More than 4 followers
  • Multi-data center issue
  • Flow control during failure scenarios

Histogram of Service Time (>100ms)

Time (ms) Frequency 100 150 200 250 10 20 30 40 50 60

slide-22
SLIDE 22

ThreatMetrix Confidential Information – Do Not Copy or Distribute Without Express Written Permission 22

Latency Curve

Histogram of Service Time (>100ms)

Time (ms) Frequency 100 150 200 250 10 20 30 40 50 60