How choosing the Raft consensus algorithm saved us 3 months of - - PowerPoint PPT Presentation

how choosing the raft consensus algorithm saved us 3
SMART_READER_LITE
LIVE PREVIEW

How choosing the Raft consensus algorithm saved us 3 months of - - PowerPoint PPT Presentation

How choosing the Raft consensus algorithm saved us 3 months of development time What do I do with unused space on my servers? Lets build an S3 cluster! Requirements: Fully S3 compatible Easy to maintain Fault tolerant I found a


slide-1
SLIDE 1

How choosing the Raft consensus algorithm saved us 3 months

  • f development time
slide-2
SLIDE 2

What do I do with unused space

  • n my servers?
slide-3
SLIDE 3

Requirements:

  • Fully S3 compatible
  • Easy to maintain
  • Fault tolerant

Let’s build an S3 cluster!

slide-4
SLIDE 4

I found a great candidate: SX + LibreS3

Bonuses:

  • Block level deduplication
  • Highly scalable
  • Multiplatform

… but something was missing!

+

slide-5
SLIDE 5

Almost there!

  • Fully distributed
  • Data replication
  • Cluster membership management

... but no support for detecting and kicking out dead nodes

What about automatic failover?

slide-6
SLIDE 6

How to deal with the failure?

  • Some node has to make a decision
  • Decisive node must not be faulty
  • All the alive nodes should follow

There is a need for a consensus algorithm.

slide-7
SLIDE 7

Choosing the algorithm

Paxos:

  • Proven to work
  • Very complicated
  • Many variants and

interpretations (ZooKeeper, …) Raft:

  • Easy
  • Straightforward

implementation

  • Accurate and

comprehensive specs And the winner is… Raft!

slide-8
SLIDE 8

Raft How does it work?

slide-9
SLIDE 9

Leader election

slide-10
SLIDE 10

Leader election

slide-11
SLIDE 11

Leader election

slide-12
SLIDE 12

Leader election

slide-13
SLIDE 13

Raft Node failure

slide-14
SLIDE 14

Dead node detection

slide-15
SLIDE 15

Dead node detection

slide-16
SLIDE 16

Dead node detection

slide-17
SLIDE 17

How I implemented Raft in SX

slide-18
SLIDE 18
  • Heartbeats are sent via internal SX

communication

  • Membership changes are performed

automatically

  • Node failure detection relies on configurable

timeouts

  • Almost no impact on SX performance

Implementation details

slide-19
SLIDE 19

Enable Raft node failure timeout:

$ sxadm cluster --set-param hb_deadtime=120 \ sx://admin@sx.foo.com

Kill one of the nodes and check its status:

$ sxadm cluster –I sx://admin@sx.foo.com * node 10…da: … status: follower, online: ** NO ** * node bd…ad: … status: follower, online: yes * node c2…b7: … status: leader, online: yes

Wait for the node to be marked as faulty:

$ sxadm cluster –I sx://admin@sx.foo.com * node 10…da: … status: follower, online: ** FAULTY ** * node bd…ad: … status: follower, online: yes * node c2…b7: … status: leader, online: yes

How to enable Raft in SX?

slide-20
SLIDE 20

Robert Wojciechowski

follow @skylable

www.skylable.com

slide-21
SLIDE 21

Stay tuned…

slide-22
SLIDE 22

FUSE based filesystem mapping for SX:

  • Client-side encrypted
  • Fully deniable
  • Deduplication
  • Fault tolerant

Coming up next: SXFS

slide-23
SLIDE 23
  • There is only one legitimate leader
  • Each node chooses a timeout
  • When timeout is reached a new election is

started

  • A candidate node votes for itself
  • The candidate requests a vote
  • In case the candidate received a majority of

votes it becomes a new leader

The election basics

slide-24
SLIDE 24

Corner cases Leader failure

slide-25
SLIDE 25

Leader node failure

slide-26
SLIDE 26

Leader node failure

slide-27
SLIDE 27

Leader node failure

slide-28
SLIDE 28

Leader node failure

slide-29
SLIDE 29

Corner cases Race condition

slide-30
SLIDE 30

Election race condition

slide-31
SLIDE 31

Election race condition

slide-32
SLIDE 32

Election race condition

slide-33
SLIDE 33

Election race condition

slide-34
SLIDE 34

Corner cases Split votes

slide-35
SLIDE 35

Split votes

slide-36
SLIDE 36

Split votes

slide-37
SLIDE 37

Split votes

slide-38
SLIDE 38

Split votes