Just Say NO to Paxos Overhead: Replacing Consensus with Network - - PowerPoint PPT Presentation

just say no to paxos overhead replacing consensus with
SMART_READER_LITE
LIVE PREVIEW

Just Say NO to Paxos Overhead: Replacing Consensus with Network - - PowerPoint PPT Presentation

Just Say NO to Paxos Overhead: Replacing Consensus with Network Ordering Jialin Li , Ellis Michael, Naveen Kr. Sharma, Adriana Szekeres, Dan R. K. Ports Server failures are the common case in data centers Server failures are the common case


slide-1
SLIDE 1

Just Say NO to Paxos Overhead: Replacing Consensus with Network Ordering

Jialin Li, Ellis Michael, Naveen Kr. Sharma, Adriana Szekeres, Dan R. K. Ports

slide-2
SLIDE 2

Server failures are the common case in data centers

slide-3
SLIDE 3

Server failures are the common case in data centers

slide-4
SLIDE 4

Server failures are the common case in data centers

slide-5
SLIDE 5

Server failures are the common case in data centers

slide-6
SLIDE 6

State Machine Replication

Operation A Operation B Operation C Operation A Operation B Operation C Operation A Operation B Operation C

slide-7
SLIDE 7

State Machine Replication

Operation A Operation B Operation C Operation A Operation B Operation C Operation A Operation B Operation C

slide-8
SLIDE 8

State Machine Replication

Operation A Operation B Operation C Operation A Operation B Operation C Operation A Operation B Operation C

slide-9
SLIDE 9

Paxos for state machine replication

Client Leader Replica Replica Replica

prepare prepareok request reply

slide-10
SLIDE 10

Paxos for state machine replication

Client Leader Replica Replica Replica

prepare prepareok request reply

Throughput Bottleneck

slide-11
SLIDE 11

Paxos for state machine replication

Client Leader Replica Replica Replica

prepare prepareok request reply

Throughput Bottleneck Latency Penalty

slide-12
SLIDE 12

Can we eliminate Paxos

  • verhead?

Performance overhead due to worst-case network assumptions

  • valid assumptions for the Internet
  • data center networks are different

What properties should the network have to enable faster replication?

slide-13
SLIDE 13

Network properties determine replication complexity

Asynchronous Network

  • Paxos protocol on every operation
  • High performance cost

Messages may be:

  • dropped
  • reordered
  • delivered with

arbitrary latency

Paxos

slide-14
SLIDE 14

Network properties determine replication complexity

Asynchronous Network

  • Paxos protocol on every operation
  • High performance cost

All replicas:

  • receive the same set
  • f messages
  • receive them in the

same order

Paxos

Reliability Ordering

slide-15
SLIDE 15

Network properties determine replication complexity

Asynchronous Network

  • Paxos protocol on every operation
  • High performance cost
  • Replication is trivial

All replicas:

  • receive the same set
  • f messages
  • receive them in the

same order

Paxos

Reliability Ordering

slide-16
SLIDE 16

Network properties determine replication complexity

Asynchronous Network

  • Paxos protocol on every operation
  • High performance cost
  • Replication is trivial
  • Network implementation

has the same complexity as Paxos

All replicas:

  • receive the same set
  • f messages
  • receive them in the

same order

Paxos

Reliability Ordering

slide-17
SLIDE 17

Network Guarantee Weak Strong

Asynchronous Network

Paxos

Ordering Reliability

slide-18
SLIDE 18

Network Guarantee Weak Strong

Asynchronous Network

Paxos

Ordering Reliability

slide-19
SLIDE 19

Network Guarantee Weak Strong

Can we build a network model that:

  • provides performance benefits
  • can be implemented more efficiently

Asynchronous Network

Paxos

Ordering Reliability

slide-20
SLIDE 20

This Talk

slide-21
SLIDE 21

This Talk

A new network model with near-zero-cost implementation: Ordered Unreliable Multicast

slide-22
SLIDE 22

This Talk

A new network model with near-zero-cost implementation: Ordered Unreliable Multicast

+

slide-23
SLIDE 23

This Talk

A new network model with near-zero-cost implementation: Ordered Unreliable Multicast

+

A coordination-free replication protocol: Network-Ordered Paxos

slide-24
SLIDE 24

This Talk

A new network model with near-zero-cost implementation: Ordered Unreliable Multicast

+

A coordination-free replication protocol: Network-Ordered Paxos

=

slide-25
SLIDE 25

This Talk

A new network model with near-zero-cost implementation: Ordered Unreliable Multicast

+

A coordination-free replication protocol: Network-Ordered Paxos

=

replication within 2% throughput overhead

slide-26
SLIDE 26

Outline

  • 1. Background on state machine replication and

data center network

  • 2. Ordered Unreliable Multicast
  • 3. Network-Ordered Paxos
  • 4. Evaluation
slide-27
SLIDE 27

Towards an ordered but unreliable network

Key Idea: Separate ordering from reliable delivery in state machine replication Network provides ordering Replication protocol handles reliability

slide-28
SLIDE 28

OUM Approach

  • Designate one sequencer in the network
  • Sequencer maintains a counter for each OUM group
  • 1. Forward OUM messages to the sequencer
  • 2. Sequencer increments counter and writes

counter value into packet headers

  • 3. Receivers use sequence numbers to detect

reordering and message drops

slide-29
SLIDE 29

Ordered Unreliable Multicast Senders Receivers

Counter:

slide-30
SLIDE 30

1 1

Ordered Unreliable Multicast Senders Receivers

Counter: 1 2

1 2 2 2

slide-31
SLIDE 31

1 1

Ordered Unreliable Multicast Senders Receivers

Counter: 1 2 3 4

1 2 3 4 2 2 3 4 4

slide-32
SLIDE 32

1 1

Ordered Unreliable Multicast Senders Receivers

DROP

Counter: 1 2 3 4

1 2 3 4 2 2 3 4 4

slide-33
SLIDE 33

1 1

Ordered Unreliable Multicast Senders Receivers

DROP

Counter: 1 2 3 4

1 2 3 4 2 2 3 4 4

Ordered Multicast: no coordination required to determine order of messages

slide-34
SLIDE 34

1 1

Ordered Unreliable Multicast Senders Receivers

DROP

Counter: 1 2 3 4

1 2 3 4 2 2 3 4 4

Ordered Multicast: no coordination required to determine order of messages Drop Detection: coordination only required when messages are dropped

slide-35
SLIDE 35

Sequencer Implementations

In-switch sequencing

  • next generation

programmable switches

  • implemented in

P4

  • nearly zero cost

Middlebox prototype

  • Cavium Octeon

network processor

  • connects to root

switches

  • adds 8 us latency

End-host sequencing

  • no specialized

hardware required

  • incurs higher

latency penalties

  • similar throughput

benefits

slide-36
SLIDE 36

Sequencer Implementations

In-switch sequencing

  • next generation

programmable switches

  • implemented in

P4

  • nearly zero cost

Middlebox prototype

  • Cavium Octeon

network processor

  • connects to root

switches

  • adds 8 us latency

End-host sequencing

  • no specialized

hardware required

  • incurs higher

latency penalties

  • similar throughput

benefits

slide-37
SLIDE 37

Sequencer Implementations

In-switch sequencing

  • next generation

programmable switches

  • implemented in

P4

  • nearly zero cost

Middlebox prototype

  • Cavium Octeon

network processor

  • connects to root

switches

  • adds 8 us latency

End-host sequencing

  • no specialized

hardware required

  • incurs higher

latency penalties

  • similar throughput

benefits

slide-38
SLIDE 38

Outline

  • 1. Background on state machine replication and

data center network

  • 2. Ordered Unreliable Multicast
  • 3. Network-Ordered Paxos
  • 4. Evaluation
slide-39
SLIDE 39

NOPaxos Overview

  • Built on top of the guarantees of OUM
  • Client requests are totally ordered but can be

dropped

  • No coordination in the common case
  • Replicas run agreement on drop detection
  • View change protocol for leader or sequencer

failure

slide-40
SLIDE 40

Normal Operation

Client Replica (leader) Replica Replica

slide-41
SLIDE 41

Normal Operation

Client Replica (leader) Replica Replica

OUM

request

slide-42
SLIDE 42

Normal Operation

Client Replica (leader) Replica Replica

OUM

request reply

Execute

slide-43
SLIDE 43

Normal Operation

Client Replica (leader) Replica Replica

OUM

request reply

Execute

waits for replies from majority including leader’ s

slide-44
SLIDE 44

Normal Operation

Client Replica (leader) Replica Replica

OUM

request reply

Execute

waits for replies from majority including leader’ s no coordination

slide-45
SLIDE 45

Normal Operation

Client Replica (leader) Replica Replica

OUM

request reply

Execute

waits for replies from majority including leader’ s no coordination

1 Round Trip Time

slide-46
SLIDE 46

Gap Agreement

Replicas detect message drops

  • Non-leader replicas: recover the missing

message from the leader

  • Leader replica: coordinates to commit a NO-OP

(Paxos)

  • Efficient recovery from network anomalies
slide-47
SLIDE 47

View Change

  • Handles leader or sequencer failure
  • Ensures that all replicas are in a consistent state
  • Runs a view change protocol similar to VR
  • view-number is a tuple of

<leader-number, session-number>

slide-48
SLIDE 48

Outline

  • 1. Background on state machine replication and

data center network

  • 2. Ordered Unreliable Multicast
  • 3. Network-Ordered Paxos
  • 4. Evaluation
slide-49
SLIDE 49

Evaluation Setup

  • 3-level fat-tree network testbed
  • 5 replicas with 2.5 GHz Intel Xeon E5-2680
  • Middle box sequencer

Sequencer

slide-50
SLIDE 50

NOPaxos achieves better throughput and latency

Latency (us) Throughput (ops/sec)

better → better ↓

slide-51
SLIDE 51

250 500 750 1000 65,000 130,000 195,000 260,000

NOPaxos achieves better throughput and latency

Latency (us) Throughput (ops/sec)

better → better ↓

slide-52
SLIDE 52

250 500 750 1000 65,000 130,000 195,000 260,000

NOPaxos achieves better throughput and latency

Latency (us) Throughput (ops/sec)

NOPaxos Fast Paxos Paxos

better → better ↓

slide-53
SLIDE 53

250 500 750 1000 65,000 130,000 195,000 260,000

NOPaxos achieves better throughput and latency

Latency (us) Throughput (ops/sec)

NOPaxos Fast Paxos Paxos

4.7X throughput and more than 40% reduction in latency

better → better ↓

slide-54
SLIDE 54

250 500 750 1000 65,000 130,000 195,000 260,000

NOPaxos achieves better throughput and latency

Latency (us) Throughput (ops/sec)

NOPaxos Fast Paxos Paxos Paxos + Batching

4.7X throughput and more than 40% reduction in latency

better → better ↓

slide-55
SLIDE 55

250 500 750 1000 65,000 130,000 195,000 260,000

NOPaxos achieves better throughput and latency

Latency (us) Throughput (ops/sec)

NOPaxos Fast Paxos Paxos Paxos + Batching

4.7X throughput and more than 40% reduction in latency 25% higher throughput and 6X lower latency

better → better ↓

slide-56
SLIDE 56

NOPaxos is resilient to network anomalies

65,000 130,000 195,000 260,000 0.001% 0.01% 0.1% 1%

NOPaxos Paxos SpecPaxos

Throughput (ops/sec) Packet Drop Rate

slide-57
SLIDE 57

NOPaxos is resilient to network anomalies

65,000 130,000 195,000 260,000 0.001% 0.01% 0.1% 1%

NOPaxos Paxos SpecPaxos

Throughput (ops/sec) Packet Drop Rate

NOPaxos Speculative Paxos Paxos

slide-58
SLIDE 58

NOPaxos is resilient to network anomalies

65,000 130,000 195,000 260,000 0.001% 0.01% 0.1% 1%

NOPaxos Paxos SpecPaxos

Throughput (ops/sec) Packet Drop Rate

drops to 24% of maximum throughput

NOPaxos Speculative Paxos Paxos

slide-59
SLIDE 59

NOPaxos attains throughput within 2% of an unreplicated system

slide-60
SLIDE 60

NOPaxos attains throughput within 2% of an unreplicated system

125 250 375 500 65,000 130,000 195,000 260,000

Latency (us) Throughput (ops/sec)

better → better ↓

slide-61
SLIDE 61

NOPaxos attains throughput within 2% of an unreplicated system

125 250 375 500 65,000 130,000 195,000 260,000

NOPaxos Unreplicated Paxos

Latency (us) Throughput (ops/sec)

better → better ↓

slide-62
SLIDE 62

NOPaxos attains throughput within 2% of an unreplicated system

125 250 375 500 65,000 130,000 195,000 260,000

NOPaxos Unreplicated Paxos

within 2% throughput and 16us latency of an unreplicated system

Latency (us) Throughput (ops/sec)

better → better ↓

slide-63
SLIDE 63

NOPaxos attains throughput within 2% of an unreplicated system

125 250 375 500 65,000 130,000 195,000 260,000

NOPaxos Unreplicated NOPaxos using end-host sequencer Paxos

within 2% throughput and 16us latency of an unreplicated system

Latency (us) Throughput (ops/sec)

better → better ↓

slide-64
SLIDE 64

NOPaxos attains throughput within 2% of an unreplicated system

125 250 375 500 65,000 130,000 195,000 260,000

NOPaxos Unreplicated NOPaxos using end-host sequencer Paxos

within 2% throughput and 16us latency of an unreplicated system similar throughput but 36% higher latency

Latency (us) Throughput (ops/sec)

better → better ↓

slide-65
SLIDE 65

Related Work

Group communication systems

  • Virtual Synchrony [Birman, et al.], CATOCS [Cheriton, et al.],

Amoeba [Kaashoek, et al.]

Consensus protocols

  • Fast Paxos [Lamport], Optimistic Atomic Broadcast [Pedone, et

al.], Speculative Paxos [Ports, et al.]

  • Egalitarian Paxos [Moraru, et al.], Tapir [Zhang, et al.]

Network and Hardware support for distributed systems

  • SwitchKV [Li, et al.], NetPaxos [Dang, et al.], FaRM [Dragojevic, et

al.], Consensus in a Box [Istvan, et al.]

slide-66
SLIDE 66

Summary

  • Separate ordering from reliable delivery in state

machine replication

  • A new network model OUM that provides ordered

but unreliable message delivery

  • A more efficient replication protocol NOPaxos that

ensures reliable delivery

  • The combined system achieves performance

equivalent to an unreplicated system