Just Say NO to Paxos Overhead: Replacing Consensus with Network Ordering
Jialin Li, Ellis Michael, Naveen Kr. Sharma, Adriana Szekeres, Dan R. K. Ports
Just Say NO to Paxos Overhead: Replacing Consensus with Network - - PowerPoint PPT Presentation
Just Say NO to Paxos Overhead: Replacing Consensus with Network Ordering Jialin Li , Ellis Michael, Naveen Kr. Sharma, Adriana Szekeres, Dan R. K. Ports Server failures are the common case in data centers Server failures are the common case
Jialin Li, Ellis Michael, Naveen Kr. Sharma, Adriana Szekeres, Dan R. K. Ports
Operation A Operation B Operation C Operation A Operation B Operation C Operation A Operation B Operation C
Operation A Operation B Operation C Operation A Operation B Operation C Operation A Operation B Operation C
Operation A Operation B Operation C Operation A Operation B Operation C Operation A Operation B Operation C
Client Leader Replica Replica Replica
prepare prepareok request reply
Client Leader Replica Replica Replica
prepare prepareok request reply
Throughput Bottleneck
Client Leader Replica Replica Replica
prepare prepareok request reply
Throughput Bottleneck Latency Penalty
Performance overhead due to worst-case network assumptions
What properties should the network have to enable faster replication?
Asynchronous Network
Messages may be:
arbitrary latency
Asynchronous Network
All replicas:
same order
Reliability Ordering
Asynchronous Network
All replicas:
same order
Reliability Ordering
Asynchronous Network
has the same complexity as Paxos
All replicas:
same order
Reliability Ordering
Network Guarantee Weak Strong
Asynchronous Network
Paxos
Ordering Reliability
Network Guarantee Weak Strong
Asynchronous Network
Paxos
Ordering Reliability
Network Guarantee Weak Strong
Asynchronous Network
Paxos
Ordering Reliability
A new network model with near-zero-cost implementation: Ordered Unreliable Multicast
A new network model with near-zero-cost implementation: Ordered Unreliable Multicast
A new network model with near-zero-cost implementation: Ordered Unreliable Multicast
A coordination-free replication protocol: Network-Ordered Paxos
A new network model with near-zero-cost implementation: Ordered Unreliable Multicast
A coordination-free replication protocol: Network-Ordered Paxos
A new network model with near-zero-cost implementation: Ordered Unreliable Multicast
A coordination-free replication protocol: Network-Ordered Paxos
replication within 2% throughput overhead
data center network
Key Idea: Separate ordering from reliable delivery in state machine replication Network provides ordering Replication protocol handles reliability
counter value into packet headers
reordering and message drops
Ordered Unreliable Multicast Senders Receivers
Counter:
1 1
Ordered Unreliable Multicast Senders Receivers
Counter: 1 2
1 2 2 2
1 1
Ordered Unreliable Multicast Senders Receivers
Counter: 1 2 3 4
1 2 3 4 2 2 3 4 4
1 1
Ordered Unreliable Multicast Senders Receivers
DROP
Counter: 1 2 3 4
1 2 3 4 2 2 3 4 4
1 1
Ordered Unreliable Multicast Senders Receivers
DROP
Counter: 1 2 3 4
1 2 3 4 2 2 3 4 4
Ordered Multicast: no coordination required to determine order of messages
1 1
Ordered Unreliable Multicast Senders Receivers
DROP
Counter: 1 2 3 4
1 2 3 4 2 2 3 4 4
Ordered Multicast: no coordination required to determine order of messages Drop Detection: coordination only required when messages are dropped
In-switch sequencing
programmable switches
P4
Middlebox prototype
network processor
switches
End-host sequencing
hardware required
latency penalties
benefits
In-switch sequencing
programmable switches
P4
Middlebox prototype
network processor
switches
End-host sequencing
hardware required
latency penalties
benefits
In-switch sequencing
programmable switches
P4
Middlebox prototype
network processor
switches
End-host sequencing
hardware required
latency penalties
benefits
data center network
dropped
failure
Client Replica (leader) Replica Replica
Client Replica (leader) Replica Replica
OUM
request
Client Replica (leader) Replica Replica
OUM
request reply
Execute
Client Replica (leader) Replica Replica
OUM
request reply
Execute
waits for replies from majority including leader’ s
Client Replica (leader) Replica Replica
OUM
request reply
Execute
waits for replies from majority including leader’ s no coordination
Client Replica (leader) Replica Replica
OUM
request reply
Execute
waits for replies from majority including leader’ s no coordination
1 Round Trip Time
Replicas detect message drops
message from the leader
(Paxos)
<leader-number, session-number>
data center network
Sequencer
Latency (us) Throughput (ops/sec)
better → better ↓
250 500 750 1000 65,000 130,000 195,000 260,000
Latency (us) Throughput (ops/sec)
better → better ↓
250 500 750 1000 65,000 130,000 195,000 260,000
Latency (us) Throughput (ops/sec)
NOPaxos Fast Paxos Paxos
better → better ↓
250 500 750 1000 65,000 130,000 195,000 260,000
Latency (us) Throughput (ops/sec)
NOPaxos Fast Paxos Paxos
4.7X throughput and more than 40% reduction in latency
better → better ↓
250 500 750 1000 65,000 130,000 195,000 260,000
Latency (us) Throughput (ops/sec)
NOPaxos Fast Paxos Paxos Paxos + Batching
4.7X throughput and more than 40% reduction in latency
better → better ↓
250 500 750 1000 65,000 130,000 195,000 260,000
Latency (us) Throughput (ops/sec)
NOPaxos Fast Paxos Paxos Paxos + Batching
4.7X throughput and more than 40% reduction in latency 25% higher throughput and 6X lower latency
better → better ↓
65,000 130,000 195,000 260,000 0.001% 0.01% 0.1% 1%
NOPaxos Paxos SpecPaxos
Throughput (ops/sec) Packet Drop Rate
65,000 130,000 195,000 260,000 0.001% 0.01% 0.1% 1%
NOPaxos Paxos SpecPaxos
Throughput (ops/sec) Packet Drop Rate
NOPaxos Speculative Paxos Paxos
65,000 130,000 195,000 260,000 0.001% 0.01% 0.1% 1%
NOPaxos Paxos SpecPaxos
Throughput (ops/sec) Packet Drop Rate
drops to 24% of maximum throughput
NOPaxos Speculative Paxos Paxos
125 250 375 500 65,000 130,000 195,000 260,000
Latency (us) Throughput (ops/sec)
better → better ↓
125 250 375 500 65,000 130,000 195,000 260,000
NOPaxos Unreplicated Paxos
Latency (us) Throughput (ops/sec)
better → better ↓
125 250 375 500 65,000 130,000 195,000 260,000
NOPaxos Unreplicated Paxos
within 2% throughput and 16us latency of an unreplicated system
Latency (us) Throughput (ops/sec)
better → better ↓
125 250 375 500 65,000 130,000 195,000 260,000
NOPaxos Unreplicated NOPaxos using end-host sequencer Paxos
within 2% throughput and 16us latency of an unreplicated system
Latency (us) Throughput (ops/sec)
better → better ↓
125 250 375 500 65,000 130,000 195,000 260,000
NOPaxos Unreplicated NOPaxos using end-host sequencer Paxos
within 2% throughput and 16us latency of an unreplicated system similar throughput but 36% higher latency
Latency (us) Throughput (ops/sec)
better → better ↓
Group communication systems
Amoeba [Kaashoek, et al.]
Consensus protocols
al.], Speculative Paxos [Ports, et al.]
Network and Hardware support for distributed systems
al.], Consensus in a Box [Istvan, et al.]
machine replication
but unreliable message delivery
ensures reliable delivery
equivalent to an unreplicated system