Taming Latency In Data Center Applications
Ph.D. Defense of Dissertation Mohan Kumar Advisor: Taesoo Kim
1
Taming Latency In Data Center Applications Ph.D. Defense of - - PowerPoint PPT Presentation
Taming Latency In Data Center Applications Ph.D. Defense of Dissertation Mohan Kumar Advisor: Taesoo Kim 1 Motivation: Importance of Latency Latency Critical In Data Center Applications 2 Data Center Applications Key-value Stores Web
1
2
Key-value Stores Web Servers Distributed Services
3
➢
66% of the inter-rack latency [1]
➢
81% of the intra-rack latency [1]
4
[1] Network requirements for resource disaggregation, OSDI’16
Key-value Stores Web Servers Distributed Services
5
Protocol stack - 80% overhead TLB shootdown - 30% overhead Consensus - 82% overhead
System abstractions and optimizations are needed at different levels of the software stack, from the software services running in the user space and the kernel to the software running on SmartNICs, to reduce the latency and improve the throughput
6
Key-value Stores Web Servers Distributed Services
7
Protocol stack - 80% overhead TLB shootdown - 30% overhead Consensus - 82% overhead Protocol stack - Xps TLB shootdown - LATR Consensus - Dyad
➢
Abstraction in kernel and user-space protocol stacks, and SmartNICs
➢
Reduces Redis latency by up to 73.3%
➢
Kernel mechanism for free operations, page migration and swapping
➢
Reduces Apache latency by up to 26.1%
8
➢
Abstraction in SmartNIC for consensus
➢
Reduces timestamp server latency by up to 79%
9
10
11
➢
Provides high availability by state machine replication
➢
Keeps data consistent - linearizable
➢
Consensus algorithms:
■
Multi-Paxos/Viewstamp Replication (VR)
■
Raft and Zookeeper Atomic Broadcast (ZAB)
12
➢ Timestamp Servers ➢ Key-value stores ➢ Database ➢ Lock managers
13
Distributed Services
14
Replica 1/ Leader Replica 2 Replica 3 Client request response prepare prepareok exec()
Application Consensus
15
commit
Replica 2 Replica 3 Client request response commit exec()
Application Consensus
16
propose
TCP
ack Disk Disk Disk Replica 1/ Leader
Consensus Protocol Processing Application BSD socket Linux epoll NIC PCIe Consensus Protocol Processing Application BSD socket Linux epoll NIC PCIe Consensus Protocol Processing Application BSD socket Linux epoll NIC PCIe
Data Center Network (μs RTT)
Replica 1 - Leader Replica 2 Replica 3 Client Requests
17
Consensus Protocol Processing
Application BSD socket Linux epoll NIC Host PCIe Network Replica
~0.8 μs [1] ~10 μs
18
[1] Understanding PCIe performance for end host, SIGCOMM’18
Leader Replica 1 Replica 2 Client request response prepare prepareok commit
19
PCIe Protocol processing Context switch Application ~11 μs
Leader Replica 1 Replica 2 Client prepare prepareok
20
PCIe Protocol processing Context switch Application System Direct
VR 67 μs
Leader Replica 1 Replica 2 Client prepareok
21
PCIe Protocol processing Context switch Application request request commit System Direct Indirect
VR 62 μs 85 μs
22
➢
Critical path for handling a client request
➢
Recovery – application recovery after failure
➢
View Change – new replicas joining/leaving the group, new leader
➢
Heartbeats – health status messages exchanged across replicas
23
24
➢
NoPaxos, Speculative Paxos - relies on network to order requests
➢
NetPaxos - proposal to execute paxos in programmable switches
➢
Logically coupled consensus in hardware (FPGA)
➢
Application is limited by the resources available on FPGA
25
26
Consensus Protocol Processing
Application BSD socket Linux epoll NIC Host PCIe Network
27
Consensus - Control
Protocol Processing
Application BSD socket Linux epoll Host PCIe Network Replica SmartNIC
Consensus – Data
Consensus - Control
28
Consensus Protocol Processing
Application BSD socket Linux epoll NIC PCIe
29
Logically-Coupled Consensus
Protocol Processing
Application BSD socket Linux epoll PCIe SmartNIC
Consensus – Data Consensus - Control
Dyad Consensus
➢
Critical path for handling a client request
➢
Recovery – application recovery after failure
➢
View Change – new replicas joining/leaving the group, new leader
➢
Heartbeats – health status messages exchanged across replicas
30
31
Protocol Processing
Application BSD socket Linux epoll Host PCIe Network Replica SmartNIC
Consensus – Data
Consensus - Control
32
Replica 2 Replica 3 Client request response prepare prepareok
33
PCIe Protocol processing Context switch Application SmartNIC to host 2.6 μs 3.5 μs 1.7 μs 3 μs 3 μs 0.1 μs 3 μs commit 1.5 μs 1.5 μs Replica 1/ Leader
Replica 2 Replica 3 Client prepare prepareok
34
SmartNIC to host 2.6 μs 3.5 μs 3 μs 1.7 μs System Direct Indirect
VR 67 μs 85 μs Dyad 12.8 μs % Reduction 81%
Replica 1/ Leader
Replica 2 Replica 3 Client prepareok
35
PCIe Protocol processing Context switch Application SmartNIC 3 μs 0.1 μs commit 1.5 μs 1.5 μs System Direct Indirect
VR 62 μs 85 μs Dyad 12.7 μs 6.1 μs % Reduction 81% 92%
Replica 1/ Leader
➢
Specify packet format in domain-specific language (P4)
➢
Filter messages based on the header and payload
➢
Filters are applied to messages coming from the network and the host
36
➢
Filtered messages invoke request/consensus/response handlers
➢
Handlers drop/forward/modify a packet
➢
Generate new packets
37
PCIe Network
38
Ingress H/W filter (P4) C Handlers Memory Egress H/W filter (P4)
Leader Replica 1 Replica 2 Client request response prepare prepareok
39
PCIe Protocol processing Context switch Application SmartNIC to host commit
Consensus - Data PCIe Network Leader SmartNIC
Request Handler Request 1 Prepare 2 Assign sequence number and Log
40
2 Ordered Log
2, 3 2, 3
Client Replica
Consensus - Data PCIe Network Leader SmartNIC
Prepare Handler Prepareok 1 Ordered Log 1 Request 1 Majority prepareok for request 1
41
2
2, 3 2, 3 3
Replica 2
Consensus - Data PCIe Network Leader SmartNIC
Prepare Handler Prepareok 2 Ordered Log 1 Request not sent to host Majority prepareok for request 2
42
2
2, 3 2, 3 3
Replica 2
Consensus - Data PCIe Network Leader SmartNIC
Prepare Handler Prepareok 1 Ordered Log 1 Request 1 & 2 Majority prepareok for request 1
43
2
3 2, 3 3
Replica 2
Consensus - Data PCIe Network Leader SmartNIC
Response Handler 1 Response
44
Commit Response Ordered Log 2 Update log meta-data
3 3
Client Replica
➢ Reduce latency by up to 76%, Improves throughput by 5.8x
45
~2 Million messages processed on the NIC
Leader Replica 1 Replica 2 Client request response prepare prepareok
46
PCIe Protocol processing Context switch Application SmartNIC to host commit
➢
Logs ordered by the sequence number in prepare message
➢
Prepare message are processed and dropped on the SmartNIC
➢
Commit messages forwarded to the host processor
➢
The request is appended to the commit message by SmartNIC
47
Consensus - Data PCIe Network Replica SmartNIC
Prepare Handler Prepare 2 1 Prepareok 2 Log request using sequence number
48
2 Ordered Log Leader Leader
Consensus - Data PCIe Network Replica SmartNIC
Commit Handler Commit 1 Ordered Log 1 Commit 1 Verify order of received commit
49
2 Leader
➢ Reduce latency by 30 μs
50
System Consensus latency (μs) % reduction
VR 350 N/A VR-batching 409 N/A Dyad-Leader 48 86% Dyad-All 17 95%
Timestamp server - 5 replicas
51
52
➢ Reduce CPU usage by up to 70% on the leader
Protocol Processing
Application BSD socket Linux epoll Host PCIe Network Replica SmartNIC
Consensus – Data
Consensus - Control
53
Protocol Processing
Application BSD socket Linux epoll Host PCIe Network Replica SmartNIC
Consensus – Data
Consensus - Control
54
92% catastrophic failure - due to software [1]
[1] Simple Testing Can Prevent Most Critical Failures, OSDI’14
Fail-stop failure
Protocol Processing
Application BSD socket Linux epoll Host Network Replica SmartNIC
Consensus – Data
Consensus - Control
Response Request Host RTT
55
56
Consensus Protocol Processing Application BSD socket Linux epoll NIC PCIe Consensus Protocol Processing Application BSD socket Linux epoll NIC PCIe Consensus Protocol Processing Application BSD socket Linux epoll NIC PCIe
Data Center Network (μs RTT)
Replica 1 - Leader Replica 2 Replica 3 Client Requests
Application Restart
Log Transfer
57
➢
Recover logs from the SmartNIC
➢
Recover remaining logs from other replicas
58
59
➢ Dyad reduces recovery time by up to 67%
400MB of data received
Protocol Processing
Application BSD socket Linux epoll Host PCIe Network Replica SmartNIC
Consensus – Data
Consensus - Control
60
Protocol Processing
Application BSD socket Linux epoll Host PCIe Network Replica SmartNIC
Consensus – Data
Consensus - Control
61
8% - hardware faults, misconfigs [1]
[1] Simple Testing Can Prevent Most Critical Failures, OSDI’14
➢
Detected on the host using heartbeat/client messages
➢
Existing VR recovery: fetch remaining logs from other replicas
➢
Existing VR recovery: fetch logs from other replicas
➢
Dyad supports logging to disk from host (Raft)
62
➢
Using TCP connection to replicas
➢
TCP stack specifically decode Raft headers and payload
➢
Host application logs client commands to disk for persistence
63
➢ Improves latency by up to 62%
64
➢
Enable consensus for Memcached ■ ~100 lines of code for data operations on replica
➢
Evaluate impact on latency and throughput
65
66
➢ Provides consensus with ~7% reduction in throughput
67
➢ Provides consensus with ~16% increase in latency
68
69
➢
Abstraction in kernel, user space, and SmartNIC
➢
Kernel mechanism for TLB shootdown
70
System abstractions and optimizations are needed at different levels of the software stack to reduce the latency and improve the throughput of current data-center applications.
71
72
73
74
75
76
77
Systems Latency (μs) Mitigation
mTCP ~ 23 Batching IX ~12 Batching Arrakis ~2.6 - 6.3 None
78
79
80
81
82
➢
Sequence numbers are available in prepareok message
➢
Multi-thread execution by using the sequence number
➢
Request are ordered without containing the sequence number
➢
SmartNIC appends the sequence number to the client request
83
➢ Improves throughput by up to 2.1x
84
85
➢ Log read throughput ~256 MB with 16 threads
86
87
88
89
Replicas Log Size (MB) Data transferred (MB)
3 100 200 5 100 400 7 100 600
90
91
➢ RTT = ~96 μs
92
93
94
95
Consensus Protocol Processing Application BSD socket Linux epoll SmartNIC PCIe Consensus Protocol Processing Application BSD socket Linux epoll SmartNIC PCIe Consensus Protocol Processing Application BSD socket Linux epoll SmartNIC PCIe
Data Center Network (μs RTT)
Replica 1 - Leader Replica 2 Replica 3 Client Requests
Application Restart
96
2 1 2 1 2 1 3 3 1, 2
Consensus Protocol Processing Application BSD socket Linux epoll NIC PCIe Consensus Protocol Processing Application BSD socket Linux epoll NIC PCIe
Data Center Network (μs RTT)
Replica 1 - Leader Replica 2 Replica 3 Client Requests
Application Restart
Log Transfer
97
2 1 3 2 1 3
Consensus Protocol Processing BSD socket Linux epoll NIC PCIe Application
2 1 3
Log Transfer
3 3
98
99
➢ Improves latency by up to 46%
100
➢
Possible with NVMe over fabric
➢
Possible over PCIe? - ARM, FPGA, or NPU
➢
Sending parsed requests to host
101