Taming Latency In Data Center Applications Ph.D. Defense of - - PowerPoint PPT Presentation

taming latency in data center applications
SMART_READER_LITE
LIVE PREVIEW

Taming Latency In Data Center Applications Ph.D. Defense of - - PowerPoint PPT Presentation

Taming Latency In Data Center Applications Ph.D. Defense of Dissertation Mohan Kumar Advisor: Taesoo Kim 1 Motivation: Importance of Latency Latency Critical In Data Center Applications 2 Data Center Applications Key-value Stores Web


slide-1
SLIDE 1

Taming Latency In Data Center Applications

Ph.D. Defense of Dissertation Mohan Kumar Advisor: Taesoo Kim

1

slide-2
SLIDE 2

Motivation: Importance of Latency

2

Latency Critical In Data Center Applications

slide-3
SLIDE 3

Data Center Applications

Key-value Stores Web Servers Distributed Services

3

slide-4
SLIDE 4
  • Optimized network – microsecond round-trip time
  • Moving from 10/25 Gbps to 100/200 Gbps network
  • Software running in servers induce high latency:

66% of the inter-rack latency [1]

81% of the intra-rack latency [1]

Contemporary Data Center Characteristics

4

[1] Network requirements for resource disaggregation, OSDI’16

slide-5
SLIDE 5

Data Center Applications - Server Latency

Key-value Stores Web Servers Distributed Services

5

Protocol stack - 80% overhead TLB shootdown - 30% overhead Consensus - 82% overhead

slide-6
SLIDE 6

System abstractions and optimizations are needed at different levels of the software stack, from the software services running in the user space and the kernel to the software running on SmartNICs, to reduce the latency and improve the throughput

  • f current data-center applications.

Thesis Statement

6

slide-7
SLIDE 7

Data Center Applications - Server Latency

Key-value Stores Web Servers Distributed Services

7

Protocol stack - 80% overhead TLB shootdown - 30% overhead Consensus - 82% overhead Protocol stack - Xps TLB shootdown - LATR Consensus - Dyad

slide-8
SLIDE 8
  • Xps - Extensible Protocol Stack:

Abstraction in kernel and user-space protocol stacks, and SmartNICs

Reduces Redis latency by up to 73.3%

  • LATR - Lazy Translation Coherence:

Kernel mechanism for free operations, page migration and swapping

Reduces Apache latency by up to 26.1%

8

Taming Application Latency- Thesis

slide-9
SLIDE 9
  • Dyad - Untangling Logically-Coupled Consensus:

Abstraction in SmartNIC for consensus

Reduces timestamp server latency by up to 79%

9

Taming Application Latency - Thesis

slide-10
SLIDE 10

Dyad: Untangling Logically-Coupled Consensus

10

slide-11
SLIDE 11

Motivation - Consensus Algorithms

11

Failures are inevitable and expensive

slide-12
SLIDE 12
  • Consensus Algorithms:

Provides high availability by state machine replication

Keeps data consistent - linearizable

Consensus algorithms:

Multi-Paxos/Viewstamp Replication (VR)

Raft and Zookeeper Atomic Broadcast (ZAB)

Consensus Algorithms

12

slide-13
SLIDE 13

Consensus Algorithms - Applications

➢ Timestamp Servers ➢ Key-value stores ➢ Database ➢ Lock managers

13

Distributed Services

slide-14
SLIDE 14
  • Background
  • Overview
  • Design and Evaluation
  • Conclusion

Dyad: Untangling Logically-Coupled Consensus

14

slide-15
SLIDE 15

Consensus – VR Data Operation

Replica 1/ Leader Replica 2 Replica 3 Client request response prepare prepareok exec()

  • 1. Ordering
  • 2. Replication

Application Consensus

  • 3. Ordered execution

15

commit

slide-16
SLIDE 16

Consensus – ZAB or Raft Data Operation

Replica 2 Replica 3 Client request response commit exec()

  • 1. Ordering
  • 2. Replication

Application Consensus

  • 3. Ordered execution

16

propose

TCP

ack Disk Disk Disk Replica 1/ Leader

slide-17
SLIDE 17

Replicas in a Data Center

Consensus Protocol Processing Application BSD socket Linux epoll NIC PCIe Consensus Protocol Processing Application BSD socket Linux epoll NIC PCIe Consensus Protocol Processing Application BSD socket Linux epoll NIC PCIe

Data Center Network (μs RTT)

Replica 1 - Leader Replica 2 Replica 3 Client Requests

17

slide-18
SLIDE 18

Logically-Coupled Consensus

Consensus Protocol Processing

Application BSD socket Linux epoll NIC Host PCIe Network Replica

~0.8 μs [1] ~10 μs

18

[1] Understanding PCIe performance for end host, SIGCOMM’18

slide-19
SLIDE 19

Leader Replica 1 Replica 2 Client request response prepare prepareok commit

  • 1. Ordering
  • 2. Replication
  • 3. Ordered execution

Consensus – VR Data Operation

19

PCIe Protocol processing Context switch Application ~11 μs

slide-20
SLIDE 20

Consensus – Direct Cost of Latency

Leader Replica 1 Replica 2 Client prepare prepareok

  • 1. Ordering
  • 2. Replication
  • 3. Ordered execution

20

PCIe Protocol processing Context switch Application System Direct

VR 67 μs

slide-21
SLIDE 21

Consensus – Indirect Cost of Latency

Leader Replica 1 Replica 2 Client prepareok

  • 1. Ordering
  • 2. Replication
  • 3. Ordered execution

21

PCIe Protocol processing Context switch Application request request commit System Direct Indirect

VR 62 μs 85 μs

Consensus - high system overhead due to direct and indirect cost

slide-22
SLIDE 22

Consensus Latency - Increasing Replicas

22

Consensus latency is up to 82% of the end-to-end latency

slide-23
SLIDE 23
  • Data Operation:

Critical path for handling a client request

  • Control Operations:

Recovery – application recovery after failure

View Change – new replicas joining/leaving the group, new leader

Heartbeats – health status messages exchanged across replicas

Consensus Operations

23

slide-24
SLIDE 24
  • Every client request has high consensus overhead
  • Consensus algorithms share resources with application
  • Consensus overhead increases with increasing replicas

Cost of Consensus - Summary

24

slide-25
SLIDE 25
  • Network approaches:

NoPaxos, Speculative Paxos - relies on network to order requests

NetPaxos - proposal to execute paxos in programmable switches

  • Hardware approach:

Logically coupled consensus in hardware (FPGA)

Application is limited by the resources available on FPGA

Consensus - Existing Research Rely on Network Guarantees Logically Coupled Consensus

25

slide-26
SLIDE 26
  • Background
  • Overview
  • Design and Evaluation
  • Conclusion

Dyad: Untangling Logically-Coupled Consensus

26

slide-27
SLIDE 27

Logically-Coupled Consensus

Consensus Protocol Processing

Application BSD socket Linux epoll NIC Host PCIe Network

27

Consensus - Control

slide-28
SLIDE 28

Protocol Processing

Application BSD socket Linux epoll Host PCIe Network Replica SmartNIC

Consensus – Data

Dyad: Untangling Logically-Coupled Consensus

Consensus - Control

28

slide-29
SLIDE 29

Consensus Protocol Processing

Application BSD socket Linux epoll NIC PCIe

29

Dyad: Untangling Logically-Coupled Consensus

Logically-Coupled Consensus

Protocol Processing

Application BSD socket Linux epoll PCIe SmartNIC

Consensus – Data Consensus - Control

Dyad Consensus

slide-30
SLIDE 30
  • Data Operation - SmartNIC:

Critical path for handling a client request

  • Control Operations - Host:

Recovery – application recovery after failure

View Change – new replicas joining/leaving the group, new leader

Heartbeats – health status messages exchanged across replicas

Dyad: Classifying Consensus Operations

30

slide-31
SLIDE 31
  • Background
  • Overview
  • Design and Evaluation
  • Conclusion

Dyad: Untangling Logically-Coupled Consensus

31

slide-32
SLIDE 32

Protocol Processing

Application BSD socket Linux epoll Host PCIe Network Replica SmartNIC

Consensus – Data

Dyad: Data Operations

Consensus - Control

32

slide-33
SLIDE 33

Replica 2 Replica 3 Client request response prepare prepareok

  • 1. Ordering
  • 2. Replication
  • 3. Ordered execution

Dyad – Viewstamp Replication (VR)

33

PCIe Protocol processing Context switch Application SmartNIC to host 2.6 μs 3.5 μs 1.7 μs 3 μs 3 μs 0.1 μs 3 μs commit 1.5 μs 1.5 μs Replica 1/ Leader

slide-34
SLIDE 34

Replica 2 Replica 3 Client prepare prepareok

  • 1. Ordering
  • 2. Replication
  • 3. Ordered execution

Dyad – Direct Cost

34

SmartNIC to host 2.6 μs 3.5 μs 3 μs 1.7 μs System Direct Indirect

VR 67 μs 85 μs Dyad 12.8 μs % Reduction 81%

Replica 1/ Leader

slide-35
SLIDE 35

Replica 2 Replica 3 Client prepareok

  • 1. Ordering
  • 2. Replication
  • 3. Ordered execution

Dyad – Indirect Cost

35

PCIe Protocol processing Context switch Application SmartNIC 3 μs 0.1 μs commit 1.5 μs 1.5 μs System Direct Indirect

VR 62 μs 85 μs Dyad 12.7 μs 6.1 μs % Reduction 81% 92%

Replica 1/ Leader

Direct and indirect cost reduced by Dyad

slide-36
SLIDE 36
  • Hardware Filtering:

Specify packet format in domain-specific language (P4)

Filter messages based on the header and payload

Filters are applied to messages coming from the network and the host

Dyad: SmartNIC Primitives

36

slide-37
SLIDE 37
  • Packet Processing:

Filtered messages invoke request/consensus/response handlers

Handlers drop/forward/modify a packet

Generate new packets

Dyad: SmartNIC Primitives

37

slide-38
SLIDE 38

PCIe Network

Dyad: SmartNIC Primitives

38

Ingress H/W filter (P4) C Handlers Memory Egress H/W filter (P4)

slide-39
SLIDE 39

Leader Replica 1 Replica 2 Client request response prepare prepareok

  • 1. Ordering
  • 2. Replication
  • 3. Ordered execution

Dyad - Leader Data Operations

39

PCIe Protocol processing Context switch Application SmartNIC to host commit

slide-40
SLIDE 40

Consensus - Data PCIe Network Leader SmartNIC

Dyad: Ordering on Leader SmartNIC

Request Handler Request 1 Prepare 2 Assign sequence number and Log

40

2 Ordered Log

2, 3 2, 3

Client Replica

slide-41
SLIDE 41

Consensus - Data PCIe Network Leader SmartNIC

Dyad: Replication on Leader SmartNIC

Prepare Handler Prepareok 1 Ordered Log 1 Request 1 Majority prepareok for request 1

41

2

2, 3 2, 3 3

Replica 2

slide-42
SLIDE 42

Consensus - Data PCIe Network Leader SmartNIC

Dyad: Reordered Consensus Message

Prepare Handler Prepareok 2 Ordered Log 1 Request not sent to host Majority prepareok for request 2

42

2

2, 3 2, 3 3

Replica 2

slide-43
SLIDE 43

Consensus - Data PCIe Network Leader SmartNIC

Dyad: Reordered Consensus Message

Prepare Handler Prepareok 1 Ordered Log 1 Request 1 & 2 Majority prepareok for request 1

43

2

3 2, 3 3

Replica 2

slide-44
SLIDE 44

Consensus - Data PCIe Network Leader SmartNIC

Dyad: Response and Commit

Response Handler 1 Response

44

Commit Response Ordered Log 2 Update log meta-data

3 3

Client Replica

slide-45
SLIDE 45

Dyad: Timestamp Server with 5 replicas

➢ Reduce latency by up to 76%, Improves throughput by 5.8x

45

~2 Million messages processed on the NIC

slide-46
SLIDE 46

Leader Replica 1 Replica 2 Client request response prepare prepareok

  • 1. Ordering
  • 2. Replication
  • 3. Ordered execution

Dyad – Replica Data Operations

46

PCIe Protocol processing Context switch Application SmartNIC to host commit

slide-47
SLIDE 47
  • Ordering and Logging:

Logs ordered by the sequence number in prepare message

Prepare message are processed and dropped on the SmartNIC

  • Ordered Execution:

Commit messages forwarded to the host processor

The request is appended to the commit message by SmartNIC

Dyad: Ordering on Replica SmartNIC

47

slide-48
SLIDE 48

Consensus - Data PCIe Network Replica SmartNIC

Dyad: Logging on Replica SmartNIC

Prepare Handler Prepare 2 1 Prepareok 2 Log request using sequence number

48

2 Ordered Log Leader Leader

slide-49
SLIDE 49

Consensus - Data PCIe Network Replica SmartNIC

Dyad: Ordered Execution on the Replica

Commit Handler Commit 1 Ordered Log 1 Commit 1 Verify order of received commit

49

2 Leader

slide-50
SLIDE 50

Dyad: Timestamp Server with 5 replicas

➢ Reduce latency by 30 μs

50

slide-51
SLIDE 51

Dyad: Consensus Latency

System Consensus latency (μs) % reduction

VR 350 N/A VR-batching 409 N/A Dyad-Leader 48 86% Dyad-All 17 95%

Timestamp server - 5 replicas

51

slide-52
SLIDE 52

Dyad: CPU Usage Timestamp Server

52

➢ Reduce CPU usage by up to 70% on the leader

slide-53
SLIDE 53

Protocol Processing

Application BSD socket Linux epoll Host PCIe Network Replica SmartNIC

Consensus – Data

Dyad: Control Operations

Consensus - Control

53

slide-54
SLIDE 54

Protocol Processing

Application BSD socket Linux epoll Host PCIe Network Replica SmartNIC

Consensus – Data

Dyad: Application Failures

Consensus - Control

54

92% catastrophic failure - due to software [1]

[1] Simple Testing Can Prevent Most Critical Failures, OSDI’14

Fail-stop failure

slide-55
SLIDE 55

Protocol Processing

Application BSD socket Linux epoll Host Network Replica SmartNIC

Consensus – Data

Dyad: Detecting Application Failures

Consensus - Control

Response Request Host RTT

55

slide-56
SLIDE 56
  • Measure host RTT for each request
  • Computed weighted average of host RTTs
  • Detect failure - response not within host RTT threshold

Dyad: Detecting Application Failures

56

slide-57
SLIDE 57

Application Recovery - VR

Consensus Protocol Processing Application BSD socket Linux epoll NIC PCIe Consensus Protocol Processing Application BSD socket Linux epoll NIC PCIe Consensus Protocol Processing Application BSD socket Linux epoll NIC PCIe

Data Center Network (μs RTT)

Replica 1 - Leader Replica 2 Replica 3 Client Requests

Application Restart

Log Transfer

57

slide-58
SLIDE 58
  • Recovery using logs on SmartNIC
  • Two stage recovery:

Recover logs from the SmartNIC

Recover remaining logs from other replicas

Dyad: Application Recovery

58

slide-59
SLIDE 59

Dyad: Application Recovery

59

➢ Dyad reduces recovery time by up to 67%

400MB of data received

slide-60
SLIDE 60

Protocol Processing

Application BSD socket Linux epoll Host PCIe Network Replica SmartNIC

Consensus – Data

Dyad: SmartNIC Failure

Consensus - Control

60

slide-61
SLIDE 61

Protocol Processing

Application BSD socket Linux epoll Host PCIe Network Replica SmartNIC

Consensus – Data

Dyad: System Failure

Consensus - Control

61

8% - hardware faults, misconfigs [1]

[1] Simple Testing Can Prevent Most Critical Failures, OSDI’14

slide-62
SLIDE 62
  • SmartNIC Failure:

Detected on the host using heartbeat/client messages

Existing VR recovery: fetch remaining logs from other replicas

  • System Failure:

Existing VR recovery: fetch logs from other replicas

Dyad supports logging to disk from host (Raft)

Dyad: System Recovery

62

slide-63
SLIDE 63
  • Dyad Supports Raft:

Using TCP connection to replicas

TCP stack specifically decode Raft headers and payload

Host application logs client commands to disk for persistence

Dyad: Reliable Connection

63

slide-64
SLIDE 64

Dyad: Raft Latency

➢ Improves latency by up to 62%

64

slide-65
SLIDE 65
  • Memcached:

Enable consensus for Memcached ■ ~100 lines of code for data operations on replica

Evaluate impact on latency and throughput

Dyad: Ease of Use

65

slide-66
SLIDE 66

Dyad: Memcached Throughput

66

➢ Provides consensus with ~7% reduction in throughput

slide-67
SLIDE 67

Dyad: Memcached Latency

67

➢ Provides consensus with ~16% increase in latency

slide-68
SLIDE 68
  • Motivation
  • Background
  • Overview
  • Design and Evaluation
  • Conclusion

Dyad: Untangling Logically-Coupled Consensus

68

slide-69
SLIDE 69
  • SmartNIC abstraction for consensus
  • Data operations performed on the SmartNIC
  • Control operations performed on the Host
  • Enables consensus as a service on SmartNICs

Dyad: Conclusion

69

slide-70
SLIDE 70
  • Xps - Extensible Protocol Stack:

Abstraction in kernel, user space, and SmartNIC

  • Latr - lazy TLB shootdown:

Kernel mechanism for TLB shootdown

Thesis: Conclusion

70

System abstractions and optimizations are needed at different levels of the software stack to reduce the latency and improve the throughput of current data-center applications.

slide-71
SLIDE 71

Thank you!

71

slide-72
SLIDE 72

Backup Slides

72

slide-73
SLIDE 73

Arrakis

73

slide-74
SLIDE 74

Redis comparison with Arrakis

74

slide-75
SLIDE 75

Latr - Apache

75

slide-76
SLIDE 76

Latr - Apache latency

76

slide-77
SLIDE 77

User-Space Stacks

77

slide-78
SLIDE 78

User Space: Protocol processing

Systems Latency (μs) Mitigation

mTCP ~ 23 Batching IX ~12 Batching Arrakis ~2.6 - 6.3 None

78

slide-79
SLIDE 79

VR: IX batching with 3 Replicas

79

slide-80
SLIDE 80

Context Switch

80

slide-81
SLIDE 81

VR - Leader Context Switch

81

slide-82
SLIDE 82

Dyad - Parallelism

82

slide-83
SLIDE 83
  • Without SmartNIC:

Sequence numbers are available in prepareok message

Multi-thread execution by using the sequence number

  • Dyad:

Request are ordered without containing the sequence number

SmartNIC appends the sequence number to the client request

Dyad: Application Parallelism

83

slide-84
SLIDE 84

Dyad: Parallelism Timestamp Server

➢ Improves throughput by up to 2.1x

84

slide-85
SLIDE 85

Reading Logs

85

slide-86
SLIDE 86

Dyad: Log Read Throughput

➢ Log read throughput ~256 MB with 16 threads

86

slide-87
SLIDE 87

Direct Cost Formula

87

slide-88
SLIDE 88

Cost of Consensus - Direct and Indirect Consensus overhead increases with increasing replicas

88

slide-89
SLIDE 89

VR Recovery Data Transfer

89

slide-90
SLIDE 90

Application Recovery - VR data transfer

Replicas Log Size (MB) Data transferred (MB)

3 100 200 5 100 400 7 100 600

90

slide-91
SLIDE 91

False Positives RTT

91

slide-92
SLIDE 92

Dyad: False Positives with Timestamp Server

➢ RTT = ~96 μs

92

slide-93
SLIDE 93

SmartNIC - Netronome

93

slide-94
SLIDE 94

SmartNIC: Memory Hierarchy and Latency

94

slide-95
SLIDE 95

Recovery Example

95

slide-96
SLIDE 96

Dyad - Recovery Phase1

Consensus Protocol Processing Application BSD socket Linux epoll SmartNIC PCIe Consensus Protocol Processing Application BSD socket Linux epoll SmartNIC PCIe Consensus Protocol Processing Application BSD socket Linux epoll SmartNIC PCIe

Data Center Network (μs RTT)

Replica 1 - Leader Replica 2 Replica 3 Client Requests

Application Restart

96

2 1 2 1 2 1 3 3 1, 2

slide-97
SLIDE 97

Dyad - Recovery Phase2

Consensus Protocol Processing Application BSD socket Linux epoll NIC PCIe Consensus Protocol Processing Application BSD socket Linux epoll NIC PCIe

Data Center Network (μs RTT)

Replica 1 - Leader Replica 2 Replica 3 Client Requests

Application Restart

Log Transfer

97

2 1 3 2 1 3

Consensus Protocol Processing BSD socket Linux epoll NIC PCIe Application

2 1 3

Log Transfer

3 3

slide-98
SLIDE 98

Raft - Logging to Disk

98

slide-99
SLIDE 99

Dyad: Raft Latency with disk logging

99

➢ Improves latency by up to 46%

slide-100
SLIDE 100

Dyad - Future Work

100

slide-101
SLIDE 101
  • Logging to disk from SmartNIC:

Possible with NVMe over fabric

Possible over PCIe? - ARM, FPGA, or NPU

  • Optimize request handling:

Sending parsed requests to host

Dyad: Future Work

101