When NVMe over Fabrics Meets Arm: Performance and Implications
Yichen Jia* Eric Anger† Feng Chen*
*Louisiana State University †Arm Inc
When NVMe over Fabrics Meets Arm: Performance and Implications - - PowerPoint PPT Presentation
When NVMe over Fabrics Meets Arm: Performance and Implications Yichen Jia * Eric Anger Feng Chen * * Louisiana State University Arm Inc MSST19 May 23 th , 2019 Table of Content Background Experimental Setup Experimental
*Louisiana State University †Arm Inc
1
3
4
NVMe Structure*
*https://nvmexpress.org/about/nvm-express-overview/
5
NVMe over Fabrics
Application File System
NVMe PCIe Driver NVMe Fabrics Initiator PCI Express
NIC + RDMA Block Layer
NVMe PCIe Driver
Host Side Target Side Block Layer
Kernel
NVMe SSD
HW
PCI Express
NIC + RDMA
NVMe SSD NVMe Fabrics Target
Ethernet
RDMA Driver RDMA Driver
6
NVMe over Fabrics
Application File System
NVMe PCIe Driver NVMe Fabrics Initiator PCI Express
NIC + RDMA Block Layer
NVMe PCIe Driver
Host Side Target Side Block Layer
Kernel
NVMe SSD
HW
PCI Express
NIC + RDMA
NVMe SSD NVMe Fabrics Target
Ethernet
RDMA Driver RDMA Driver
7
NVMe over Fabrics
Application File System
NVMe PCIe Driver NVMe Fabrics Initiator PCI Express
NIC + RDMA Block Layer
NVMe PCIe Driver
Host Side Target Side Block Layer
Kernel
NVMe SSD
HW
PCI Express
NIC + RDMA
NVMe SSD NVMe Fabrics Target
Ethernet
RDMA Driver RDMA Driver
8
NVMe over Fabrics
Application File System
NVMe PCIe Driver NVMe Fabrics Initiator PCI Express
NIC + RDMA Block Layer
NVMe PCIe Driver
Host Side Target Side Block Layer
Kernel
NVMe SSD
HW
PCI Express
NIC + RDMA
NVMe SSD NVMe Fabrics Target
Ethernet
RDMA Driver RDMA Driver
9
10
A thorough performance study of NVMeoF on Arm is becoming necessary.
12
Server/Client Arm/x86 x86/Arm Bandwidth(Gb/s) 45.42 45.40 Latency (us) 3.26 3.17
RoCEv2 Performance
14
15
16
NVMe Structure*
*https://nvmexpress.org/about/nvm-express-overview/
17
1. Latency increases as the number of jobs increases 2. NVMeoF has a close or shorter tail latency for seq read 3. BW reaches plateau when job number reaches 4 4. CPU utilization on target side is much lower 5. Arm is powerful enough to be storage server
*Sequential Read, 4KB, 128 IODepth, 1-16 Concurrent Jobs
18
1. Latency increases as the number of jobs increases 2. NVMeoF has a close or shorter tail latency for seq read 3. BW reaches plateau when job number reaches 4 4. CPU utilization on target side is much lower 5. Arm is powerful enough to be storage server
*Sequential Read, 4KB, 128 IODepth, 1-16 Concurrent Jobs
Close
19
NVMeoF Local 1. Latency increases as the number of jobs increases 2. NVMeoF has a close or shorter tail latency for seq read 3. BW reaches plateau when job number reaches 4 4. CPU utilization on target side is much lower 5. Arm is powerful enough to be storage server
*Sequential Read, 4KB, 128 IODepth, 1-16 Concurrent Jobs
20
Linear Plateau 1. Latency increases as the number of jobs increases 2. NVMeoF has a close or shorter tail latency for seq read 3. BW reaches plateau when job number reaches 4 4. CPU utilization on target side is much lower 5. Arm is powerful enough to be storage server
*Sequential Read, 4KB, 128 IODepth, 1-16 Concurrent Jobs
21
1. Latency increases as the number of jobs increases 2. NVMeoF has a close or shorter tail latency for seq read 3. BW reaches plateau when job number reaches 4 4. CPU utilization on target side is much lower 5. Arm is powerful enough to be storage server
*Sequential Read, 4KB, 128 IODepth, 1-16 Concurrent Jobs
22
1. Latency increases as the number of jobs increases 2. NVMeoF has a close or shorter tail latency for seq read 3. BW reaches plateau when job number reaches 4 4. CPU utilization on target side is much lower 5. Arm is powerful enough to be storage server
*Sequential Read, 4KB, 128 IODepth, 1-16 Concurrent Jobs
23
*Random Write, 4 KB-128KB, 8 Concurrent jobs, 128 IODepth
26.9%
24
31.5%
*Random Write, 4 KB-128KB, 8 Concurrent jobs, 128 IODepth
26.9%
25
NVMe and RDMA Queues
26
* Sequential Read, 4KB, 8 Concurrent Jobs, 1-128 IODepth, one Arm core
27
Local NVMeoF
* Sequential Read, 4KB, 8 Concurrent Jobs, 1-128 IODepth, one Arm core
28
NVMeoF Local
* Sequential Read, 4KB, 8 Concurrent Jobs, 1-128 IODepth, one Arm core
29
Local NVMeoF NVMeoF Local
* Sequential Read, 4KB, 8 Concurrent Jobs, 1-128 IODepth, one Arm core
30
* Sequential Read, 4KB, 8 Concurrent Jobs, 1-128 IODepth, one Arm core
31
* Sequential Read, 4KB, 8 Concurrent Jobs, 1-128 IODepth, one Arm core
32
* Sequential Read, 4KB, 8 Concurrent Jobs, 1-128 IODepth, one Arm core
33
Local
34
x86 ARM
35
ARM
37
38
39
40
41
42
43
*J.Yang, D.B.Minturn,and F.Hady. When Poll is Better Than Interrupt. FAST ’12
44
*J.Yang, D.B.Minturn,and F.Hady. When Poll is Better Than Interrupt. FAST ’12
45
46
1453705 and CCF-1629291
47
Yichen Jia yjia@csc.lsu.edu
49
1. The latency and BW increase as req. size increases 2. Latency overhead is minimal (~2%) 3. BW overhead is at most 20% 4. CPU utilization decreases and keeps below 8% on both host and target side
*Sequential read, 4 KB -128 KB, 8 Concurrent jobs. 1 IODepth.
50
Long tail Plateau
*Random write, 4 KB-128KB, 8 Concurrent Jobs, 128 IODepth
1. NVMeoF has a longer tail latency than NVMe for random writes 2. The bandwidth reaches the peak(about 500MB/s) for different request sizes
51
NVMeoF Local Linear Plateau 1. Latency increases as the number of jobs increases 2. NVMeoF has a close or shorter tail latency for seq read 3. BW reaches plateau when job number reaches 4 4. CPU utilization on target side is much lower 5. Arm is powerful enough to be storage server
*Sequential Read, 4KB, 128 IODepth, 1-16 Concurrent Jobs
Close
52
31.5%
*Random Write, 4 KB-128KB, 8 Concurrent jobs, 128 IODepth
26.9%
53
Local x86 ARM