When NVMe over Fabrics Meets Arm: Performance and Implications - - PowerPoint PPT Presentation

when nvme over fabrics meets arm performance and
SMART_READER_LITE
LIVE PREVIEW

When NVMe over Fabrics Meets Arm: Performance and Implications - - PowerPoint PPT Presentation

When NVMe over Fabrics Meets Arm: Performance and Implications Yichen Jia * Eric Anger Feng Chen * * Louisiana State University Arm Inc MSST19 May 23 th , 2019 Table of Content Background Experimental Setup Experimental


slide-1
SLIDE 1

When NVMe over Fabrics Meets Arm: Performance and Implications

Yichen Jia* Eric Anger† Feng Chen*

*Louisiana State University †Arm Inc

MSST’19 May 23th , 2019

slide-2
SLIDE 2

1

Table of Content

  • Background
  • Experimental Setup
  • Experimental Results
  • System Implications
  • Conclusions
slide-3
SLIDE 3

Background

Arm, NVMe and NVMe over Fabrics

slide-4
SLIDE 4

3

Background : Arm Processors

  • Arm processors have become dominant in IoT and mobile phones, etc
  • The recently released 64-bit ARM CPUs are suitable for cloud and data centers
  • Arm-based instances have been available in Amazon AWS since Nov, 2018
  • One of its important applications is to be the storage server
  • Enhanced computing capability and power efficiency
slide-5
SLIDE 5

4

Background : NVM Express

  • Flash-based SSD is becoming cheaper and more popular
  • High throughput and low latency
  • Suitable for parallel I/Os
  • Non-Volatile Memory Express (NVMe)
  • Supporting deep and paired queues
  • Scalable for the next generation NVM

NVMe Structure*

*https://nvmexpress.org/about/nvm-express-overview/

slide-6
SLIDE 6

5

Background: NVMe-over-Fabrics

  • Direct Attached Storage (DAS)
  • Computing and storage in one box
  • Less flexible, hard to scale, etc
  • Storage Disaggregation
  • Separated computing and storage
  • Reduced total cost of ownership (TCO)
  • Improved hardware utilization
  • Examples: NVMe over Fabrics, iSCSI

NVMe over Fabrics

Application File System

NVMe PCIe Driver NVMe Fabrics Initiator PCI Express

NIC + RDMA Block Layer

NVMe PCIe Driver

Host Side Target Side Block Layer

Kernel

NVMe SSD

HW

PCI Express

NIC + RDMA

NVMe SSD NVMe Fabrics Target

Ethernet

RDMA Driver RDMA Driver

slide-7
SLIDE 7

6

Background: NVMe-over-Fabrics

  • Direct Attached Storage (DAS)
  • Computing and storage in one box
  • Less flexible, hard to scale, etc
  • Storage Disaggregation
  • Separated computing and storage
  • Reduced total cost of ownership (TCO)
  • Improved hardware utilization
  • Examples: NVMe over Fabrics, iSCSI

NVMe over Fabrics

Application File System

NVMe PCIe Driver NVMe Fabrics Initiator PCI Express

NIC + RDMA Block Layer

NVMe PCIe Driver

Host Side Target Side Block Layer

Kernel

NVMe SSD

HW

PCI Express

NIC + RDMA

NVMe SSD NVMe Fabrics Target

Ethernet

RDMA Driver RDMA Driver

slide-8
SLIDE 8

7

Background: NVMe-over-Fabrics

  • Direct Attached Storage (DAS)
  • Computing and storage in one box
  • Less flexible, hard to scale, etc
  • Storage Disaggregation
  • Separated computing and storage
  • Reduced total cost of ownership (TCO)
  • Improved hardware utilization
  • Examples: NVMe over Fabrics, iSCSI

NVMe over Fabrics

Application File System

NVMe PCIe Driver NVMe Fabrics Initiator PCI Express

NIC + RDMA Block Layer

NVMe PCIe Driver

Host Side Target Side Block Layer

Kernel

NVMe SSD

HW

PCI Express

NIC + RDMA

NVMe SSD NVMe Fabrics Target

Ethernet

RDMA Driver RDMA Driver

slide-9
SLIDE 9

8

Background: NVMe-over-Fabrics

  • Direct Attached Storage (DAS)
  • Computing and storage in one box
  • Less flexible, hard to scale, etc
  • Storage Disaggregation
  • Separated computing and storage
  • Reduced total cost of ownership (TCO)
  • Improved hardware utilization
  • Examples: NVMe over Fabrics, iSCSI

NVMe over Fabrics

Application File System

NVMe PCIe Driver NVMe Fabrics Initiator PCI Express

NIC + RDMA Block Layer

NVMe PCIe Driver

Host Side Target Side Block Layer

Kernel

NVMe SSD

HW

PCI Express

NIC + RDMA

NVMe SSD NVMe Fabrics Target

Ethernet

RDMA Driver RDMA Driver

slide-10
SLIDE 10

9

Motivations

  • Continuous investment in Arm-based solutions
  • Increasingly popular NVMe over Fabrics
  • Integrating Arm with NVMeoF is highly appealing
  • However, the first-hand comprehensive experimental data is still lacking
slide-11
SLIDE 11

10

Motivations

  • Continuous investment in Arm-based solutions
  • Increasingly popular NVMe over Fabrics
  • Integrating Arm with NVMeoF is highly appealing
  • However, the first-hand comprehensive experimental data is still lacking

A thorough performance study of NVMeoF on Arm is becoming necessary.

slide-12
SLIDE 12

Experimental Setup

slide-13
SLIDE 13

12

Experimental Setup

  • Target Side: Broadcom 5880X Stingray.
  • CPU: 8-core 3GHz ARMv8 Coretx-A72 CPU
  • Memory: 48GB
  • Storage: Intel Data Center P3600 SSD
  • Network: Broadcom NetXtreme NIC
  • Host Side: Lenovo ThinkCentre M910s
  • CPU: Intel(R) 4-core (HT) i7-6700 3.40GHz CPU
  • Memory: 16GB
  • Network: Broadcom NetXtreme NIC
  • The host and target machines are connected by a Leoni ParaLink@23 cable
  • Speed on both host and target sides is configured to be 50Gb/s
  • Benchmarking tool: FIO

Server/Client Arm/x86 x86/Arm Bandwidth(Gb/s) 45.42 45.40 Latency (us) 3.26 3.17

RoCEv2 Performance

slide-14
SLIDE 14

Experimental Results

slide-15
SLIDE 15

14

Experiments

  • Effect of Parallelism
  • Study of Computational Cost
  • Effect of IODepth
  • Effect of Request Sizes
slide-16
SLIDE 16

15

Experiments

  • Effect of Parallelism
  • Study of Computational Cost
  • Effect of IODepth
  • Effect of Request Sizes
slide-17
SLIDE 17

16

Parallelism Feature in NVMe

NVMe Structure*

  • Parallel I/Os play an important role in NVMe to fully exploit hardware potentials
  • I/O parallelism will also have a great impact on NVMe-over-Fabrics

*https://nvmexpress.org/about/nvm-express-overview/

slide-18
SLIDE 18

17

Finding #1: Effect of Parallelism

1. Latency increases as the number of jobs increases 2. NVMeoF has a close or shorter tail latency for seq read 3. BW reaches plateau when job number reaches 4 4. CPU utilization on target side is much lower 5. Arm is powerful enough to be storage server

*Sequential Read, 4KB, 128 IODepth, 1-16 Concurrent Jobs

slide-19
SLIDE 19

18

Finding #1: Effect of Parallelism

1. Latency increases as the number of jobs increases 2. NVMeoF has a close or shorter tail latency for seq read 3. BW reaches plateau when job number reaches 4 4. CPU utilization on target side is much lower 5. Arm is powerful enough to be storage server

*Sequential Read, 4KB, 128 IODepth, 1-16 Concurrent Jobs

Close

slide-20
SLIDE 20

19

Finding #1: Effect of Parallelism

NVMeoF Local 1. Latency increases as the number of jobs increases 2. NVMeoF has a close or shorter tail latency for seq read 3. BW reaches plateau when job number reaches 4 4. CPU utilization on target side is much lower 5. Arm is powerful enough to be storage server

*Sequential Read, 4KB, 128 IODepth, 1-16 Concurrent Jobs

slide-21
SLIDE 21

20

Finding #1: Effect of Parallelism

Linear Plateau 1. Latency increases as the number of jobs increases 2. NVMeoF has a close or shorter tail latency for seq read 3. BW reaches plateau when job number reaches 4 4. CPU utilization on target side is much lower 5. Arm is powerful enough to be storage server

*Sequential Read, 4KB, 128 IODepth, 1-16 Concurrent Jobs

slide-22
SLIDE 22

21

Finding #1: Effect of Parallelism

1. Latency increases as the number of jobs increases 2. NVMeoF has a close or shorter tail latency for seq read 3. BW reaches plateau when job number reaches 4 4. CPU utilization on target side is much lower 5. Arm is powerful enough to be storage server

*Sequential Read, 4KB, 128 IODepth, 1-16 Concurrent Jobs

slide-23
SLIDE 23

22

Finding #1: Effect of Parallelism

1. Latency increases as the number of jobs increases 2. NVMeoF has a close or shorter tail latency for seq read 3. BW reaches plateau when job number reaches 4 4. CPU utilization on target side is much lower 5. Arm is powerful enough to be storage server

*Sequential Read, 4KB, 128 IODepth, 1-16 Concurrent Jobs

slide-24
SLIDE 24

23

Finding #2 : Computational Cost

*Random Write, 4 KB-128KB, 8 Concurrent jobs, 128 IODepth

  • 1. NVMeoF consumes 31.5% more CPU on host side than local NVMe
  • 2. Kernel level overhead is dominant(26.9%) when request size is 4KB
  • 3. Kernel level overhead are amortized as request size increases

26.9%

slide-25
SLIDE 25

24

Finding #2 : Computational Cost

31.5%

*Random Write, 4 KB-128KB, 8 Concurrent jobs, 128 IODepth

  • 1. NVMeoF consumes 31.5% more CPU on host side than local NVMe
  • 2. Kernel level overhead is dominant(26.9%) when request size is 4KB
  • 3. Kernel level overhead are amortized as request size increases

26.9%

slide-26
SLIDE 26

25

IODepth is important for NVMeoF

NVMe and RDMA Queues

slide-27
SLIDE 27

26

Finding #3: Effect of IODepth

* Sequential Read, 4KB, 8 Concurrent Jobs, 1-128 IODepth, one Arm core

  • When IODepth is small, local access has a short tail latency than remote access
  • When IODepth is large, remote access has a short tail latency than local access
slide-28
SLIDE 28

27

Finding #3: Effect of IODepth

Local NVMeoF

* Sequential Read, 4KB, 8 Concurrent Jobs, 1-128 IODepth, one Arm core

  • When IODepth is small, local access has a short tail latency than remote access
  • When IODepth is large, remote access has a short tail latency than local access
slide-29
SLIDE 29

28

Finding #3: Effect of IODepth

NVMeoF Local

* Sequential Read, 4KB, 8 Concurrent Jobs, 1-128 IODepth, one Arm core

  • When IODepth is small, local access has a short tail latency than remote access
  • When IODepth is large, remote access has a short tail latency than local access
slide-30
SLIDE 30

29

Finding #3: Effect of IODepth

Local NVMeoF NVMeoF Local

* Sequential Read, 4KB, 8 Concurrent Jobs, 1-128 IODepth, one Arm core

  • When IODepth is small, local access has a short tail latency than remote access
  • When IODepth is large, remote access has a short tail latency than local access
slide-31
SLIDE 31

30

Finding #3: Effect of IODepth cont’d

* Sequential Read, 4KB, 8 Concurrent Jobs, 1-128 IODepth, one Arm core

  • Bandwidth will increase, and keep stable and increase again when IODepth is over 32.
  • CPU utilization will increase, and keep stable and decrease when IODepth is larger than 32.
slide-32
SLIDE 32

31

Finding #3: Effect of IODepth cont’d

* Sequential Read, 4KB, 8 Concurrent Jobs, 1-128 IODepth, one Arm core

  • Bandwidth will increase, and keep stable and increase again when IODepth is over 32.
  • CPU utilization will increase, and keep stable and decrease when IODepth is larger than 32.
slide-33
SLIDE 33

32

Finding #3: Effect of IODepth cont’d

* Sequential Read, 4KB, 8 Concurrent Jobs, 1-128 IODepth, one Arm core

  • Bandwidth will increase, and keep stable and increase again when IODepth is over 32.
  • CPU utilization will increase, and keep stable and decrease when IODepth is larger than 32.
slide-34
SLIDE 34

33

Interrupt Moderation

  • Interrupt moderation means multiple packets are handled for each interrupt
  • Overall interrupt-processing efficiency is improved and CPU utilization is decreased

Local

slide-35
SLIDE 35

34

Interrupt Moderation

  • Interrupt moderation means multiple packets are handled for each interrupt
  • Overall interrupt-processing efficiency is improved and CPU utilization is decreased

x86 ARM

slide-36
SLIDE 36

35

Interrupt Moderation

  • Interrupt moderation means multiple packets are handled for each interrupt
  • Overall interrupt-processing efficiency is improved and CPU utilization is decreased

ARM

slide-37
SLIDE 37

Summary and Implications

slide-38
SLIDE 38

37

Observations

  • NVMeoF can provide satisfactory and comparable performance to NVMe
slide-39
SLIDE 39

38

Observations

  • NVMeoF can provide satisfactory and comparable performance to NVMe
  • Arm processor is powerful enough as the NVMeoF target
slide-40
SLIDE 40

39

Observations

  • NVMeoF can provide satisfactory and comparable performance to NVMe
  • Arm processor is powerful enough as the NVMeoF target
  • Request size, parallelism, and I/O queue depth are important for performance
slide-41
SLIDE 41

40

Observations

  • NVMeoF can provide satisfactory and comparable performance to NVMe
  • Arm processor is powerful enough as the NVMeoF target
  • Request size, parallelism, and I/O queue depth are important for performance
  • Kernel level overhead can be significant on NVMeoF host
slide-42
SLIDE 42

41

Observations

  • NVMeoF can provide satisfactory and comparable performance to NVMe
  • Arm processor is powerful enough as the NVMeoF target
  • Request size, parallelism, and I/O queue depth are important for performance
  • Kernel level overhead can be significant on NVMeoF host
  • Interrupt moderation is important for overall performance improvement
slide-43
SLIDE 43

42

Implications

  • Application level
  • I/O Clustering. Merging small random operations into large sequential ones.
  • A proper configuration, such as parallelism, request size, etc.
slide-44
SLIDE 44

43

Implications

  • Application level
  • I/O Clustering. Merging small random operations into large sequential ones.
  • A proper configuration, such as parallelism, request size, etc.
  • System level
  • Simplifying the I/O stack. Moving kernel level driver to user level.
  • Replacing interrupts with polling*. More tradeoff space when storage becomes faster.

*J.Yang, D.B.Minturn,and F.Hady. When Poll is Better Than Interrupt. FAST ’12

slide-45
SLIDE 45

44

Implications

  • Application level
  • I/O Clustering. Merging small random operations into large sequential ones.
  • A proper configuration, such as parallelism, request size, etc.
  • System level
  • Simplifying the I/O stack. Moving kernel level driver to user level.
  • Replacing interrupts with polling*. More tradeoff space when storage becomes faster.
  • Hardware level
  • Interrupt moderation. Important for performance improvement.
  • NIC configurations

*J.Yang, D.B.Minturn,and F.Hady. When Poll is Better Than Interrupt. FAST ’12

slide-46
SLIDE 46

45

Conclusions

  • We benchmark NVMe and NVMeoF on Arm based server
  • NVMe over Fabrics only incurs minimal overhead than (Local) NVMe
  • Arm servers are powerful enough to be target(storage) for NVMeoF
  • NMVeoF shows better performance than NVMe for I/O intensive applications
  • We give explanations for this phenomena
  • We discuss related system implications for performance optimization
  • I/O clustering, simplifying I/O stack, interrupt moderation, etc.
slide-47
SLIDE 47

46

Acknowledgements

  • We thank the anonymous reviewers for their constructive feedback and comments
  • We also thank Haresh Sakariya from Broadcom Inc. for his technical support
  • This paper was partially supported by National Science Foundation under Grants CCF-

1453705 and CCF-1629291

slide-48
SLIDE 48

47

Q&A

Thanks & Questions?

Yichen Jia yjia@csc.lsu.edu

slide-49
SLIDE 49

Backup Slides

slide-50
SLIDE 50

49

Effect of Request Size

1. The latency and BW increase as req. size increases 2. Latency overhead is minimal (~2%) 3. BW overhead is at most 20% 4. CPU utilization decreases and keeps below 8% on both host and target side

*Sequential read, 4 KB -128 KB, 8 Concurrent jobs. 1 IODepth.

slide-51
SLIDE 51

50

Computational Cost(1)

Long tail Plateau

*Random write, 4 KB-128KB, 8 Concurrent Jobs, 128 IODepth

1. NVMeoF has a longer tail latency than NVMe for random writes 2. The bandwidth reaches the peak(about 500MB/s) for different request sizes

slide-52
SLIDE 52

51

Finding #1: Effect of Parallelism

NVMeoF Local Linear Plateau 1. Latency increases as the number of jobs increases 2. NVMeoF has a close or shorter tail latency for seq read 3. BW reaches plateau when job number reaches 4 4. CPU utilization on target side is much lower 5. Arm is powerful enough to be storage server

*Sequential Read, 4KB, 128 IODepth, 1-16 Concurrent Jobs

Close

slide-53
SLIDE 53

52

Finding #2 : Computational Cost

31.5%

*Random Write, 4 KB-128KB, 8 Concurrent jobs, 128 IODepth

  • 1. NVMeoF consumes 31.5% more CPU on host side than local NVMe
  • 2. Kernel level overhead is dominant(26.9%) when request size is 4KB
  • 3. Kernel level overhead are amortized as request size increases

26.9%

slide-54
SLIDE 54

53

Interrupt Moderation

  • Interrupt moderation means multiple packets are handled for each interrupt
  • Overall interrupt-processing efficiency is improved and CPU utilization is decreased

Local x86 ARM