SLIDE 1 Computer Architecture and Memory systems Laboratory
CAME L AME Lab ab
Myoungsoo Jung
ATC 2020
Sponsored by
Fully Hardware Automated Open Research Framework for Future Fast NVMe Device
SLIDE 2 CAMEL ELab ab
Emerging Non-Volatile Memory for SSDs
Latency (reads)
Memory Types
TLC MLC SLC New Flash PRAM MRAM DRAM
Flash Technologies 450 us 150 us 25 us 3 us 120 ns 50~80 ns 60~80 ns Storage Class Memory (SCM)
SLIDE 3
CAMEL ELab ab
NVMe Internals and Interfaces
CPU
Flash Flash Flash Flash CTRL
SLIDE 4
CAMEL ELab ab
NVMe Storage Stack
CPU
Flash Flash Flash Flash CTRL
1~3GB/sec
Applications (Processes) VFS /FS Block layer Page cache Block device driver
SLIDE 5 CAMEL ELab ab
NVMe Storage Stack Redesign
CPU
Flash Flash Flash Flash CTRL
1~3GB/sec
Applications (Processes) VFS /FS Block layer Page cache Block device driver
- FlashShare: Punching Through Server Storage Stack
from Kernel to Firmware for Ultra-Low Latency SSDs (OSDI’18)
- De-indirection for Flash-Based SSDs with Nameless
writes (FAST’12)
- Towards SLO Complying SSDs Through OPS Isolation
(FAST’15)
- The case of FEMU: Cheap, Accurate, Scalable and
Extensible Flash Emulator (FAST’18)
Challenges #1: Most storage research relies on simulation/kernel- level emulation
SLIDE 6
CAMEL ELab ab
SCM-based NVMe Storage Card
SCM SCM SCM SCM CTRL
7GB/sec
Applications (Processes) VFS /FS Block layer Page cache Block device driver
CPU
Challenges #2: SSD’s CPU can be a performance bottleneck for SCMs
SLIDE 7
CAMEL ELab ab
What Does SSD’s CPU Do?
Applications (Processes) VFS /FS Block layer Page cache Block device driver
CPU
SCM SCM SCM SCM CTRL
7GB/sec
SLIDE 8
CAMEL ELab ab
What Does SSD’s CPU Do?
CPU
SCM CTRL
Applications (Processes) VFS /FS Block layer Page cache Block device driver
Address space Host memory
Completion queue (CQ) Submission queue (SQ)
Device register
SQ Doorbell CQ Doorbell
SLIDE 9
CAMEL ELab ab
What Does SSD’s CPU Do?
CPU
SCM CTRL
Applications (Processes) VFS /FS Block layer Page cache Block device driver
Address space
Completion queue (CQ) Submission queue (SQ) SQ Doorbell CQ Doorbell
❶ I/O submission
Data (PRP)
SLIDE 10
CAMEL ELab ab
What Does SSD’s CPU Do?
CPU
SCM CTRL
Applications (Processes) VFS /FS Block layer Page cache Block device driver
Address space
Completion queue (CQ) Submission queue (SQ) SQ Doorbell CQ Doorbell Data (PRP)
❷ Ring SQ doorbell
SQ Doorbell
SLIDE 11
CAMEL ELab ab
What Does SSD’s CPU Do?
CPU
SCM CTRL
Applications (Processes) VFS /FS Block layer Page cache Block device driver
Address space
Completion queue (CQ) Submission queue (SQ) SQ Doorbell CQ Doorbell Data (PRP) SQ Doorbell
❸ I/O fetch
SLIDE 12
CAMEL ELab ab
What Does SSD’s CPU Do?
CPU
SCM CTRL
Applications (Processes) VFS /FS Block layer Page cache Block device driver
Address space
Completion queue (CQ) Submission queue (SQ) SQ Doorbell CQ Doorbell
❹ Data transfer
Data (PRP)
SLIDE 13
CAMEL ELab ab
What Does SSD’s CPU Do?
CPU
SCM CTRL
Applications (Processes) VFS /FS Block layer Page cache Block device driver
Address space
Completion queue (CQ) Submission queue (SQ) SQ Doorbell CQ Doorbell Data (PRP)
❺ I/O process
SLIDE 14
CAMEL ELab ab
What Does SSD’s CPU Do?
CPU
SCM CTRL
Applications (Processes) VFS /FS Block layer Page cache Block device driver
Address space
Completion queue (CQ) Submission queue (SQ) SQ Doorbell CQ Doorbell
❻ I/O completion
SLIDE 15
CAMEL ELab ab
What Does SSD’s CPU Do?
CPU
SCM CTRL
Applications (Processes) VFS /FS Block layer Page cache Block device driver
Address space
Completion queue (CQ) Submission queue (SQ) SQ Doorbell CQ Doorbell
❼ Interrupt (notification)
SLIDE 16
CAMEL ELab ab
What Does SSD’s CPU Do?
CPU
SCM CTRL
Applications (Processes) VFS /FS Block layer Page cache Block device driver
Address space
Completion queue (CQ) Submission queue (SQ) SQ Doorbell CQ Doorbell
❽ Process completion
SLIDE 17
CAMEL ELab ab
What Does SSD’s CPU Do?
CPU
SCM CTRL
Applications (Processes) VFS /FS Block layer Page cache Block device driver
Address space
Completion queue (CQ) Submission queue (SQ) SQ Doorbell CQ Doorbell
❾ Ring CQ doorbell
CQ Doorbell
SLIDE 18
CAMEL ELab ab
What Does SSD’s CPU Do?
CPU
SCM CTRL
Applications (Processes) VFS /FS Block layer Page cache Block device driver
Address space
Completion queue (CQ) Submission queue (SQ) SQ Doorbell CQ Doorbell CQ Doorbell
All these NVMe activities give a burden on the storage!
SLIDE 19 CAMEL ELab ab
Multi-core IP for High-Performance SSD
Backend Channel Complex
CPU
Inbound
NVMe driver
SQ CQ
Core0
Outbound
PCIe
PCIe Client Logic PCIe
I-RAM I-RAM
I-RAM
Interconnection Networks
Memory Controller SRAM
SLIDE 20 CAMEL ELab ab
Component Latency Decomposition
TLCMLCSLC ZNAND PRAM MRAM 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown TLCMLCSLC ZNAND PRAM MRAM 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown
Completion Translation PRP Queue/Doorbells Fetching NVM Completion Translation PRP Queue/Doorbells Fetching NVM
SLIDE 21 CAMEL ELab ab
Component Latency Decomposition
T L C M L C S L C Z N A N D P R A M M R A M 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown T L C M L C S L C Z N A N D P R A M M R A M 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown T L C M L C S L C Z N A N D P R A M M R A M 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown T L C M L C S L C Z N A N D P R A M M R A M 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown
Completion Translation PRP Queue/Doorbells Fetching NVM Completion Translation PRP Queue/Doorbells Fetching NVM
SLIDE 22 CAMEL ELab ab
Component Latency Decomposition
T L C M L C S L C Z N A N D P R A M M R A M 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown T L C M L C S L C Z N A N D P R A M M R A M 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown T L C M L C S L C Z N A N D P R A M M R A M 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown T L C M L C S L C Z N A N D P R A M M R A M 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown TLCMLCSLC ZNAND PRAM MRAM 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown T L C M L C S L C Z N A N D P R A M M R A M 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown
Completion Translation PRP Queue/Doorbells Fetching NVM Completion Translation PRP Queue/Doorbells Fetching NVM
SLIDE 23 CAMEL ELab ab
Component Latency Decomposition
T L C M L C S L C Z N A N D P R A M M R A M 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown T L C M L C S L C Z N A N D P R A M M R A M 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown T L C M L C S L C Z N A N D P R A M M R A M 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown T L C M L C S L C Z N A N D P R A M M R A M 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown TLCMLCSLC ZNAND PRAM MRAM 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown T L C M L C S L C Z N A N D P R A M M R A M 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown TLCMLCSLC ZNAND PRAM MRAM 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown TLCMLCSLC ZNAND PRAM MRAM 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown
Completion Translation PRP Queue/Doorbells Fetching NVM Completion Translation PRP Queue/Doorbells Fetching NVM
SLIDE 24 CAMEL ELab ab
Component Latency Decomposition
T L C M L C S L C Z N A N D P R A M M R A M 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown T L C M L C S L C Z N A N D P R A M M R A M 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown T L C M L C S L C Z N A N D P R A M M R A M 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown T L C M L C S L C Z N A N D P R A M M R A M 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown TLCMLCSLC ZNAND PRAM MRAM 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown T L C M L C S L C Z N A N D P R A M M R A M 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown T L C M L C S L C Z N A N D P R A M M R A M 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown TLCMLCSLC ZNAND PRAM MRAM 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown TLCMLCSLC ZNAND PRAM MRAM 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown TLCMLCSLC ZNAND PRAM MRAM 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown
Completion Translation PRP Queue/Doorbells Fetching NVM Completion Translation PRP Queue/Doorbells Fetching NVM
SLIDE 25 CAMEL ELab ab
Component Latency Decomposition
T L C M L C S L C Z N A N D P R A M M R A M 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown T L C M L C S L C Z N A N D P R A M M R A M 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown T L C M L C S L C Z N A N D P R A M M R A M 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown T L C M L C S L C Z N A N D P R A M M R A M 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown TLCMLCSLC ZNAND PRAM MRAM 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown T L C M L C S L C Z N A N D P R A M M R A M 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown T L C M L C S L C Z N A N D P R A M M R A M 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown TLCMLCSLC ZNAND PRAM MRAM 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown TLCMLCSLC ZNAND PRAM MRAM 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown TLCMLCSLC ZNAND PRAM MRAM 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown TLCMLCSLC ZNAND PRAM MRAM 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown TLCMLCSLC ZNAND PRAM MRAM 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown
Completion Translation PRP Queue/Doorbells Fetching NVM Completion Translation PRP Queue/Doorbells Fetching NVM
CPU bursts can be
bursts Firmware control become the critical performance bottleneck
SLIDE 26 CAMEL ELab ab
Overview
- A framework that fully automates NVMe control logic in hardware to make
it possible to build customizable devices
- OpenExpress is an NVMe host accelerator IP for the integration with an easy-to-access
FPGA design.
- OpenExpress doesn’t require any software intervention to process
concurrent read and write NVMe requests
- It supports scalable data submission, rich outstanding NVMe commands and submissio
n/completion queue management.
- We prototype OpenExpress on a commercially available Xilinx FPGA board
and optimize all the logic modules to operate a high frequency.
SLIDE 27 CAMEL ELab ab
Full Hardware Automation for Critical I/O Path
Backend Channel Complex
CPU
Inbound
NVMe driver
SQ CQ
Core0
Outbound
PCIe
PCIe Client Logic PCIe
I-RAM I-RAM
I-RAM
Interconnection Networks
Memory Controller SRAM
SLIDE 28 CAMEL ELab ab
Full Hardware Automation for Critical I/O Path
Backend Channel Complex
CPU
Inbound
NVMe driver
SQ CQ
Core0
Outbound
PCIe
PCIe Client Logic PCIe
SoC Memory Bus
Datapath
Hardware Automation (OpenExpress)
SLIDE 29 CAMEL ELab ab
Queue Dispatching
OpenExpress
SQ 1 DB SQ 0 DB SQ Tail Doorbell Region
Address (reads/writes) Data (reads/writes) Write Response
SQ Entry Fetch Manager (FET)
SLIDE 30 CAMEL ELab ab
Data Transferring
OpenExpress
SQ 1 DB SQ 0 DB SQ Tail Doorbell Region
Address (reads/writes) Data (reads/writes) Write Response
SQ Entry Fetch Manager (FET) PRP Engine (HTRW)
Backend DMA (DMA)
PCIe EP Complex
DIMM0 DIMM1 DIMM2 DIMM3
SLIDE 31 CAMEL ELab ab
Completion Handling
OpenExpress
SQ 1 DB SQ 0 DB SQ Tail Doorbell Region
PCIe EP Complex Address (reads/writes) Data (reads/writes) Write Response
SQ Entry Fetch Manager (FET) PRP Engine (HTRW)
Backend DMA (DMA)
Completion Handler (CMT)
CQ 1 DB CQ 0 DB CQ Head Doorbell Region
DIMM0 DIMM1 DIMM2 DIMM3
SLIDE 32 CAMEL ELab ab
NVMe Context Management
OpenExpress
SQ 1 DB SQ 0 DB SQ Tail Doorbell Region
PCIe EP Complex Address (reads/writes) Data (reads/writes) Write Response
SQ Entry Fetch Manager (FET) PRP Engine (HTRW)
Backend DMA (DMA)
Completion Handler (CMT)
CQ 1 DB CQ 0 DB CQ Head Doorbell Region
DIMM0 DIMM1 DIMM2 DIMM3
NVMe Context Box (CTX)
SLIDE 33 CAMEL ELab ab
Operation Flow of OpenExpress
OpenExpress
SQ 1 DB SQ 0 DB SQ Tail Doorbell Region
PCIe EP Complex
Address (reads/writes) Data (reads/writes) Write Response
SQ Entry Fetch Manager (FET) PRP Engine (HTRW)
Backend DMA (DMA)
Completion Handler (CMT)
CQ 1 DB CQ 0 DB CQ Head Doorbell Region
DIMM0 DIMM1 DIMM2 DIMM3
Completion Queue 1 Tail Head CQ 1 BAR
Submission Queue 1 Tail Head Issue a command
1
2 Write SQ 1
tail pointer
3 Monitor the DB region
and signal an event
4 Fetch 16 DW
SQ Entry
5
Send the parsed command DATA
PRP DRAM
6 Processing PRP
6 Copying data associated with each PRP
Posting CQ entry
7
Create MSI and interrupt
8
9
Write CQ 1 head pointer
SLIDE 34
CAMEL ELab ab
Major IP Cores
FET HTRW CMT CTX Frontend
SLIDE 35
CAMEL ELab ab
Major IP Cores
FET HTRW CMT CTX Frontend
SLIDE 36
CAMEL ELab ab
Major IP Cores
SLIDE 37 CAMEL ELab ab
Synthesis and Implementation
Four 64GB DDR4 x72 DIMM(256GB) Xilinx Virtex UltraScale 190 PCIe Gen3
Descriptions CPU i5-9400 2.9GHz Design Suite Vivado 2017.3.1 OS Ubuntu 18.04 LTS Building Time More than 7 hours
SLIDE 38 CAMEL ELab ab
Frequency Tuning (50MHz ~ 250 MHz)
- Detect the long routing delay and
place register slice IPs between hardware modules
- It can increase the frequency while
minimizing a negative slack
- Gradual trial-and error to reduce the
amount of route delay
Memory Controller DDR interconnect AXI DMA PCIe EP CMT FET HTRW CTX
Putting register slides between different IP cores
SLIDE 39 CAMEL ELab ab
Frequency Tuning (50MHz ~ 250 MHz)
- Group hardware modules in the floor-
planning
- After synthesis, create FPGA pblock (a part
- f grid cells), allocate hardware IPs to the
pblock, and do PNR (place and route)
- Check the implementation report and do
gradual trial-and-error again
Memory Controller DDR interconnect AXI DMA PCIe EP CMT FET HTRW CTX
Reducing route distances by grouping IP logic
SLIDE 40 CAMEL ELab ab
- Perf. Improvement of Frequency Tuning
- The performance improvement for
reads and writes is as high as 20% and 60%, respectively
- Reads exhibit more benefits as its
nature of data movement on PCIe packets (explained shortly)
- The large size block processing has
more benefits because of automated PRP data processing
Reads Writes
SLIDE 41 CAMEL ELab ab
- Perf. Improvement of Frequency Tuning
Reads Writes
DRAM DRAM
Outbound
Emul Emul Emul
Inbound
PCIe
OpenExpress Host
IO CQ Region Control Reg Set IO SQ Region IO Command
PCIe BAR
PRP List
Pointer Pointer
PRP parsing/traversing and data transfers are fully automated
SLIDE 42 CAMEL ELab ab
Evaluation
- For all executions, a single I/O worker cannot extract full bandwidth (CPU
bottleneck)
- We execute 10 threads for both benchmark execution and real workload
Descriptions CPU 8-core 3.3GHz Intel Skylake-X Server Microarchitecture DRAM 32GB DDR4 (2666) Benchmark FIO Real workload CAMEL’s Open Storage Trace (SNIA) https://trace.camelab.org/ I/O Workers 10 threads Block Device Optane SSD P4800X
- Evaluations demonstrated in the paper are
all performed with a queue-depth “8”
- The queue depth exhibiting the best bandwidth
- Note that we don’t claim that OpenExpress
can be faster than other fast NVMe devices.
- Instead, the evaluation shows that an FPGA-
based design and implementation for NVMe IP cores can offer good performance to make it a viable candidate for use in storage research.
SLIDE 43
CAMEL ELab ab
Performance w/ Microbenchmarks
Bandwidth (seq) Bandwidth (rand) Latency (seq) Latency (rand)
SLIDE 44 CAMEL ELab ab
Performance w/ Microbenchmarks
- 4KB-sized requests
- There is no much bandwidth/latency
difference between reads and writes (3 GB/sec vs 2.8/sec)
- OpenExpress latency is 72~77 us at the
queue-depth eight (vs. P4800X offers 120~150 us)
- It reaches the maximum bandwidth
with 16KB-sized requests
- 4.7GB/sec for random writes (258 us)
- 7GB/sec for random reads (175 us)
(vs. P4800X offers 532 us~600us) Bandwidth (seq) Latency (seq)
Why are writes slower than reads?
SLIDE 45 CAMEL ELab ab
Performance w/ Microbenchmarks
Bandwidth (seq) Latency (seq)
Doorbell NVMe PRP Data Data Doorbell NVMe PRP Data Data Doorbell NVMe PRP Data Data Doorbell NVMe PRP Data Data
Writes Reads
RX TX RX TX
All payloads are serialized Data and NVMe payloads are served in parallel
SLIDE 46 CAMEL ELab ab
Real Workload Evaluation
24HR FIU DevDiv Server TPCE
100 200 300 400
OpenExpress
Microsec
24HR FIU DevDiv Server TPCE
0.0 1.0 2.0 3.0 4.0 5.0 6.0 MB/s
OpenExpress
- The performance of real workloads
are different with microbenchmark results because of unaligned request
- ffsets, sector length variations, etc.
- Most storage cards cannot reach their
best performance with real workload executions
(DevDiv and Server)
- OpenExpress offers 4GB/sec ~ 4.5GB/sec
(100~200 us)
- Write-intensive workloads
(24HR, FIU and TPCE)
- 1.25GB/sec~2.1GB/sec (all under 100 us)
SLIDE 47 CAMEL ELab ab
Real Workload Evaluation
24HR FIU DevDiv Server TPCE
100 200 300 400
OpenExpress
Microsec
24HR FIU DevDiv Server TPCE
0.0 1.0 2.0 3.0 4.0 5.0 6.0 MB/s
OpenExpress
24HR FIU DevDiv Server TPCE
0.0 1.0 2.0 3.0 4.0 5.0 6.0 MB/s
Optane SSD OpenExpress
24HR FIU DevDiv Server TPCE
100 200 300 400
Optane SSD OpenExpress
Microsec
- Bandwidth
- OpenExpress shows 76.3% better
bandwidth than Optane SSD, on average
- Latency
- It exhibits 68.6% shorter latency than
those of OptaneSSD
- DevDiv
- 111.5% better performance compared to
Optane SSD (2.1GB/sec)
FPGA is yet much slower, but the FPGA design and implementation for NVMe IP cores are not
- n the critical path and can be used for system-
level studies as a research vehicle
SLIDE 48 CAMEL ELab ab
Conclusion: Related Work and Download
$ 35K $ 45K $ 40K Ip-m**** Inte****** Ep*****
Per-month unitary prices for 3rd party NVMe IP Core
The price of a single-use source code is around $ 100K
- For academic/non commercial purpose,
OpenExpress will be freely downloadable:
- Hardware automation IP cores (HTRW, FET, CMT, CTX..)
- Firmware for MicroBlaze (to control administrator
command management, device initialization, etc.)
- Download information
- https://openexpress.camelab.org
SLIDE 49 Computer Architecture and Memory systems Laboratory
CAME L AME Lab ab
Myoungsoo Jung
ATC 2020
Sponsored by
Fully Hardware Automated Open Research Framework for Future Fast NVMe Device