CAME AME L Lab ab Emerging Non-Volatile Memory for SSDs 450 us - - PowerPoint PPT Presentation

came ame l lab ab emerging non volatile memory for ssds
SMART_READER_LITE
LIVE PREVIEW

CAME AME L Lab ab Emerging Non-Volatile Memory for SSDs 450 us - - PowerPoint PPT Presentation

ATC 2020 Fully Hardware Automated Open Research Framework for Future Fast NVMe Device Myoungsoo Jung Computer Architecture and Memory systems Laboratory Sponsored by CAME AME L Lab ab Emerging Non-Volatile Memory for SSDs 450 us


slide-1
SLIDE 1

Computer Architecture and Memory systems Laboratory

CAME L AME Lab ab

Myoungsoo Jung

ATC 2020

Sponsored by

Fully Hardware Automated Open Research Framework for Future Fast NVMe Device

slide-2
SLIDE 2

CAMEL ELab ab

Emerging Non-Volatile Memory for SSDs

Latency (reads)

Memory Types

TLC MLC SLC New Flash PRAM MRAM DRAM

Flash Technologies 450 us 150 us 25 us 3 us 120 ns 50~80 ns 60~80 ns Storage Class Memory (SCM)

slide-3
SLIDE 3

CAMEL ELab ab

NVMe Internals and Interfaces

CPU

Flash Flash Flash Flash CTRL

slide-4
SLIDE 4

CAMEL ELab ab

NVMe Storage Stack

CPU

Flash Flash Flash Flash CTRL

1~3GB/sec

Applications (Processes) VFS /FS Block layer Page cache Block device driver

slide-5
SLIDE 5

CAMEL ELab ab

NVMe Storage Stack Redesign

CPU

Flash Flash Flash Flash CTRL

1~3GB/sec

Applications (Processes) VFS /FS Block layer Page cache Block device driver

  • FlashShare: Punching Through Server Storage Stack

from Kernel to Firmware for Ultra-Low Latency SSDs (OSDI’18)

  • De-indirection for Flash-Based SSDs with Nameless

writes (FAST’12)

  • Towards SLO Complying SSDs Through OPS Isolation

(FAST’15)

  • The case of FEMU: Cheap, Accurate, Scalable and

Extensible Flash Emulator (FAST’18)

  • There’re more and more!

Challenges #1: Most storage research relies on simulation/kernel- level emulation

slide-6
SLIDE 6

CAMEL ELab ab

SCM-based NVMe Storage Card

SCM SCM SCM SCM CTRL

7GB/sec

Applications (Processes) VFS /FS Block layer Page cache Block device driver

CPU

Challenges #2: SSD’s CPU can be a performance bottleneck for SCMs

slide-7
SLIDE 7

CAMEL ELab ab

What Does SSD’s CPU Do?

Applications (Processes) VFS /FS Block layer Page cache Block device driver

CPU

SCM SCM SCM SCM CTRL

7GB/sec

slide-8
SLIDE 8

CAMEL ELab ab

What Does SSD’s CPU Do?

CPU

SCM CTRL

Applications (Processes) VFS /FS Block layer Page cache Block device driver

Address space Host memory

Completion queue (CQ) Submission queue (SQ)

Device register

SQ Doorbell CQ Doorbell

slide-9
SLIDE 9

CAMEL ELab ab

What Does SSD’s CPU Do?

CPU

SCM CTRL

Applications (Processes) VFS /FS Block layer Page cache Block device driver

Address space

Completion queue (CQ) Submission queue (SQ) SQ Doorbell CQ Doorbell

❶ I/O submission

Data (PRP)

slide-10
SLIDE 10

CAMEL ELab ab

What Does SSD’s CPU Do?

CPU

SCM CTRL

Applications (Processes) VFS /FS Block layer Page cache Block device driver

Address space

Completion queue (CQ) Submission queue (SQ) SQ Doorbell CQ Doorbell Data (PRP)

❷ Ring SQ doorbell

SQ Doorbell

slide-11
SLIDE 11

CAMEL ELab ab

What Does SSD’s CPU Do?

CPU

SCM CTRL

Applications (Processes) VFS /FS Block layer Page cache Block device driver

Address space

Completion queue (CQ) Submission queue (SQ) SQ Doorbell CQ Doorbell Data (PRP) SQ Doorbell

❸ I/O fetch

slide-12
SLIDE 12

CAMEL ELab ab

What Does SSD’s CPU Do?

CPU

SCM CTRL

Applications (Processes) VFS /FS Block layer Page cache Block device driver

Address space

Completion queue (CQ) Submission queue (SQ) SQ Doorbell CQ Doorbell

❹ Data transfer

Data (PRP)

slide-13
SLIDE 13

CAMEL ELab ab

What Does SSD’s CPU Do?

CPU

SCM CTRL

Applications (Processes) VFS /FS Block layer Page cache Block device driver

Address space

Completion queue (CQ) Submission queue (SQ) SQ Doorbell CQ Doorbell Data (PRP)

❺ I/O process

slide-14
SLIDE 14

CAMEL ELab ab

What Does SSD’s CPU Do?

CPU

SCM CTRL

Applications (Processes) VFS /FS Block layer Page cache Block device driver

Address space

Completion queue (CQ) Submission queue (SQ) SQ Doorbell CQ Doorbell

❻ I/O completion

slide-15
SLIDE 15

CAMEL ELab ab

What Does SSD’s CPU Do?

CPU

SCM CTRL

Applications (Processes) VFS /FS Block layer Page cache Block device driver

Address space

Completion queue (CQ) Submission queue (SQ) SQ Doorbell CQ Doorbell

❼ Interrupt (notification)

slide-16
SLIDE 16

CAMEL ELab ab

What Does SSD’s CPU Do?

CPU

SCM CTRL

Applications (Processes) VFS /FS Block layer Page cache Block device driver

Address space

Completion queue (CQ) Submission queue (SQ) SQ Doorbell CQ Doorbell

❽ Process completion

slide-17
SLIDE 17

CAMEL ELab ab

What Does SSD’s CPU Do?

CPU

SCM CTRL

Applications (Processes) VFS /FS Block layer Page cache Block device driver

Address space

Completion queue (CQ) Submission queue (SQ) SQ Doorbell CQ Doorbell

❾ Ring CQ doorbell

CQ Doorbell

slide-18
SLIDE 18

CAMEL ELab ab

What Does SSD’s CPU Do?

CPU

SCM CTRL

Applications (Processes) VFS /FS Block layer Page cache Block device driver

Address space

Completion queue (CQ) Submission queue (SQ) SQ Doorbell CQ Doorbell CQ Doorbell

All these NVMe activities give a burden on the storage!

slide-19
SLIDE 19

CAMEL ELab ab

Multi-core IP for High-Performance SSD

Backend Channel Complex

CPU

Inbound

NVMe driver

SQ CQ

Core0

Outbound

PCIe

PCIe Client Logic PCIe

I-RAM I-RAM

I-RAM

Interconnection Networks

Memory Controller SRAM

slide-20
SLIDE 20

CAMEL ELab ab

Component Latency Decomposition

TLCMLCSLC ZNAND PRAM MRAM 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown TLCMLCSLC ZNAND PRAM MRAM 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown

Completion Translation PRP Queue/Doorbells Fetching NVM Completion Translation PRP Queue/Doorbells Fetching NVM

slide-21
SLIDE 21

CAMEL ELab ab

Component Latency Decomposition

T L C M L C S L C Z N A N D P R A M M R A M 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown T L C M L C S L C Z N A N D P R A M M R A M 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown T L C M L C S L C Z N A N D P R A M M R A M 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown T L C M L C S L C Z N A N D P R A M M R A M 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown

Completion Translation PRP Queue/Doorbells Fetching NVM Completion Translation PRP Queue/Doorbells Fetching NVM

slide-22
SLIDE 22

CAMEL ELab ab

Component Latency Decomposition

T L C M L C S L C Z N A N D P R A M M R A M 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown T L C M L C S L C Z N A N D P R A M M R A M 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown T L C M L C S L C Z N A N D P R A M M R A M 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown T L C M L C S L C Z N A N D P R A M M R A M 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown TLCMLCSLC ZNAND PRAM MRAM 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown T L C M L C S L C Z N A N D P R A M M R A M 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown

Completion Translation PRP Queue/Doorbells Fetching NVM Completion Translation PRP Queue/Doorbells Fetching NVM

slide-23
SLIDE 23

CAMEL ELab ab

Component Latency Decomposition

T L C M L C S L C Z N A N D P R A M M R A M 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown T L C M L C S L C Z N A N D P R A M M R A M 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown T L C M L C S L C Z N A N D P R A M M R A M 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown T L C M L C S L C Z N A N D P R A M M R A M 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown TLCMLCSLC ZNAND PRAM MRAM 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown T L C M L C S L C Z N A N D P R A M M R A M 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown TLCMLCSLC ZNAND PRAM MRAM 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown TLCMLCSLC ZNAND PRAM MRAM 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown

Completion Translation PRP Queue/Doorbells Fetching NVM Completion Translation PRP Queue/Doorbells Fetching NVM

slide-24
SLIDE 24

CAMEL ELab ab

Component Latency Decomposition

T L C M L C S L C Z N A N D P R A M M R A M 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown T L C M L C S L C Z N A N D P R A M M R A M 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown T L C M L C S L C Z N A N D P R A M M R A M 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown T L C M L C S L C Z N A N D P R A M M R A M 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown TLCMLCSLC ZNAND PRAM MRAM 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown T L C M L C S L C Z N A N D P R A M M R A M 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown T L C M L C S L C Z N A N D P R A M M R A M 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown TLCMLCSLC ZNAND PRAM MRAM 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown TLCMLCSLC ZNAND PRAM MRAM 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown TLCMLCSLC ZNAND PRAM MRAM 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown

Completion Translation PRP Queue/Doorbells Fetching NVM Completion Translation PRP Queue/Doorbells Fetching NVM

slide-25
SLIDE 25

CAMEL ELab ab

Component Latency Decomposition

T L C M L C S L C Z N A N D P R A M M R A M 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown T L C M L C S L C Z N A N D P R A M M R A M 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown T L C M L C S L C Z N A N D P R A M M R A M 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown T L C M L C S L C Z N A N D P R A M M R A M 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown TLCMLCSLC ZNAND PRAM MRAM 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown T L C M L C S L C Z N A N D P R A M M R A M 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown T L C M L C S L C Z N A N D P R A M M R A M 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown TLCMLCSLC ZNAND PRAM MRAM 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown TLCMLCSLC ZNAND PRAM MRAM 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown TLCMLCSLC ZNAND PRAM MRAM 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown TLCMLCSLC ZNAND PRAM MRAM 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown TLCMLCSLC ZNAND PRAM MRAM 0.0 0.2 0.4 0.6 0.8 1.0 Latency breakdown

Completion Translation PRP Queue/Doorbells Fetching NVM Completion Translation PRP Queue/Doorbells Fetching NVM

CPU bursts can be

  • verlapped with I/O

bursts Firmware control become the critical performance bottleneck

slide-26
SLIDE 26

CAMEL ELab ab

Overview

  • A framework that fully automates NVMe control logic in hardware to make

it possible to build customizable devices

  • OpenExpress is an NVMe host accelerator IP for the integration with an easy-to-access

FPGA design.

  • OpenExpress doesn’t require any software intervention to process

concurrent read and write NVMe requests

  • It supports scalable data submission, rich outstanding NVMe commands and submissio

n/completion queue management.

  • We prototype OpenExpress on a commercially available Xilinx FPGA board

and optimize all the logic modules to operate a high frequency.

slide-27
SLIDE 27

CAMEL ELab ab

Full Hardware Automation for Critical I/O Path

Backend Channel Complex

CPU

Inbound

NVMe driver

SQ CQ

Core0

Outbound

PCIe

PCIe Client Logic PCIe

I-RAM I-RAM

I-RAM

Interconnection Networks

Memory Controller SRAM

slide-28
SLIDE 28

CAMEL ELab ab

Full Hardware Automation for Critical I/O Path

Backend Channel Complex

CPU

Inbound

NVMe driver

SQ CQ

Core0

Outbound

PCIe

PCIe Client Logic PCIe

SoC Memory Bus

Datapath

Hardware Automation (OpenExpress)

slide-29
SLIDE 29

CAMEL ELab ab

Queue Dispatching

OpenExpress

SQ 1 DB SQ 0 DB SQ Tail Doorbell Region

Address (reads/writes) Data (reads/writes) Write Response

SQ Entry Fetch Manager (FET)

slide-30
SLIDE 30

CAMEL ELab ab

Data Transferring

OpenExpress

SQ 1 DB SQ 0 DB SQ Tail Doorbell Region

Address (reads/writes) Data (reads/writes) Write Response

SQ Entry Fetch Manager (FET) PRP Engine (HTRW)

Backend DMA (DMA)

PCIe EP Complex

DIMM0 DIMM1 DIMM2 DIMM3

slide-31
SLIDE 31

CAMEL ELab ab

Completion Handling

OpenExpress

SQ 1 DB SQ 0 DB SQ Tail Doorbell Region

PCIe EP Complex Address (reads/writes) Data (reads/writes) Write Response

SQ Entry Fetch Manager (FET) PRP Engine (HTRW)

Backend DMA (DMA)

Completion Handler (CMT)

CQ 1 DB CQ 0 DB CQ Head Doorbell Region

DIMM0 DIMM1 DIMM2 DIMM3

slide-32
SLIDE 32

CAMEL ELab ab

NVMe Context Management

OpenExpress

SQ 1 DB SQ 0 DB SQ Tail Doorbell Region

PCIe EP Complex Address (reads/writes) Data (reads/writes) Write Response

SQ Entry Fetch Manager (FET) PRP Engine (HTRW)

Backend DMA (DMA)

Completion Handler (CMT)

CQ 1 DB CQ 0 DB CQ Head Doorbell Region

DIMM0 DIMM1 DIMM2 DIMM3

NVMe Context Box (CTX)

slide-33
SLIDE 33

CAMEL ELab ab

Operation Flow of OpenExpress

OpenExpress

SQ 1 DB SQ 0 DB SQ Tail Doorbell Region

PCIe EP Complex

Address (reads/writes) Data (reads/writes) Write Response

SQ Entry Fetch Manager (FET) PRP Engine (HTRW)

Backend DMA (DMA)

Completion Handler (CMT)

CQ 1 DB CQ 0 DB CQ Head Doorbell Region

DIMM0 DIMM1 DIMM2 DIMM3

Completion Queue 1 Tail Head CQ 1 BAR

Submission Queue 1 Tail Head Issue a command

1

2 Write SQ 1

tail pointer

3 Monitor the DB region

and signal an event

4 Fetch 16 DW

SQ Entry

5

Send the parsed command DATA

PRP DRAM

6 Processing PRP

6 Copying data associated with each PRP

Posting CQ entry

7

Create MSI and interrupt

8

9

Write CQ 1 head pointer

slide-34
SLIDE 34

CAMEL ELab ab

Major IP Cores

FET HTRW CMT CTX Frontend

slide-35
SLIDE 35

CAMEL ELab ab

Major IP Cores

FET HTRW CMT CTX Frontend

slide-36
SLIDE 36

CAMEL ELab ab

Major IP Cores

slide-37
SLIDE 37

CAMEL ELab ab

Synthesis and Implementation

Four 64GB DDR4 x72 DIMM(256GB) Xilinx Virtex UltraScale 190 PCIe Gen3

Descriptions CPU i5-9400 2.9GHz Design Suite Vivado 2017.3.1 OS Ubuntu 18.04 LTS Building Time More than 7 hours

slide-38
SLIDE 38

CAMEL ELab ab

Frequency Tuning (50MHz ~ 250 MHz)

  • Detect the long routing delay and

place register slice IPs between hardware modules

  • It can increase the frequency while

minimizing a negative slack

  • Gradual trial-and error to reduce the

amount of route delay

Memory Controller DDR interconnect AXI DMA PCIe EP CMT FET HTRW CTX

Putting register slides between different IP cores

slide-39
SLIDE 39

CAMEL ELab ab

Frequency Tuning (50MHz ~ 250 MHz)

  • Group hardware modules in the floor-

planning

  • After synthesis, create FPGA pblock (a part
  • f grid cells), allocate hardware IPs to the

pblock, and do PNR (place and route)

  • Check the implementation report and do

gradual trial-and-error again

Memory Controller DDR interconnect AXI DMA PCIe EP CMT FET HTRW CTX

Reducing route distances by grouping IP logic

slide-40
SLIDE 40

CAMEL ELab ab

  • Perf. Improvement of Frequency Tuning
  • The performance improvement for

reads and writes is as high as 20% and 60%, respectively

  • Reads exhibit more benefits as its

nature of data movement on PCIe packets (explained shortly)

  • The large size block processing has

more benefits because of automated PRP data processing

Reads Writes

slide-41
SLIDE 41

CAMEL ELab ab

  • Perf. Improvement of Frequency Tuning

Reads Writes

DRAM DRAM

Outbound

Emul Emul Emul

Inbound

PCIe

OpenExpress Host

IO CQ Region Control Reg Set IO SQ Region IO Command

PCIe BAR

PRP List

Pointer Pointer

PRP parsing/traversing and data transfers are fully automated

slide-42
SLIDE 42

CAMEL ELab ab

Evaluation

  • For all executions, a single I/O worker cannot extract full bandwidth (CPU

bottleneck)

  • We execute 10 threads for both benchmark execution and real workload

Descriptions CPU 8-core 3.3GHz Intel Skylake-X Server Microarchitecture DRAM 32GB DDR4 (2666) Benchmark FIO Real workload CAMEL’s Open Storage Trace (SNIA) https://trace.camelab.org/ I/O Workers 10 threads Block Device Optane SSD P4800X

  • Evaluations demonstrated in the paper are

all performed with a queue-depth “8”

  • The queue depth exhibiting the best bandwidth
  • Note that we don’t claim that OpenExpress

can be faster than other fast NVMe devices.

  • Instead, the evaluation shows that an FPGA-

based design and implementation for NVMe IP cores can offer good performance to make it a viable candidate for use in storage research.

slide-43
SLIDE 43

CAMEL ELab ab

Performance w/ Microbenchmarks

Bandwidth (seq) Bandwidth (rand) Latency (seq) Latency (rand)

slide-44
SLIDE 44

CAMEL ELab ab

Performance w/ Microbenchmarks

  • 4KB-sized requests
  • There is no much bandwidth/latency

difference between reads and writes (3 GB/sec vs 2.8/sec)

  • OpenExpress latency is 72~77 us at the

queue-depth eight (vs. P4800X offers 120~150 us)

  • It reaches the maximum bandwidth

with 16KB-sized requests

  • 4.7GB/sec for random writes (258 us)
  • 7GB/sec for random reads (175 us)

(vs. P4800X offers 532 us~600us) Bandwidth (seq) Latency (seq)

Why are writes slower than reads?

slide-45
SLIDE 45

CAMEL ELab ab

Performance w/ Microbenchmarks

Bandwidth (seq) Latency (seq)

Doorbell NVMe PRP Data Data Doorbell NVMe PRP Data Data Doorbell NVMe PRP Data Data Doorbell NVMe PRP Data Data

Writes Reads

RX TX RX TX

All payloads are serialized Data and NVMe payloads are served in parallel

slide-46
SLIDE 46

CAMEL ELab ab

Real Workload Evaluation

24HR FIU DevDiv Server TPCE

100 200 300 400

OpenExpress

Microsec

24HR FIU DevDiv Server TPCE

0.0 1.0 2.0 3.0 4.0 5.0 6.0 MB/s

OpenExpress

  • The performance of real workloads

are different with microbenchmark results because of unaligned request

  • ffsets, sector length variations, etc.
  • Most storage cards cannot reach their

best performance with real workload executions

  • Read-intensive workloads

(DevDiv and Server)

  • OpenExpress offers 4GB/sec ~ 4.5GB/sec

(100~200 us)

  • Write-intensive workloads

(24HR, FIU and TPCE)

  • 1.25GB/sec~2.1GB/sec (all under 100 us)
slide-47
SLIDE 47

CAMEL ELab ab

Real Workload Evaluation

24HR FIU DevDiv Server TPCE

100 200 300 400

OpenExpress

Microsec

24HR FIU DevDiv Server TPCE

0.0 1.0 2.0 3.0 4.0 5.0 6.0 MB/s

OpenExpress

24HR FIU DevDiv Server TPCE

0.0 1.0 2.0 3.0 4.0 5.0 6.0 MB/s

Optane SSD OpenExpress

24HR FIU DevDiv Server TPCE

100 200 300 400

Optane SSD OpenExpress

Microsec

  • Bandwidth
  • OpenExpress shows 76.3% better

bandwidth than Optane SSD, on average

  • Latency
  • It exhibits 68.6% shorter latency than

those of OptaneSSD

  • DevDiv
  • 111.5% better performance compared to

Optane SSD (2.1GB/sec)

FPGA is yet much slower, but the FPGA design and implementation for NVMe IP cores are not

  • n the critical path and can be used for system-

level studies as a research vehicle

slide-48
SLIDE 48

CAMEL ELab ab

Conclusion: Related Work and Download

$ 35K $ 45K $ 40K Ip-m**** Inte****** Ep*****

Per-month unitary prices for 3rd party NVMe IP Core

The price of a single-use source code is around $ 100K

  • For academic/non commercial purpose,

OpenExpress will be freely downloadable:

  • Hardware automation IP cores (HTRW, FET, CMT, CTX..)
  • Firmware for MicroBlaze (to control administrator

command management, device initialization, etc.)

  • Download information
  • https://openexpress.camelab.org
slide-49
SLIDE 49

Computer Architecture and Memory systems Laboratory

CAME L AME Lab ab

Myoungsoo Jung

ATC 2020

Sponsored by

Fully Hardware Automated Open Research Framework for Future Fast NVMe Device