LightNVM: The Linux Open-Channel SSD Subsystem Matia tias Bj - - PowerPoint PPT Presentation

lightnvm the linux open channel ssd subsystem matia tias
SMART_READER_LITE
LIVE PREVIEW

LightNVM: The Linux Open-Channel SSD Subsystem Matia tias Bj - - PowerPoint PPT Presentation

LightNVM: The Linux Open-Channel SSD Subsystem Matia tias Bj Bjrli ling (ITU (ITU, CN CNEX La Labs), Javier Gonzlez (CNEX Labs), Philippe Bonnet (ITU) 0% Writes - Read Latency 4K Random Read 4K Random Read Latency Percentile 2


slide-1
SLIDE 1

½ LightNVM: The Linux Open-Channel SSD Subsystem

Matia tias Bj Bjørli ling (ITU (ITU, CN CNEX La Labs), Javier González (CNEX Labs), Philippe Bonnet (ITU)

slide-2
SLIDE 2

0% Writes - Read Latency

2

4K Random Read Percentile 4K Random Read Latency

slide-3
SLIDE 3

20% Writes - Read Latency

3

4K Random Read Latency 4K Random Read / 4K Random Write Signficant outliers! Worst-case 30X

4ms!

Percentile

slide-4
SLIDE 4

NAND Capacity Continues to Grow

4

Source: William Tidwell -The Harder Alternative – Managing NAND capacity in the 3D age

Performance – Endurance – DRAM overheads Solid State Drive

Workload #1 Workload #2 Workload #3 Workload #4

slide-5
SLIDE 5

5

1 2 3 Host: Log-on-Log Device: Write Indirection & Unknown State

Solid-State Drive Block Layer Log-structured File-system HW Kernel Space User Space Log-structured Database (e.g., RocksDB)

Read/Write/Trim pread/pwrite Metadata Mgmt. Garbage Collection Address Mapping Metadata Mgmt. Garbage Collection Address Mapping Metadata Mgmt. Garbage Collection Address Mapping

VFS Solid-State Drive Pipeline

die2 die3 die0 die1

NAND Controller

Reads

Write Buffer

Writes

Even if Writes and Reads does not collide from application Indirection and a Narrow Storage interface cause outliers

ii

Log- Structured Unable to align data logically = Write amplification increase + extra GC Buffered Writes Drive maps logical data to the physical location with Best Effort Host is oblivious to physical data placement due to indirection

What contributes to outliers?

slide-6
SLIDE 6

Open-Channel SSDs

6

I/O Isolation

Provide isolation between tenants by allocating independent parallel units

Predictable Latency

I/Os are synchronous. Access time to parallel units are explicit defined.

Data Placement & I/O Scheduling

Manage the non-volatile memory as a block device, through a file-system or inside your application.

slide-7
SLIDE 7

Solid-State Drive

Parallel Units Flash Translation Layer Channel X Media Error Handling Media Retention Management Media Controller Responsibilities Host Interface Channel Y

Solid-State Drives

7

Read/Write/Erase Read/Write Tens of Units! R/W/E to R/W Manage Media Constraints ECC, RAID, Retention

Read (50-100us) Write (1-5ms) Erase (3-15ms)

slide-8
SLIDE 8

Expose device parallelism

  • Parallel units (LUNs) are exposed as independent units to the host.
  • Can be a logical or a physical representation.
  • Explicit performance characteristics.

Log-Structured Storage

  • Exposes storage as chunks that must be written sequentially.
  • Similar to the HDD Shingled Magnetic Recording (SMR) interface.
  • No need for internal garbage collection by the device.

Integrate with file-systems and databases, and can also implement I/O determinism, streams, barriers, and other new data management schemes without changing device firmware.

Rebalance the Storage Interface

8

slide-9
SLIDE 9

Specification

9

Device model

  • Defines parallel units and how they are laid out in the

LBA address space.

  • Defines chunks. Each chunk is a range of LBAs where

writes must be sequential. To write again, a chunk must be reset. – A chunk can be in one of four states (free/open/closed/offline) – If a chunk is open, there is a write pointer associated. – The model is media-agnostic.

Geometry and I/O Commands

  • Read/Write/Reset – Scalars and Vectors
slide-10
SLIDE 10

Logical Block Address Space

Drive Model - Chunks

1 LBA -1 1 … Chunk - 1 LBA Chunk

Logical block granularity For example 4KB

  • Min. Write size

granularity Synchronous – May fail – An error marks write bad, not whole SSD Chunk granularity Synchronous – May fail – An error only marks chunk bad, and not whole SSD

Write Reads Reset

slide-11
SLIDE 11

Logical Block Address Space

Drive Model - Organization

Parallelism across Groups (Shared bus) Parallel Units (LUNs)

1 … LBA -1 1 … Chunk - 1 LBA Chunk 1 … PU - 1 1 … Group - 1 PU Group

SSD LUN LUN PU LUN LUN PU

Chunk Chunk

NVMe

Host

slide-12
SLIDE 12

LightNVM Subsystem Architecture

12

  • 1. NVMe Device Driver

Detection of OCSSD Implements specification

  • 2. LightNVM Subsystem

Generic layer Core functionality Target management

  • 3. High-level I/O Interfaces

Block device using a target Application integration with liblightnvm File-systems, ...

Open-Channel SSD NVMe Device Driver LightNVM Subsystem pblk Hardware Kernel Space User Space Application(s) File System

PPA Addressing Scalar Read/Write (optional) Geometry Vectored R/W/E (2) (1) (3)

slide-13
SLIDE 13

pblk - Host-side Flash Translation Layer

13

Mapping table

  • Logical block granularity

Write buffering

  • Lockless circular buffer
  • Multiple producers
  • Single consumer (Write Thread)

Error Handling

  • Device write/reset errors

Garbage Collection

  • Refresh data
  • Rewrite chunks

Open-Channel SSD NVMe Device Driver LightNVM Subsystem Hardware Linux Kernel File System

make_rq make_rq Write Thread Error Handling L2P Table Write Context Write Buffer

Write Entry Write Lookup Cache Hit Read Add Entry

GC/Rate-limiting Thread

Read Path Write Path

slide-14
SLIDE 14

Experimentation

  • Drive

CNEX Labs Open-Channel SSD NVMe, Gen3x8, 2TB MLC NAND Implements Open-Channel 1.2 specification

  • Parallelism

16 channels 8 parallel units per channel (Total: 128 PUs)

  • Parallel unit characteristic
  • Min. Write size: 16K + 64B OOB

Chunks: 1,067, Chunk size: 16MB

  • Throughput per parallel unit:

Write: 47MB/s Read: 108MB/s (4K), 280MB/s (64K)

14

slide-15
SLIDE 15

Base Performance – Throughput + Latency

15

Grows with parallelism RR slightly lower Request I/O Size

slide-16
SLIDE 16

Limit # of Active Writers

16

Limit number of writers to improve read latency

256K Write QD1 256K Read QD16 Single Read or Write Perf. Mixed Read/Write Write Perf. at 200MB/s Write latency increases, and read latency reduces

A priori knowledge of workload. Write 200MB/s

slide-17
SLIDE 17

Multi-Tenant Workloads

17

NVMe SSD OCSSD 2 Tenants (1W/1R) 4 Tenants (3W/1R) 8 Tenants (7W/1R)

Source: Multi-Tenant I/O Isolation with Open-Channel SSDs, Javier González and Matias Bjørling, NVMW ‘17

slide-18
SLIDE 18

Lessons Learned

18

  • 1. Warranty to end-users – Users has direct access to

media.

  • 2. Media characterization is complex and performed

for each type of NAND memory – Abstract the media to a ”clean” interface.

  • 3. Write buffering – For MLC/TLC media, write buffering

is required. Decide if in host or in device.

  • 4. Application-agnostic wear leveling is mandatory –

Enable statistics for host to make appropriate decisions.

slide-19
SLIDE 19
  • New storage interface between host

and drive.

  • The Linux kernel LightNVM subsystem.
  • pblk: A host-side Flash Translation Layer

for Open-Channel SSDs.

  • Demonstration of an Open-Channel

SSD.

Conclusion

  • Initial release of subsystem with Linux

kernel 4.4 (January 2016).

  • User-space library (liblightnvm) support

upstream in Linux kernel 4.11 (April 2017).

  • pblk available in Linux kernel 4.12 (July

2017).

  • Open-Channel SSD 2.0 specification

released (January 2018) and support available from Linux kernel 4.17 (May 2018).

Contributions LightNVM

12-03-2018 · 19

slide-20
SLIDE 20

Thank You

12-03-2018 · 20