Server Server Server Server Server Datacenter Network - - PowerPoint PPT Presentation

server server server server server datacenter network e g
SMART_READER_LITE
LIVE PREVIEW

Server Server Server Server Server Datacenter Network - - PowerPoint PPT Presentation

Chanwoo Chung , Jinhyung Koo, Junsu Im, Arvind , and Sungjin Lee DGIST and MIT NVRAMOS 19 2019.10.24 DATA -INTENSIVE COMPUTING SYSTEMS LAB ORATORY Computation Application Application Application Application Application


slide-1
SLIDE 1

Chanwoo Chungǂ, Jinhyung Koo, Junsu Im, Arvindǂ, and Sungjin Lee

DGIST and MITǂ

DATA-INTENSIVE COMPUTING SYSTEMS LABORATORY

NVRAMOS ‘19 2019.10.24

slide-2
SLIDE 2

2

Computation

Application Server Application Server Application Server Application Server Application Server

Storage

… …

Datacenter Network (e.g., Ethernet, InfiniBand, …)

… …

Storage Node 0 Storage Node 1 Storage Node N

It is not mere storage – it is another high-end server!!!

High-end Xeon CPUs Several GBs of DRAM An array of SSDs Large form-factor … Power Hungry (e.g., 1700 W) Expensive (e.g., $2~40,000 w/o SSDs) Large Volume (e.g., 2-4 U) High TCO (e.g., Cooling) …

Xeon CPUs

GB DRAM Disk Array w/ RAID

slide-3
SLIDE 3

3

▪ HDD is slow – require large DRAM and array of disks

▪ 10 ms latency & 100~300 MB/s throughput

▪ HDD is dumb – the host system makes it smarter

▪ Xeon CPUs with advanced algorithms

300 MB/s

HDD HDD HDD HDD HDD HDD HDD HDD

Xeon CPUs

GB DRAM Disk Array w/ RAID

300 MB/s

  • Aggr. Network Throughput = 20 GB/s

Host Protocol Translation (e.g., NFS, CIFS, …) Storage Host Local File System (e.g., EXT4, WAFL, …) Prefetching Caching/Buffering Parity Mgmt

Dedup/Compresion

40GbE 40GbE 40GbE 40GbE

slide-4
SLIDE 4

3

▪ HDD is slow – require large DRAM and array of disks

▪ 10 ms latency & 100~300 MB/s throughput

▪ HDD is dumb – the host system makes it smarter

▪ Xeon CPUs with advanced algorithms

1~10 GB/s

SSD SSD SSD SSD SSD SSD SSD SSD

Xeon CPUs

GB DRAM SSD Array w/ RAID

1~10 GB/s

  • Aggr. Network Throughput = 20 GB/s

Host Protocol Translation (e.g., NFS, CIFS, …) Storage Host Local File System (e.g., EXT4, WAFL, …) Prefetching Caching/Buffering Parity Mgmt

Dedup/Compresion

SSDs are not a bottleneck → Network/CPU are new bottlenecks

  • Aggr. SDD Throughput = 10~100 GB/s

(with 10 SSDs)

Bottleneck!!! 40GbE 40GbE 40GbE 40GbE

slide-5
SLIDE 5

※ Aggr. SSD throughput was estimated assuming each SSD offers 1GB/s throughput

4 EMC XtremIO NetApp SolidFire HPE 3PAR Hynix AFA SSD Array Capacity 36~144TB 46TB 750TB 522TB # of SSDs 18~72 12 120 576 Aggr. Throughput* 18~72 GB/s 12 GB/s 120 GB/s 576 GB/s Network Ports 4~8x 10Gb iSCSI 2x 25Gb iSCSI 4~12x 16Gb FC 3x Gen3 PCIe Aggr. Throughput 5~10 GB/s 6.25 GB/s 8~24 GB/s 48 GB/s

▪ Supported by the latest works

▪ K. Kourtis et al., “Reaping the performance of fast NVM storage with uDepot,” USENIX FAST ‘19 ▪ J. Kim et al., “Alleviating Garbage Collection Interference through Spatial Separation in All Flash Arrays,” USENIX ATC ‘19

slide-6
SLIDE 6

4

▪ Supported by the latest works

▪ K. Kourtis et al., “Reaping the performance of fast NVM storage with uDepot,” USENIX FAST ‘19 ▪ J. Kim et al., “Alleviating Garbage Collection Interference through Spatial Separation in All Flash Arrays,” USENIX ATC ‘19

slide-7
SLIDE 7

5

▪ HDD is slow – require large DRAM and array of disks

▪ 10 ms latency & 100~300 MB/s throughput

▪ HDD is dumb – the host system makes it smarter

▪ Xeon CPUs with advanced algorithms

1~10 GB/s

SSD SSD SSD SSD SSD SSD SSD SSD

Xeon CPUs

GB DRAM SSD Array w/ RAID

1~10 GB/s

  • Aggr. Network Throughput = 20 GB/s

Host Protocol Translation (e.g., NFS, CIFS, …) Storage Host Local File System (e.g., EXT4, WAFL, …) Prefetching Caching/Buffering Parity Mgmt

Dedup/Compresion

SSDs are not a bottleneck → Network/CPU are new bottlenecks

Bottleneck!!!

Ctrl Ctrl Ctrl Ctrl Ctrl Ctrl Ctrl Ctrl

SSDs are smart enough, supporting many features → Duplicate storage management hurts performance

40GbE 40GbE 40GbE 40GbE

slide-8
SLIDE 8

6

▪ 4 embedded CPUs (ARM) running at 700 MHz to 1.4 GHz and > 1~16GB DRAM that a desktop PC had 10 years ago ▪ Those resources are required for running firmware (i.e., FTL)

PCIe Interface (1~10 GB/s) Host-to-PCIe Controller DRAM (>4 GB)

ARM CPU (Max 1.4 GHz) ARM CPU (Max 1.4 GHz) ARM CPU (Max 1.4 GHz) ARM CPU (Max 1.4 GHz)

NAND CHIP NAND CHIP NAND CHIP NAND CHIP NAND CHIP NAND CHIP NAND CHIP NAND CHIP

Block I/O-to-Flash I/O Interfacing Cleaning Compression Deduplication Parity Mgmt. Wear-Leveling Remapping

RAID

slide-9
SLIDE 9

7

Computation Storage

Application Server Application Server Application Server Application Server Application Server

… …

Datacenter Network (e.g., Ethernet, InfiniBand, …)

… …

Storage Node 0 Storage Node 1 Storage Node N

Xeon CPUs

GB DRAM Disk Array w/ RAID

Let’s assume that this storage node has 8TB 72 SSDs (EMC XtremIO) ▪ # of ARM cores: 4 cores x 72 = 288 ARM cores ▪ Aggregate DRAM: 8 GB x 72 = 576 GB DRAM Just for managing NAND flash

Q: Is this a storage node or a low-power microserver?

slide-10
SLIDE 10

▪Use simple SSD?

▪ Software Defined Flash (ASPLOS ’14) ▪ Application-managed Flash (USENIX FAST ’16) ▪ LightNVM (USENIX FAST ’17) → Network/CPU are still bottleneck

▪Use better SSD organization?

▪ SWAN (HotStorage ’16; USENIX ATC ‘19) → Still rely on power-hungry and expensive host

▪Any other solution?

8

slide-11
SLIDE 11

▪Motivation ▪Basic Idea ▪LightStore Software ▪LightStore Controller ▪LightStore Adapters ▪Experimental Results ▪Conclusion

9

slide-12
SLIDE 12

▪ Get rid of a space-consuming, expensive, power-hungry host server ▪ Put and run everything in SSDs ▪ Attach SSDs to a datacenter network ▪ Let application servers directly talk to SSDs

10 SSD SSD SSD SSD SSD SSD

Ctrl Ctrl Ctrl Ctrl Ctrl Ctrl

Application Server Application Server Application Server

SSD

Ctrl

Datacenter Network Host Protocol Translation (e.g., NFS, CIFS, …)

Local File System (e.g., EXT4, WAFL, …) Prefetching Caching/Buffering Parity Mgmt

slide-13
SLIDE 13

▪ Get rid of a space-consuming, expensive, power-hungry host server ▪ Put and run everything in SSDs ▪ Attach SSDs to a datacenter network ▪ Let application servers directly talk to SSDs

10 SSD SSD SSD SSD SSD SSD

Ctrl Ctrl Ctrl Ctrl Ctrl Ctrl

Application Server Application Server Application Server

SSD

Ctrl

Datacenter Network Host Protocol Translation (e.g., NFS, CIFS, …)

Local File System (e.g., EXT4, WAFL, …) Prefetching Caching/Buffering Parity Mgmt

slide-14
SLIDE 14

▪ Get rid of a space-consuming, expensive, power-hungry host server ▪ Put and run everything in SSDs ▪ Attach SSDs to a datacenter network ▪ Let application servers directly talk to SSDs

10 SSD SSD SSD SSD SSD SSD

Ctrl Ctrl Ctrl Ctrl Ctrl Ctrl

Application Server Application Server Application Server

SSD

Ctrl

Host-to-PCIe Controller

DRAM (2~4 GB)

NAND

NAND NAND NAND NAND NAND NAND NAND

RAID

Datacenter Network

Low-level Flash Management High-level Flash Management Host Protocol Translation

slide-15
SLIDE 15

▪ Get rid of a space-consuming, expensive, power-hungry host server ▪ Put and run everything in SSDs ▪ Attach SSDs to a datacenter network ▪ Let application servers directly talk to SSDs

10 SSD SSD SSD SSD SSD SSD

Ctrl Ctrl Ctrl Ctrl Ctrl Ctrl

Application Server Application Server Application Server

SSD

Ctrl

Host-to-PCIe Controller

DRAM (2~4 GB)

NAND

NAND NAND NAND NAND NAND NAND NAND

RAID

Datacenter Network

Low-level Flash Management High-level Flash Management Host Protocol Translation

Ethernet Controller

slide-16
SLIDE 16

▪ Get rid of a space-consuming, expensive, power-hungry host server ▪ Put and run everything in SSDs ▪ Attach SSDs to a datacenter network ▪ Let application servers directly talk to SSDs

10 SSD SSD SSD SSD SSD SSD

Ctrl Ctrl Ctrl Ctrl Ctrl Ctrl

Application Server Application Server Application Server

SSD

Ctrl

Host-to-PCIe Controller

DRAM (2~4 GB)

NAND

NAND NAND NAND NAND NAND NAND NAND

RAID

Datacenter Network

Low-level Flash Management High-level Flash Management Host Protocol Translation

Ethernet Controller

Deliver Flash’s low latency & high throughput to network ports!

slide-17
SLIDE 17

▪ Get rid of a space-consuming, expensive, power-hungry host server ▪ Put and run everything in SSDs ▪ Attach SSDs to a datacenter network ▪ Let application servers directly talk to SSDs

10 SSD SSD SSD SSD SSD SSD

Ctrl Ctrl Ctrl Ctrl Ctrl Ctrl

Application Server Application Server Application Server

SSD

Ctrl

Datacenter Network

An x86 storage server with N SSDs is replaced with N SSDs

Low Power (e.g., 100 W / 10 SSDs) Cheap (e.g., Zero server cost) Small Volume (e.g., Less than 1U) Low TCO (e.g., Less Cooling) Scalability (No network bottleneck)

slide-18
SLIDE 18

▪ Can we run complicated server software on wimpy ARM cores? ▪ How can we provide the same interface with application servers? ▪ How can we manage unreliable NAND without more ARM cores?

11

slide-19
SLIDE 19

▪ Can we run complicated server software on wimpy ARM cores? ▪ How can we provide the same interface with application servers? ▪ How can we manage unreliable NAND without more ARM cores?

11

LightStore Cluster

NIC

KV Store

Flash NIC

KV Store

Flash NIC

KV Store

Flash KVS File System YCSB Clients (Datacenter Applications ) YCSB Adapter FS Adapter Block Blk Adapter

INSERT fwrite()

read() get()

KV requests hashed to different nodes by adapters w/ Consistent Hashing

KV Request (GET, SET, DELETE, …) Application Servers

Datacenter Network

KV Protocol Server LSM-tree Algorithm LightStore Software

LightStore Node (Drive-sized Embedded System)

Hardware FTL Flash Controller LightStore Controller

NAND NAND NAND NAND

Network Interface Card Flash

  • Exp. Net

Expansion Card Network

… …

Run a simple KV store (LSM-tree) which exposes a flexible KV interface Run adaptors on application servers that translate XX-to-KV Implement FTL in hardware since LSM-tree is append-only

slide-20
SLIDE 20

▪ Can we run complicated server software on wimpy ARM cores? ▪ How can we provide the same interface with application servers? ▪ How can we manage unreliable NAND without more ARM cores?

11

LightStore Cluster

NIC

KV Store

Flash NIC

KV Store

Flash NIC

KV Store

Flash KVS File System YCSB Clients (Datacenter Applications ) YCSB Adapter FS Adapter Block Blk Adapter

INSERT fwrite()

read() get()

KV requests hashed to different nodes by adapters w/ Consistent Hashing

KV Request (GET, SET, DELETE, …) Application Servers

Datacenter Network

KV Protocol Server LSM-tree Algorithm LightStore Software

LightStore Node (Drive-sized Embedded System)

Hardware FTL Flash Controller LightStore Controller

NAND NAND NAND NAND

Network Interface Card Flash

  • Exp. Net

Expansion Card Network

… …

Run a simple KV store (LSM-tree) which exposes a flexible KV interface Run adaptors on application servers that translate XX-to-KV Implement FTL in hardware since LSM-tree is append-only

①LightStore Software ②LightStore Controller ③LightStore Adapter

slide-21
SLIDE 21

▪Motivation ▪Basic Idea ▪LightStore Software ▪LightStore Controller ▪LightStore Adapters ▪Experimental Results ▪Conclusion

12

slide-22
SLIDE 22

▪Hash-based KVS

13

▪ Simple implementation ▪ Unordered keys ▪ limited RANGE & SCAN ▪ Random==Sequential access ▪ Unbounded tail-latency ▪ KV-SSDs (mounted on host) ▪ Samsung KV-SSD ▪ KAML [Jin et. al., HPCA 2017] ▪ BlueCache [Shuotao et. al., VLDB 2016]

▪LSM-tree-based KVS

▪ Multi-level search tree ▪ Sorted keys ▪ RANGE & SCAN ▪ Fast sequential access → Adapter-friendly ▪ Bounded tail-latency ▪ Append-only batched writes → Flash-friendly

Our Choice!

slide-23
SLIDE 23

▪ LightStore Software is implemented using the LSM-tree algorithm ▪ Popular algorithm for implementing key-value store (KVS) ▪ Suitable for NAND flash since it is append-only ▪ How about using existing popular KV software (e.g., RocksDB)? ▪ It is quite heavy to run on ARM cores ▪ RocksDB on 4-core ARM + Samsung’s 960PRO SSSD ▪ Failed to deliver raw flash throughput to a network port

14

slide-24
SLIDE 24

▪ Three main bottlenecks in running RocksDB on ARM

1. Excessive Memory-copy Overhead:

  • memcpy() calls account for up to 30% of the total CPU cycles
  • Partially due to compaction

2. High Context Switch Overhead:

  • Spawns more than 20 threads for simultaneously processing user

requests, flush and compaction

  • 4 cores are available in SSD controller

3. Deep and Sophisticated Software Stack:

  • Runs atop kernel layers, such as a page cache, a file system and a block

I/O layer

▪ Solutions?

1. Implement KVS from scratch so that it efficiently runs on ARM 2. Rebuild a lightweight storage stack

15

slide-25
SLIDE 25

▪ Platform Library

  • Not rely on the kernel too much
  • Zero-copy memory allocator: Use

mmap() to directly transfer data between DRAM and devices

  • Direct-IO engine: Use memory-

mapped registers and poll to control HW 16

Direct-IO Engine

LPDRAM

Zero-Copy Memory Allocator

Userspace Kernel Hardware

Interrupt poll ()

Platform Library

mmap ()

Poller

Thread #5

Interrupt Handler Memory Mapper

Device Ctrl.

Device Driver User Library

LightStore Controller

slide-26
SLIDE 26

17

KV Protocol Server

Datacenter Network

Direct-IO Engine

LPDRAM

Zero-Copy Memory Allocator

lsn_malloc () lsn_free ()

Userspace Kernel Hardware

Interrupt poll ()

LightStore-Engine Platform Library

mmap () KV Reply Handler Thread #2 Thread #1 KV Request Handler

Poller

Thread #5

Interrupt Handler Memory Mapper

Device Ctrl.

Device Driver User Library

LightStore Controller

▪ KV Protocol Server

  • A simple socket server to deal

with KV requests

  • Use the zero-copy allocator to

avoid data copy between NIC and DRAM

slide-27
SLIDE 27

18

LSM-Tree Engine KV Protocol Server

Datacenter Network Memtable

Direct-IO Engine

LPDRAM

Zero-Copy Memory Allocator

lsn_malloc () lsn_free () lsn_read() lsn_write()

Userspace Kernel Hardware

Acknowledge Interrupt poll () lsn_read() Lock-free Queues

LightStore-Engine Platform Library

mmap () KV Reply Handler Thread #2 Thread #1 KV Request Handler LSM-Tree Manager Thread #3 Thread #4 Writer & Compaction

Poller

Thread #5

Interrupt Handler Memory Mapper

Device Ctrl.

Device Driver User Library

LightStore Controller

▪ LSM-Tree Engine

  • Implementation of the LSM-tree

algorithm optimized for ARM 1. Key-value decoupling 2. Key-table caching 3. …

  • Use the direct-IO engine to

control the LightStore controller

  • Just forward pointers of

allocated memory chunks to the LightStore controller

slide-28
SLIDE 28

19

LSM-Tree Engine KV Protocol Server

Datacenter Network Memtable

Direct-IO Engine

LPDRAM

Zero-Copy Memory Allocator

lsn_malloc () lsn_free () lsn_read() lsn_write()

Userspace Kernel Hardware

Acknowledge Interrupt poll () lsn_read() Lock-free Queues

LightStore-Engine Platform Library

mmap () KV Reply Handler Thread #2 Thread #1 KV Request Handler LSM-Tree Manager Thread #3 Thread #4 Writer & Compaction

Poller

Thread #5

Interrupt Handler Memory Mapper

Device Ctrl.

Device Driver User Library

LightStore Controller

❶ Less context switch overheads

  • # of threads is limited to five
  • Glued via lock-free queues

❸ Less intervention by the deep I/O stack

  • No block layer, no file system, …

❷ No mem copy across all layers,

  • including KV server, LSM-tree engine,

and platform library

slide-29
SLIDE 29

▪Motivation ▪Basic Idea ▪LightStore Software ▪LightStore Controller ▪LightStore Adapters ▪Experimental Results ▪Conclusion

20

slide-30
SLIDE 30

▪ The LSM-tree writes all the data sequentially all the time ▪ Example:

  • I/O access patterns of RocksDB based on LSM-tree

21 Always Append Data LSM-Tree Compaction

slide-31
SLIDE 31

▪ The append-only behaviors of the LSM-tree simplify the FTL design

▪ No fine-grained mapping (e.g., page-level mapping) ▪ No garbage collection (i.e., LSM-tree’s compaction replaces it)

▪ The FTL is completely implemented in HW

▪ No ARM CPU is necessary; enables us to use more ARM cores to run software ▪ Faster than SW FTL; 700 ns for address translation

22

NAN D NAN D NAN D NAN D NAND Flash Array Card

NAND NAND NAND NAND

NAND Flash Array Card

Software Interface & DMA Engines LightStore Controller

FMC

Lightweight Flash Translation Layer

(Segment Mapping, Wear-leveling, Bad-block Mgmt.)

Expansion Card Manager Flash Chip Manager

(NAND Control, ECC, I/O Scheduling) Expansion Card Manager

Flash Chip Manager LightStore Expansion Card Block RAM Built-in Battery

NAND NAND NAND NAND NAND NAND NAND

NAND Flash Array Card

ARM Core (e.g., Cortex-A53)

NAND

System Bus (e.g., AXI Bus) FMC Serial Link

slide-32
SLIDE 32

▪Motivation ▪Basic Idea ▪LightStore Software ▪LightStore Controller ▪LightStore Adapters ▪Experimental Results ▪Conclusion

23

slide-33
SLIDE 33

▪ LightStore adapter is responsible for translating traditional I/O commands into KV pairs ▪ Run on applications server side as FUSE, BUSE, and library

24

LightStore Cluster

NIC

KV Store

Flash NIC

KV Store

Flash NIC

KV Store

Flash KVS File System YCSB Clients (Datacenter Applications ) YCSB Adapter FS Adapter Block Blk Adapter

INSERT fwrite()

read() get() KV Request (GET, SET, DELETE, …) Application Servers

Datacenter Network

Flash

  • Exp. Net

Expansion Card Network

… …

Virtual File System Network Driver File-to-KV Adapter (FUSE module) FUSE Kernel POSIX Interface

User Application

Socket File IO (e.g., fwrite()) File IO KV Pairs

Kernel-space User-space

Example: File-to-KV Adapter

slide-34
SLIDE 34

▪ The flexibility of KV interface makes it possible for us to support various traditional protocols ▪ Four protocols are supported

  • 1. Native KV Interface: Get/Put …
  • LightStore supports a KV interface natively
  • 2. YCSB Interface: Read/Insert/Scan …
  • Each

YCSB command directly corresponds to a specific KV operation, except for multiple fields

  • Multiple fields can be supported with MGET/MSET
  • 3. Block Interface: Read/Write/Trim
  • A key corresponds to LBA; A value corresponds to 4KB fixed-size data
  • 4. File Interface: fread()/fwrite() …
  • A file can be handled as the form of a key-value object
  • Currently, run a file system atop the block interface

25

slide-35
SLIDE 35

▪Motivation ▪Basic Idea ▪LightStore Software ▪LightStore Controller ▪LightStore Adapters ▪Experimental Results ▪Conclusion

26

slide-36
SLIDE 36

27

▪ Each LightStore Prototype node is implemented using a Xilinx ZCU102 evaluation board (w/ Cortex A53 CPU) and a custom flash card

slide-37
SLIDE 37

28

▪ Clients and storage nodes are connected to the same 10GbE switch

x86-based storage system LightStore CPU Xeon E5-2640 (20 cores @ 2.4 GHz) ARM Cortex-A53 (4 cores @ 1.2 GHz) DRAM 32 GB 4 GB SSD or flash Throughput Latency Samsung 960 PRO 512 GB SSD

3.21 GB/s / 1.38 GB/s 80 us / 120 us Firmware (FTL, buffers ...)

Custom 512 GB NAND Flash

1.2 GB/s / 430 MB/s 120 us / 480 us Raw Flash

KVS RocksDB v5.8 Our LSM-tree engine Client Ifc ARDB Our KV protocol server Network 10 Gbit Ethernet (* up to 1.20 GB/s) 10 Gbit Ethernet (* up to 620 MB/s) OS Ubuntu 16.04 (Linux 4.9.0)

slide-38
SLIDE 38

29

▪5 synthetic workloads to evaluate KVS performance

▪ The value size of 8-KB used to match the flash page size

  • The latest version has been improved to support various key/value sizes

Synthetic Workloads S-SET Sequential Write S-GET Sequential Read R-SET Random Write R-GET Random Read R-Mixed Random R:W=9:1

slide-39
SLIDE 39

30 Search overheads

▪ Except for write workloads, LightStore fully saturates flash bandwidth

S-SET: Sequential Set S-GET: Sequential Get R-SET: Random Set R-GET: Random Get R-Mixed: Random Mixed

Compaction overheads Fully saturate NAND bandwidth for sequential I/O Search & memory

  • verheads
slide-40
SLIDE 40

31

▪ Except for write workloads, LightStore fully saturates Net bandwidth

S-SET: Sequential Set S-GET: Sequential Get R-SET: Random Set R-GET: Random Get R-Mixed: Random Mixed

Fully saturate NAND bandwidth for sequential I/O

slide-41
SLIDE 41

▪ x86-RocksDB performs better thanks to high speed of Samsung 960PRO ▪ LightStore outperforms x86 under random writes (e.g., R-SET and R-Mixed) ▪ x86-ARDB suffers from non-trivial software stack overheads

32

Flash Bottleneck Net Bottleneck Net Bottleneck

slide-42
SLIDE 42

33

▪ LightStore scales linearly according to the number of SSDs added to a cluster

slide-43
SLIDE 43

34

▪ Assume that x86-ARDB scales with up to 4 SSDs ▪ 4 times the performance seen previously ▪ Peak power ▪ x86-ARDB – 400W, LightStore-Prototype – 25W

IOPS/W S-SET S-GET R-SET R-GET R/W mix LightStore Gain

1.8x 2.5x 7.4x 2.8x 5.7x

slide-44
SLIDE 44

35

slide-45
SLIDE 45

36

▪ HW FTL > Lightweight SW FTL > Full SW FTL ▪ Full SW: page mapping; garbage collection copying overhead ▪ Read: 7-10% degradation ▪ Write: 28-50% degradation ▪ Compaction thread very active; More SW FTL tasks

➔Without FPGA (or HW FTL), we would need an extra set of cores (Trade-off between Cost and Design Efforts)

slide-46
SLIDE 46

<YCSB performance>

37

<Block I/O performance> <File I/O performance>

▪ Network-attached Single Node Performance ▪ Scalability w/ Multiple Nodes

  • Max. Network
  • Max. NAND

Ceph is inefficient for handling small data

  • Max. Network
  • Max. NAND
slide-47
SLIDE 47

▪ This work was motivated by two observations in distributed storage

  • 1. The existing storage architecture did not scale well
  • 2. Applications failed to exploit full performance of SSDs over the network

▪ LightStore is a lean drive-sized high-speed KV node which plugs directly into a network port

  • 1. Lightweight KV storage engine → Deliver full NAND speed to network ports
  • 2. Hardware FTL → Minimize resource requirements
  • 3. XX-to-KV adapters → Support various applications w/ no modification

▪ A four-node cluster showed a comparable throughput to the AFA with four SSDs and achieved up to 7.4x better ops/J

38

slide-48
SLIDE 48

Thank you!

https://datalab.dgist.ac.kr