Designing a Storage Software Stack for Accelerators
Shinichi Awamoto1, Erich Focht2, Michio Honda3
1NEC Labs Europe; 2NEC Deutschland; 3Univerisity of Edinburgh
<shinichi.awamoto@gmail.com>
Designing a Storage Software Stack for Accelerators Shinichi Awamoto - - PowerPoint PPT Presentation
Designing a Storage Software Stack for Accelerators Shinichi Awamoto 1 , Erich Focht 2 , Michio Honda 3 1 NEC Labs Europe; 2 NEC Deutschland; 3 Univerisity of Edinburgh <shinichi.awamoto@gmail.com> execute the entire application code unlike
Shinichi Awamoto1, Erich Focht2, Michio Honda3
1NEC Labs Europe; 2NEC Deutschland; 3Univerisity of Edinburgh
<shinichi.awamoto@gmail.com>
photos from: xilinx.com, www.mellanox.com, uk.nec.com, nvidia.com
SmartNICs Xilinx Zynq NEC SX-Aurora TSUBASA Vector Engine
general core special core
execute the entire application code unlike GPUs.
GPU kernel function
general core special core
I/O access
But still, the host system mediates data access, which incurs overhead.
Multiple data copies and dispatch inside of redirected system calls increase latency. Even on microbenchmarks, the overhead is obvious.
K U
HW
L H CPU D Acc. D
bc
PCI Hb
105ns 42748ns
Design options
Expensive kernel emulation
e.g. GPUfs (ASPLOS ‘13), SPIN (ATC ‘17)
Only DMAs are mitigated.
e.g. Multikernel (SOSP ‘09), Popcorn Linux (ASPLOS ‘17)
No kernel context in the accelerator
Goal: Fast storage access in the accelerator applications.
ext4fs
Conventional system software does not perform well on wimpy accelerator cores.
accelerator user-space
and buffer caches
Kee Ue HW
LeeDB Dec I/O Ee
gibc Di Haa PCIe Hb
host physical memory space registers accelaratorʼs memory space DMA buffers Direct I/O Engine DMA buffers registers remapping
are remapped using APIs provided by the accelerator vendor.
needed throughout the entire process.
without sacrifice of performance.
The current ext2-like design would be replaced with accelerator-aware implementation.
As a further step, we are exploring accelerator specific optimizations. (e.g. vectorized LSM-tree compactions)
Intel Xeon Gold 6126 (2.60GHz, 12-core) 96GB RAM
photos from: sx-aurora.github.io
How much does HAYAGUI improve file operations?
fdatasync(2) are used.
Lower is better.
How much does HAYAGUI improve KVS workloads?
Lower is better.
Realistic KVS workloads
Higher is better.
How does Hayagui improve realistic apps?
(2.1GB) (7.0GB) (15.0GB)
Smaller is better.
Shinichi Awamoto1, Erich Focht2, Michio Honda3
1NEC Labs Europe; 2NEC Deutschland; 3Univerisity of Edinburgh
<shinichi.awamoto@gmail.com>