Designing a Storage Software Stack for Accelerators Shinichi Awamoto - PowerPoint PPT Presentation

Designing a Storage Software Stack for Accelerators Shinichi Awamoto 1 , Erich Focht 2 , Michio Honda 3 1 NEC Labs Europe; 2 NEC Deutschland; 3 Univerisity of Edinburgh <shinichi.awamoto@gmail.com>

execute the entire application code unlike GPUs. SoC-based Accelerators special core general core SmartNICs NEC SX-Aurora TSUBASA Xilinx Zynq Vector Engine GPU kernel function photos from: xilinx.com, www.mellanox.com, uk.nec.com, nvidia.com

But still, the host system mediates data access, The I/O problem in the accelerator which incurs overhead. special core general I/O access core

How does this problem happen? the overhead is obvious. 105ns Multiple data copies and 42748ns Even on microbenchmarks, dispatch inside of redirected system calls increase latency. Acc��. A�� U�� bc Acc��. OS Acc��. K�� L�� D�� H�� CPU P��c. C�� HW Acc��. D��c� PCI� H�b D��

Designing a Storage Stack Design options Expensive kernel emulation e.g. GPUfs (ASPLOS ‘13), SPIN (ATC ‘17) Only DMAs are mitigated. e.g. Multikernel (SOSP ‘09), Popcorn Linux (ASPLOS ‘17) No kernel context in the accelerator Goal: Fast storage access in the accelerator applications. ext4fs overhead Conventional system software does not perform well on wimpy accelerator cores. • Linux kernel library (RoEduNet ‘10) • Buffer cache sharing • Heterogeneous-arch kernels

Accelerator storage stack accelerator user-space and buffer caches Ha�a�� g�ibc Acce�. A�� Le�e�DB • NVMe device driver within the Acce�. FS U�e� D��ec� I/O E��e • File system for data organization Ke��e� P��c. C��e� PCIe HW • LevelDB for KVS interface H�b Acce�. De�ice Di��

process. SSDs? needed throughout the entire by the accelerator vendor. are remapped using APIs provided on the host side. How Direct I/O Engine controls • UIO manages a device access right accelaratorʼs Direct I/O Engine memory space registers DMA buffers • Device registers and DMA buffers host physical memory space remapping registers DMA buffers • No host side intervention is

without sacrifice of performance. The current ext2-like design would be replaced with accelerator-aware implementation. As a further step, we are exploring accelerator specific optimizations. (e.g. vectorized LSM-tree compactions) AccelFS & LevelDB • AccelFS provides file names, organized data, buffer caches • LevelDB is also ported on top of AccelFS.

Evaluation Intel Xeon Gold 6126 (2.60GHz, 12-core) 96GB RAM • Host: • Accelerator: NEC SX-Aurora TSUBASA • NVMe SSD: Samsung EVO 970 photos from: sx-aurora.github.io

Microbenchmarks How much does HAYAGUI improve file operations? fdatasync(2) are used. • read, write and write+sync • In baselines, read(2), write(2), Lower is better. • 20-99% reduction in latency

LevelDB evaluation (db_bench) How much does HAYAGUI improve KVS workloads? • sequential and random access workloads via db_bench • 33-81% latency improvements Lower is better.

LevelDB evaluation (YCSB) Realistic KVS workloads • Small- or medium-sized data • 12-89% throughput improvements Higher is better.

Genome sequence matching app How does Hayagui improve realistic apps? • an accelerator application analyzing DNA sequences • Bulk-read workloads • 33-46% reduction in execution time Smaller is better. ( 15.0GB ) ( 7.0GB ) ( 2.1GB )

Summary • On SoC-based accelerators, I/O access matters. • HAYAGUI: an accelerator storage stack • reads and writes the storage medium directly. • provides various interfaces: raw I/O, file system and KVS • outperforms the system call redirection baselines • Ongoing work • Is the direct access architecture feasible in other accelerators? • How do we overcome the weakness of general-purpose cores in accelerators? • How could we exploit specialized engines in accelerators? • Is it possible to build a generic, one-size-fit-all storage stack for accelerators?

Designing a Storage Software Stack for Accelerators Thank You Q&A Shinichi Awamoto 1 , Erich Focht 2 , Michio Honda 3 1 NEC Labs Europe; 2 NEC Deutschland; 3 Univerisity of Edinburgh <shinichi.awamoto@gmail.com>

Designing a Storage Software Stack for Accelerators Shinichi Awamoto - PowerPoint PPT Presentation

Designing a Storage Software Stack for Accelerators Shinichi Awamoto 1 , Erich Focht 2 , Michio Honda 3 1 NEC Labs Europe; 2 NEC Deutschland; 3 Univerisity of Edinburgh <shinichi.awamoto@gmail.com> execute the entire application code unlike

Stack Stack Heap Heap Data Data Text Text Program A Program B Stack Stack Text Heap

Stack and Queue Stack Overview Stack ADT Basic operations of stack Pushing, popping

Application Accelerators: Application Accelerators: Application Accelerators: Application

Compilers Stack Machines Alex Aiken Stack Machines Only storage is a stack An

Stack ADT Tiziana Ligorio 1 Todays Plan Questons? Stack ADT 2 Abstract Data Types

Call Stack Stack Bottom Memory region managed with stack discipline Procedures and the Call

Sorting with Pop Stacks Stack sorting Pop stack sorting 1-pop-stack sortability 2-pop-stack

The Stack Eric McCreath The Stack The stack is a simple but useful data structure in computer

Re-arquitetando o Re-arquitetando o Stack Overflow Stack Overflow ou como construmos o Stack

Designing for Designing for Greenspace Greenspace Greenspace Designing for Designing for

Stack machines (Using slides adapted from the book) Stacks A stack machine maintains an

ADT Stack 1 Stacks of Coins and Plates 2 Stacks of Rocks and Books TOP OF THE STACK TOP OF

ADT Stack 1 Stacks of Coins and Plates 2 Stacks of Rocks and Books TOP OF THE STACK TOP OF

CS180 Recitation Apr 13, 2012 Stack Data structure Stack Class public class Stack { 1 private

DETECTORS AND ACCELERATORS DETECTORS AND ACCELERATORS APPLIED TO MEDICINE Jos Bernabu Jos

Accelerators for Americas Future ACCELERATORS - MODERN SHIPS OF DISCOVERY October 26, 2009

SEC Staff Review of Common Financial Reporting Issues Facing Smaller Issuers Wayne Carnall,

CS 251 Fall 2019 CS 251 Fall 2019 Topics Principles of Programming Languages Principles

Effect of Dapagliflozin on Heart Failure and Mortality in Type 2 Diabetes Mellitus Results form

Declensions 1 98-348: Lecture 3 This class counts as a linguistics elective! 3 units towards

Tail Recursion and Accumulators Recursion Should now be

Understanding Network Failures in Data Centers: Measurement, Analysis and Implications Phillipa

3. Monte Carlo Simulations Understanding Molecular Simulation Molecular Simulations Molecular

Colorados Accountable Care Collaborative Phase II An Overview Kathryn M. Jantz ACC

Sambuz

Useful Links

Newsletter

Mail Us