GPU Direct IO with HDF5 John Ravi Quincey Koziol Suren Byna - - PowerPoint PPT Presentation

gpu direct io with hdf5
SMART_READER_LITE
LIVE PREVIEW

GPU Direct IO with HDF5 John Ravi Quincey Koziol Suren Byna - - PowerPoint PPT Presentation

GPU Direct IO with HDF5 John Ravi Quincey Koziol Suren Byna Motivation With large-scale computing systems are moving towards using GPUs as workhorses of computing file I/O to move data between GPUs and storage devices becomes


slide-1
SLIDE 1

GPU Direct IO with HDF5

John Ravi • Quincey Koziol • Suren Byna

slide-2
SLIDE 2
  • With large-scale computing systems are moving towards using GPUs as workhorses of

computing

  • file I/O to move data between GPUs and storage devices becomes critical
  • I/O performance optimizing technologies
  • NVIDIA’s GPU Direct Storage (GDS) - reducing the latency of data movement between

GPUs and storage.

  • In this presentation, we will talk about a recently developed virtual file driver (VFD)

that takes advantage of the GDS technology allowing data transfers between GPUs and storage without using CPU memory as a “bounce buffer” Motivation

slide-3
SLIDE 3

Traditional Data Transfer without GPUDirect Storage

3

1. fd = open(“file.txt”, O_RDONLY); 2. buf = malloc(size); 3. pread(fd, buf, size, 0); 4. cudaMalloc(d_buf, size); 5. cudaMemcpy(d_buf, buf, size, cudaMemcpyHostToDevice);

slide-4
SLIDE 4

Data Transfer with GPUDirect Storage (GDS)

4

NVIDIA GPUDirect Storage 1. fd = open(“file.txt”, O_RDONLY | O_DIRECT, …);

  • 2. cudaMalloc(d_buf, size);
  • 3. cuFileRead(fhandle, d_buf, size, 0);

Traditional Data Transfer

1.

fd = open(“file.txt”, O_RDONLY, …);

2.

buf = malloc(size);

3.

pread(fd, buf, size, 0);

4.

cudaMalloc(d_buf, size);

5.

cudaMemcpy(d_buf, buf, size, cudaMemcpyHostToDevice);

No need for a “bounce buffer”

slide-5
SLIDE 5

High Level I/O Library Objectives

  • Ease-of-use
  • Standardized format
  • Portable Performance

Optimizations

HPC I/O software stack

Applications High Level I/O Library (HDF5, netCDF, ADIOS) I/O Middleware (MPI-IO) I/O Forwarding Parallel File System (Lustre, GPFS, …) I/O Hardware (disk-based, SSD-based, …)

slide-6
SLIDE 6

HDF5 Virtual File Driver(s)

VFD Description SEC2 default driver POSIX file-system functions like read and write to perform I/O to a single file DIRECT force data to be written directly to file-system disables OS buffering MPIIO used with Parallel HDF5, to provide parallel I/O support

HDF5 File Format

File

Virtual File Layer

SEC2

File on Parallel Filesystem

MPI I/O

Other

Custom

Internals

Memory Mgmt Datatype Conversion I/O Filters Chunked Storage Version Compatibility et cetera…

Data Model Objects

Files, Groups, Datasets, Attributes, …

Tunable Properties

Chunk Size, I/O Driver, …

HDF5 Library Storage

netCDF-4 High Level APIs HDFview

Apps

h5dump

Java

H5Hut API

C++/FORTRAN/Python

Infrastructure

Datatype, Dataspace, IDs, …

APIs

Direct IO to Filesystem

DIRECT

slide-7
SLIDE 7

HDF5 Virtual File Driver(s)

VFD Description SEC2 default driver POSIX file-system functions like read and write to perform I/O to a single file DIRECT force data to be written directly to file-system disables OS buffering MPIIO used with Parallel HDF5, to provide parallel I/O support

HDF5 File Format

File

Virtual File Layer

SEC2

Internals

Memory Mgmt Datatype Conversion I/O Filters Chunked Storage Version Compatibility et cetera…

Data Model Objects

Files, Groups, Datasets, Attributes, …

Tunable Properties

Chunk Size, I/O Driver, …

HDF5 Library Storage

netCDF-4 High Level APIs HDFview

Apps

h5dump

Java

H5Hut API

C++/FORTRAN/Python

Infrastructure

Datatype, Dataspace, IDs, …

APIs

Direct IO to Filesystem

DIRECT

GPUDirect to Filesystem

GDS GDS Enable GPUDirect Storage

slide-8
SLIDE 8

GPU Data Management

8

GPU CPU

Compute Cores Memory Hierarchy

Registers L2 Cache

NVMe Storage Host Memory

Apps OS Kernel

I/O Call

L1 Cache SMEM Global Memory

PCIe 3.0 16 GB/s

Copy Engine

PCIe 3.0 16 GB/s

slide-9
SLIDE 9

GPU Data Management

9

GPU CPU

Compute Cores Memory Hierarchy

Registers L2 Cache

NVMe Storage Host Memory

Apps OS Kernel

I/O Call

L1 Cache SMEM Global Memory

PCIe 3.0 16 GB/s

Copy Engine

PCIe 3.0 16 GB/s

slide-10
SLIDE 10

GPU Data Management

10

GPU CPU

Compute Cores Memory Hierarchy

Registers L2 Cache

NVMe Storage Host Memory

Apps OS Kernel

I/O Call

L1 Cache SMEM Global Memory

PCIe 3.0 16 GB/s

Copy Engine

PCIe 3.0 16 GB/s

slide-11
SLIDE 11

GPU Data Management

11

GPU CPU

Compute Cores Memory Hierarchy

Registers L2 Cache

NVMe Storage Host Memory

Apps OS Kernel

I/O Call

L1 Cache SMEM Global Memory

PCIe 3.0 16 GB/s

Copy Engine

PCIe 3.0 16 GB/s

slide-12
SLIDE 12

GPU Data Management

12

GPU CPU

Compute Cores Memory Hierarchy

Registers L2 Cache

NVMe Storage Host Memory

Apps OS Kernel

I/O Call

L1 Cache SMEM Global Memory

PCIe 3.0 16 GB/s

Copy Engine

PCIe 3.0 16 GB/s

slide-13
SLIDE 13

GPU Data Management

13

GPU CPU

Compute Cores Memory Hierarchy

Registers L2 Cache

NVMe Storage Host Memory

Apps OS Kernel

I/O Call

L1 Cache SMEM Global Memory

PCIe 3.0 16 GB/s

Copy Engine

PCIe 3.0 16 GB/s

slide-14
SLIDE 14

GPU Data Management

14

GPU CPU

Compute Cores Memory Hierarchy

Registers L2 Cache

NVMe Storage Host Memory

Apps OS Kernel

I/O Call

L1 Cache SMEM Global Memory

PCIe 3.0 16 GB/s

Copy Engine

PCIe 3.0 16 GB/s

slide-15
SLIDE 15

GPU Data Management

15

GPU CPU

Compute Cores Memory Hierarchy

Registers L2 Cache

NVMe Storage Host Memory

Apps OS Kernel

I/O Call

L1 Cache SMEM Global Memory

PCIe 3.0 16 GB/s

Copy Engine

PCIe 3.0 16 GB/s

slide-16
SLIDE 16

GPU Data Management

16

GPU CPU

Compute Cores Memory Hierarchy

Registers L2 Cache

NVMe Storage Host Memory

Apps OS Kernel

I/O Call

L1 Cache SMEM Global Memory

PCIe 3.0 16 GB/s

Copy Engine

PCIe 3.0 16 GB/s

slide-17
SLIDE 17

GPU Data Management

17

GPU CPU

Compute Cores Memory Hierarchy

Registers L2 Cache

NVMe Storage Host Memory

Apps OS Kernel

I/O Call

L1 Cache SMEM Global Memory

PCIe 3.0 16 GB/s

Copy Engine

PCIe 3.0 16 GB/s

slide-18
SLIDE 18

GPU Data Management

18

GPU CPU

Compute Cores Memory Hierarchy

Registers L2 Cache

NVMe Storage Host Memory

Apps OS Kernel

I/O Call

L1 Cache SMEM Global Memory

PCIe 3.0 16 GB/s

Copy Engine

PCIe 3.0 16 GB/s

slide-19
SLIDE 19

GPU Data Management (with GDS)

19

GPU CPU

Compute Cores Memory Hierarchy

Registers L2 Cache

NVMe Storage Host Memory

Apps OS Kernel

I/O Call

L1 Cache SMEM Global Memory

PCIe 3.0 16 GB/s

Copy Engine

PCIe 3.0 16 GB/s

slide-20
SLIDE 20

GPU Data Management (with GDS)

20

GPU CPU

Compute Cores Memory Hierarchy

Registers L2 Cache

NVMe Storage Host Memory

Apps OS Kernel

I/O Call

L1 Cache SMEM Global Memory

PCIe 3.0 16 GB/s

Copy Engine

PCIe 3.0 16 GB/s

slide-21
SLIDE 21

GPU Data Management (with GDS)

21

GPU CPU

Compute Cores Memory Hierarchy

Registers L2 Cache

NVMe Storage Host Memory

Apps OS Kernel

I/O Call

L1 Cache SMEM Global Memory

PCIe 3.0 16 GB/s

Copy Engine

PCIe 3.0 16 GB/s

slide-22
SLIDE 22

GPU Data Management (with GDS)

22

GPU CPU

Compute Cores Memory Hierarchy

Registers L2 Cache

NVMe Storage Host Memory

Apps OS Kernel

I/O Call

L1 Cache SMEM Global Memory

PCIe 3.0 16 GB/s

Copy Engine

PCIe 3.0 16 GB/s

slide-23
SLIDE 23

GPU Data Management (with GDS)

23

GPU CPU

Compute Cores Memory Hierarchy

Registers L2 Cache

NVMe Storage Host Memory

Apps OS Kernel

I/O Call

L1 Cache SMEM Global Memory

PCIe 3.0 16 GB/s

Copy Engine

PCIe 3.0 16 GB/s

slide-24
SLIDE 24
  • GDS VFD differences from SEC2 VFD
  • File Descriptor is open with O_DIRECT (disables all OS buffering)
  • Read and Write handlers needs to distinguish between CPU (metadata) and GPU memory

pointers

  • cuFileDriver needs to be initialized per run
  • Some overhead for each I/O call
  • Querying CUDA Runtime for information about memory pointers
  • cuFile buffer registration and deregistration

HDF5 GDS – Virtual File Driver

slide-25
SLIDE 25
  • GDS VFD knobs
  • num_threads – number of pthreads servicing one cuFile request
  • blocksize – transfer size of one cuFile request

Experimental Evaluation – Lustre File System

Image Source: https://wiki.lustre.org/Introduction_to_Lustre

slide-26
SLIDE 26
  • System Configuration
  • NVIDIA DGX-2
  • 16x Tesla v100
  • 2x Samsung NVMe SM961/PM961 RAID0 (Seq Reads = ~6.4 GB/s, Seq Write = ~3.6 GB/s)
  • Lustre File System (4 OSTs, 1MB strip size)
  • Benchmarks
  • Local Storage
  • Sequential R/W Rates
  • Lustre File System
  • Multi-threaded Sequential R/W Rates
  • Multi-GPU (one GPU per process, one file per process)

Experimental Evaluation

slide-27
SLIDE 27
  • HDF5 GDS achieves higher

write rates for requests greater than 512 MB

  • Possible Optimizations:
  • make user specify the

location of the memory pointer for each memory transfer

  • cuFile buffer register

before I/O call

Write Performance – Local Storage

slide-28
SLIDE 28

Read Performance – Local Storage

  • HDF5 GDS achieves higher

read rates for requests greater than 256 MB

  • Possible Optimizations:
  • make user specify the

location of the memory pointer for each memory transfer

  • cuFile buffer register

before I/O call

slide-29
SLIDE 29
  • Using more threads increases write rates

dramatically (almost 2x speed for using 8 threads instead of 4 threads)

  • Varying blocksize did not change much
  • Default behavior of SEC2 (no threading)
  • Requires a significant change
  • Some developers are working on

relaxing Serial HDF5 “global lock”

Multi-Threaded Writes, Single GPU, Lustre File System

slide-30
SLIDE 30
  • SEC2 read rates are best in most

cases

  • More threads did not offer an

improvement in read rate

  • Read ahead was left on for this

experiment Multi-Threaded Read, Single GPU, Lustre File System

slide-31
SLIDE 31

Multi-Process Writes, Multiple GPU, Lustre File System

  • GDS VFD clear advantage over

SEC2 VFD for a distributed file system GDS VFD Knobs

  • 4 threads (OSTs)
  • 1MB blocksize (strip size)

Multi-Process Writes

  • Single GPU per MPI Rank
  • Single HDF5 file per MPI Rank
  • File size: 1GB
slide-32
SLIDE 32
  • SEC2 VFD dominates over GDS VFD

(read ahead was left enabled) GDS VFD Knobs

  • 4 threads (OSTs)
  • 1MB blocksize (strip size)

Multi-Process Reads

  • Single GPU per MPI Rank
  • Single HDF5 file per MPI Rank
  • File size: 1GB

Multi-Process Reads, Multiple GPU, Lustre File System

slide-33
SLIDE 33
  • HDF5 GDS VFD improves the write rates over SEC2 VFD
  • HDF5 SEC2 VFD seems to offer higher read rates over GDS VFD mainly because of
  • ptimizations at other layers (read ahead)

Future Work

  • GDS for Parallel HDF5 – MPIIO VFD
  • MPI-IO developers are working on this
  • HDF5 GDS VFD tuning knobs for Distributed File Systems
  • Avoiding the overhead
  • Track data buffer locations
  • Track data buffer reuse
  • Async IO

Conclusions

slide-34
SLIDE 34
  • Contact:

John Ravi jjravi@lbl.gov Quincey Koziol koziol@lbl.gov Suren Byna sbyna@lbl.gov

Thank you