CudaDMA: Optimizing GPU Memory Bandwidth via Warp Specialization - PowerPoint PPT Presentation

CudaDMA: Optimizing GPU Memory Bandwidth via Warp Specialization Michael Bauer (Stanford) Henry Cook (UC Berkeley) Brucek Khailany (NVIDIA Research) 1

GPUs Are Ubiquitous  GPUs are in many supercomputers today  GPUs are great  High floating point performance  High memory bandwidth  Why is programming them so challenging?  Explicit data movement through memory hierarchy  Difficult to overlap computation and memory accesses 2

Outline  Overview of GPU Architecture  Motivating Benchmark  CudaDMA API  Methodology  Experiments  Conclusions 3

GPU Architecture/Programming SM SM SM SM CTA CTA CTA CTA Shared Memory Shared Memory Shared Memory Shared Memory On-Chip Memory Data Data Data Data Off-Chip DRAM 4

Warp Definition  Each CTA is decomposed into warps  A warp is 32 contiguous threads in the same CTA Warp 0 Warp 1 Warp 2 Warp 3  SM performs scheduling at warp-granularity  Each warp has its own program counter  All threads in a warp execute in lock-step  Intra-warp divergence has performance penalty  Inter-warp divergence has no performance penalty 5

Motivating Benchmark 6

Motivating Benchmark  Modified SAXPY kernel, staging data through shared  Variable amount of arithmetic  Fixed amount of data transferred and number of warps Increasing compute 7 intensity

GPU Performance Challenges Memory System Bottlenecks Computational Bottlenecks  Instruction Issue  Long-latency memory accesses  Memory Level Parallelism (MLP)  Synchronization  Data Access Patterns overheads  Coalescing  Data Access Patterns  Control Divergence Goal: remove entanglement between the bottlenecks 8

GPU Programmability Challenges  Mismatch CTA size/shape and shared data size/shape  Leads to thread divergence (lots of ‘if’ statements) Goal: decouple CTA size/shape from data size/shape 9

Warp Specialization  Differentiate warps into compute and DMA*  DMA warps  Maximize MLP  Compute warps  No stalls due to memory  Producer-consumer synchronization  Enable better overlapping of compute and memory accesses  CudaDMA objects to manage warp specialization  Describe data transfer patterns  Independent of warp count 10 * D. Merrill and A. Grimshaw. Revisiting Sorting for GPGPU Stream Architectures.

CudaDMA API 11

CudaDMA API class cudaDMA  Declare CudaDMA object { public: to manage shared buffer // Base constructor __device__ cudaDMA ( const int dmaID, const int num_dma_threads,  Separate DMA and const int num_comp_threads, const int thread_idx_start); compute warps public: __device__ bool owns_this_thread(); public: // Compute thread sync functions  Provide synchronization __device__ void start_async_dma(); __device__ void wait_for_dma_finish(); primitives public: // DMA thread sync functions __device__ void wait_for_dma_start(); __device__ void finish_async_dma(); public:  Perform repeated transfer __device__ void execute_dma( operations void *src_ptr, void *dst_ptr); }; 12

CudaDMA Application Structure __global__  Declare shared buffer at void cuda_dma_kernel(float *data) { kernel scope __shared__ float buffer[NUM_ELMTS]; cudaDMA dma_ld(0,NUM_DMA_THRS,  Declare CudaDMA NUM_COMPUTE_THRS, NUM_COMPUTE_THRS); object to manage buffer if (dma_ld.owns_this_thread()) { // DMA warps  Split DMA warps from for (int i=0; i<NUM_ITERS; i++) { dma_ld.wait_for_dma_start(); compute warps dma_ld.execute_dma(data,buffer); dma_ld.finish_async_dma();  Load buffer using DMA } } warps else { // Compute warps for (int i=0; i<NUM_ITERS; i++) {  Process buffer using dma_ld.start_async_dma(); dma_ld.wait_for_dma_finish(); compute warps process_buffer(buffer); }  Iterate (optional) } } 13

Execution Model  Use PTX named barriers  bar.sync Compute DMA Warps Warps  bar.arrive start_async_dma wait_for_dma_start  Available on Fermi bar.arrive bar.sync Named Barrier 1 wait_for_dma_finish finish_async_dma  Fine-grained bar.sync bar.arrive Named synchronization Iteration i Barrier 2 Named Barrier 1 Named Barrier 2 Iteration i+1 14

CudaDMA Methodology 15

Buffering T echniques  Usually one set of DMA warps per buffer  Single-Buffering  One buffer, one warp group  Double-Buffering  Two buffers, two warp groups  Manual Double-Buffering  Two buffers, one warp group 16

CudaDMA Instances  CudaDMASequential  CudaDMAStrided  CudaDMAIndirect  Arbitrary accesses  CudaDMAHalo  2D halo regions  CudaDMACustom 17

Access Patterns  Explicitly state data loading pattern in code  Decouple implementation from transfer pattern  Common patterns implemented by experts  Used by application programmers  Optimized for high memory bandwidth at low warp count 18

Experiments 19

Micro-Benchmarks  Same modified SAXPY kernel shown earlier  Fix compute intensity (6 B/FLOP), vary warp count 20

BLAS2: SGEMV  Dense matrix-vector multiplication  CudaDMASequential for loading vector elements  CudaDMAStrided for loading matrix elements  Varied buffering schemes  Up to 3.2x speedup 21

3D Finite Difference Stencil  8 th order in space, 1 st order in time computation  Load 2D slices into shared for each step in Z-dimension  Loading halo cells uses uncoalesced accesses  Earlier version of cudaDMAHalo 22 Figures from: P. Micikevicius. 3D Finite Difference Computation on GPUs Using CUDA.

3D Finite-Difference Stencil  Use DMA warps for loading halo 35 33.14 cells as well as 29.1 30 27.83 main block cells 25.22 24.16 25 22.3 20  Speedups from 13-15% 15 Execution Time (s) 10  Improvement 5 from more MLP 0 512x512x512 640x640x400 800x800x200 and fewer load Reference CudaDMA instructions Problem Size 23

Conclusions  CudaDMA  Extensible API  Create specialized DMA Warps  Works best for moderate compute intensity applications  Decouple transfer pattern from implementation  Optimized instances for common patterns  CudaDMASequential, CudaDMAStrided  CudaDMAIndirect, CudaDMAHalo  Speedups on micro-benchmarks and applications 24

Download CudaDMA: http://code.google.com/p/cudadma Tech Talk at NVIDIA Booth on Thursday at 1pm Questions? 25

Backup Slides 26

Asynchronous DMA Engines  Decouple transfer implementation from specification  Asynchronous to overlap computation and memory access  Ironman abstraction for ZPL (software)  Sequoia runtime interface (software)  Cell Broadband Engine (hardware)  Imagine Stream Processor (hardware) 27

Code Example: SGEMV  BLAS2: matrix-vector multiplication  Two Instances of CudaDMA objects  Compute Warps  Vector DMA Warps  Matrix DMA Warps 28

Synchronization Points  Compute Warps  start_async_dma()  wait_for_dma_finish()  DMA Warps  wait_for_dma_start()  finish_async_dma() 29

Future Work  Additional CudaDMA Instances  Indirect memory accesses  More applications  Sparse-Matrix operations  Target for higher-level language/DSL compilers  Copperhead, Liszt  Actual hardware DMA engines for GPUs  Warp-specialization aware programming models  Compiler implementations 30

Fast Fourier Transforms  1D, Power of 2 FFTs  Compared to optimized CUFFT library (version 4.0)  32 warps per SM  CudaDMA (custom loader)  24 warps per SM  16 compute, 8 DMA  Same performance at lower warp count 31

CudaDMA: Optimizing GPU Memory Bandwidth via Warp Specialization - PowerPoint PPT Presentation

CudaDMA: Optimizing GPU Memory Bandwidth via Warp Specialization Michael Bauer (Stanford) Henry Cook (UC Berkeley) Brucek Khailany (NVIDIA Research) 1 GPUs Are Ubiquitous GPUs are in many supercomputers today GPUs are great High

Explaining Operation Warp Speed Whats the goal? Operation Warp Speeds goal is to produce

From 0 to 60+: From 0 to 60+: Revving up FYE at Warp Revving up FYE at Warp Speed Speed Nancy

WARP and WEFT + X PLAN Studio Warp and Weft Design Team Architecture Alessandro TRIVELLI, Arch.

Real-time 802.11 on WARP Patrick Murphy Mango Communications Nov 2013 http://warpproject.org

Shuffle: Tips and Tricks Julien Demouth, NVIDIA Glossary Warp Implicitly synchronized

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

MULTI GPU PROGRAMMING WITH MPI Jiri Kraus, Senior Devtech Compute, April 4th 2016 MPI+CUDA

MPI AND OPENACC JIRI KRAUS, NVIDIA MPI+OPENACC System System System GDDR5 Memory GDDR5

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Optimizing monitoring networks for Optimizing monitoring networks for Optimizing monitoring

Enhancing Permissiveness of Transactional Memory via Time-Warp Nuno Diegues and Paolo Romano

GPU peak performance vs. CPU Squeezing GPU performance Peak Double Precision FLOPS Peak Memory

Module 6.1 Memory Access Performance DRAM Bandwidth Objective To learn that memory

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

DMA implementations for FPGA- based data acquisition systems Presenter: Wojciech M. Zabootny

The COIN-OR Optimization Suite: Open Source Tools for Optimization Part 5: Advanced Modeling with

Neural Networks for Time Series Prediction 15-496/782: Artificial Neural Networks Kornel

Advanced Features of xts Manipulating Time Series Data in R: Case Studies Finding Endpoints

Confidential Accelerators Stavros Volos Microsoft Research Accelerators Play Pivotal Role in

A n A n a l y s i s o f A c c e l e r a t o r C o u p l i n g i n

Examples of I/O Devices Device Behaviour Partner Data Rate (Mbit/sec) Keyboard Input Human

PCIe and DMA in MirageOS Fabian Bonk Wednesday 20 th May, 2020 Chair of Network Architectures and