Memory Access Patterns: The Missing Piece of the Multi-GPU Puzzle - PowerPoint PPT Presentation

Memory Access Patterns: The Missing Piece of the Multi-GPU Puzzle Tal Ben-Nun , Ely Levy, Amnon Barak and Eri Rubin The Hebrew University of Jerusalem, Israel Supercomputing ‘ 15, November 2015

Introduction • Developing efficient parallel algorithms for GPUs is challenging Host Node • Memory I/O recurs as the bottleneck • Code clutter caused by device and CPU 1 CPU 2 memory management RAM PCI-Express • Current programming models for GPU GPU GPU GPU multi-GPU nodes are often: 1 2 3 4 • Insufficient for specific programming needs • Overly complex • Hard to debug

Case Study – The Game of Life • Famous cellular automaton • Each cell requires its 3×3 neighborhood to compute next generation Input Output • In GPUs, each thread computes one (or several) cells • Similar to stencil operators, the Jacobi method and many more

Case Study – The Game of Life Thread Pseudocode: Wrapped Window Width Boundaries // Load data to shared memory smem[tidy * BW + tidx] = M[(bidy * BH + tidy) * STRIDE + Block (bidx * BW + tidx)]; // Wrap coords Height // ... __syncthreads(); neighbors = 0; GPU 1 current_gen = smem[tidy * BW + tidx]; for (int ly = -1; ly <= 1; ++ly) { for (int lx = -1; lx <= 1; ++lx) { if (lx == 0 && ly == 0) continue ; neighbors += smem[(ly + tidy) * BW + (lx + tidx)]; } } outM[(bidy * BH + tidy) * STRIDE + (bidx * BW + tidx)] = ...; Single GPU

Case Study – The Game of Life GPU 1 GPU 2 Boundary Exchanges GPU 3 GPU 4 Multi-GPU

Case Study – The Game of Life GPU 1 M1(0,0) = M[STRIDE] GPU 2 M1(x,-1) = M4(x, HEIGHT/4) = = M[(HEIGHT-1)*STRIDE+x] GPU 3 M3(0,0) = M2(0,HEIGHT/4+1) = = M[((HEIGHT/4)*2+1)*STRIDE] GPU 4 Multi-GPU

Case Study – The Game of Life • 3 indexing systems: • Node memory GPU 1 • Per-GPU memory • Shared memory/registers GPU 2 • Error-prone • Index-wise • Synchronization-wise GPU 3 • Difficult to debug/maintain GPU 4 • Many lines-of-code Multi-GPU

Input Memory Access Patterns Access Pattern Description Typical Examples Illustrations Each thread requires an Matrix multiplication, Block (ND) entire dimension of a Exact N-body simulation, buffer Matrix transposition Each thread-block ND convolution, Window (ND) requires a spatially-local Jacobi method, N-dimensional window Stencil operators Sporadic access of a SpMV, Adjacency dense data structure with Cloth simulation a fixed pattern Thread operates on Barnes-Hut N-body Traversal (BFS, DFS) neighbors of a vertex algorithm Thread-block operates on Fast Fourier transform Permutation a permutation of the original data Patterns that cannot be Finite state machines Irregular determined in advance

Output Memory Access Patterns • Based on all possible mappings between number of threads and number of outputs per buffer: n  O(n) n  m < n n  Unpredictable Structured Unstructured Reductive Reductive Irregular Injective Injective Static Dynamic

MAPS-Multi • An automatic multi-GPU task partitioning framework: • By expressing the input/output access patterns of each task, automatically segments and copies memory • Based on concepts from the Partitioned Global Address Space (PGAS) model • No source-to-source compilation or other intrusive actions • Header only, standard C++11 (over CUDA) library • Can work in conjunction with other systems (e.g. MPI) and device-level libraries (e.g. CUBLAS)

Framework Components Host-Level Infrastructure Device-Level Infrastructure Task Multiple Device Abstraction Multi-Device Input Container Output Container Input Container Output Container Input Container Output Container Datum Access Pattern Access Pattern Access Datum Datum Access Pattern Datum Datum Access Pattern Access Pattern Datum Global Memory N Global Memory 1 Pattern Aggregator Aggregator Aggregator Device Dimensions Kernel Constants Shared Shared Block Scheduler Container Container Device-level Device-level Segmenters Allocator Aggregator Aggregator Warp Segment Location Memory Analyzer Monitor Iterators Iterators Iterators Iterators Iterators Iterators Iterators Iterators Iterators Iterators Iterators Iterators Invoker Thread Invoker Thread Invoker Thread Thread Input Output Input Output Controller Controller Controller Controller GPU 1 GPU 2 GPU n

Game of Life Code Sample Host Code Device Code Scheduler sched; template < typename T, int ILPX, int ILPY> __global__ void GameOfLifeTick MAPS_MULTIDEF ( typedef Window2D <T,1, WRAP ,ILPX,ILPY> Win2D ; Window2D <T,1,WRAP,ILPX,ILPY> current_gen, typedef StructuredInjective <T,2,ILPX,ILPY> SMat ; StructuredInjective <T,2,ILPX,ILPY> next_gen) { // Define data structures to be used MAPS_MULTI_INIT (current_gen, next_gen); Matrix <T> A (width, height), B (width, height); #pragma unroll // Use existing host buffers as matrices MAPS_FOREACH (nextgen_iter, next_gen) { A.Bind(hbuffer_A); int live_neighbors = 0, is_live = 0; B.Bind(hbuffer_B); #pragma unroll // Analyze memory access patterns for allocation MAPS_FOREACH_ALIGNED (iter, current_gen, sched.AnalyzeCall( Win2D (A), SMat (B)); nextgen_iter) { sched.AnalyzeCall( Win2D (B), SMat (A)); // Set variables according to the rules if (iter.index() == 4) is_live = *iter; // Invoke the kernels else live_neighbors += *iter; for ( int i = 0; i < iterations; ++i) } sched.Invoke(GameOfLifeTick, int result = GameOfLifeConditions(...); Win2D ((i % 2) ? B : A), *nextgen_iter = result; SMat ((i % 2) ? A : B)); } next_gen.commit(); // Gather processed data back to host } if ((iterations % 2) == 0) sched.Gather(A); else sched.Gather(B);

Code Sample – Host Scheduler sched; typedef Window2D <T,1, WRAP ,ILPX,ILPY> Win2D ; typedef StructuredInjective <T,2,ILPX,ILPY> SMat ; // Define data structures to be used Matrix <T> A (width, height), B (width, height); // Use existing host buffers as matrices A.Bind(hbuffer_A); B.Bind(hbuffer_B); // Analyze memory access patterns for allocation sched.AnalyzeCall( Win2D (A), SMat (B)); sched.AnalyzeCall( Win2D (B), SMat (A)); // Invoke the kernels for ( int i = 0; i < iterations; ++i) sched.Invoke(GameOfLifeTick, Win2D ((i % 2) ? B : A), SMat ((i % 2) ? A : B)); // Gather processed data back to host if ((iterations % 2) == 0) sched.Gather(A); else sched.Gather(B);

Code Sample – Host GPU 3 GPU 1 GPU 2 GPU 4 Scheduler sched; typedef Window2D <T,1, WRAP ,ILPX,ILPY> Win2D ; typedef StructuredInjective <T,2,ILPX,ILPY> SMat ; // Define data structures to be used Matrix <T> A (width, height), B (width, height); // Use existing host buffers as matrices A.Bind(hbuffer_A); B.Bind(hbuffer_B); // Analyze memory access patterns for allocation sched.AnalyzeCall( Win2D (A), SMat (B)); sched.AnalyzeCall( Win2D (B), SMat (A)); // Invoke the kernels for ( int i = 0; i < iterations; ++i) sched.Invoke(GameOfLifeTick, Win2D ((i % 2) ? B : A), SMat ((i % 2) ? A : B)); // Gather processed data back to host if ((iterations % 2) == 0) sched.Gather(A); else sched.Gather(B);

Memory Access Patterns: The Missing Piece of the Multi-GPU Puzzle - PowerPoint PPT Presentation

Memory Access Patterns: The Missing Piece of the Multi-GPU Puzzle Tal Ben-Nun , Ely Levy, Amnon Barak and Eri Rubin The Hebrew University of Jerusalem, Israel Supercomputing 15, November 2015 Introduction Developing efficient parallel

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Factory Patterns: Factory Method and Abstract Factory Design Patterns In Java Bob Tarr

Multiple Imputation for Missing Data in KLoSA Juwon Song Korea University and UCLA Contents 1.

Missing Values in SAS Magnus Mengelbier Director PhUSE 2011 1 Topics Introduction

Missing Data and Imputation NINA ORWITZ OCTOBER 30 TH , 2017 Outline Types of missing data

Memory Management Memory Manager Requirements Minimize primary memory access time

Principles and Patterns 26 February, 2020 Recap Principles Patterns Inheritance Anti-patterns

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Personal SE Computer Memory Addresses C Pointers Computer Memory Organization Memory is a

Memory Memory processing is the ability to: Acquire (Short term memory) Manipulate

Whats Missing? SOCI 101 November 29, 2011 SOCI 101 () Whats Missing? November 29, 2011

Searching for and replacing missing values Nicholas Tierney Statistician DataCamp Dealing With

Bayesian Generalized linear mixed models with data missing not at random Overview: Two simple

Design Patterns Applications Programming What is design patterns? The design patterns are

P ricing Default Correlation Products within Structural framework. Authors: Luis Seco, Marcos

Combining Faceted Search with Data-analytic Visualizations on Top of a SPARQL Endpoint VOILA!,

Total Cost of Care (TCOC) Workgroup January 29, 2020 Agenda MPA Collection Timeline for Y3 1.

Title V of P.L. 111-291 The Abeyta Water Rights Adjudication By Nelson J. Cordova Director,

The Use of Ever Increasing Datasets in Macroeconomic Forecasting Prof. Dr. Jan-Egbert Sturm 12.

Optimizing and Tuning the Fast Multipole Method for Multicore and Accelerator Systems Georgia

802.1 Closing Plenary July 2014 San Diego Glenn Parsons Chair, IEEE 802.1 WG

Nausheen R Shah Nausheen R. Shah Particle Physics Division Theory Fermi National Accelerator