Memory Access Patterns: The Missing Piece of the Multi-GPU Puzzle - - PowerPoint PPT Presentation

memory access patterns the missing
SMART_READER_LITE
LIVE PREVIEW

Memory Access Patterns: The Missing Piece of the Multi-GPU Puzzle - - PowerPoint PPT Presentation

Memory Access Patterns: The Missing Piece of the Multi-GPU Puzzle Tal Ben-Nun , Ely Levy, Amnon Barak and Eri Rubin The Hebrew University of Jerusalem, Israel Supercomputing 15, November 2015 Introduction Developing efficient parallel


slide-1
SLIDE 1

Memory Access Patterns: The Missing Piece of the Multi-GPU Puzzle

Tal Ben-Nun, Ely Levy, Amnon Barak and Eri Rubin

The Hebrew University of Jerusalem, Israel Supercomputing ‘15, November 2015

slide-2
SLIDE 2

Introduction

  • Developing efficient parallel

algorithms for GPUs is challenging

  • Memory I/O recurs as the bottleneck
  • Code clutter caused by device and

memory management

  • Current programming models for

multi-GPU nodes are often:

  • Insufficient for specific programming

needs

  • Overly complex
  • Hard to debug

Host Node

CPU 1 CPU 2

RAM

PCI-Express

GPU 2 GPU 1 GPU 4 GPU 3

slide-3
SLIDE 3

Case Study – The Game of Life

  • Famous cellular automaton
  • Each cell requires its 3×3 neighborhood to compute next generation
  • In GPUs, each thread computes one (or several) cells
  • Similar to stencil operators, the Jacobi method and many more

Input Output

slide-4
SLIDE 4

Case Study – The Game of Life

  • Famous cellular automaton
  • Each cell requires its 3×3 neighborhood to compute next generation
  • In GPUs, each thread computes one (or several) cells
  • Similar to stencil operators, the Jacobi method and many more

Input Output

slide-5
SLIDE 5

Case Study – The Game of Life

  • Famous cellular automaton
  • Each cell requires its 3×3 neighborhood to compute next generation
  • In GPUs, each thread computes one (or several) cells
  • Similar to stencil operators, the Jacobi method and many more

Input Output

slide-6
SLIDE 6

Case Study – The Game of Life

  • Famous cellular automaton
  • Each cell requires its 3×3 neighborhood to compute next generation
  • In GPUs, each thread computes one (or several) cells
  • Similar to stencil operators, the Jacobi method and many more

Input Output

slide-7
SLIDE 7

Case Study – The Game of Life

  • Famous cellular automaton
  • Each cell requires its 3×3 neighborhood to compute next generation
  • In GPUs, each thread computes one (or several) cells
  • Similar to stencil operators, the Jacobi method and many more

Input Output

slide-8
SLIDE 8

Case Study – The Game of Life

  • Famous cellular automaton
  • Each cell requires its 3×3 neighborhood to compute next generation
  • In GPUs, each thread computes one (or several) cells
  • Similar to stencil operators, the Jacobi method and many more

Input Output

slide-9
SLIDE 9

Case Study – The Game of Life

  • Famous cellular automaton
  • Each cell requires its 3×3 neighborhood to compute next generation
  • In GPUs, each thread computes one (or several) cells
  • Similar to stencil operators, the Jacobi method and many more

Input Output

slide-10
SLIDE 10

Case Study – The Game of Life

  • Famous cellular automaton
  • Each cell requires its 3×3 neighborhood to compute next generation
  • In GPUs, each thread computes one (or several) cells
  • Similar to stencil operators, the Jacobi method and many more

Input Output

slide-11
SLIDE 11

Case Study – The Game of Life

  • Famous cellular automaton
  • Each cell requires its 3×3 neighborhood to compute next generation
  • In GPUs, each thread computes one (or several) cells
  • Similar to stencil operators, the Jacobi method and many more

Input Output

slide-12
SLIDE 12

Case Study – The Game of Life

GPU 1

Thread Pseudocode:

// Load data to shared memory smem[tidy * BW + tidx] = M[(bidy * BH + tidy) * STRIDE + (bidx * BW + tidx)]; // Wrap coords // ... __syncthreads(); neighbors = 0; current_gen = smem[tidy * BW + tidx]; for (int ly = -1; ly <= 1; ++ly) { for (int lx = -1; lx <= 1; ++lx) { if (lx == 0 && ly == 0) continue; neighbors += smem[(ly + tidy) * BW + (lx + tidx)]; } }

  • utM[(bidy * BH + tidy) * STRIDE +

(bidx * BW + tidx)] = ...; Window Width Block Height Wrapped Boundaries

Single GPU

slide-13
SLIDE 13

Case Study – The Game of Life

GPU 1 GPU 4 GPU 2 GPU 3

Boundary Exchanges Multi-GPU

slide-14
SLIDE 14

Case Study – The Game of Life

GPU 1 GPU 4 GPU 2 GPU 3

M1(0,0) = M[STRIDE] M1(x,-1) = M4(x, HEIGHT/4) = = M[(HEIGHT-1)*STRIDE+x] M3(0,0) = M2(0,HEIGHT/4+1) = = M[((HEIGHT/4)*2+1)*STRIDE] Multi-GPU

slide-15
SLIDE 15

Case Study – The Game of Life

GPU 1 GPU 4 GPU 2 GPU 3

  • 3 indexing systems:
  • Node memory
  • Per-GPU memory
  • Shared memory/registers
  • Error-prone
  • Index-wise
  • Synchronization-wise
  • Difficult to debug/maintain
  • Many lines-of-code

Multi-GPU

slide-16
SLIDE 16

Input Memory Access Patterns

Access Pattern Description Typical Examples Illustrations Block (ND)

Each thread requires an entire dimension of a buffer Matrix multiplication, Exact N-body simulation, Matrix transposition

Window (ND)

Each thread-block requires a spatially-local N-dimensional window ND convolution, Jacobi method, Stencil operators

Adjacency

Sporadic access of a dense data structure with a fixed pattern SpMV, Cloth simulation

Traversal (BFS, DFS)

Thread operates on neighbors of a vertex Barnes-Hut N-body algorithm

Permutation

Thread-block operates on a permutation of the

  • riginal data

Fast Fourier transform

Irregular

Patterns that cannot be determined in advance Finite state machines

slide-17
SLIDE 17

Output Memory Access Patterns

  • Based on all possible mappings between number of threads and

number of outputs per buffer:

n  O(n) n  m < n n  Unpredictable Structured Injective Unstructured Injective Reductive Static Reductive Dynamic Irregular

slide-18
SLIDE 18

MAPS-Multi

  • An automatic multi-GPU task partitioning framework:
  • By expressing the input/output access patterns of each task, automatically

segments and copies memory

  • Based on concepts from the Partitioned Global Address Space (PGAS) model
  • No source-to-source compilation or other intrusive actions
  • Header only, standard C++11 (over CUDA) library
  • Can work in conjunction with other systems (e.g. MPI) and device-level

libraries (e.g. CUBLAS)

slide-19
SLIDE 19

Framework Components

Scheduler

GPU 1 GPU 2 GPU n

Segmenters Allocator Segment Location Monitor Memory Analyzer

Invoker Thread Invoker Thread Invoker Thread

Task

Kernel Output Container

Datum Access Pattern Aggregator

Input Container

Datum Access Pattern

Input Container

Datum Access Pattern

Input Container

Datum Access Pattern

Output Container

Datum Access Pattern Aggregator

Output Container

Datum Access Pattern Aggregator

Dimensions Constants

Thread Warp Block Device Multi-Device

Multiple Device Abstraction

Input Controller

Global Memory 1

Output Controller Iterators Iterators Iterators Shared Container Iterators Iterators Iterators Device-level Aggregator Input Controller

Global Memory N

Output Controller Iterators Iterators Iterators Shared Container Iterators Iterators Iterators Device-level Aggregator

Host-Level Infrastructure Device-Level Infrastructure

slide-20
SLIDE 20

Game of Life Code Sample

Scheduler sched; typedef Window2D<T,1,WRAP,ILPX,ILPY> Win2D; typedef StructuredInjective<T,2,ILPX,ILPY> SMat; // Define data structures to be used Matrix<T> A (width, height), B (width, height); // Use existing host buffers as matrices A.Bind(hbuffer_A); B.Bind(hbuffer_B); // Analyze memory access patterns for allocation sched.AnalyzeCall(Win2D(A), SMat(B)); sched.AnalyzeCall(Win2D(B), SMat(A)); // Invoke the kernels for (int i = 0; i < iterations; ++i) sched.Invoke(GameOfLifeTick, Win2D((i % 2) ? B : A), SMat((i % 2) ? A : B)); // Gather processed data back to host if ((iterations % 2) == 0) sched.Gather(A); else sched.Gather(B); template <typename T, int ILPX, int ILPY> __global__ void GameOfLifeTick MAPS_MULTIDEF( Window2D<T,1,WRAP,ILPX,ILPY> current_gen, StructuredInjective<T,2,ILPX,ILPY> next_gen) { MAPS_MULTI_INIT(current_gen, next_gen); #pragma unroll MAPS_FOREACH(nextgen_iter, next_gen) { int live_neighbors = 0, is_live = 0; #pragma unroll MAPS_FOREACH_ALIGNED(iter, current_gen, nextgen_iter) { // Set variables according to the rules if (iter.index() == 4) is_live = *iter; else live_neighbors += *iter; } int result = GameOfLifeConditions(...); *nextgen_iter = result; } next_gen.commit(); }

Host Code Device Code

slide-21
SLIDE 21

Code Sample – Host

Scheduler sched; typedef Window2D<T,1,WRAP,ILPX,ILPY> Win2D; typedef StructuredInjective<T,2,ILPX,ILPY> SMat; // Define data structures to be used Matrix<T> A (width, height), B (width, height); // Use existing host buffers as matrices A.Bind(hbuffer_A); B.Bind(hbuffer_B); // Analyze memory access patterns for allocation sched.AnalyzeCall(Win2D(A), SMat(B)); sched.AnalyzeCall(Win2D(B), SMat(A)); // Invoke the kernels for (int i = 0; i < iterations; ++i) sched.Invoke(GameOfLifeTick, Win2D((i % 2) ? B : A), SMat((i % 2) ? A : B)); // Gather processed data back to host if ((iterations % 2) == 0) sched.Gather(A); else sched.Gather(B);

slide-22
SLIDE 22

Code Sample – Host

Scheduler sched; typedef Window2D<T,1,WRAP,ILPX,ILPY> Win2D; typedef StructuredInjective<T,2,ILPX,ILPY> SMat; // Define data structures to be used Matrix<T> A (width, height), B (width, height); // Use existing host buffers as matrices A.Bind(hbuffer_A); B.Bind(hbuffer_B); // Analyze memory access patterns for allocation sched.AnalyzeCall(Win2D(A), SMat(B)); sched.AnalyzeCall(Win2D(B), SMat(A)); // Invoke the kernels for (int i = 0; i < iterations; ++i) sched.Invoke(GameOfLifeTick, Win2D((i % 2) ? B : A), SMat((i % 2) ? A : B)); // Gather processed data back to host if ((iterations % 2) == 0) sched.Gather(A); else sched.Gather(B);

GPU 1 GPU 2 GPU 3 GPU 4

slide-23
SLIDE 23

Code Sample – Host

Scheduler sched; typedef Window2D<T,1,WRAP,ILPX,ILPY> Win2D; typedef StructuredInjective<T,2,ILPX,ILPY> SMat; // Define data structures to be used Matrix<T> A (width, height), B (width, height); // Use existing host buffers as matrices A.Bind(hbuffer_A); B.Bind(hbuffer_B); // Analyze memory access patterns for allocation sched.AnalyzeCall(Win2D(A), SMat(B)); sched.AnalyzeCall(Win2D(B), SMat(A)); // Invoke the kernels for (int i = 0; i < iterations; ++i) sched.Invoke(GameOfLifeTick, Win2D((i % 2) ? B : A), SMat((i % 2) ? A : B)); // Gather processed data back to host if ((iterations % 2) == 0) sched.Gather(A); else sched.Gather(B);

A B

GPU 1 GPU 2 GPU 3 GPU 4

slide-24
SLIDE 24

Code Sample – Host

Scheduler sched; typedef Window2D<T,1,WRAP,ILPX,ILPY> Win2D; typedef StructuredInjective<T,2,ILPX,ILPY> SMat; // Define data structures to be used Matrix<T> A (width, height), B (width, height); // Use existing host buffers as matrices A.Bind(hbuffer_A); B.Bind(hbuffer_B); // Analyze memory access patterns for allocation sched.AnalyzeCall(Win2D(A), SMat(B)); sched.AnalyzeCall(Win2D(B), SMat(A)); // Invoke the kernels for (int i = 0; i < iterations; ++i) sched.Invoke(GameOfLifeTick, Win2D((i % 2) ? B : A), SMat((i % 2) ? A : B)); // Gather processed data back to host if ((iterations % 2) == 0) sched.Gather(A); else sched.Gather(B);

hbuffer_A hbuffer_B

GPU 1 GPU 2 GPU 3 GPU 4

A B

slide-25
SLIDE 25

hbuffer_A hbuffer_B

GPU 1 GPU 2 GPU 3 GPU 4

A B

Code Sample – Host

A B

GPU 1 GPU 2 GPU 3 GPU 4 GPU 1 GPU 2 GPU 3 GPU 4

Scheduler sched; typedef Window2D<T,1,WRAP,ILPX,ILPY> Win2D; typedef StructuredInjective<T,2,ILPX,ILPY> SMat; // Define data structures to be used Matrix<T> A (width, height), B (width, height); // Use existing host buffers as matrices A.Bind(hbuffer_A); B.Bind(hbuffer_B); // Analyze memory access patterns for allocation sched.AnalyzeCall(Win2D(A), SMat(B)); sched.AnalyzeCall(Win2D(B), SMat(A)); // Invoke the kernels for (int i = 0; i < iterations; ++i) sched.Invoke(GameOfLifeTick, Win2D((i % 2) ? B : A), SMat((i % 2) ? A : B)); // Gather processed data back to host if ((iterations % 2) == 0) sched.Gather(A); else sched.Gather(B);

Determined Allocation Sizes

slide-26
SLIDE 26

hbuffer_A hbuffer_B

GPU 1 GPU 2 GPU 3 GPU 4

A B

Code Sample – Host

A B

GPU 1 GPU 2 GPU 3 GPU 4

Scheduler sched; typedef Window2D<T,1,WRAP,ILPX,ILPY> Win2D; typedef StructuredInjective<T,2,ILPX,ILPY> SMat; // Define data structures to be used Matrix<T> A (width, height), B (width, height); // Use existing host buffers as matrices A.Bind(hbuffer_A); B.Bind(hbuffer_B); // Analyze memory access patterns for allocation sched.AnalyzeCall(Win2D(A), SMat(B)); sched.AnalyzeCall(Win2D(B), SMat(A)); // Invoke the kernels for (int i = 0; i < iterations; ++i) sched.Invoke(GameOfLifeTick, Win2D((i % 2) ? B : A), SMat((i % 2) ? A : B)); // Gather processed data back to host if ((iterations % 2) == 0) sched.Gather(A); else sched.Gather(B);

Determined Allocation Sizes

GPU 1 GPU 2 GPU 3 GPU 4

slide-27
SLIDE 27

Code Sample – Host

Scheduler sched; typedef Window2D<T,1,WRAP,ILPX,ILPY> Win2D; typedef StructuredInjective<T,2,ILPX,ILPY> SMat; // Define data structures to be used Matrix<T> A (width, height), B (width, height); // Use existing host buffers as matrices A.Bind(hbuffer_A); B.Bind(hbuffer_B); // Analyze memory access patterns for allocation sched.AnalyzeCall(Win2D(A), SMat(B)); sched.AnalyzeCall(Win2D(B), SMat(A)); // Invoke the kernels for (int i = 0; i < iterations; ++i) sched.Invoke(GameOfLifeTick, Win2D((i % 2) ? B : A), SMat((i % 2) ? A : B)); // Gather processed data back to host if ((iterations % 2) == 0) sched.Gather(A); else sched.Gather(B);

GPU 1 GPU 2 GPU 3 GPU 4

hbuffer_A hbuffer_B A B A B A B A B A B

Scatter

slide-28
SLIDE 28

Code Sample – Host

Scheduler sched; typedef Window2D<T,1,WRAP,ILPX,ILPY> Win2D; typedef StructuredInjective<T,2,ILPX,ILPY> SMat; // Define data structures to be used Matrix<T> A (width, height), B (width, height); // Use existing host buffers as matrices A.Bind(hbuffer_A); B.Bind(hbuffer_B); // Analyze memory access patterns for allocation sched.AnalyzeCall(Win2D(A), SMat(B)); sched.AnalyzeCall(Win2D(B), SMat(A)); // Invoke the kernels for (int i = 0; i < iterations; ++i) sched.Invoke(GameOfLifeTick, Win2D((i % 2) ? B : A), SMat((i % 2) ? A : B)); // Gather processed data back to host if ((iterations % 2) == 0) sched.Gather(A); else sched.Gather(B);

GPU 1 GPU 2 GPU 3 GPU 4

hbuffer_A hbuffer_B A B A B A B A B A B

Exchange

slide-29
SLIDE 29

Code Sample – Host

Scheduler sched; typedef Window2D<T,1,WRAP,ILPX,ILPY> Win2D; typedef StructuredInjective<T,2,ILPX,ILPY> SMat; // Define data structures to be used Matrix<T> A (width, height), B (width, height); // Use existing host buffers as matrices A.Bind(hbuffer_A); B.Bind(hbuffer_B); // Analyze memory access patterns for allocation sched.AnalyzeCall(Win2D(A), SMat(B)); sched.AnalyzeCall(Win2D(B), SMat(A)); // Invoke the kernels for (int i = 0; i < iterations; ++i) sched.Invoke(GameOfLifeTick, Win2D((i % 2) ? B : A), SMat((i % 2) ? A : B)); // Gather processed data back to host if ((iterations % 2) == 0) sched.Gather(A); else sched.Gather(B);

GPU 1 GPU 2 GPU 3 GPU 4

hbuffer_A hbuffer_B A B A B A B A B A B

Gather

slide-30
SLIDE 30

Code Sample – Device

template <typename T, int ILPX, int ILPY> __global__ void GameOfLifeTick MAPS_MULTIDEF( Window2D<T,1,WRAP,ILPX,ILPY> current_gen, StructuredInjective<T,2,ILPX,ILPY> next_gen) { MAPS_MULTI_INIT(current_gen, next_gen); #pragma unroll MAPS_FOREACH(nextgen_iter, next_gen) { int live_neighbors = 0, is_live = 0; #pragma unroll MAPS_FOREACH_ALIGNED(iter, current_gen, nextgen_iter) { // Set variables according to the rules if (iter.index() == 4) is_live = *iter; else live_neighbors += *iter; } int result = GameOfLifeConditions(...); *nextgen_iter = result; } next_gen.commit(); }

slide-31
SLIDE 31

Code Sample – Device

template <typename T, int ILPX, int ILPY> __global__ void GameOfLifeTick MAPS_MULTIDEF( Window2D<T,1,WRAP,ILPX,ILPY> current_gen, StructuredInjective<T,2,ILPX,ILPY> next_gen) { MAPS_MULTI_INIT(current_gen, next_gen); #pragma unroll MAPS_FOREACH(nextgen_iter, next_gen) { int live_neighbors = 0, is_live = 0; #pragma unroll MAPS_FOREACH_ALIGNED(iter, current_gen, nextgen_iter) { // Set variables according to the rules if (iter.index() == 4) is_live = *iter; else live_neighbors += *iter; } int result = GameOfLifeConditions(...); *nextgen_iter = result; } next_gen.commit(); }

1,1 1,2 1,3 1,4 2,1 2,2 2,3 2,4 3,1 3,2 3,4 GPU 1 GPU 2 Block (3,3) Threads

slide-32
SLIDE 32

Code Sample – Device

template <typename T, int ILPX, int ILPY> __global__ void GameOfLifeTick MAPS_MULTIDEF( Window2D<T,1,WRAP,ILPX,ILPY> current_gen, StructuredInjective<T,2,ILPX,ILPY> next_gen) { MAPS_MULTI_INIT(current_gen, next_gen); #pragma unroll MAPS_FOREACH(nextgen_iter, next_gen) { int live_neighbors = 0, is_live = 0; #pragma unroll MAPS_FOREACH_ALIGNED(iter, current_gen, nextgen_iter) { // Set variables according to the rules if (iter.index() == 4) is_live = *iter; else live_neighbors += *iter; } int result = GameOfLifeConditions(...); *nextgen_iter = result; } next_gen.commit(); }

Global Memory

Window Width Block Height

Coalesced Global-to-Shared/Registers

slide-33
SLIDE 33

Code Sample – Device

template <typename T, int ILPX, int ILPY> __global__ void GameOfLifeTick MAPS_MULTIDEF( Window2D<T,1,WRAP,ILPX,ILPY> current_gen, StructuredInjective<T,2,ILPX,ILPY> next_gen) { MAPS_MULTI_INIT(current_gen, next_gen); #pragma unroll MAPS_FOREACH(nextgen_iter, next_gen) { int live_neighbors = 0, is_live = 0; #pragma unroll MAPS_FOREACH_ALIGNED(iter, current_gen, nextgen_iter) { // Set variables according to the rules if (iter.index() == 4) is_live = *iter; else live_neighbors += *iter; } int result = GameOfLifeConditions(...); *nextgen_iter = result; } next_gen.commit(); }

Global Memory Thread (1,1)

slide-34
SLIDE 34

Code Sample – Device

template <typename T, int ILPX, int ILPY> __global__ void GameOfLifeTick MAPS_MULTIDEF( Window2D<T,1,WRAP,ILPX,ILPY> current_gen, StructuredInjective<T,2,ILPX,ILPY> next_gen) { MAPS_MULTI_INIT(current_gen, next_gen); #pragma unroll MAPS_FOREACH(nextgen_iter, next_gen) { int live_neighbors = 0, is_live = 0; #pragma unroll MAPS_FOREACH_ALIGNED(iter, current_gen, nextgen_iter) { // Set variables according to the rules if (iter.index() == 4) is_live = *iter; else live_neighbors += *iter; } int result = GameOfLifeConditions(...); *nextgen_iter = result; } next_gen.commit(); }

Global Memory

slide-35
SLIDE 35

Code Sample – Device

template <typename T, int ILPX, int ILPY> __global__ void GameOfLifeTick MAPS_MULTIDEF( Window2D<T,1,WRAP,ILPX,ILPY> current_gen, StructuredInjective<T,2,ILPX,ILPY> next_gen) { MAPS_MULTI_INIT(current_gen, next_gen); #pragma unroll MAPS_FOREACH(nextgen_iter, next_gen) { int live_neighbors = 0, is_live = 0; #pragma unroll MAPS_FOREACH_ALIGNED(iter, current_gen, nextgen_iter) { // Set variables according to the rules if (iter.index() == 4) is_live = *iter; else live_neighbors += *iter; } int result = GameOfLifeConditions(...); *nextgen_iter = result; } next_gen.commit(); }

Global Memory t1,1

slide-36
SLIDE 36

Code Sample – Device

template <typename T, int ILPX, int ILPY> __global__ void GameOfLifeTick MAPS_MULTIDEF( Window2D<T,1,WRAP,ILPX,ILPY> current_gen, StructuredInjective<T,2,ILPX,ILPY> next_gen) { MAPS_MULTI_INIT(current_gen, next_gen); #pragma unroll MAPS_FOREACH(nextgen_iter, next_gen) { int live_neighbors = 0, is_live = 0; #pragma unroll MAPS_FOREACH_ALIGNED(iter, current_gen, nextgen_iter) { // Set variables according to the rules if (iter.index() == 4) is_live = *iter; else live_neighbors += *iter; } int result = GameOfLifeConditions(...); *nextgen_iter = result; } next_gen.commit(); }

Global Memory t0,1 t1,1 t2,1

slide-37
SLIDE 37

Performance

  • MAPS-Multi was tested on four different GPUs:
  • GTX 780 (Kepler), Titan Black (Kepler), Tesla K40m (Kepler), GTX 980 (Maxwell)
  • Near linear scaling, up to 3.94× on 4 GPUs:

0.5 1 1.5 2 2.5 3 3.5 4 780 Titan K40 980 780 Titan K40 980 780 Titan K40 980 Game of Life Histogram SGEMM

Speedup

1 GPU 2 GPUs 3 GPUs 4 GPUs

slide-38
SLIDE 38

Performance vs. Optimized GPU Libraries

  • Performance is on-par with production-level libraries
  • Comparison with NVIDIA’s CUBLAS-XT (left) and CUB (right):

0.5 1 1.5 2 2.5 3 3.5 4

CUBLAS- XT MAPS- Multi CUBLAS- XT MAPS- Multi CUBLAS- XT MAPS- Multi GTX 780 Titan Black GTX 980

Speedup

1 GPU 2 GPUs 3 GPUs 4 GPUs

1000 2000 3000 4000 5000 6000

Naive CUB MAPS Naive CUB MAPS Naive CUB MAPS GTX 780 Titan Black GTX 980

Throughput [images / s]

1 GPU 2 GPUs 3 GPUs 4 GPUs

(All multi-GPU results over MAPS-Multi)

Matrix Multiplication Histogram

slide-39
SLIDE 39

Real-World Applications – Deep Learning

  • Multi-GPU nodes are the platform of choice for deep neural network

training

  • Complex, multi-stage iterative process with many access patterns
  • Two leading deep learning frameworks utilize GPUs by default: Caffe

and Torch

  • Caffe has its own CPU/GPU memory manager (only for single GPU)
  • Torch uses a scripting engine, multi-GPU specifically partitioned by users
  • Both comprised of large codebases
slide-40
SLIDE 40

GPU 3 4D Multi- Convolution Images Pooling

GPU 3 GPU 1

=1 =7 =5 =3 =5 =3 =9 =2 =4 =5 =0 =0 Forward Propagation – Compute 𝒈 Backpropagation – Compute 𝜶𝒈

GPU 1 GPU 2 Fully Connected

… … … … … …

GPU 3 GPU 1

slide-41
SLIDE 41

Deep Learning Performance

  • Deep learning was implemented to train a digit classifier (LeNet)
  • Implementation is compared with the Caffe and Torch frameworks:

20,000 40,000 60,000 80,000 100,000 120,000

Caffe Torch MAPS- Multi Caffe Torch MAPS- Multi Caffe Torch MAPS- Multi GTX 780 Titan Black GTX 980

Trained Images / Second

1 GPU 2 GPUs 4 GPUs

slide-42
SLIDE 42

Non-Negative Matrix Factorization

  • Used for dimensionality reduction, finds matrices 𝐗, 𝐈 s.t. 𝑾 ≈ 𝑿𝑰
  • NMF-mGPU* is a multi-GPU implementation of NMF
  • ~15,000 lines-of-code over multiple files
  • Specifically tailored for nodes with multiple Kepler GPUs
  • Uses MPI for inter-GPU communication
  • The MAPS-Multi version consists of 870

lines-of-code

𝑩𝒅𝒅 = ∑𝑿𝒔∗

𝑾 = 𝑿𝑰

𝑩𝒗𝒚 = 𝑾/𝑾 𝑮 = 𝑿𝑼 ⋅ 𝑩𝒗𝒚

𝑰𝒔𝒇𝒕 = 𝑰 ∗ 𝑮/𝑩𝒅𝒅

Update H Update W

𝑩𝒅𝒅 = ∑𝑰∗𝒒 𝑾 𝑼 = 𝑰𝑼𝑿𝑼

𝑩𝒗𝒚𝑼 = 𝑾𝑼/𝑾 𝑼

𝑮𝑼 = 𝑰 ⋅ 𝑩𝒗𝒚𝑼

𝑿𝒔𝒇𝒕 = 𝑿𝑼 ∗ 𝑮𝑼/𝑩𝒅𝒅𝑼

𝑿𝑼

𝑿 = 𝑿𝒔𝒇𝒕

𝑼

Window2D Block2D-T, Block1D Block2D-T, Block2D Block2D-T

* E. Mejía-Roa, D. Tabas-Madrid, J. Setoain, C. Garca, F. Tirado, and A. Pascual-Montano. NMF-mGPU: non-negative matrix factorization on multi-GPU systems. BMC Bioinformatics, 2015.

slide-43
SLIDE 43

NMF Performance

20 25 30 35 40 45 50 55 60 65 1 2 3 4

Throughput [iterations / s]

GPUs

20 30 40 50 60 70 1 2 3 4

Throughput [iterations / s]

GPUs

25 35 45 55 65 75 85 95 1 2 3 4

Throughput [iterations / s]

GPUs

NMF-mGPU MAPS-Multi

GTX 780 Titan Black GTX 980

slide-44
SLIDE 44

Conclusions

Memory access pattern specification:

  • Is a promising approach for ease-of-programming and improved

performance

  • Can dramatically reduce per-architecture GPU kernel tuning
  • Alleviates manual device management
  • Valid for many parallel algorithms
  • Performance surpasses existing multi-GPU implementations
slide-45
SLIDE 45

Future Research

  • Explore additional memory access patterns
  • Implications on power efficiency and resilience
  • Other architectures (e.g., multi-core CPUs), clusters
  • Compiler support for simplification, optimizations and automatic analysis
  • Kernel workload estimation (for reordering and pipeline optimizations)
slide-46
SLIDE 46

Thank You

Questions?

MAPS-Multi is open-source (New BSD license). The code is provided at: http://www.cs.huji.ac.il/project/maps/

This research was supported by the Ministry of Science and Technology, Israel and by the German Research Foundation (DFG) Priority Program “Software for Exascale Computing” (SPP-EXA), research project FFMK.