EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY Nikolay - - PowerPoint PPT Presentation

everything you need to know about unified memory
SMART_READER_LITE
LIVE PREVIEW

EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY Nikolay - - PowerPoint PPT Presentation

EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY Nikolay Sakharnykh, 3/27/2018 SINGLE POINTER CPU vs GPU CPU code GPU code w/ Unified Memory void *data; void *data; data = malloc(N); data = malloc(N); cpu_func1(data, N); cpu_func1(data,


slide-1
SLIDE 1

Nikolay Sakharnykh, 3/27/2018

EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY

slide-2
SLIDE 2

2

SINGLE POINTER

void *data; data = malloc(N); cpu_func1(data, N); cpu_func2(data, N); cpu_func3(data, N); free(data); void *data; data = malloc(N); cpu_func1(data, N); gpu_func2<<<...>>>(data, N); cudaDeviceSynchronize(); cpu_func3(data, N); free(data);

CPU vs GPU

CPU code GPU code w/ Unified Memory

slide-3
SLIDE 3

3

SINGLE POINTER

void *data, *d_data; data = malloc(N); cudaMalloc(&d_data, N); cpu_func1(data, N); cudaMemcpy(d_data, data, N, ...) gpu_func2<<<...>>>(d_data, N); cudaMemcpy(data, d_data, N, ...) cudaFree(d_data); cpu_func3(data, N); free(data); void *data; data = malloc(N); cpu_func1(data, N); gpu_func2<<<...>>>(data, N); cudaDeviceSynchronize(); cpu_func3(data, N); free(data);

Explicit vs Unified Memory

Explicit Memory Management GPU code w/ Unified Memory

slide-4
SLIDE 4

4

SINGLE POINTER

void *data, *d_data; data = malloc(N); cudaMalloc(&d_data, N); cpu_func1(data, N); cudaMemcpy(d_data, data, N, ...) gpu_func2<<<...>>>(d_data, N); cudaMemcpy(data, d_data, N, ...) cudaFree(d_data); cpu_func3(data, N); free(data); void *data; data = malloc(N); cpu_func1(data, N); cudaMemPrefetchAsync(data, N, GPU) gpu_func2<<<...>>>(data, N); cudaMemPrefetchAsync(data, N, CPU) cudaDeviceSynchronize(); cpu_func3(data, N); free(data);

Full Control with Prefetching

Explicit Memory Management Unified Memory + Prefetching

slide-5
SLIDE 5

5

SINGLE POINTER

char **data; // allocate and initialize data on the CPU char **d_data; char **h_data = (char**)malloc(N*sizeof(char*)); for (int i = 0; i < N; i++) { cudaMalloc(&h_data[i], N); cudaMemcpy(h_data[i], data[i], N, ...); } cudaMalloc(&d_data, N*sizeof(char*)); cudaMemcpy(d_data, h_data, N*sizeof(char*), ...); gpu_func<<<...>>>(d_data, N); char **data; // allocate and initialize data on the CPU gpu_func<<<...>>>(data, N);

Deep Copy

Explicit Memory Management GPU code w/ Unified Memory

slide-6
SLIDE 6

6

UNIFIED MEMORY BASICS

GPU A GPU B

page1 page2 Single virtual memory shared between processors

slide-7
SLIDE 7

7

UNIFIED MEMORY BASICS

page1 page2

GPU A GPU B A’s page table A’s phys mem B’s phys mem

page1 page2

B’s page table

page1 page2 Single virtual memory shared between processors

slide-8
SLIDE 8

8

UNIFIED MEMORY BASICS

page1 page2 page3

A’s phys mem B’s phys mem

page1 page2 page3 *addr1 = 1 local access *addr3 = 1 page fault

A’s page table B’s page table

slide-9
SLIDE 9

9

UNIFIED MEMORY BASICS

page1 page2 page3

A’s phys mem B’s phys mem

page1 page2 page3 *addr3 = 1 access replay page3 populated and mapped into B’s memory

A’s page table B’s page table

slide-10
SLIDE 10

10

UNIFIED MEMORY BASICS

page1 page2 page3

A’s phys mem B’s phys mem

page1 page2 page3 *addr2 = 1 page fault *addr3 = 1 page fault

A’s page table B’s page table

slide-11
SLIDE 11

11

UNIFIED MEMORY BASICS

page1 page2 page3

A’s phys mem B’s phys mem

page1 page2 page3 *addr2 = 1 page fault *addr3 = 1 page fault page2 and page3 unmapped from B’s memory

A’s page table B’s page table

slide-12
SLIDE 12

12

UNIFIED MEMORY BASICS

page1 page2 page3

A’s phys mem B’s phys mem

page1 page2 page3 *addr2 = 1 page fault *addr3 = 1 page fault pages data migrated to A’s physical memory

A’s page table B’s page table

slide-13
SLIDE 13

13

UNIFIED MEMORY BASICS

page1 page2 page3

A’s phys mem B’s phys mem

page1 page2 page3 *addr2 = 1 access replay *addr3 = 1 access replay

A’s page table B’s page table

slide-14
SLIDE 14

14

MEMORY OVERSUBSCRIPTION

page1 page1 page3 page4 page5

A’s phys mem B’s phys mem

page1 page1 page3 page4 page5 *addr5 = 1 page fault

A’s page table B’s page table

slide-15
SLIDE 15

15

MEMORY OVERSUBSCRIPTION

page1 page1 page3 page4 page5

A’s phys mem B’s phys mem

page1 page1 page3 page4 page5 *addr5 = 1 page fault page4 unmapped from A’s memory and migrated

A’s page table B’s page table

slide-16
SLIDE 16

16

MEMORY OVERSUBSCRIPTION

page1 page1 page3 page4 page5

A’s phys mem B’s phys mem

page1 page1 page3 page4 page5 *addr5 = 1 page fault page4 mapped in B’s memory, page5 unmapped and migrated to A

A’s page table B’s page table

slide-17
SLIDE 17

17

MEMORY OVERSUBSCRIPTION

page1 page1 page3 page4 page5

A’s phys mem B’s phys mem

page1 page1 page3 page4 page5 *addr5 = 1 access replay

A’s page table B’s page table

slide-18
SLIDE 18

18

SIMPLIFYING DL FRAMEWORK DESIGN

Eliminated 3,000 lines of repetitive and error-prone code in Caffe Developers can add new inherited Layer classes in a much simpler manner The final call to a CPU function or a GPU kernel (caffe_gpu_gemm) still need to be explicit

class ConvolutionLayer { public: void cpu_data() void cpu_diff() void gpu_data() void gpu_diff() void mutable_cpu_data() void mutable_cpu_diff() void mutable_gpu_data() void mutable_gpu_diff() void Forward_cpu() void Forward_gpu() void forward_cpu_gemm() void forward_gpu_gemm() void forward_cpu_bias() void forward_gpu_bias() void Backward_cpu() void Backward_gpu() void backward_cpu_gemm() void backward_gpu_gemm() void backward_cpu_bias() void backward_gpu_bias() } class ConvolutionLayer { public: void data() void diff() void mutable_data() void mutable_diff() void Forward() void forward_gemm() void forward_bias() void Backward() void backward_gemm() void backward_bias() }

Existing Design Unified Memory Design

  • A. A. Awan, C-H Chu, H. Subramoni, X. Lu, D.K. Panda, “OC-DNN: Designing

Out-of-Core Deep Neural Network Training by Exploiting Unified Memory on Pascal and Volta GPUs”, <double-blind submission under review>

slide-19
SLIDE 19

19

CAN THIS DESIGN OFFER GOOD PERF?

DL training with Unified Memory

OC-Caffe will be released by the HiDL Team@OSU: hidl.cse.ohio-state.edu, mvapich.cse.ohio-state.edu

20 40 60 80 100 120 140 160 Batch 40 Batch 45 Images/sec

ResNet-50 training on 1xV100 (16GB)

Caffe (BVLC) OC-Caffe 20 40 60 80 100 120 140 160 Batch 110 Batch 120 Images/sec

VGG19 training on 1xV100 (16GB)

Caffe (BVLC) OC-Caffe

  • ut-of-core

in-memory

  • ut-of-core

in-memory

slide-20
SLIDE 20

20

CONCURRENT ACCESS

page1 page2 page3

A’s phys mem B’s phys mem

page1 page2 page3

A’s page table B’s page table

slide-21
SLIDE 21

21

CONCURRENT ACCESS

Exclusive Access*

page1 page2 page3

A’s phys mem B’s phys mem

page1 page2 page3

atomicAdd_system (addr2, 1) page fault atomicAdd_system (addr2, 1) local access A’s page table B’s page table

*this is a possible implementation and to guarantee this behavior you need to use cudaMemAdvise policies

slide-22
SLIDE 22

22

CONCURRENT ACCESS

Exclusive Access

page1 page2 page3

A’s phys mem B’s phys mem

page1 page2 page3

atomicAdd_system (addr2, 1) page fault

page2 unmapped in B’s memory and migrated to A

A’s page table B’s page table

slide-23
SLIDE 23

23

CONCURRENT ACCESS

Exclusive Access

page1 page2 page3

A’s phys mem B’s phys mem

page1 page2 page3

atomicAdd_system (addr2, 1) local access A’s page table B’s page table

slide-24
SLIDE 24

24

CONCURRENT ACCESS

Atomics over NVLINK*

page1 page2 page3

A’s phys mem B’s phys mem

page1 page2 page3

atomicAdd_system (addr2, 1) remote access

*both processors need to support atomic operations

atomicAdd_system (addr2, 1) local access A’s page table B’s page table

slide-25
SLIDE 25

25

CONCURRENT ACCESS

Read duplication*

page1 page2 page3

A’s phys mem B’s phys mem

page1 page2 page3

val = *addr2; local access

*each processor must maintain its own page table

val = *addr2; local access A’s page table B’s page table

slide-26
SLIDE 26

26

CONCURRENT ACCESS

Read duplication: write

page1 page2 page3

A’s phys mem B’s phys mem

page1 page2 page3

*addr2 = val2; local access

a write will collapse all copies into one

A’s page table B’s page table

slide-27
SLIDE 27

27

CONCURRENT ACCESS

Read duplication: read after write

page1 page2 page3

A’s phys mem B’s phys mem

page1 page2 page3

val = *addr2; local access

pages are duplicated again on faults

A’s page table B’s page table

slide-28
SLIDE 28

28

ANALYTICS USE CASE

Design of a Concurrent Hybrid Hash Table

GPU 0 GPU 1 Multiple CPU Cores Hash table implemented via Unified Memory Concurrent Inserts and Fetches Non-blocking updates using atomic compare&swap Concurrent Fetches Concurrent Fetches

S8172 - Evaluation of Hybrid Cache-Coherent Concurrent Hash Table on POWER9 System with NVLink 2.0 – Thu 11:00 Room 210F

slide-29
SLIDE 29

29

ANALYTICS USE CASE

Concurrent Access To Hash Table

HBM SM L2 SM SM SYSMEM page fault x86

slide-30
SLIDE 30

30

SYSMEM

ANALYTICS USE CASE

Concurrent Access To Hash Table

HBM SM L2 SM SM page fault page migration access replay x86

slide-31
SLIDE 31

31

SYSMEM

ANALYTICS USE CASE

Concurrent Access To Hash Table

HBM SM L2 SM SM P9 cache P9 can directly update hash entry in GPU memory no page faults or migrations! HBM SM L2 SM SM page fault page migration access replay x86

S8172 - Evaluation of Hybrid Cache-Coherent Concurrent Hash Table on POWER9 System with NVLink 2.0 – Thu 11:00 Room 210F

slide-32
SLIDE 32

32

UNIFIED MEMORY + DGX-2

UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User control of data locality

GPU GPU 1 GPU 2 GPU 3 GPU 4 GPU 5 GPU 6 GPU 7 GPU 8 GPU 9 GPU 10 GPU 11 GPU 12 GPU 13 GPU 14 GPU 15

512 GB Unified Memory

slide-33
SLIDE 33

33

ENABLING MULTI-GPU

__global__ void kernel(int *data) { int idx = threadIdx.x + blockDim.x * blockIdx.x; doSomeStuff(idx, data, ...); } cudaMallocManaged(&data, N * sizeof(int)); // initialize data on the CPU kernel<<<grid, block>>>(data); __global__ void kernel(int *data, int gpuId) { int idx = threadIdx.x + blockDim.x * (blockIdx.x + gpuId * gridDim.x); doSomeStuff(idx, data, ...); } cudaMallocManaged(&data, N * sizeof(int)); // initialize data on the CPU for (int i = 0; i < numGPUs; i++) { cudaSetDevice(i); kernel<<<grid/numGPUs, block>>>(data, i); }

Single-GPU Multi-GPU update launch config update blockIdx.x

slide-34
SLIDE 34

34

MULTI-GPU WITH UNIFIED MEMORY

GPU 3 GPU 0 GPU 2 GPU 1 SYSMEM GPU kernels initiate migrations

slide-35
SLIDE 35

35

MULTI-GPU WITH UNIFIED MEMORY

GPU 3 GPU 0 GPU 2 GPU 1 Data automatically partitioned between GPUs

  • n first-touch

SYSMEM

slide-36
SLIDE 36

36

MULTI-GPU WITH UNIFIED MEMORY

GPU 3 GPU 0 GPU 2 GPU 1 With policies: Remote data can be accessed directly without migrations SYSMEM

slide-37
SLIDE 37

37

MULTI-GPU WITH UNIFIED MEMORY

GPU 3 GPU 0 GPU 2 GPU 1 With policies: Read-only data can be duplicated and accessed locally SYSMEM

slide-38
SLIDE 38

38

GPU ARCHITECTURE AND SOFTWARE EVOLUTION

slide-39
SLIDE 39

39

UNIFIED MEMORY

Evolution of GPU architectures

2012 2014 2016 2017

Kepler

Release of the new “single- pointer” programming model

Maxwell

No new features related to Unified Memory

Pascal

On-demand migration, concurrent access and atomics,

  • versubscription

Volta

Access counters, P9 only: hardware coherency, ATS support

NVLINK1 NVLINK2

*Not all features are available on all platforms

slide-40
SLIDE 40

40

UNIFIED MEMORY: BEFORE PASCAL

No GPU page fault support: move all dirty pages on kernel launch No concurrent access, no GPU memory oversubscription, no system-wide atomics

Available since CUDA 6

page1 page2 page3

A’s page table A’s phys mem B’s phys mem

page1 page2 page3

B’s page table

kernel launch

slide-41
SLIDE 41

41

UNIFIED MEMORY: PASCAL AND VOLTA

GPU page fault support, concurrent access, extended VA space (48-bit) On-demand migration to accessing processor on first touch

Available since CUDA 8

page1 page2

A’s phys mem B’s phys mem

page1 page2

A’s page table B’s page table

slide-42
SLIDE 42

42

UNIFIED MEMORY ON VOLTA+P9

If memory is mapped to the GPU, migration can be triggered by access counters

New Feature: Access Counters

page1 page2

A’s phys mem B’s phys mem

page1 page2 few accesses many accesses

A’s page table B’s page table

slide-43
SLIDE 43

43

UNIFIED MEMORY ON VOLTA+P9

With access counters only hot pages will be moved to the GPU Migrations are delayed compared to the fault-based method

New Feature: Access Counters

page1 page2

A’s phys mem B’s phys mem

page1 page2 many accesses

*When implemented this feature can be enabled with cudaMemAdvise policies

A’s page table B’s page table

slide-44
SLIDE 44

44

UNIFIED MEMORY ON VOLTA+P9

CPU can directly access and cache GPU memory Native atomics support for all accessible memory

New Feature: Hardware Coherency with NVLINK2

page1 page2

GPU mem system mem

page1 page2 V100 P9

V100’s page table P9’s page table

slide-45
SLIDE 45

45

UNIFIED MEMORY ON VOLTA+P9

ATS: address translation service; CPU and GPU can share a single page table

New Feature: ATS support

page1 page2

GPU mem CPU mem

V100 P9

ATS page table

slide-46
SLIDE 46

46

UNIFIED MEMORY WITH SYSTEM ALLOCATOR

System allocator support allows GPU to access all system memory malloc, stack, global, file system P9: Address Translation Service (ATS) Support enabled in CUDA 9.2 x86: Heterogeneous Memory Management (HMM) Initial version of the patchset is integrated into 4.14 kernel NVIDIA will be supporting upcoming versions of HMM

https://lkml.org/lkml/2017/6/23/443

slide-47
SLIDE 47

47

WHAT YOU CAN DO WITH UNIFIED MEMORY

See it in action at the end of the talk!

int *data; cudaMallocManaged(&data, sizeof(int) * n); kernel<<<grid, block>>>(data); int *data = (int*)malloc(sizeof(int) * n); kernel<<<grid, block>>>(data); int data[1024]; kernel<<<grid, block>>>(data); int *data = (int*)alloca(sizeof(int) * n); kernel<<<grid, block>>>(data); extern int *data; kernel<<<grid, block>>>(data);

Works everywhere today Works on Power9 + ATS in CUDA 9.2 Will work in the future on x86 + HMM

slide-48
SLIDE 48

48

UNIFIED MEMORY LANGUAGES

CUDA C/C++: cudaMallocManaged CUDA Fortran: managed attribute (per allocation) Python: pycuda.driver.managed_empty (allocate numpy.ndarray) OpenACC: -ta=managed compiler option (all dynamic allocations)

slide-49
SLIDE 49

49

UNIFIED MEMORY + OPENACC

Literally adding a single line will get your code running on the GPU Easy to optimize later: add loop and data directives

Effortless way to run you code on GPUs

#pragma acc kernels { for (i = 0; i < n; ++i) { c[i] = a[i] + b[i]; ... } } ...

Initiate parallel execution

slide-50
SLIDE 50

50

GYROKINETIC TOROIDAL CODE

Particle-In-Cell production code

http://phoenix.ps.uci.edu/gtc_group

2 4 6 8 10 12 14 16 18 CPU only 2xV100 Data Directives 2xV100 Unified Memory 4xV100 Data Directives 4xV100 Unified Memory

Speed-up CPU: Haswell E5-2698 v3 @ 2.30GHz, dual socket 16-core; MPI: MVAPICH-GDR

slide-51
SLIDE 51

51

PERFORMANCE DEEP DIVE

slide-52
SLIDE 52

52

STREAMING BENCHMARK

How fast is on-demand migration?

5 10 15 20 25 30 35 x86 + P100 (PCIe) x86 + V100 (PCIe) P8 + P100 (NVLINK) GB/s

Streaming performance

  • n-demand

prefetch memcpy

__global__ void kernel(int *host, int *device) { int i = threadIdx.x + blockDim.x * blockIdx.x; device[i] = host[i]; } // allocate and initialize memory cudaMallocManaged(&host, size); cudaMalloc(&device, size); memset(host, 0, size); // benchmark CPU->GPU migration if (prefetch) cudaMemPrefetchAsync(host, size, gpuId); kernel<<<grid, block>>>(host, device);

slide-53
SLIDE 53

53

==14487== Profiling result: Type Time(%) Time Calls Avg Min Max Name GPU activities: 100.00% 23.270ms 1 23.270ms 23.270ms 23.270ms void kernel(int*, int*) API calls: 79.56% 23.272ms 1 23.272ms 23.272ms 23.272ms cudaDeviceSynchronize 20.42% 5.9732ms 1 5.9732ms 5.9732ms 5.9732ms cudaLaunch 0.01% 2.0490us 1 2.0490us 2.0490us 2.0490us cudaConfigureCall 0.01% 1.8360us 4 459ns 138ns 833ns cudaSetupArgument ==14487== Unified Memory profiling result: Device "Tesla V100-PCIE-16GB (0)" Count Avg Size Min Size Max Size Total Size Total Time Name 3012 21.758KB 4.0000KB 952.00KB 64.00000MB 13.49043ms Host To Device 81 -

  • 23.23181ms Gpu page fault groups

UNDERSTANDING PROFILER OUTPUT

slide-54
SLIDE 54

54

HEURISTIC PREFETCHING

GPU architecture supports different page sizes Contiguous pages up to a large page size are promoted to the larger size Driver prefetches whole regions if pages are accessed densely

Do Not Confuse with API-prefetching

GPU CPU

4KB 4KB 952KB 1024KB

slide-55
SLIDE 55

55

WHAT IS PAGE FAULT GROUPS?

==14487== Unified Memory profiling result: Device "Tesla V100-PCIE-16GB (0)" Count Avg Size Min Size Max Size Total Size Total Time Name 3012 21.758KB 4.0000KB 952.00KB 64.00000MB 13.49043ms Host To Device 81 -

  • 23.23181ms Gpu page fault groups

Unified Memory Virtual Address Name 8 0x3900010000 [Unified Memory GPU page faults] 9 0x3900040000 [Unified Memory GPU page faults] 5 0x3900108000 [Unified Memory GPU page faults] 5 0x3900200000 [Unified Memory GPU page faults]

nvprof --print-gpu-trace ...

slide-56
SLIDE 56

56

PAGE FAULTS HANDLING

128B 128B 128B 128B 128B thread_4B warp 0 warp 1

GPU page

warp 2

faulted stalled stalled

warp 3

faulted resumed resumed fault processing replayed replayed

slide-57
SLIDE 57

57

OPTIMIZING ON-DEMAND MIGRATION

64KB 64KB warp_64KB warp 0 warp 1

GPU page GPU page

Increase fault concurrency to reduce page fault stalls

128B 128B 128B 128B 128B thread_4B warp 0 warp 1

GPU page fewer warps are stalled “spread-out” pattern improves prefetching multiple faults per page warps are stalled on fault processing

slide-58
SLIDE 58

58

OPTIMIZING ON-DEMAND MIGRATION

Count Avg Size Min Size Max Size Total Size Total Time Name 3012 21.758KB 4.0000KB 952.00KB 64.00000MB 13.49043ms Host To Device 81 -

  • 23.23181ms Gpu page fault groups

Unified Memory Virtual Address Name 8 0x3900010000 [Unified Memory GPU page faults] 9 0x3900040000 [Unified Memory GPU page faults] 5 0x3900108000 [Unified Memory GPU page faults] Count Avg Size Min Size Max Size Total Size Total Time Name 957 68.481KB 4.0000KB 576.00KB 64.00000MB 8.242080ms Host To Device 6 -

  • 9.769984ms Gpu page fault groups

Unified Memory Virtual Address Name 1 0x39000d0000 [Unified Memory GPU page faults] 1 0x39000c0000 [Unified Memory GPU page faults] 1 0x3900080000 [Unified Memory GPU page faults]

Thread/4B Warp/64KB more efficient prefetching fewer stalls

slide-59
SLIDE 59

59

STREAMING BENCHMARK

How fast is on-demand migration?

5 10 15 20 25 30 35 x86 + P100 (PCIe) x86 + V100 (PCIe) P8 + P100 (NVLINK) GB/s

Streaming performance

  • n-demand: thread/4B
  • n-demand: warp/64K

prefetch memcpy

128B 128B 128B 128B 128B 64KB 64KB thread_4B warp_64KB warp 0 warp 1 warp 0 warp 1

Also check the Parallel Forall blog: https://devblogs.nvidia.com/maximizing-unified-memory-performance-cuda/

slide-60
SLIDE 60

60

5 10 15 20 25 10 20 30 40 50 GB/s Working set (GB)

Fault-based migration throughput (P8/NVLINK)

thread/4B warp/64K thread/4B + readmostly warp/64K + readmostly 1 2 3 4 5 6 7 8 9 10 10 20 30 40 50 GB/s Working set (GB)

Fault-based migration throughput (x86/PCIe)

thread/4B warp/64K thread/4B + readmostly warp/64K + readmostly

GPU MEMORY OVERSUBSCRIPTION

Let’s see how perf changes as we increase the working set

GPU memory limit GPU memory limit

slide-61
SLIDE 61

61

EVICTION ALGORITHM

Driver keeps a list of physical chunks of GPU memory Chunks from the front of the list are evicted first (LRU) A chunk is considered “in use” when it is fully-populated or migrated Prefetching and policies may impact eviction heuristic in the future

What Pages Are Moving Out of the GPU

eviction allocation

  • r

migration

slide-62
SLIDE 62

62

PREFETCHING AND EVICTIONS

2 4 6 8 10 12

  • n-demand

thread/4B

  • n-demand

warp/64K 1-way prefetch GB/s

Migration throughput (x86+V100)

Without eviction With eviction

Prefetch can be overlapped with evictions without using CUDA streams! (enabled in CUDA 9.1) cudaMemcpy solution requires scheduling DtoH and HtoD copies into two separate streams

slide-63
SLIDE 63

63

DRIVER ENHANCEMENTS

Tesla V100 + x86: HPGMG-AMR default (no hints)

50 100 150 200 250 1.4 4.7 8.6 28.9 58.6 MDOF/s Working set (GB) 384.81 (CUDA 8.0 GA2) 387.34 (CUDA 9.0)

Improvements for the allocator

slide-64
SLIDE 64

64

DRIVER ENHANCEMENTS

Tesla V100 + x86: HPGMG-AMR optimized (prefetches)

50 100 150 200 250 1.4 4.7 8.6 28.9 58.6 MDOF/s Working set (GB) 384.81 (CUDA 8.0 GA2) 387.34 (CUDA 9.0) 390.30 (CUDA 9.1)

Prefetch + Evictions Overlap

slide-65
SLIDE 65

65

LOCALITY AND ACCESS CONTROL

Default: data migrates on first touch ReadMostly: data duplicated on first touch PreferredLocation: resist migrating away from the preferred location AccessedBy: establish direct mapping and avoid faults

cudaMemAdvise

GPU1 GPU0 GPU1 GPU0 GPU1 GPU0 GPU1 GPU0

slide-66
SLIDE 66

66

while not converged: LagrangeNodal CalcForceForNodes CalcAccelerationForNodes BoundaryConditionsForNodes CalcVelocityForNodes CalcPositionForNodes LagrangeElements CalcLagrangeElements CalcQForElems MaterialPropertiesForElems CalcTimeConstraintsForElems CalcCourantConstraintForElems CalcHydroConstraintForElems

LULESH

CORAL OpenACC app

few heavy GPU kernels many small GPU kernels many small CPU functions

slide-67
SLIDE 67

67

USING VISUAL PROFILER

Kernel time = 17ms Kernel time = 27ms 1st iteration 2nd iteration longer, why? what’s this?

slide-68
SLIDE 68

68

THRASHING MITIGATION

Before CUDA 9.2: when memory is pinned we lose any insight into access pattern In the future we may use access counters on Volta to find a better location

Processors frequently read or write to the same page

GPU CPU Page Throttling Remote Map Memory thrashing detected

slide-69
SLIDE 69

69

LULESH: PREFERRED LOCATION = GPU

No policies Preferred Location lots of back-and-forth migrations but no slowdown for the heavy kernels anti-thrashing heuristic kicks in and pins the page to CPU

slide-70
SLIDE 70

70

LULESH: READ MOSTLY

no slowdown for the heavy kernels almost no thrashing for the hybrid CPU/GPU part

slide-71
SLIDE 71

71

10,000 20,000 30,000 40,000 50,000 60,000 70,000 Default Preferred ReadMostly Improved Default Heuristics FOM (z/s)

LULESH performance, 200^3 mesh, 100 iterations, CUDA 9.1

Quadro GP100 Quadro GV100

LULESH: PERF IMPROVEMENTS

GP100 has 500x more sysmem writes than GV100 Will be enabled in the future drivers

slide-72
SLIDE 72

72

WHEN TO USE UNIFIED MEMORY

cudaMalloc cudaMallocManaged Pinned allocation

cudaMalloc cudaMallocManaged PreferredLocation(GPU) SetAccessedBy(peer GPUs) cudaMemPrefetchAsync(GPU)

cudaMemcpy: ptrA -> ptrB

Staging for non-pinned allocations

  • r between non-P2P GPUs

Staging or a copy kernel required in all cases

Memory migration

Not possible

cudaMemPrefetchAsync

Debugging

Difficult Easy

Oversubscription

No Yes

IPC support

Yes No

slide-73
SLIDE 73

73

WHEN TO USE UNIFIED MEMORY

Application Code

Sequential code (CPU) Sequential code (CPU) Parallel code (GPU) cudaMalloc cudaMallocManaged/malloc cudaMallocManaged/malloc

Parallel Code

slide-74
SLIDE 74
slide-75
SLIDE 75

75

UNIFIED MEMORY PLATFORMS

KEPLER PASCAL VOLTA Linux + x86

No GPU fault support No concurrent access On-demand migration On-demand migration

Linux + Power

On-demand migration 80GB/s CPU-GPU BW* On-demand migration 150GB/s CPU-GPU BW** Access counters HW coherency ATS support

Windows

No GPU fault support No concurrent access

MacOS

No GPU fault support No concurrent access

Tegra

Cached on CPU and iGPU No concurrent access

*IBM Minsky: 4xP100 + 2xP8, 2xNVLINK1 links between P100 and P8, bi-directional aggregate BW **IBM Newell: 4xV100 + 2xP9, 3xNVLINK2 links between V100 and P9, bi-directional aggregate BW

slide-76
SLIDE 76

76

READ DUPLICATION

Usage example

char *data; cudaMallocManaged(&data, N); init_data(data, N); cudaMemAdvise(data, N, ..SetReadMostly, myGpuId); cudaMemPrefetchAsync(data, N, myGpuId, s); mykernel<<<..., s>>>(data, N); use_data(data, N); cudaFree(data);

The prefetch creates a copy instead of moving data Both processors can read data simultaneously without faults Writes will collapse all copies into one, subsequent reads will fault and duplicate

GPU: my_kernel CPU: init_data CPU: use_data

slide-77
SLIDE 77

77

PREFERRED LOCATION

Resisting migrations

char *data; cudaMallocManaged(&data, N); init_data(data, N); cudaMemAdvise(data, N, ..PreferredLocation, cudaCpuDeviceId); mykernel<<<..., s>>>(data, N); use_data(data, N); cudaFree(data);

The kernel will page fault and generate direct mapping to data on the CPU The driver will “resist” migrating data away from the preferred location

GPU: my_kernel CPU: init_data CPU: use_data

slide-78
SLIDE 78

78

PREFERRED LOCATION

Page population on first-touch

char *data; cudaMallocManaged(&data, N); cudaMemAdvise(data, N, ..PreferredLocation, cudaCpuDeviceId); mykernel<<<..., s>>>(data, N); use_data(data, N); cudaFree(data);

The kernel will page fault, populate pages on the CPU and generate direct mapping to data on the CPU Pages are populated on the preferred location if the faulting processor can access it

GPU: my_kernel CPU: use_data

slide-79
SLIDE 79

79

PREFERRED LOCATION ON P9+V100

CPU can directly access GPU memory

char *data; cudaMallocManaged(&data, N); init_data(data, N); cudaMemAdvise(data, N, ..PreferredLocation, gpuId); mykernel<<<..., s>>>(data, N); use_data(data, N); cudaFree(data);

The kernel will page fault and migrate data to the GPU CPU will fault and access data directly instead of migrating

  • n non P9+V100 systems the driver will migrate back to the CPU

GPU: my_kernel CPU: init_data CPU: use_data

slide-80
SLIDE 80

80

ACCESSED BY

Usage example

char *data; cudaMallocManaged(&data, N); init_data(data, N); cudaMemAdvise(data, N, ..SetAccessedBy, myGpuId); mykernel<<<..., s>>>(data, N); use_data(data, N); cudaFree(data);

GPU will establish direct mapping of data in CPU memory, no page faults will be generated Memory can move freely to other processors and mapping will carry

  • ver

GPU: my_kernel CPU: init_data CPU: use_data

slide-81
SLIDE 81

81

ACCESSED BY

Using access counters on Volta

char *data; cudaMallocManaged(&data, N); init_data(data, N); cudaMemAdvise(data, N, ..SetAccessedBy, myGpuId); mykernel<<<..., s>>>(data, N); use_data(data, N); cudaFree(data);

GPU will establish direct mapping of data in CPU memory, no page faults will be generated Access counters may eventually trigger migration of frequently accessed pages to the GPU

GPU: my_kernel CPU: init_data CPU: use_data

slide-82
SLIDE 82

82

ptr = malloc(size); doStuffOnGpu<<<...>>>(ptr, size);

MANAGED VS MALLOC ON VOLTA+P9

First touch allocation policy

GPU page faults Unified Memory driver allocates on GPU GPU accesses GPU memory ptr = cudaMallocManaged(size); doStuffOnGpu<<<...>>>(ptr, size);

*You may alter this behavior by using cudaMemAdvise policies

GPU uses ATS, faults OS allocates on CPU (by default) GPU uses ATS to access CPU memory

slide-83
SLIDE 83

83

MANAGED VS MALLOC ON P9

cudaMallocManaged: same behavior as x86

ptr = cudaMallocManaged(size); fillData(ptr, size); doStuffOnGpu<<<...>>>(ptr, size); cudaDeviceSynchronize(); doStuffOnCpu(ptr, size); GPU page faults ptr migrated to GPU CPU page faults ptr migrated to CPU

slide-84
SLIDE 84

84

MANAGED VS MALLOC ON P9

malloc: no on-demand migrations*

ptr = malloc(size); fillData(ptr, size); doStuffOnGpu<<<...>>>(ptr, size); cudaDeviceSynchronize(); doStuffOnCpu(ptr, size); GPU uses ATS to access CPU memory (no on-demand migration except cudaMemPrefetchAsync*) CPU accesses CPU memory

*In the future Volta access counters will be used to migrate malloc memory

slide-85
SLIDE 85

85

HYPRE-INSPIRED USE CASE

Algebraic Multi-Grid library: https://github.com/LLNL/hypre Lots of small allocations: multiple variables may end up on the same page If used by different processors this will result in false-sharing

512B 512B 512B 512B 512B 512B 512B 512B GPU CPU single 4KB page

slide-86
SLIDE 86

86

FALSE-SHARING

Issues with false-sharing:

  • Spurious migrations, thrashing mitigation does not solve it
  • Performance hints are applied on page boundaries, due to suballocation data may

inherit the wrong policies How to mitigate this:

  • Use separate allocators or memory pools for CPU and GPU