Nikolay Sakharnykh, 3/27/2018
EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY Nikolay - - PowerPoint PPT Presentation
EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY Nikolay - - PowerPoint PPT Presentation
EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY Nikolay Sakharnykh, 3/27/2018 SINGLE POINTER CPU vs GPU CPU code GPU code w/ Unified Memory void *data; void *data; data = malloc(N); data = malloc(N); cpu_func1(data, N); cpu_func1(data,
2
SINGLE POINTER
void *data; data = malloc(N); cpu_func1(data, N); cpu_func2(data, N); cpu_func3(data, N); free(data); void *data; data = malloc(N); cpu_func1(data, N); gpu_func2<<<...>>>(data, N); cudaDeviceSynchronize(); cpu_func3(data, N); free(data);
CPU vs GPU
CPU code GPU code w/ Unified Memory
3
SINGLE POINTER
void *data, *d_data; data = malloc(N); cudaMalloc(&d_data, N); cpu_func1(data, N); cudaMemcpy(d_data, data, N, ...) gpu_func2<<<...>>>(d_data, N); cudaMemcpy(data, d_data, N, ...) cudaFree(d_data); cpu_func3(data, N); free(data); void *data; data = malloc(N); cpu_func1(data, N); gpu_func2<<<...>>>(data, N); cudaDeviceSynchronize(); cpu_func3(data, N); free(data);
Explicit vs Unified Memory
Explicit Memory Management GPU code w/ Unified Memory
4
SINGLE POINTER
void *data, *d_data; data = malloc(N); cudaMalloc(&d_data, N); cpu_func1(data, N); cudaMemcpy(d_data, data, N, ...) gpu_func2<<<...>>>(d_data, N); cudaMemcpy(data, d_data, N, ...) cudaFree(d_data); cpu_func3(data, N); free(data); void *data; data = malloc(N); cpu_func1(data, N); cudaMemPrefetchAsync(data, N, GPU) gpu_func2<<<...>>>(data, N); cudaMemPrefetchAsync(data, N, CPU) cudaDeviceSynchronize(); cpu_func3(data, N); free(data);
Full Control with Prefetching
Explicit Memory Management Unified Memory + Prefetching
5
SINGLE POINTER
char **data; // allocate and initialize data on the CPU char **d_data; char **h_data = (char**)malloc(N*sizeof(char*)); for (int i = 0; i < N; i++) { cudaMalloc(&h_data[i], N); cudaMemcpy(h_data[i], data[i], N, ...); } cudaMalloc(&d_data, N*sizeof(char*)); cudaMemcpy(d_data, h_data, N*sizeof(char*), ...); gpu_func<<<...>>>(d_data, N); char **data; // allocate and initialize data on the CPU gpu_func<<<...>>>(data, N);
Deep Copy
Explicit Memory Management GPU code w/ Unified Memory
6
UNIFIED MEMORY BASICS
GPU A GPU B
page1 page2 Single virtual memory shared between processors
7
UNIFIED MEMORY BASICS
page1 page2
GPU A GPU B A’s page table A’s phys mem B’s phys mem
page1 page2
B’s page table
page1 page2 Single virtual memory shared between processors
8
UNIFIED MEMORY BASICS
page1 page2 page3
A’s phys mem B’s phys mem
page1 page2 page3 *addr1 = 1 local access *addr3 = 1 page fault
A’s page table B’s page table
9
UNIFIED MEMORY BASICS
page1 page2 page3
A’s phys mem B’s phys mem
page1 page2 page3 *addr3 = 1 access replay page3 populated and mapped into B’s memory
A’s page table B’s page table
10
UNIFIED MEMORY BASICS
page1 page2 page3
A’s phys mem B’s phys mem
page1 page2 page3 *addr2 = 1 page fault *addr3 = 1 page fault
A’s page table B’s page table
11
UNIFIED MEMORY BASICS
page1 page2 page3
A’s phys mem B’s phys mem
page1 page2 page3 *addr2 = 1 page fault *addr3 = 1 page fault page2 and page3 unmapped from B’s memory
A’s page table B’s page table
12
UNIFIED MEMORY BASICS
page1 page2 page3
A’s phys mem B’s phys mem
page1 page2 page3 *addr2 = 1 page fault *addr3 = 1 page fault pages data migrated to A’s physical memory
A’s page table B’s page table
13
UNIFIED MEMORY BASICS
page1 page2 page3
A’s phys mem B’s phys mem
page1 page2 page3 *addr2 = 1 access replay *addr3 = 1 access replay
A’s page table B’s page table
14
MEMORY OVERSUBSCRIPTION
page1 page1 page3 page4 page5
A’s phys mem B’s phys mem
page1 page1 page3 page4 page5 *addr5 = 1 page fault
A’s page table B’s page table
15
MEMORY OVERSUBSCRIPTION
page1 page1 page3 page4 page5
A’s phys mem B’s phys mem
page1 page1 page3 page4 page5 *addr5 = 1 page fault page4 unmapped from A’s memory and migrated
A’s page table B’s page table
16
MEMORY OVERSUBSCRIPTION
page1 page1 page3 page4 page5
A’s phys mem B’s phys mem
page1 page1 page3 page4 page5 *addr5 = 1 page fault page4 mapped in B’s memory, page5 unmapped and migrated to A
A’s page table B’s page table
17
MEMORY OVERSUBSCRIPTION
page1 page1 page3 page4 page5
A’s phys mem B’s phys mem
page1 page1 page3 page4 page5 *addr5 = 1 access replay
A’s page table B’s page table
18
SIMPLIFYING DL FRAMEWORK DESIGN
Eliminated 3,000 lines of repetitive and error-prone code in Caffe Developers can add new inherited Layer classes in a much simpler manner The final call to a CPU function or a GPU kernel (caffe_gpu_gemm) still need to be explicit
class ConvolutionLayer { public: void cpu_data() void cpu_diff() void gpu_data() void gpu_diff() void mutable_cpu_data() void mutable_cpu_diff() void mutable_gpu_data() void mutable_gpu_diff() void Forward_cpu() void Forward_gpu() void forward_cpu_gemm() void forward_gpu_gemm() void forward_cpu_bias() void forward_gpu_bias() void Backward_cpu() void Backward_gpu() void backward_cpu_gemm() void backward_gpu_gemm() void backward_cpu_bias() void backward_gpu_bias() } class ConvolutionLayer { public: void data() void diff() void mutable_data() void mutable_diff() void Forward() void forward_gemm() void forward_bias() void Backward() void backward_gemm() void backward_bias() }
Existing Design Unified Memory Design
- A. A. Awan, C-H Chu, H. Subramoni, X. Lu, D.K. Panda, “OC-DNN: Designing
Out-of-Core Deep Neural Network Training by Exploiting Unified Memory on Pascal and Volta GPUs”, <double-blind submission under review>
19
CAN THIS DESIGN OFFER GOOD PERF?
DL training with Unified Memory
OC-Caffe will be released by the HiDL Team@OSU: hidl.cse.ohio-state.edu, mvapich.cse.ohio-state.edu
20 40 60 80 100 120 140 160 Batch 40 Batch 45 Images/sec
ResNet-50 training on 1xV100 (16GB)
Caffe (BVLC) OC-Caffe 20 40 60 80 100 120 140 160 Batch 110 Batch 120 Images/sec
VGG19 training on 1xV100 (16GB)
Caffe (BVLC) OC-Caffe
- ut-of-core
in-memory
- ut-of-core
in-memory
20
CONCURRENT ACCESS
page1 page2 page3
A’s phys mem B’s phys mem
page1 page2 page3
A’s page table B’s page table
21
CONCURRENT ACCESS
Exclusive Access*
page1 page2 page3
A’s phys mem B’s phys mem
page1 page2 page3
atomicAdd_system (addr2, 1) page fault atomicAdd_system (addr2, 1) local access A’s page table B’s page table
*this is a possible implementation and to guarantee this behavior you need to use cudaMemAdvise policies
22
CONCURRENT ACCESS
Exclusive Access
page1 page2 page3
A’s phys mem B’s phys mem
page1 page2 page3
atomicAdd_system (addr2, 1) page fault
page2 unmapped in B’s memory and migrated to A
A’s page table B’s page table
23
CONCURRENT ACCESS
Exclusive Access
page1 page2 page3
A’s phys mem B’s phys mem
page1 page2 page3
atomicAdd_system (addr2, 1) local access A’s page table B’s page table
24
CONCURRENT ACCESS
Atomics over NVLINK*
page1 page2 page3
A’s phys mem B’s phys mem
page1 page2 page3
atomicAdd_system (addr2, 1) remote access
*both processors need to support atomic operations
atomicAdd_system (addr2, 1) local access A’s page table B’s page table
25
CONCURRENT ACCESS
Read duplication*
page1 page2 page3
A’s phys mem B’s phys mem
page1 page2 page3
val = *addr2; local access
*each processor must maintain its own page table
val = *addr2; local access A’s page table B’s page table
26
CONCURRENT ACCESS
Read duplication: write
page1 page2 page3
A’s phys mem B’s phys mem
page1 page2 page3
*addr2 = val2; local access
a write will collapse all copies into one
A’s page table B’s page table
27
CONCURRENT ACCESS
Read duplication: read after write
page1 page2 page3
A’s phys mem B’s phys mem
page1 page2 page3
val = *addr2; local access
pages are duplicated again on faults
A’s page table B’s page table
28
ANALYTICS USE CASE
Design of a Concurrent Hybrid Hash Table
GPU 0 GPU 1 Multiple CPU Cores Hash table implemented via Unified Memory Concurrent Inserts and Fetches Non-blocking updates using atomic compare&swap Concurrent Fetches Concurrent Fetches
S8172 - Evaluation of Hybrid Cache-Coherent Concurrent Hash Table on POWER9 System with NVLink 2.0 – Thu 11:00 Room 210F
29
ANALYTICS USE CASE
Concurrent Access To Hash Table
HBM SM L2 SM SM SYSMEM page fault x86
30
SYSMEM
ANALYTICS USE CASE
Concurrent Access To Hash Table
HBM SM L2 SM SM page fault page migration access replay x86
31
SYSMEM
ANALYTICS USE CASE
Concurrent Access To Hash Table
HBM SM L2 SM SM P9 cache P9 can directly update hash entry in GPU memory no page faults or migrations! HBM SM L2 SM SM page fault page migration access replay x86
S8172 - Evaluation of Hybrid Cache-Coherent Concurrent Hash Table on POWER9 System with NVLink 2.0 – Thu 11:00 Room 210F
32
UNIFIED MEMORY + DGX-2
UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User control of data locality
GPU GPU 1 GPU 2 GPU 3 GPU 4 GPU 5 GPU 6 GPU 7 GPU 8 GPU 9 GPU 10 GPU 11 GPU 12 GPU 13 GPU 14 GPU 15
512 GB Unified Memory
33
ENABLING MULTI-GPU
__global__ void kernel(int *data) { int idx = threadIdx.x + blockDim.x * blockIdx.x; doSomeStuff(idx, data, ...); } cudaMallocManaged(&data, N * sizeof(int)); // initialize data on the CPU kernel<<<grid, block>>>(data); __global__ void kernel(int *data, int gpuId) { int idx = threadIdx.x + blockDim.x * (blockIdx.x + gpuId * gridDim.x); doSomeStuff(idx, data, ...); } cudaMallocManaged(&data, N * sizeof(int)); // initialize data on the CPU for (int i = 0; i < numGPUs; i++) { cudaSetDevice(i); kernel<<<grid/numGPUs, block>>>(data, i); }
Single-GPU Multi-GPU update launch config update blockIdx.x
34
MULTI-GPU WITH UNIFIED MEMORY
GPU 3 GPU 0 GPU 2 GPU 1 SYSMEM GPU kernels initiate migrations
35
MULTI-GPU WITH UNIFIED MEMORY
GPU 3 GPU 0 GPU 2 GPU 1 Data automatically partitioned between GPUs
- n first-touch
SYSMEM
36
MULTI-GPU WITH UNIFIED MEMORY
GPU 3 GPU 0 GPU 2 GPU 1 With policies: Remote data can be accessed directly without migrations SYSMEM
37
MULTI-GPU WITH UNIFIED MEMORY
GPU 3 GPU 0 GPU 2 GPU 1 With policies: Read-only data can be duplicated and accessed locally SYSMEM
38
GPU ARCHITECTURE AND SOFTWARE EVOLUTION
39
UNIFIED MEMORY
Evolution of GPU architectures
2012 2014 2016 2017
Kepler
Release of the new “single- pointer” programming model
Maxwell
No new features related to Unified Memory
Pascal
On-demand migration, concurrent access and atomics,
- versubscription
Volta
Access counters, P9 only: hardware coherency, ATS support
NVLINK1 NVLINK2
*Not all features are available on all platforms
40
UNIFIED MEMORY: BEFORE PASCAL
No GPU page fault support: move all dirty pages on kernel launch No concurrent access, no GPU memory oversubscription, no system-wide atomics
Available since CUDA 6
page1 page2 page3
A’s page table A’s phys mem B’s phys mem
page1 page2 page3
B’s page table
kernel launch
41
UNIFIED MEMORY: PASCAL AND VOLTA
GPU page fault support, concurrent access, extended VA space (48-bit) On-demand migration to accessing processor on first touch
Available since CUDA 8
page1 page2
A’s phys mem B’s phys mem
page1 page2
A’s page table B’s page table
42
UNIFIED MEMORY ON VOLTA+P9
If memory is mapped to the GPU, migration can be triggered by access counters
New Feature: Access Counters
page1 page2
A’s phys mem B’s phys mem
page1 page2 few accesses many accesses
A’s page table B’s page table
43
UNIFIED MEMORY ON VOLTA+P9
With access counters only hot pages will be moved to the GPU Migrations are delayed compared to the fault-based method
New Feature: Access Counters
page1 page2
A’s phys mem B’s phys mem
page1 page2 many accesses
*When implemented this feature can be enabled with cudaMemAdvise policies
A’s page table B’s page table
44
UNIFIED MEMORY ON VOLTA+P9
CPU can directly access and cache GPU memory Native atomics support for all accessible memory
New Feature: Hardware Coherency with NVLINK2
page1 page2
GPU mem system mem
page1 page2 V100 P9
V100’s page table P9’s page table
45
UNIFIED MEMORY ON VOLTA+P9
ATS: address translation service; CPU and GPU can share a single page table
New Feature: ATS support
page1 page2
GPU mem CPU mem
V100 P9
ATS page table
46
UNIFIED MEMORY WITH SYSTEM ALLOCATOR
System allocator support allows GPU to access all system memory malloc, stack, global, file system P9: Address Translation Service (ATS) Support enabled in CUDA 9.2 x86: Heterogeneous Memory Management (HMM) Initial version of the patchset is integrated into 4.14 kernel NVIDIA will be supporting upcoming versions of HMM
https://lkml.org/lkml/2017/6/23/443
47
WHAT YOU CAN DO WITH UNIFIED MEMORY
See it in action at the end of the talk!
int *data; cudaMallocManaged(&data, sizeof(int) * n); kernel<<<grid, block>>>(data); int *data = (int*)malloc(sizeof(int) * n); kernel<<<grid, block>>>(data); int data[1024]; kernel<<<grid, block>>>(data); int *data = (int*)alloca(sizeof(int) * n); kernel<<<grid, block>>>(data); extern int *data; kernel<<<grid, block>>>(data);
Works everywhere today Works on Power9 + ATS in CUDA 9.2 Will work in the future on x86 + HMM
48
UNIFIED MEMORY LANGUAGES
CUDA C/C++: cudaMallocManaged CUDA Fortran: managed attribute (per allocation) Python: pycuda.driver.managed_empty (allocate numpy.ndarray) OpenACC: -ta=managed compiler option (all dynamic allocations)
49
UNIFIED MEMORY + OPENACC
Literally adding a single line will get your code running on the GPU Easy to optimize later: add loop and data directives
Effortless way to run you code on GPUs
#pragma acc kernels { for (i = 0; i < n; ++i) { c[i] = a[i] + b[i]; ... } } ...
Initiate parallel execution
50
GYROKINETIC TOROIDAL CODE
Particle-In-Cell production code
http://phoenix.ps.uci.edu/gtc_group
2 4 6 8 10 12 14 16 18 CPU only 2xV100 Data Directives 2xV100 Unified Memory 4xV100 Data Directives 4xV100 Unified Memory
Speed-up CPU: Haswell E5-2698 v3 @ 2.30GHz, dual socket 16-core; MPI: MVAPICH-GDR
51
PERFORMANCE DEEP DIVE
52
STREAMING BENCHMARK
How fast is on-demand migration?
5 10 15 20 25 30 35 x86 + P100 (PCIe) x86 + V100 (PCIe) P8 + P100 (NVLINK) GB/s
Streaming performance
- n-demand
prefetch memcpy
__global__ void kernel(int *host, int *device) { int i = threadIdx.x + blockDim.x * blockIdx.x; device[i] = host[i]; } // allocate and initialize memory cudaMallocManaged(&host, size); cudaMalloc(&device, size); memset(host, 0, size); // benchmark CPU->GPU migration if (prefetch) cudaMemPrefetchAsync(host, size, gpuId); kernel<<<grid, block>>>(host, device);
53
==14487== Profiling result: Type Time(%) Time Calls Avg Min Max Name GPU activities: 100.00% 23.270ms 1 23.270ms 23.270ms 23.270ms void kernel(int*, int*) API calls: 79.56% 23.272ms 1 23.272ms 23.272ms 23.272ms cudaDeviceSynchronize 20.42% 5.9732ms 1 5.9732ms 5.9732ms 5.9732ms cudaLaunch 0.01% 2.0490us 1 2.0490us 2.0490us 2.0490us cudaConfigureCall 0.01% 1.8360us 4 459ns 138ns 833ns cudaSetupArgument ==14487== Unified Memory profiling result: Device "Tesla V100-PCIE-16GB (0)" Count Avg Size Min Size Max Size Total Size Total Time Name 3012 21.758KB 4.0000KB 952.00KB 64.00000MB 13.49043ms Host To Device 81 -
- 23.23181ms Gpu page fault groups
UNDERSTANDING PROFILER OUTPUT
54
HEURISTIC PREFETCHING
GPU architecture supports different page sizes Contiguous pages up to a large page size are promoted to the larger size Driver prefetches whole regions if pages are accessed densely
Do Not Confuse with API-prefetching
GPU CPU
4KB 4KB 952KB 1024KB
55
WHAT IS PAGE FAULT GROUPS?
==14487== Unified Memory profiling result: Device "Tesla V100-PCIE-16GB (0)" Count Avg Size Min Size Max Size Total Size Total Time Name 3012 21.758KB 4.0000KB 952.00KB 64.00000MB 13.49043ms Host To Device 81 -
- 23.23181ms Gpu page fault groups
Unified Memory Virtual Address Name 8 0x3900010000 [Unified Memory GPU page faults] 9 0x3900040000 [Unified Memory GPU page faults] 5 0x3900108000 [Unified Memory GPU page faults] 5 0x3900200000 [Unified Memory GPU page faults]
nvprof --print-gpu-trace ...
56
PAGE FAULTS HANDLING
128B 128B 128B 128B 128B thread_4B warp 0 warp 1
GPU page
warp 2
faulted stalled stalled
warp 3
faulted resumed resumed fault processing replayed replayed
57
OPTIMIZING ON-DEMAND MIGRATION
64KB 64KB warp_64KB warp 0 warp 1
GPU page GPU page
Increase fault concurrency to reduce page fault stalls
128B 128B 128B 128B 128B thread_4B warp 0 warp 1
GPU page fewer warps are stalled “spread-out” pattern improves prefetching multiple faults per page warps are stalled on fault processing
58
OPTIMIZING ON-DEMAND MIGRATION
Count Avg Size Min Size Max Size Total Size Total Time Name 3012 21.758KB 4.0000KB 952.00KB 64.00000MB 13.49043ms Host To Device 81 -
- 23.23181ms Gpu page fault groups
Unified Memory Virtual Address Name 8 0x3900010000 [Unified Memory GPU page faults] 9 0x3900040000 [Unified Memory GPU page faults] 5 0x3900108000 [Unified Memory GPU page faults] Count Avg Size Min Size Max Size Total Size Total Time Name 957 68.481KB 4.0000KB 576.00KB 64.00000MB 8.242080ms Host To Device 6 -
- 9.769984ms Gpu page fault groups
Unified Memory Virtual Address Name 1 0x39000d0000 [Unified Memory GPU page faults] 1 0x39000c0000 [Unified Memory GPU page faults] 1 0x3900080000 [Unified Memory GPU page faults]
Thread/4B Warp/64KB more efficient prefetching fewer stalls
59
STREAMING BENCHMARK
How fast is on-demand migration?
5 10 15 20 25 30 35 x86 + P100 (PCIe) x86 + V100 (PCIe) P8 + P100 (NVLINK) GB/s
Streaming performance
- n-demand: thread/4B
- n-demand: warp/64K
prefetch memcpy
128B 128B 128B 128B 128B 64KB 64KB thread_4B warp_64KB warp 0 warp 1 warp 0 warp 1
Also check the Parallel Forall blog: https://devblogs.nvidia.com/maximizing-unified-memory-performance-cuda/
60
5 10 15 20 25 10 20 30 40 50 GB/s Working set (GB)
Fault-based migration throughput (P8/NVLINK)
thread/4B warp/64K thread/4B + readmostly warp/64K + readmostly 1 2 3 4 5 6 7 8 9 10 10 20 30 40 50 GB/s Working set (GB)
Fault-based migration throughput (x86/PCIe)
thread/4B warp/64K thread/4B + readmostly warp/64K + readmostly
GPU MEMORY OVERSUBSCRIPTION
Let’s see how perf changes as we increase the working set
GPU memory limit GPU memory limit
61
EVICTION ALGORITHM
Driver keeps a list of physical chunks of GPU memory Chunks from the front of the list are evicted first (LRU) A chunk is considered “in use” when it is fully-populated or migrated Prefetching and policies may impact eviction heuristic in the future
What Pages Are Moving Out of the GPU
eviction allocation
- r
migration
62
PREFETCHING AND EVICTIONS
2 4 6 8 10 12
- n-demand
thread/4B
- n-demand
warp/64K 1-way prefetch GB/s
Migration throughput (x86+V100)
Without eviction With eviction
Prefetch can be overlapped with evictions without using CUDA streams! (enabled in CUDA 9.1) cudaMemcpy solution requires scheduling DtoH and HtoD copies into two separate streams
63
DRIVER ENHANCEMENTS
Tesla V100 + x86: HPGMG-AMR default (no hints)
50 100 150 200 250 1.4 4.7 8.6 28.9 58.6 MDOF/s Working set (GB) 384.81 (CUDA 8.0 GA2) 387.34 (CUDA 9.0)
Improvements for the allocator
64
DRIVER ENHANCEMENTS
Tesla V100 + x86: HPGMG-AMR optimized (prefetches)
50 100 150 200 250 1.4 4.7 8.6 28.9 58.6 MDOF/s Working set (GB) 384.81 (CUDA 8.0 GA2) 387.34 (CUDA 9.0) 390.30 (CUDA 9.1)
Prefetch + Evictions Overlap
65
LOCALITY AND ACCESS CONTROL
Default: data migrates on first touch ReadMostly: data duplicated on first touch PreferredLocation: resist migrating away from the preferred location AccessedBy: establish direct mapping and avoid faults
cudaMemAdvise
GPU1 GPU0 GPU1 GPU0 GPU1 GPU0 GPU1 GPU0
66
while not converged: LagrangeNodal CalcForceForNodes CalcAccelerationForNodes BoundaryConditionsForNodes CalcVelocityForNodes CalcPositionForNodes LagrangeElements CalcLagrangeElements CalcQForElems MaterialPropertiesForElems CalcTimeConstraintsForElems CalcCourantConstraintForElems CalcHydroConstraintForElems
LULESH
CORAL OpenACC app
few heavy GPU kernels many small GPU kernels many small CPU functions
67
USING VISUAL PROFILER
Kernel time = 17ms Kernel time = 27ms 1st iteration 2nd iteration longer, why? what’s this?
68
THRASHING MITIGATION
Before CUDA 9.2: when memory is pinned we lose any insight into access pattern In the future we may use access counters on Volta to find a better location
Processors frequently read or write to the same page
GPU CPU Page Throttling Remote Map Memory thrashing detected
69
LULESH: PREFERRED LOCATION = GPU
No policies Preferred Location lots of back-and-forth migrations but no slowdown for the heavy kernels anti-thrashing heuristic kicks in and pins the page to CPU
70
LULESH: READ MOSTLY
no slowdown for the heavy kernels almost no thrashing for the hybrid CPU/GPU part
71
10,000 20,000 30,000 40,000 50,000 60,000 70,000 Default Preferred ReadMostly Improved Default Heuristics FOM (z/s)
LULESH performance, 200^3 mesh, 100 iterations, CUDA 9.1
Quadro GP100 Quadro GV100
LULESH: PERF IMPROVEMENTS
GP100 has 500x more sysmem writes than GV100 Will be enabled in the future drivers
72
WHEN TO USE UNIFIED MEMORY
cudaMalloc cudaMallocManaged Pinned allocation
cudaMalloc cudaMallocManaged PreferredLocation(GPU) SetAccessedBy(peer GPUs) cudaMemPrefetchAsync(GPU)
cudaMemcpy: ptrA -> ptrB
Staging for non-pinned allocations
- r between non-P2P GPUs
Staging or a copy kernel required in all cases
Memory migration
Not possible
cudaMemPrefetchAsync
Debugging
Difficult Easy
Oversubscription
No Yes
IPC support
Yes No
73
WHEN TO USE UNIFIED MEMORY
Application Code
Sequential code (CPU) Sequential code (CPU) Parallel code (GPU) cudaMalloc cudaMallocManaged/malloc cudaMallocManaged/malloc
Parallel Code
75
UNIFIED MEMORY PLATFORMS
KEPLER PASCAL VOLTA Linux + x86
No GPU fault support No concurrent access On-demand migration On-demand migration
Linux + Power
On-demand migration 80GB/s CPU-GPU BW* On-demand migration 150GB/s CPU-GPU BW** Access counters HW coherency ATS support
Windows
No GPU fault support No concurrent access
MacOS
No GPU fault support No concurrent access
Tegra
Cached on CPU and iGPU No concurrent access
*IBM Minsky: 4xP100 + 2xP8, 2xNVLINK1 links between P100 and P8, bi-directional aggregate BW **IBM Newell: 4xV100 + 2xP9, 3xNVLINK2 links between V100 and P9, bi-directional aggregate BW
76
READ DUPLICATION
Usage example
char *data; cudaMallocManaged(&data, N); init_data(data, N); cudaMemAdvise(data, N, ..SetReadMostly, myGpuId); cudaMemPrefetchAsync(data, N, myGpuId, s); mykernel<<<..., s>>>(data, N); use_data(data, N); cudaFree(data);
The prefetch creates a copy instead of moving data Both processors can read data simultaneously without faults Writes will collapse all copies into one, subsequent reads will fault and duplicate
GPU: my_kernel CPU: init_data CPU: use_data
77
PREFERRED LOCATION
Resisting migrations
char *data; cudaMallocManaged(&data, N); init_data(data, N); cudaMemAdvise(data, N, ..PreferredLocation, cudaCpuDeviceId); mykernel<<<..., s>>>(data, N); use_data(data, N); cudaFree(data);
The kernel will page fault and generate direct mapping to data on the CPU The driver will “resist” migrating data away from the preferred location
GPU: my_kernel CPU: init_data CPU: use_data
78
PREFERRED LOCATION
Page population on first-touch
char *data; cudaMallocManaged(&data, N); cudaMemAdvise(data, N, ..PreferredLocation, cudaCpuDeviceId); mykernel<<<..., s>>>(data, N); use_data(data, N); cudaFree(data);
The kernel will page fault, populate pages on the CPU and generate direct mapping to data on the CPU Pages are populated on the preferred location if the faulting processor can access it
GPU: my_kernel CPU: use_data
79
PREFERRED LOCATION ON P9+V100
CPU can directly access GPU memory
char *data; cudaMallocManaged(&data, N); init_data(data, N); cudaMemAdvise(data, N, ..PreferredLocation, gpuId); mykernel<<<..., s>>>(data, N); use_data(data, N); cudaFree(data);
The kernel will page fault and migrate data to the GPU CPU will fault and access data directly instead of migrating
- n non P9+V100 systems the driver will migrate back to the CPU
GPU: my_kernel CPU: init_data CPU: use_data
80
ACCESSED BY
Usage example
char *data; cudaMallocManaged(&data, N); init_data(data, N); cudaMemAdvise(data, N, ..SetAccessedBy, myGpuId); mykernel<<<..., s>>>(data, N); use_data(data, N); cudaFree(data);
GPU will establish direct mapping of data in CPU memory, no page faults will be generated Memory can move freely to other processors and mapping will carry
- ver
GPU: my_kernel CPU: init_data CPU: use_data
81
ACCESSED BY
Using access counters on Volta
char *data; cudaMallocManaged(&data, N); init_data(data, N); cudaMemAdvise(data, N, ..SetAccessedBy, myGpuId); mykernel<<<..., s>>>(data, N); use_data(data, N); cudaFree(data);
GPU will establish direct mapping of data in CPU memory, no page faults will be generated Access counters may eventually trigger migration of frequently accessed pages to the GPU
GPU: my_kernel CPU: init_data CPU: use_data
82
ptr = malloc(size); doStuffOnGpu<<<...>>>(ptr, size);
MANAGED VS MALLOC ON VOLTA+P9
First touch allocation policy
GPU page faults Unified Memory driver allocates on GPU GPU accesses GPU memory ptr = cudaMallocManaged(size); doStuffOnGpu<<<...>>>(ptr, size);
*You may alter this behavior by using cudaMemAdvise policies
GPU uses ATS, faults OS allocates on CPU (by default) GPU uses ATS to access CPU memory
83
MANAGED VS MALLOC ON P9
cudaMallocManaged: same behavior as x86
ptr = cudaMallocManaged(size); fillData(ptr, size); doStuffOnGpu<<<...>>>(ptr, size); cudaDeviceSynchronize(); doStuffOnCpu(ptr, size); GPU page faults ptr migrated to GPU CPU page faults ptr migrated to CPU
84
MANAGED VS MALLOC ON P9
malloc: no on-demand migrations*
ptr = malloc(size); fillData(ptr, size); doStuffOnGpu<<<...>>>(ptr, size); cudaDeviceSynchronize(); doStuffOnCpu(ptr, size); GPU uses ATS to access CPU memory (no on-demand migration except cudaMemPrefetchAsync*) CPU accesses CPU memory
*In the future Volta access counters will be used to migrate malloc memory
85
HYPRE-INSPIRED USE CASE
Algebraic Multi-Grid library: https://github.com/LLNL/hypre Lots of small allocations: multiple variables may end up on the same page If used by different processors this will result in false-sharing
512B 512B 512B 512B 512B 512B 512B 512B GPU CPU single 4KB page
86
FALSE-SHARING
Issues with false-sharing:
- Spurious migrations, thrashing mitigation does not solve it
- Performance hints are applied on page boundaries, due to suballocation data may
inherit the wrong policies How to mitigate this:
- Use separate allocators or memory pools for CPU and GPU