1
Nikolay Sakharnykh - May 10, 2017
UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, - - PowerPoint PPT Presentation
UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS ARCHITECTURES GPU 0 GPU 1 GPU 2 CPU GPU 0 GPU 1 GPU 2 MEM MEM MEM SYS MEM 2 UNIFIED MEMORY FUNDAMENTALS Single Pointer CPU code GPU code void
1
Nikolay Sakharnykh - May 10, 2017
2
GPU 0 MEM
CPU
SYS MEM
GPU 0
GPU 1 MEM
GPU 1
GPU 2 MEM
GPU 2
3
CPU code GPU code
void *data; data = malloc(N); cpu_func1(data, N); cpu_func2(data, N); cpu_func3(data, N); free(data); void *data; data = malloc(N); cpu_func1(data, N); gpu_func2<<<...>>>(data, N); cudaDeviceSynchronize(); cpu_func3(data, N); free(data);
4
Explicit Memory Management Unified Memory
void *h_data, *d_data; h_data = malloc(N); cudaMalloc(&d_data, N); cpu_func1(h_data, N); cudaMemcpy(d_data, h_data, N, ...) gpu_func2<<<...>>>(data, N); cudaMemcpy(h_data, d_data, N, ...) cpu_func3(h_data, N); free(h_data); cudaFree(d_data); void *data; data = malloc(N); cpu_func1(data, N); gpu_func2<<<...>>>(data, N); cudaDeviceSynchronize(); cpu_func3(data, N); free(data);
5
Explicit Memory Management Unified Memory
char **data; data = (char**)malloc(N*sizeof(char*)); for (int i = 0; i < N; i++) data[i] = (char*)malloc(N); char **d_data; char **h_data = (char**)malloc(N*sizeof(char*)); for (int i = 0; i < N; i++) { cudaMalloc(&h_data2[i], N); cudaMemcpy(h_data2[i], h_data[i], N, ...); } cudaMalloc(&d_data, N*sizeof(char*)); cudaMemcpy(d_data, h_data2, N*sizeof(char*), ...); gpu_func<<<...>>>(data, N); char **data; data = (char**)malloc(N*sizeof(char*)); for (int i = 0; i < N; i++) data[i] = (char*)malloc(N); gpu_func<<<...>>>(data, N);
6
page1 page2 page3 page1 page2 page3 proc A proc B memory A memory B
7
page1 page2 page3 page1 page2 page3 *addr1 = 1 local access *addr3 = 1 page fault proc A proc B memory A memory B
8
page1 page2 page3 page1 page2 page3 *addr3 = 1 page is populated proc A proc B memory A memory B
9
page1 page2 page3 page1 page2 page3 *addr2 = 1 *addr3 = 1 page fault page fault proc A proc B memory A memory B
10
page1 page2 page3 page1 page2 page3 *addr2 = 1 *addr3 = 1 page migration page migration page fault page fault proc A proc B memory A memory B
11
page1 page2 page3 page1 page2 page3 proc A proc B *addr2 = 1 *addr3 = 1 local access local access memory A memory B
12
When it doesn’t matter how data moves to a processor 1) Quick and dirty algorithm prototyping 2) Iterative process with lots of data reuse, migration cost can be amortized 3) Simplify application debugging When it’s difficult to isolate the working set 1) Irregular or dynamic data structures, unpredictable access 2) Data partitioning between multiple processors
13
proc A proc B *addr3 = 1 page fault physical memory capacity is full memory A memory B
14
proc A proc B *addr3 = 1 page fault page eviction physical memory capacity is full memory A memory B
15
proc A proc B *addr3 = 1 page fault page migration memory A memory B
16
proc A proc B physical memory capacity is full memory A memory B
17
When you have large dataset and not enough physical memory Moving pieces by hand is error-prone and requires tuning for memory size Better to run slowly than get fail with out-of-memory error You can actually get high performance with Unified Memory!
18
page1 page2 page3 page1 page2 page3 memory A memory B proc A proc B atomicAdd_system (addr2, 1) page fault local access atomicAdd_system (addr2, 1)
19
page1 page2 page3 page1 page2 page3 memory A memory B atomicAdd_system (addr2, 1) page fault page migration proc A proc B
20
page1 page2 page3 page1 page2 page3 memory A memory B local access proc A proc B atomicAdd_system (addr2, 1)
21
page1 page2 page3 page1 page2 page3 memory A memory B remote access local access proc A proc B atomicAdd_system (addr2, 1) atomicAdd_system (addr2, 1)
*both processors need to support atomic operations
22
GPUs are very good at handling atomics from thousands of threads Makes sense to utilize atomics between GPUs or between CPU and GPU We will see this in action on a realistic example later on
23
Unified Memory Fundamentals Under the Hood Details Performance Analysis and Optimizations Applications Deep Dive
24
CUDA C: cudaMallocManaged is your most reliable way to opt in today CUDA Fortran: managed attribute (per allocation) OpenACC: -ta=managed compiler option (all dynamic allocations) malloc support is coming on Pascal+ architectures (Linux only) Note: you can write your own malloc hook to use cudaMallocManaged
25
Heterogeneous Memory Manager: a set of Linux kernel patches Allows GPUs to access all system memory (malloc, stack, file system) Page migration will be triggered the same way as for cudaMallocManaged Ongoing testing and reviews, planning next phase of optimizations More details on HMM today at 4:00 in Room 211B by John Hubbard
26
2012 2014 2016 2017
Kepler
First release of the new “single- pointer” programming model
Maxwell
No new features related to Unified Memory
Pascal
On-demand migration,
system-wide atomics
Volta
Access counters, copy engine faults, cache coherence, ATS support
NVLINK1 NVLINK2
27
Kepler GPU: no page fault support, limited virtual space
page1 page2 page3 page1 page2 page3 memory A memory B GPU CPU
28
Bulk migration of all pages attached to current stream on kernel launch
page1 page2 page3 page1 page2 page3 memory A memory B kernel launch page migration page migration GPU CPU
29
No on-demand migration for the GPU, no oversubscription, no system-wide atomics
page1 page2 page3 page1 page2 page3 memory A memory B local access GPU CPU local access
30
Pascal GPU: page fault support, extended virtual address space (48-bit)
page1 page2 page3 page1 page2 page3 memory A memory B proc A proc B
31
On-demand migration to accessing processor on first touch
page1 page2 page3 page1 page2 page3 memory A memory B local access proc A proc B page fault page migration
32
All features: on-demand migration, oversubscription, system-wide atomics
page1 page2 page3 page1 page2 page3 memory A memory B proc A proc B local access
33
Volta GPU: uses fault on first touch for migration, same as Pascal
GPU CPU page1 page2 page3 page1 page2 page3 local access page fault page migration GPU memory CPU memory
34
If memory is mapped to the GPU, migration can be triggered by access counters
page1 page2 page3 page1 page2 page3 GPU CPU remote access remote access local access GPU memory CPU memory
35
With access counters migration only hot pages will be moved to the GPU
page1 page2 page3 page1 page2 page3 GPU CPU page migration local access GPU memory CPU memory
36
CPU can directly access and cache GPU memory; native CPU-GPU atomics
page1 page2 page3 page1 page2 page3 GPU memory CPU memory GPU CPU local access remote access remote access
37
The Unified Memory driver is doing intelligent things under the hood: Prefetching: migrate pages proactively to reduce number of faults Thrashing mitigation: heuristics to avoid frequent migration of shared pages Eviction: what pages to evict when we need to make the room for new ones You can’t control them but you can override most of these with hints
38
GPU architecture supports different page sizes Contiguous pages up to a larger page size are promoted to the larger size Driver prefetches whole regions if pages are accessed densely
GPU CPU
39
Processors share the same page and frequently read or write to it Pascal: when memory is pinned we lose any insight into access pattern Volta: can use access counters information to find a better location
GPU CPU CPU throttle pin to CPU
40
Driver keeps a single list of physical chunks of GPU memory Chunks from the front of the list are evicted first (LRU) A chunk is considered “in use” when it is fully-populated or migrated
eviction allocation migration to the GPU
41
Unified Memory Fundamentals Under the Hood Details Performance Analysis and Optimizations Applications Deep Dive
42
43
44
More details tomorrow at 10:00 in Marriott Salon 3
45
If you know your application well you can optimize with hints These are also useful to override some of the driver heuristics cudaMemPrefetchAsync(ptr, size, processor, stream) Similar to move_pages() in Linux cudaMemAdvise(ptr, size, advice, processor) Similar to madvise() in Linux
46
char *data; cudaMallocManaged(&data, N); init_data(data, N); cudaMemPrefetchAsync(data, N, myGpuId, s); mykernel<<<..., s>>>(data, N); cudaMemPrefetchAsync(data, N, cudaCpuDeviceId, s); cudaStreamSynchronize(s); use_data(data, N); cudaFree(data);
Page faults can be expensive and they stall SM execution Avoid faults by prefetching data to the accessing processor
GPU CPU CPU
47
char *data; cudaMallocManaged(&data, N); init_data(data, N); cudaMemAdvise(data, N, ..SetReadMostly, myGpuId); cudaMemPrefetchAsync(data, N, myGpuId, s); mykernel<<<..., s>>>(data, N); use_data(data, N); cudaFree(data);
In this case prefetch creates a copy instead of moving data Both processors can read data simultaneously without faults Writes are allowed but they are expensive
GPU CPU CPU
48
char *data; cudaMallocManaged(&data, N); init_data(data, N); cudaMemAdvise(data, N, ..PreferredLocation, cudaCpuDeviceId); mykernel<<<..., s>>>(data, N); use_data(data, N); cudaFree(data);
Here the kernel will page fault and generate direct mapping to data on the CPU The driver will “resist” migrating data away from the preferred location
GPU CPU CPU
49
char *data; cudaMallocManaged(&data, N); init_data(data, N); cudaMemAdvise(data, N, ..SetAccessedBy, myGpuId); mykernel<<<..., s>>>(data, N); use_data(data, N); cudaFree(data);
GPU will establish direct mapping of data in CPU memory, no page faults will be generated Memory can move freely to other processors and mapping will carry
GPU CPU CPU
50
char *data; cudaMallocManaged(&data, N); init_data(data, N); cudaMemAdvise(data, N, ..SetAccessedBy, myGpuId); mykernel<<<..., s>>>(data, N); use_data(data, N); cudaFree(data);
GPU will establish direct mapping of data in CPU memory, no page faults will be generated Access counters may eventually trigger migration of this memory to the GPU
GPU CPU CPU GPU
51
How long does a page fault take to serve? - We can measure!
Linked list traversal with some large stride to avoid prefetching effects
Page fault cost (us) DtoH HtoD x86 + PCIe + GP100 20 30 P8 + NVLINK + GP100 20 20
52
GB/s GB/s
5 10 15 20 25 128KB 1MB 8MB 64MB 512MB 4GB
CPU memory
explicit single explicit multi
1 10 100 1000 128KB 1MB 8MB 64MB 512MB 4GB
GPU memory
prefetch explicit
cudaMallocManaged/cudaMalloc + cudaMemset cudaMallocManaged/mmap + fill on the CPU cudaMalloc is using preallocated memory for large sizes
53
2 4 6 8 10 12 14 128KB 1MB 8MB 64MB 512MB 4GB
CPU to GPU
prefetch memcpy GB/s
2 4 6 8 10 12 14 128KB 1MB 8MB 64MB 512MB 4GB
GPU to CPU
prefetch memcpy GB/s
54
5 10 15 20 25 30 128KB 1MB 8MB 64MB 512MB
CPU to GPU
prefetch memcpy GB/s
5 10 15 20 25 128KB 1MB 8MB 64MB 512MB
GPU to CPU
prefetch memcpy GB/s
55
cudaMallocManaged alignment: 512B on Pascal/Volta, 4KB on Kepler/Maxwell Too many small allocations will use up many pages cudaMallocManaged memory is moved at system page granularity For small allocations more data could be moved than necessary Solution: use cached allocator or memory pools
56
Unified Memory Fundamentals Under the Hood Details Performance Analysis and Optimizations Applications Deep Dive
57
High-Performance Geometric Multigrid Proxy AMR and Low Mach Combustion codes Used in Top500 benchmarking High memory usage requirements http://crd.lbl.gov/departments/computer-science/PAR/research/hpgmg/
58
Hybrid implementation requires very careful memory management Frequent data sharing when crossing the CPU-GPU threshold
V-CYCLE
GPU CPU
THRESHOLD F-CYCLE
59
Optimization: prefetch the next AMR level while running computations on the current level We can use a separate non-blocking CUDA stream to overlap with the default stream
https://devblogs.nvidia.com/parallelforall/beyond-gpu-memory-limits-unified-memory-pascal/
60
20 40 60 80 100 120 140 160 180 200 1.4 4.7 8.6 28.9 58.6
x86 K40 P100 (x86 PCI-e) P100 + hints (x86 PCI-e) P100 (P8 NVLINK) P100 + hints (P8 NVLINK) Application throughput (MDOF/s) Application working set (GB)
P100 memory size (16GB)
x86 CPU: Intel E5-2630 v3, 2 sockets of 10 cores each with HT on (40 threads)
All 5 levels fit in GPU memory Only 2 levels fit Only 1 level fits
61
vDNN: Virtualized DNN for Scalable, Memory-Efficient Neural Network Design Original version implemented custom heuristics to prefetch and offload data Unified Memory can automatically migrate memory as needed!
62
GPU: NVIDIA Quadro GP100; cuDNN 5.1, CUDA 9
5 10 15 20 25 30 batch 128 12GB batch 256 23GB batch 512 45GB
Very Large Batches (VGG-16)
All in Memory Offload Conv Offload All Unified Memory
time (ms)
5 10 15 20 25 30 batch 16 10GB batch 32 19GB batch 64 36GB
Very Deep Networks (VGG-216)
All in Memory Offload Conv Offload All Unified Memory
time (ms) GP100 mem size (16GB)
manual
fails! Unified Memory does not require any changes to the existing DNN code
63
1 1
64
2 2 1 2 2 1 2 2
65
3 2 2 1 2 2 1 2 2 3 3
66
shared visibility bitmap current frontier duplicated visibility bitmap
67
CPU: Intel Core i7-5930K @ 3.50GHz; GPU: NVIDIA Quadro GP100; edgefactor 16, harmonic mean over 64 random sources
Speed-up vs CPU
5 10 15 20 25 30 35 40 45 18 19 20 21 22 23 24 25
“Single-GPU” top-down BFS on 2xGP100 with Unified Memory
GPU: shared PCIe GPU: duplicated PCIe GPU: shared NVLINK GPU: duplicated NVLINK
Graph scale (2^N)
68
Unified Memory Fundamentals Under the Hood Details Performance Analysis and Optimizations Applications Deep Dive
69
Consider using Unified Memory for any new application development Get your code running on the GPU much sooner! Enjoy clean code and *virtually* no memory limits Increase productivity, explore and prototype new algorithms Use the explicit data management only where you need it
70