UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, - - PowerPoint PPT Presentation

unified memory on
SMART_READER_LITE
LIVE PREVIEW

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, - - PowerPoint PPT Presentation

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS ARCHITECTURES GPU 0 GPU 1 GPU 2 CPU GPU 0 GPU 1 GPU 2 MEM MEM MEM SYS MEM 2 UNIFIED MEMORY FUNDAMENTALS Single Pointer CPU code GPU code void


slide-1
SLIDE 1

1

Nikolay Sakharnykh - May 10, 2017

UNIFIED MEMORY ON PASCAL AND VOLTA

slide-2
SLIDE 2

2

HETEROGENEOUS ARCHITECTURES

GPU 0 MEM

CPU

SYS MEM

GPU 0

GPU 1 MEM

GPU 1

GPU 2 MEM

GPU 2

slide-3
SLIDE 3

3

UNIFIED MEMORY FUNDAMENTALS

Single Pointer

CPU code GPU code

void *data; data = malloc(N); cpu_func1(data, N); cpu_func2(data, N); cpu_func3(data, N); free(data); void *data; data = malloc(N); cpu_func1(data, N); gpu_func2<<<...>>>(data, N); cudaDeviceSynchronize(); cpu_func3(data, N); free(data);

slide-4
SLIDE 4

4

UNIFIED MEMORY FUNDAMENTALS

Single Pointer

Explicit Memory Management Unified Memory

void *h_data, *d_data; h_data = malloc(N); cudaMalloc(&d_data, N); cpu_func1(h_data, N); cudaMemcpy(d_data, h_data, N, ...) gpu_func2<<<...>>>(data, N); cudaMemcpy(h_data, d_data, N, ...) cpu_func3(h_data, N); free(h_data); cudaFree(d_data); void *data; data = malloc(N); cpu_func1(data, N); gpu_func2<<<...>>>(data, N); cudaDeviceSynchronize(); cpu_func3(data, N); free(data);

slide-5
SLIDE 5

5

UNIFIED MEMORY FUNDAMENTALS

Deep Copy Nightmare

Explicit Memory Management Unified Memory

char **data; data = (char**)malloc(N*sizeof(char*)); for (int i = 0; i < N; i++) data[i] = (char*)malloc(N); char **d_data; char **h_data = (char**)malloc(N*sizeof(char*)); for (int i = 0; i < N; i++) { cudaMalloc(&h_data2[i], N); cudaMemcpy(h_data2[i], h_data[i], N, ...); } cudaMalloc(&d_data, N*sizeof(char*)); cudaMemcpy(d_data, h_data2, N*sizeof(char*), ...); gpu_func<<<...>>>(data, N); char **data; data = (char**)malloc(N*sizeof(char*)); for (int i = 0; i < N; i++) data[i] = (char*)malloc(N); gpu_func<<<...>>>(data, N);

slide-6
SLIDE 6

6

UNIFIED MEMORY FUNDAMENTALS

On-Demand Migration

page1 page2 page3 page1 page2 page3 proc A proc B memory A memory B

slide-7
SLIDE 7

7

UNIFIED MEMORY FUNDAMENTALS

On-Demand Migration

page1 page2 page3 page1 page2 page3 *addr1 = 1 local access *addr3 = 1 page fault proc A proc B memory A memory B

slide-8
SLIDE 8

8

UNIFIED MEMORY FUNDAMENTALS

On-Demand Migration

page1 page2 page3 page1 page2 page3 *addr3 = 1 page is populated proc A proc B memory A memory B

slide-9
SLIDE 9

9

UNIFIED MEMORY FUNDAMENTALS

On-Demand Migration

page1 page2 page3 page1 page2 page3 *addr2 = 1 *addr3 = 1 page fault page fault proc A proc B memory A memory B

slide-10
SLIDE 10

10

UNIFIED MEMORY FUNDAMENTALS

On-Demand Migration

page1 page2 page3 page1 page2 page3 *addr2 = 1 *addr3 = 1 page migration page migration page fault page fault proc A proc B memory A memory B

slide-11
SLIDE 11

11

UNIFIED MEMORY FUNDAMENTALS

On-Demand Migration

page1 page2 page3 page1 page2 page3 proc A proc B *addr2 = 1 *addr3 = 1 local access local access memory A memory B

slide-12
SLIDE 12

12

UNIFIED MEMORY FUNDAMENTALS

When it doesn’t matter how data moves to a processor 1) Quick and dirty algorithm prototyping 2) Iterative process with lots of data reuse, migration cost can be amortized 3) Simplify application debugging When it’s difficult to isolate the working set 1) Irregular or dynamic data structures, unpredictable access 2) Data partitioning between multiple processors

When Is This Helpful?

slide-13
SLIDE 13

13

UNIFIED MEMORY FUNDAMENTALS

Memory Oversubscription

proc A proc B *addr3 = 1 page fault physical memory capacity is full memory A memory B

slide-14
SLIDE 14

14

UNIFIED MEMORY FUNDAMENTALS

Memory Oversubscription

proc A proc B *addr3 = 1 page fault page eviction physical memory capacity is full memory A memory B

slide-15
SLIDE 15

15

UNIFIED MEMORY FUNDAMENTALS

Memory Oversubscription

proc A proc B *addr3 = 1 page fault page migration memory A memory B

slide-16
SLIDE 16

16

UNIFIED MEMORY FUNDAMENTALS

Memory Oversubscription

proc A proc B physical memory capacity is full memory A memory B

slide-17
SLIDE 17

17

UNIFIED MEMORY FUNDAMENTALS

When you have large dataset and not enough physical memory Moving pieces by hand is error-prone and requires tuning for memory size Better to run slowly than get fail with out-of-memory error You can actually get high performance with Unified Memory!

Memory Oversubscription Benefits

slide-18
SLIDE 18

18

UNIFIED MEMORY FUNDAMENTALS

System-Wide Atomics with Exclusive Access

page1 page2 page3 page1 page2 page3 memory A memory B proc A proc B atomicAdd_system (addr2, 1) page fault local access atomicAdd_system (addr2, 1)

slide-19
SLIDE 19

19

UNIFIED MEMORY FUNDAMENTALS

System-Wide Atomics with Exclusive Access

page1 page2 page3 page1 page2 page3 memory A memory B atomicAdd_system (addr2, 1) page fault page migration proc A proc B

slide-20
SLIDE 20

20

UNIFIED MEMORY FUNDAMENTALS

System-Wide Atomics with Exclusive Access

page1 page2 page3 page1 page2 page3 memory A memory B local access proc A proc B atomicAdd_system (addr2, 1)

slide-21
SLIDE 21

21

UNIFIED MEMORY FUNDAMENTALS

System-Wide Atomics over NVLINK*

page1 page2 page3 page1 page2 page3 memory A memory B remote access local access proc A proc B atomicAdd_system (addr2, 1) atomicAdd_system (addr2, 1)

*both processors need to support atomic operations

slide-22
SLIDE 22

22

UNIFIED MEMORY FUNDAMENTALS

GPUs are very good at handling atomics from thousands of threads Makes sense to utilize atomics between GPUs or between CPU and GPU We will see this in action on a realistic example later on

System-Wide Atomics

slide-23
SLIDE 23

23

AGENDA

Unified Memory Fundamentals Under the Hood Details Performance Analysis and Optimizations Applications Deep Dive

slide-24
SLIDE 24

24

UNIFIED MEMORY ALLOCATOR

CUDA C: cudaMallocManaged is your most reliable way to opt in today CUDA Fortran: managed attribute (per allocation) OpenACC: -ta=managed compiler option (all dynamic allocations) malloc support is coming on Pascal+ architectures (Linux only) Note: you can write your own malloc hook to use cudaMallocManaged

Available Options

slide-25
SLIDE 25

25

HETEROGEENOUS MEMORY MANAGER

Heterogeneous Memory Manager: a set of Linux kernel patches Allows GPUs to access all system memory (malloc, stack, file system) Page migration will be triggered the same way as for cudaMallocManaged Ongoing testing and reviews, planning next phase of optimizations More details on HMM today at 4:00 in Room 211B by John Hubbard

Work In Progress

slide-26
SLIDE 26

26

UNIFIED MEMORY

Evolution of GPU Architectures

2012 2014 2016 2017

Kepler

First release of the new “single- pointer” programming model

Maxwell

No new features related to Unified Memory

Pascal

On-demand migration,

  • versubscription,

system-wide atomics

Volta

Access counters, copy engine faults, cache coherence, ATS support

NVLINK1 NVLINK2

slide-27
SLIDE 27

27

UNIFIED MEMORY ON KEPLER

Kepler GPU: no page fault support, limited virtual space

Available since CUDA 6

page1 page2 page3 page1 page2 page3 memory A memory B GPU CPU

slide-28
SLIDE 28

28

UNIFIED MEMORY ON KEPLER

Bulk migration of all pages attached to current stream on kernel launch

Available since CUDA 6

page1 page2 page3 page1 page2 page3 memory A memory B kernel launch page migration page migration GPU CPU

slide-29
SLIDE 29

29

UNIFIED MEMORY ON KEPLER

No on-demand migration for the GPU, no oversubscription, no system-wide atomics

Available since CUDA 6

page1 page2 page3 page1 page2 page3 memory A memory B local access GPU CPU local access

slide-30
SLIDE 30

30

UNIFIED MEMORY ON PASCAL

Pascal GPU: page fault support, extended virtual address space (48-bit)

Available since CUDA 8

page1 page2 page3 page1 page2 page3 memory A memory B proc A proc B

slide-31
SLIDE 31

31

UNIFIED MEMORY ON PASCAL

On-demand migration to accessing processor on first touch

Available since CUDA 8

page1 page2 page3 page1 page2 page3 memory A memory B local access proc A proc B page fault page migration

slide-32
SLIDE 32

32

UNIFIED MEMORY ON PASCAL

All features: on-demand migration, oversubscription, system-wide atomics

Available since CUDA 8

page1 page2 page3 page1 page2 page3 memory A memory B proc A proc B local access

slide-33
SLIDE 33

33

UNIFIED MEMORY ON VOLTA

Volta GPU: uses fault on first touch for migration, same as Pascal

Default model

GPU CPU page1 page2 page3 page1 page2 page3 local access page fault page migration GPU memory CPU memory

slide-34
SLIDE 34

34

UNIFIED MEMORY ON VOLTA

If memory is mapped to the GPU, migration can be triggered by access counters

New Feature: Access Counters

page1 page2 page3 page1 page2 page3 GPU CPU remote access remote access local access GPU memory CPU memory

slide-35
SLIDE 35

35

UNIFIED MEMORY ON VOLTA

With access counters migration only hot pages will be moved to the GPU

New Feature: Access Counters

page1 page2 page3 page1 page2 page3 GPU CPU page migration local access GPU memory CPU memory

slide-36
SLIDE 36

36

UNIFIED MEMORY ON VOLTA+P9

CPU can directly access and cache GPU memory; native CPU-GPU atomics

NVLINK2: Cache Coherence

page1 page2 page3 page1 page2 page3 GPU memory CPU memory GPU CPU local access remote access remote access

slide-37
SLIDE 37

37

DRIVER HEURISTICS

The Unified Memory driver is doing intelligent things under the hood: Prefetching: migrate pages proactively to reduce number of faults Thrashing mitigation: heuristics to avoid frequent migration of shared pages Eviction: what pages to evict when we need to make the room for new ones You can’t control them but you can override most of these with hints

Things You Didn’t Know Exist

slide-38
SLIDE 38

38

DRIVER PREFETCHING

GPU architecture supports different page sizes Contiguous pages up to a larger page size are promoted to the larger size Driver prefetches whole regions if pages are accessed densely

Do Not Confuse with API-prefetching

GPU CPU

slide-39
SLIDE 39

39

Processors share the same page and frequently read or write to it Pascal: when memory is pinned we lose any insight into access pattern Volta: can use access counters information to find a better location

ANTI-THRASHING POLICY

Frequent Access to Shared Data

GPU CPU CPU throttle pin to CPU

slide-40
SLIDE 40

40

Driver keeps a single list of physical chunks of GPU memory Chunks from the front of the list are evicted first (LRU) A chunk is considered “in use” when it is fully-populated or migrated

EVICTION ALGORITHM

What Pages Are Moving Out of the GPU

eviction allocation migration to the GPU

slide-41
SLIDE 41

41

AGENDA

Unified Memory Fundamentals Under the Hood Details Performance Analysis and Optimizations Applications Deep Dive

slide-42
SLIDE 42

42

PROFILER: INSPECT

slide-43
SLIDE 43

43

PROFILER: FILTER

slide-44
SLIDE 44

44

PROFILER: CORRELATE

More details tomorrow at 10:00 in Marriott Salon 3

slide-45
SLIDE 45

45

USER HINTS

If you know your application well you can optimize with hints These are also useful to override some of the driver heuristics cudaMemPrefetchAsync(ptr, size, processor, stream) Similar to move_pages() in Linux cudaMemAdvise(ptr, size, advice, processor) Similar to madvise() in Linux

Why, When, and How to Use Them

slide-46
SLIDE 46

46

USER HINTS

Prefetching

char *data; cudaMallocManaged(&data, N); init_data(data, N); cudaMemPrefetchAsync(data, N, myGpuId, s); mykernel<<<..., s>>>(data, N); cudaMemPrefetchAsync(data, N, cudaCpuDeviceId, s); cudaStreamSynchronize(s); use_data(data, N); cudaFree(data);

Page faults can be expensive and they stall SM execution Avoid faults by prefetching data to the accessing processor

GPU CPU CPU

slide-47
SLIDE 47

47

USER HINTS

Read Mostly

char *data; cudaMallocManaged(&data, N); init_data(data, N); cudaMemAdvise(data, N, ..SetReadMostly, myGpuId); cudaMemPrefetchAsync(data, N, myGpuId, s); mykernel<<<..., s>>>(data, N); use_data(data, N); cudaFree(data);

In this case prefetch creates a copy instead of moving data Both processors can read data simultaneously without faults Writes are allowed but they are expensive

GPU CPU CPU

slide-48
SLIDE 48

48

USER HINTS

Preferred Location

char *data; cudaMallocManaged(&data, N); init_data(data, N); cudaMemAdvise(data, N, ..PreferredLocation, cudaCpuDeviceId); mykernel<<<..., s>>>(data, N); use_data(data, N); cudaFree(data);

Here the kernel will page fault and generate direct mapping to data on the CPU The driver will “resist” migrating data away from the preferred location

GPU CPU CPU

slide-49
SLIDE 49

49

USER HINTS

Accessed By

char *data; cudaMallocManaged(&data, N); init_data(data, N); cudaMemAdvise(data, N, ..SetAccessedBy, myGpuId); mykernel<<<..., s>>>(data, N); use_data(data, N); cudaFree(data);

GPU will establish direct mapping of data in CPU memory, no page faults will be generated Memory can move freely to other processors and mapping will carry

  • ver

GPU CPU CPU

slide-50
SLIDE 50

50

USER HINTS

Accessed By on Volta

char *data; cudaMallocManaged(&data, N); init_data(data, N); cudaMemAdvise(data, N, ..SetAccessedBy, myGpuId); mykernel<<<..., s>>>(data, N); use_data(data, N); cudaFree(data);

GPU will establish direct mapping of data in CPU memory, no page faults will be generated Access counters may eventually trigger migration of this memory to the GPU

GPU CPU CPU GPU

slide-51
SLIDE 51

51

PERFORMANCE

How long does a page fault take to serve? - We can measure!

Page Fault Cost

Linked list traversal with some large stride to avoid prefetching effects

Page fault cost (us) DtoH HtoD x86 + PCIe + GP100 20 30 P8 + NVLINK + GP100 20 20

slide-52
SLIDE 52

52

GB/s GB/s

5 10 15 20 25 128KB 1MB 8MB 64MB 512MB 4GB

CPU memory

  • n-demand single
  • n-demand multi

explicit single explicit multi

1 10 100 1000 128KB 1MB 8MB 64MB 512MB 4GB

GPU memory

  • n-demand

prefetch explicit

PERFORMANCE

Page Allocation Throughput

cudaMallocManaged/cudaMalloc + cudaMemset cudaMallocManaged/mmap + fill on the CPU cudaMalloc is using preallocated memory for large sizes

slide-53
SLIDE 53

53

PERFORMANCE

Page Migration Throughput (PCIe)

2 4 6 8 10 12 14 128KB 1MB 8MB 64MB 512MB 4GB

CPU to GPU

  • n-demand stream
  • n-demand warp-64k

prefetch memcpy GB/s

2 4 6 8 10 12 14 128KB 1MB 8MB 64MB 512MB 4GB

GPU to CPU

  • n-demand single
  • n-demand multi

prefetch memcpy GB/s

slide-54
SLIDE 54

54

PERFORMANCE

Page Migration Throughput (2x NVLINK)

5 10 15 20 25 30 128KB 1MB 8MB 64MB 512MB

CPU to GPU

  • n-demand stream
  • n-demand warp-64k

prefetch memcpy GB/s

5 10 15 20 25 128KB 1MB 8MB 64MB 512MB

GPU to CPU

  • n-demand single
  • n-demand multi

prefetch memcpy GB/s

slide-55
SLIDE 55

55

PERFORMANCE

cudaMallocManaged alignment: 512B on Pascal/Volta, 4KB on Kepler/Maxwell Too many small allocations will use up many pages cudaMallocManaged memory is moved at system page granularity For small allocations more data could be moved than necessary Solution: use cached allocator or memory pools

Page Granularity Overhead

slide-56
SLIDE 56

56

AGENDA

Unified Memory Fundamentals Under the Hood Details Performance Analysis and Optimizations Applications Deep Dive

slide-57
SLIDE 57

57

HPC: HPGMG

High-Performance Geometric Multigrid Proxy AMR and Low Mach Combustion codes Used in Top500 benchmarking High memory usage requirements http://crd.lbl.gov/departments/computer-science/PAR/research/hpgmg/

Combustion Simulation

slide-58
SLIDE 58

58

HPC: HPGMG

Hybrid implementation requires very careful memory management Frequent data sharing when crossing the CPU-GPU threshold

Taking Advantage of the CPU and the GPU

V-CYCLE

GPU CPU

THRESHOLD F-CYCLE

slide-59
SLIDE 59

59

HPGMG: AMR PROXY

Data Locality and Reuse of AMR Levels

Optimization: prefetch the next AMR level while running computations on the current level We can use a separate non-blocking CUDA stream to overlap with the default stream

https://devblogs.nvidia.com/parallelforall/beyond-gpu-memory-limits-unified-memory-pascal/

slide-60
SLIDE 60

60

AMR PROXY OVERSUBSCRIPTION

20 40 60 80 100 120 140 160 180 200 1.4 4.7 8.6 28.9 58.6

x86 K40 P100 (x86 PCI-e) P100 + hints (x86 PCI-e) P100 (P8 NVLINK) P100 + hints (P8 NVLINK) Application throughput (MDOF/s) Application working set (GB)

P100 memory size (16GB)

x86 CPU: Intel E5-2630 v3, 2 sockets of 10 cores each with HT on (40 threads)

All 5 levels fit in GPU memory Only 2 levels fit Only 1 level fits

slide-61
SLIDE 61

61

vDNN: Virtualized DNN for Scalable, Memory-Efficient Neural Network Design Original version implemented custom heuristics to prefetch and offload data Unified Memory can automatically migrate memory as needed!

DEEP LEARNING

slide-62
SLIDE 62

62

DEEP LEARNING OVERSUBSCRIPTION

GPU: NVIDIA Quadro GP100; cuDNN 5.1, CUDA 9

5 10 15 20 25 30 batch 128 12GB batch 256 23GB batch 512 45GB

Very Large Batches (VGG-16)

All in Memory Offload Conv Offload All Unified Memory

time (ms)

5 10 15 20 25 30 batch 16 10GB batch 32 19GB batch 64 36GB

Very Deep Networks (VGG-216)

All in Memory Offload Conv Offload All Unified Memory

time (ms) GP100 mem size (16GB)

manual

  • ffload

fails! Unified Memory does not require any changes to the existing DNN code

slide-63
SLIDE 63

63

GRAPH ANALYTICS

BFS Traversal

1 1

GPU A GPU B

slide-64
SLIDE 64

64

GRAPH ANALYTICS

BFS Traversal

2 2 1 2 2 1 2 2

GPU A GPU B

slide-65
SLIDE 65

65

GRAPH ANALYTICS

BFS Traversal

3 2 2 1 2 2 1 2 2 3 3

GPU A GPU B

slide-66
SLIDE 66

66

GRAPH ANALYTICS

Shared vs Duplicated Visibility Vector GPU A GPU B

shared visibility bitmap current frontier duplicated visibility bitmap

slide-67
SLIDE 67

67

GRAPH ANALYTICS

Software vs Hardware Atomics

CPU: Intel Core i7-5930K @ 3.50GHz; GPU: NVIDIA Quadro GP100; edgefactor 16, harmonic mean over 64 random sources

Speed-up vs CPU

5 10 15 20 25 30 35 40 45 18 19 20 21 22 23 24 25

“Single-GPU” top-down BFS on 2xGP100 with Unified Memory

GPU: shared PCIe GPU: duplicated PCIe GPU: shared NVLINK GPU: duplicated NVLINK

Graph scale (2^N)

slide-68
SLIDE 68

68

AGENDA

Unified Memory Fundamentals Under the Hood Details Performance Analysis and Optimizations Applications Deep Dive

slide-69
SLIDE 69

69

CONCLUSIONS AND OUTLOOK

Consider using Unified Memory for any new application development Get your code running on the GPU much sooner! Enjoy clean code and *virtually* no memory limits Increase productivity, explore and prototype new algorithms Use the explicit data management only where you need it

slide-70
SLIDE 70

70