UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, - PowerPoint PPT Presentation

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1

HETEROGENEOUS ARCHITECTURES GPU 0 GPU 1 GPU 2 CPU GPU 0 GPU 1 GPU 2 MEM MEM MEM SYS MEM 2

UNIFIED MEMORY FUNDAMENTALS Single Pointer CPU code GPU code void *data; void *data; data = malloc(N); data = malloc(N); cpu_func1(data, N); cpu_func1(data, N); cpu_func2(data, N); gpu_func2<<<...>>>(data, N); cudaDeviceSynchronize(); cpu_func3(data, N); cpu_func3(data, N); free(data); free(data); 3

UNIFIED MEMORY FUNDAMENTALS Single Pointer Explicit Memory Unified Memory Management void *h_data, *d_data; void *data; h_data = malloc(N); data = malloc(N); cudaMalloc(&d_data, N); cpu_func1(h_data, N); cpu_func1(data, N); cudaMemcpy(d_data, h_data, N, ...) gpu_func2<<<...>>>(data, N); gpu_func2<<<...>>>(data, N); cudaDeviceSynchronize(); cudaMemcpy(h_data, d_data, N, ...) cpu_func3(h_data, N); cpu_func3(data, N); free(h_data); free(data); cudaFree(d_data); 4

UNIFIED MEMORY FUNDAMENTALS Deep Copy Nightmare Explicit Memory Unified Memory Management char **data; char **data; data = (char**)malloc(N*sizeof(char*)); data = (char**)malloc(N*sizeof(char*)); for (int i = 0; i < N; i++) for (int i = 0; i < N; i++) data[i] = (char*)malloc(N); data[i] = (char*)malloc(N); char **d_data; gpu_func<<<...>>>(data, N); char **h_data = (char**)malloc(N*sizeof(char*)); for (int i = 0; i < N; i++) { cudaMalloc(&h_data2[i], N); cudaMemcpy(h_data2[i], h_data[i], N, ...); } cudaMalloc(&d_data, N*sizeof(char*)); cudaMemcpy(d_data, h_data2, N*sizeof(char*), ...); gpu_func<<<...>>>(data, N); 5

UNIFIED MEMORY FUNDAMENTALS On-Demand Migration memory A memory B page1 page1 proc A proc B page2 page2 page3 page3 6

UNIFIED MEMORY FUNDAMENTALS On-Demand Migration memory A memory B *addr1 = 1 page1 page1 local access proc A proc B page2 page2 *addr3 = 1 page3 page3 page fault 7

UNIFIED MEMORY FUNDAMENTALS On-Demand Migration memory A memory B page1 page1 proc A proc B page2 page2 *addr3 = 1 page3 page3 page is populated 8

UNIFIED MEMORY FUNDAMENTALS On-Demand Migration memory A memory B page1 page1 proc A proc B *addr2 = 1 page2 page2 page fault *addr3 = 1 page3 page3 page fault 9

UNIFIED MEMORY FUNDAMENTALS On-Demand Migration memory A memory B page1 page1 proc A proc B *addr2 = 1 page2 page2 page fault page migration *addr3 = 1 page3 page3 page migration page fault 10

UNIFIED MEMORY FUNDAMENTALS On-Demand Migration memory A memory B page1 page1 proc A proc B *addr2 = 1 page2 page2 local access *addr3 = 1 page3 page3 local access 11

UNIFIED MEMORY FUNDAMENTALS When Is This Helpful? When it doesn’t matter how data moves to a processor 1) Quick and dirty algorithm prototyping 2) Iterative process with lots of data reuse, migration cost can be amortized 3) Simplify application debugging When it’s difficult to isolate the working set 1) Irregular or dynamic data structures, unpredictable access 2) Data partitioning between multiple processors 12

UNIFIED MEMORY FUNDAMENTALS Memory Oversubscription memory A memory B physical memory proc A proc B capacity is full *addr3 = 1 page fault 13

UNIFIED MEMORY FUNDAMENTALS Memory Oversubscription memory A memory B physical memory proc A proc B capacity is full page eviction *addr3 = 1 page fault 14

UNIFIED MEMORY FUNDAMENTALS Memory Oversubscription memory A memory B proc A proc B *addr3 = 1 page migration page fault 15

UNIFIED MEMORY FUNDAMENTALS Memory Oversubscription memory A memory B physical memory proc A proc B capacity is full 16

UNIFIED MEMORY FUNDAMENTALS Memory Oversubscription Benefits When you have large dataset and not enough physical memory Moving pieces by hand is error-prone and requires tuning for memory size Better to run slowly than get fail with out-of-memory error You can actually get high performance with Unified Memory! 17

UNIFIED MEMORY FUNDAMENTALS System-Wide Atomics with Exclusive Access memory A memory B page1 page1 atomicAdd_system atomicAdd_system (addr2, 1) (addr2, 1) page2 page2 page fault local access proc A proc B page3 page3 18

UNIFIED MEMORY FUNDAMENTALS System-Wide Atomics with Exclusive Access memory A memory B page1 page1 atomicAdd_system (addr2, 1) page2 page2 page migration page fault proc A proc B page3 page3 19

UNIFIED MEMORY FUNDAMENTALS System-Wide Atomics with Exclusive Access memory A memory B page1 page1 atomicAdd_system (addr2, 1) page2 page2 local access proc A proc B page3 page3 20

UNIFIED MEMORY FUNDAMENTALS System-Wide Atomics over NVLINK* memory A memory B page1 page1 atomicAdd_system atomicAdd_system (addr2, 1) (addr2, 1) page2 page2 remote access local access proc A proc B page3 page3 21 *both processors need to support atomic operations

UNIFIED MEMORY FUNDAMENTALS System-Wide Atomics GPUs are very good at handling atomics from thousands of threads Makes sense to utilize atomics between GPUs or between CPU and GPU We will see this in action on a realistic example later on 22

Unified Memory Fundamentals Under the Hood Details AGENDA Performance Analysis and Optimizations Applications Deep Dive 23

UNIFIED MEMORY ALLOCATOR Available Options CUDA C: cudaMallocManaged is your most reliable way to opt in today CUDA Fortran: managed attribute (per allocation) OpenACC: -ta=managed compiler option (all dynamic allocations) malloc support is coming on Pascal+ architectures (Linux only) Note: you can write your own malloc hook to use cudaMallocManaged 24

HETEROGEENOUS MEMORY MANAGER Work In Progress Heterogeneous Memory Manager: a set of Linux kernel patches Allows GPUs to access all system memory (malloc, stack, file system) Page migration will be triggered the same way as for cudaMallocManaged Ongoing testing and reviews, planning next phase of optimizations More details on HMM today at 4:00 in Room 211B by John Hubbard 25

UNIFIED MEMORY Evolution of GPU Architectures Volta Maxwell Access counters, No new features copy engine related to Unified faults, cache Pascal Kepler Memory coherence, ATS On-demand First release of support migration, the new “single - oversubscription, pointer” system-wide programming atomics model NVLINK2 NVLINK1 2012 2014 2016 2017 26

UNIFIED MEMORY ON KEPLER Available since CUDA 6 Kepler GPU: no page fault support, limited virtual space memory A memory B page1 page1 GPU CPU page2 page2 page3 page3 27

UNIFIED MEMORY ON KEPLER Available since CUDA 6 Bulk migration of all pages attached to current stream on kernel launch memory A memory B kernel page1 page1 launch GPU CPU page2 page2 page migration page3 page3 page migration 28

UNIFIED MEMORY ON KEPLER Available since CUDA 6 No on-demand migration for the GPU, no oversubscription, no system-wide atomics memory A memory B page1 page1 local access GPU CPU page2 page2 local access page3 page3 29

UNIFIED MEMORY ON PASCAL Available since CUDA 8 Pascal GPU: page fault support, extended virtual address space (48-bit) memory A memory B page1 page1 proc A proc B page2 page2 page3 page3 30

UNIFIED MEMORY ON PASCAL Available since CUDA 8 On-demand migration to accessing processor on first touch memory A memory B page1 page1 local access proc A proc B page2 page2 page migration page fault page3 page3 31

UNIFIED MEMORY ON PASCAL Available since CUDA 8 All features : on-demand migration, oversubscription, system-wide atomics memory A memory B page1 page1 proc A proc B page2 page2 local access page3 page3 32

UNIFIED MEMORY ON VOLTA Default model Volta GPU: uses fault on first touch for migration, same as Pascal GPU memory CPU memory page1 page1 local access GPU CPU page2 page2 page migration page fault page3 page3 33

UNIFIED MEMORY ON VOLTA New Feature: Access Counters If memory is mapped to the GPU, migration can be triggered by access counters GPU memory CPU memory page1 page1 local access GPU CPU page2 page2 remote access page3 page3 remote access 34

UNIFIED MEMORY ON VOLTA New Feature: Access Counters With access counters migration only hot pages will be moved to the GPU GPU memory CPU memory page1 page1 GPU CPU page2 page2 page3 page3 page migration local access 35

UNIFIED MEMORY ON VOLTA+P9 NVLINK2: Cache Coherence CPU can directly access and cache GPU memory; native CPU-GPU atomics GPU memory CPU memory page1 page1 GPU CPU page2 page2 remote access page3 page3 remote access local access 36

DRIVER HEURISTICS Things You Didn’t Know Exist The Unified Memory driver is doing intelligent things under the hood: Prefetching : migrate pages proactively to reduce number of faults Thrashing mitigation : heuristics to avoid frequent migration of shared pages Eviction : what pages to evict when we need to make the room for new ones You can’t control them but you can override most of these with hints 37

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, - PowerPoint PPT Presentation

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS ARCHITECTURES GPU 0 GPU 1 GPU 2 CPU GPU 0 GPU 1 GPU 2 MEM MEM MEM SYS MEM 2 UNIFIED MEMORY FUNDAMENTALS Single Pointer CPU code GPU code void

UNIFIED MEMORY IN CUDA 6 MARK HARRIS NVIDIA CONFIDENTIAL Unified Memory Dramatically Lower

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Basics of Unified Sports Ways to get involved with Unified Sports in Ohio Ohio 1 What are

SARVAM UCS Unified Communication Server Unified Communication Server for Modern Enterprises

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Unified Straight and Curved Steel Girder Design Specifications Introduction Unified Steel

UNIFIED PAYMENTS AT A GLANCE DEAR MERCHANT, WELCOME TO UNIFIED PAYMENTS! At Unified Payments,

SPORTS! Unified Basketball Special Olympics U NIFIED B ASKETBALL Unified Basketball helps

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Personal SE Computer Memory Addresses C Pointers Computer Memory Organization Memory is a

Memory Memory processing is the ability to: Acquire (Short term memory) Manipulate

Memory Management Memory Manager Requirements Minimize primary memory access time

Virtual Memory and Virtual Memory and Demand Paging Demand Paging Virtual Memory Illustrated

Dynamic Memory Management 333 Dynamic Memory Management Process Memory Layout Process Memory

Lecture 11: Persistent Memory Databases 1 / 71 Persistent Memory Databases Recap

Work-in-Progress: The Effects of Concurrent Presentation of Engineering Concepts and FEA

Support Accessing the GSE: Understanding and Addressing Processing Deficits of SWDs February 12,

Professional Development Webinar (Term 2, 2016) WORKING MEMORY What you need to know PC users:

V CHRISTIANA CARE HEALTH SYSTEM VALUE INSTITUTE Christiana Care Team Kimberly D. Williams,

Classroom noise and its effect on learning Staffan Hygge Environmental Psychology Faculty of

Be Be Good! Good! WHA WHAT IS T IS SECURE? SECURE? SECURe is a school-wide program designed

Variety in Content Presentation It is a good idea to have variety in lecture content presentation:

EXECUTIVE FUNCTION SKILLS So, what are they anyway? Why do I need them? Susan McKenney What do