UNIFIED MEMORY IN CUDA 6 MARK HARRIS NVIDIA CONFIDENTIAL Unified - PowerPoint PPT Presentation

UNIFIED MEMORY IN CUDA 6 MARK HARRIS NVIDIA CONFIDENTIAL

Unified Memory Dramatically Lower Developer Effort Developer View Today Developer View With Unified Memory System GPU Memory Unified Memory Memory

Super Simplified Memory Management Code CPU Code CUDA 6 Code with Unified Memory void sortfile(FILE *fp, int N) { void sortfile(FILE *fp, int N) { char *data; char *data; data = (char *)malloc(N); cudaMallocManaged(&data, N); fread(data, 1, N, fp); fread(data, 1, N, fp); qsort(data, N, 1, compare); qsort<<<...>>>(data,N,1,compare); cudaDeviceSynchronize(); use_data(data); use_data(data); free(data); cudaFree(data); } }

Unified Memory Delivers Single pointer to data, accessible anywhere  1. Simpler Programming &  Tight language integration Memory Model Greatly simplifies code porting   Migrate data to accessing processor 2. Performance Through Guarantee global coherency  Data Locality Still allows cudaMemcpyAsync() hand tuning 

Simpler Memory Model: Eliminate Deep Copies CPU Memory dataElem prop1 struct dataElem prop2 { “Hello World” *text int prop1; int prop2; char *text; }; GPU Memory

Simpler Memory Model: Eliminate Deep Copies CPU Memory dataElem prop1 struct dataElem prop2 { “Hello World” *text int prop1; int prop2; char *text; }; Two Copies Required dataElem prop1 prop2 “Hello World” *text GPU Memory

Simpler Memory Model: Eliminate Deep Copies void launch(dataElem *elem) { CPU Memory dataElem *g_elem; char *g_text; dataElem prop1 int textlen = strlen(elem->text); prop2 // Allocate storage for struct and text “Hello World” *text cudaMalloc(&g_elem, sizeof(dataElem)); cudaMalloc(&g_text, textlen); // Copy up each piece separately, including // new “text” pointer value Two Copies cudaMemcpy(g_elem, elem, sizeof(dataElem)); Required cudaMemcpy(g_text, elem->text, textlen); dataElem cudaMemcpy(&(g_elem->text), &g_text, prop1 sizeof(g_text)); prop2 // Finally we can launch our kernel, but “Hello World” // CPU & GPU use different copies of “ elem ” *text kernel<<< ... >>>(g_elem); GPU Memory }

Simpler Memory Model: Eliminate Deep Copies CPU Memory void launch(dataElem *elem) { kernel<<< ... >>>(elem); } Unified Memory dataElem prop1 prop2 “Hello World” *text GPU Memory

Simpler Memory Model Example: GPU & CPU Shared Linked Lists CPU Memory key key key key data data data data next next next next GPU Memory

Simpler Memory Model Example: GPU & CPU Shared Linked Lists CPU Memory key key key key Only practical option is to use data data data data zero-copy (pinned system) memory next next next next GPU accesses at PCIe bandwidth All GPU accesses at very high latency data access over PCIe GPU Memory

Simpler Memory Model Example: GPU & CPU Shared Linked Lists CPU Memory Local Can pass list elements between data Host & Device access Unified Memory Can insert and delete elements from Host or Device* key key key key Single list - no complex data data data data synchronization next next next next Local data access *Program must still ensure no race conditions. *Data is coherent between CPU & GPU GPU Memory at kernel launch & sync only

Unified Memory with C++ Host/Device C++ integration has been difficult in CUDA Cannot construct GPU class from CPU CPU Memory References fail because of kernel<<< >>>(data); dataElem no deep copies prop1 prop2 “Hello text World” // Ideal C++ version of class class dataElem { int prop1; int prop2; String text; void kernel(dataElem data) }; { } GPU Memory

Unified Memory with C++ Host/Device C++ integration has been difficult in CUDA Cannot construct GPU class from CPU CPU Memory References fail because of kernel<<< >>>(data); dataElem no deep copies prop1 prop2 “Hello Pass-by-value text World” uses copy constructor // Ideal C++ version of class class dataElem { int prop1; int prop2; String text; void kernel(dataElem data) }; { } CPU cannot constuct on GPU GPU Memory

Unified Memory with C++ C++ objects migrate easily when allocated on managed heap Overload new operator* to use C++ in unified memory region class Managed { void *operator new(size_t len) { void *ptr; cudaMallocManaged(&ptr, len); return ptr; } void operator delete(void *ptr) { cudaFree(ptr); } }; * (or use placement-new)

Unified Memory with C++ Pass-by-reference enabled with new overload // Deriving from “Managed” allows pass -by-reference class String : public Managed { int length; char *data; }; NOTE: CPU/GPU class sharing is restricted to POD-classes only (i.e. no virtual functions)

Unified Memory with C++ Pass-by-value enabled by managed memory copy constructors // Deriving from “Managed” allows pass -by-reference class String : public Managed { int length; char *data; // Unified memory copy constructor allows pass-by- value String (const String &s) { length = s.length; cudaMallocManaged(&data, length); memcpy(data, s.data, length); } }; NOTE: CPU/GPU class sharing is restricted to POD-classes only (i.e. no virtual functions)

Unified Memory with C++ Combination of C++ and Unified Memory is very powerful Concise and explicit: let C++ handle deep copies Pass by-value or by-reference without memcpy shenanigans CPU Program dataElem *data = new dataElem; Unified Memory // Note “managed” on this class, too. // C++ now handles our deep copies dataElem class dataElem : public Managed { prop1 int prop1; int prop2; prop2 String text; “Hello text }; World” GPU Program

C++ Pass By Reference Single pointer to data makes object references just work CPU Program kernel<<< ... >>>(data); Unified Memory dataElem prop1 Reference prop2 points to same “Hello object text World” __global__ void kernel_by_ref(dataElem &data) { } GPU Program

C++ Pass By Value Copy constructors from CPU create GPU-usable objects By-value CPU Program copy in managed memory kernel<<< ... >>>(data); Unified Memory dataElem dataElem prop1 prop1 copy prop2 prop2 “Hello “Hello text text World” World” __global__ void kernel_by_val(dataElem data) { } GPU Program

Unified Memory Roadmap CUDA 6: Ease of Use Next: Optimizations Maxwell Single Pointer to Data No Memcopy Required Coherence @ launch & sync Prefetching Shared C/C++ Data System Allocator Unified Migration Hints Structures Stack Memory Unified Additional OS Support HW-Accelerated Coherence

1 Unified Memory CUDA 2 XT and Drop-in Libraries 6 3 GPUDirect RDMA in MPI 4 Developer Tools

CUDA 6 Dramatically Simplifies Parallel Programming with Unified Memory More on Parallel Forall Blog http://devblogs.nvidia.com/parallelforall/unified- memory-in-cuda-6/ Sign up for CUDA Registered Developer Program https://developer.nvidia.com/cuda-toolkit

UNIFIED MEMORY IN CUDA 6 MARK HARRIS NVIDIA CONFIDENTIAL Unified - PowerPoint PPT Presentation

UNIFIED MEMORY IN CUDA 6 MARK HARRIS NVIDIA CONFIDENTIAL Unified Memory Dramatically Lower Developer Effort Developer View Today Developer View With Unified Memory System GPU Memory Unified Memory Memory Super Simplified Memory Management

Outline Overview Parallel Computing with GPU Introduction to CUDA CUDA Thread Model

CUDA 8 AND BEYOND Mark Harris, April 5, 2016 INTRODUCING CUDA 8 Pascal Support Unified Memory

Introduction to CUDA C What is CUDA? CUDA Architecture Expose general-purpose GPU

Lecture 2.1 - Introduction to CUDA C CUDA C vs. Thrust vs. CUDA Libraries Objective To learn

CUDA 7 AND BEYOND MARK HARRIS, NVIDIA CUDA 7 Runtime C++11 cuSOLVER Compilation

CUDA/Ada An Ada binding to CUDA Reto B urki, Adrian-Ken R uegsegger University of Applied

CUDA ON MOBILE Yogesh Kini, GTC 2016 Typical pipeline ABSTRACT CUDA Interop APIs Unified

GPU Programming Alan Gray EPCC The University of Edinburgh Overview Motivation and need

Unified memory Talk outline [28 slides] GPGPU 2015: High Performance Computing with CUDA

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

EXTENDING THE REACH OF PARALLEL COMPUTING WITH CUDA Mark Harris, NVIDIA @harrism #NVSC14

Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology

CUDA (Compute Unified Device Dr. Bharathwaj Bharath Muthuswamy Architecture) and OpenCL

A High-Level Intro to CUDA CS5220 Fall 2015 What is CUDA? C ompute U nified D evice A

Lecture 2.4 Introduction to CUDA C Introduction to the CUDA Toolkit Objective To become

Basics of Unified Sports Ways to get involved with Unified Sports in Ohio Ohio 1 What are

Brief Structure of Symposium Presentation Havent been able to do a draft because I am still

If each of these people said this, who would you trust? Blueberries are great they can

Pivotal Memory Technologies Enabling New Generation of AI Workloads Tien Shiah Memory Product

Motivation Memory is a shared resource Core Core Memory Core Core Threads requests

Memory in Python Announcements For This Lecture Assignment 1 More Assignments Assignment 2

N a n o s c a l e S S i l i c o n B a s e d N a n o s c a l e i l

MIDDLEWARE COMMUNICATIONS Steve Bernier, M.Sc., Senior Architect & Project Leader, Advanced

Corporate Overview $22.8 million convertible bridge facility Update on memory market and STTs

UNIFIED MEMORY IN CUDA 6 MARK HARRIS NVIDIA CONFIDENTIAL Unified - PowerPoint PPT Presentation

UNIFIED MEMORY IN CUDA 6 MARK HARRIS NVIDIA CONFIDENTIAL Unified Memory Dramatically Lower Developer Effort Developer View Today Developer View With Unified Memory System GPU Memory Unified Memory Memory Super Simplified Memory Management

Outline Overview Parallel Computing with GPU Introduction to CUDA CUDA Thread Model

CUDA 8 AND BEYOND Mark Harris, April 5, 2016 INTRODUCING CUDA 8 Pascal Support Unified Memory

Introduction to CUDA C What is CUDA? CUDA Architecture Expose general-purpose GPU

Lecture 2.1 - Introduction to CUDA C CUDA C vs. Thrust vs. CUDA Libraries Objective To learn

CUDA 7 AND BEYOND MARK HARRIS, NVIDIA CUDA 7 Runtime C++11 cuSOLVER Compilation

CUDA/Ada An Ada binding to CUDA Reto B urki, Adrian-Ken R uegsegger University of Applied

CUDA ON MOBILE Yogesh Kini, GTC 2016 Typical pipeline ABSTRACT CUDA Interop APIs Unified

GPU Programming Alan Gray EPCC The University of Edinburgh Overview Motivation and need

Unified memory Talk outline [28 slides] GPGPU 2015: High Performance Computing with CUDA

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

EXTENDING THE REACH OF PARALLEL COMPUTING WITH CUDA Mark Harris, NVIDIA @harrism #NVSC14

Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology

CUDA (Compute Unified Device Dr. Bharathwaj Bharath Muthuswamy Architecture) and OpenCL

A High-Level Intro to CUDA CS5220 Fall 2015 What is CUDA? C ompute U nified D evice A

Lecture 2.4 Introduction to CUDA C Introduction to the CUDA Toolkit Objective To become

Basics of Unified Sports Ways to get involved with Unified Sports in Ohio Ohio 1 What are

Brief Structure of Symposium Presentation Havent been able to do a draft because I am still

If each of these people said this, who would you trust? Blueberries are great they can

Pivotal Memory Technologies Enabling New Generation of AI Workloads Tien Shiah Memory Product

Motivation Memory is a shared resource Core Core Memory Core Core Threads requests

Memory in Python Announcements For This Lecture Assignment 1 More Assignments Assignment 2

N a n o s c a l e S S i l i c o n B a s e d N a n o s c a l e i l

MIDDLEWARE COMMUNICATIONS Steve Bernier, M.Sc., Senior Architect &amp; Project Leader, Advanced

Corporate Overview $22.8 million convertible bridge facility Update on memory market and STTs

MIDDLEWARE COMMUNICATIONS Steve Bernier, M.Sc., Senior Architect & Project Leader, Advanced