EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY Nikolay - PowerPoint PPT Presentation

EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY Nikolay Sakharnykh, 3/27/2018

SINGLE POINTER CPU vs GPU CPU code GPU code w/ Unified Memory void *data; void *data; data = malloc(N); data = malloc(N); cpu_func1(data, N); cpu_func1(data, N); cpu_func2(data, N); gpu_func2<<<...>>>(data, N); cudaDeviceSynchronize(); cpu_func3(data, N); cpu_func3(data, N); free(data); free(data); 2

SINGLE POINTER Explicit vs Unified Memory Explicit Memory Management GPU code w/ Unified Memory void *data, *d_data; void *data; data = malloc(N); data = malloc(N); cudaMalloc(&d_data, N); cpu_func1(data, N); cpu_func1(data, N); cudaMemcpy(d_data, data, N, ...) gpu_func2<<<...>>>(d_data, N); gpu_func2<<<...>>>(data, N); cudaMemcpy(data, d_data, N, ...) cudaDeviceSynchronize(); cudaFree(d_data); cpu_func3(data, N); cpu_func3(data, N); free(data); free(data); 3

SINGLE POINTER Full Control with Prefetching Explicit Memory Management Unified Memory + Prefetching void *data, *d_data; void *data; data = malloc(N); data = malloc(N); cudaMalloc(&d_data, N); cpu_func1(data, N); cpu_func1(data, N); cudaMemcpy(d_data, data, N, ...) cudaMemPrefetchAsync(data, N, GPU) gpu_func2<<<...>>>(d_data, N); gpu_func2<<<...>>>(data, N); cudaMemcpy(data, d_data, N, ...) cudaMemPrefetchAsync(data, N, CPU) cudaFree(d_data); cudaDeviceSynchronize(); cpu_func3(data, N); cpu_func3(data, N); free(data); free(data); 4

SINGLE POINTER Deep Copy Explicit Memory Management GPU code w/ Unified Memory char **data; char **data; // allocate and initialize data on the CPU // allocate and initialize data on the CPU char **d_data; char **h_data = (char**)malloc(N*sizeof(char*)); for (int i = 0; i < N; i++) { cudaMalloc(&h_data[i], N); cudaMemcpy(h_data[i], data[i], N, ...); } cudaMalloc(&d_data, N*sizeof(char*)); cudaMemcpy(d_data, h_data, N*sizeof(char*), ...); gpu_func<<<...>>>(d_data, N); gpu_func<<<...>>>(data, N); 5

UNIFIED MEMORY BASICS GPU A GPU B page1 page2 Single virtual memory shared between processors 6

UNIFIED MEMORY BASICS A’s page table B’s page table page1 page1 A’s phys mem B’s phys mem GPU A GPU B page2 page2 page1 page2 Single virtual memory shared between processors 7

UNIFIED MEMORY BASICS A’s page table B’s page table page1 page1 A’s phys mem B’s phys mem *addr1 = 1 local access page2 page2 *addr3 = 1 page fault page3 page3 8

UNIFIED MEMORY BASICS A’s page table B’s page table page1 page1 A’s phys mem B’s phys mem page2 page2 *addr3 = 1 access replay page3 page3 page3 populated and mapped into B’s memory 9

UNIFIED MEMORY BASICS A’s page table B’s page table page1 page1 A’s phys mem B’s phys mem page2 page2 *addr2 = 1 page fault page3 page3 *addr3 = 1 page fault 10

UNIFIED MEMORY BASICS A’s page table B’s page table page1 page1 A’s phys mem B’s phys mem page2 page2 *addr2 = 1 page fault page3 page3 *addr3 = 1 page fault page2 and page3 unmapped from B’s memory 11

UNIFIED MEMORY BASICS A’s page table B’s page table page1 page1 A’s phys mem B’s phys mem page2 page2 *addr2 = 1 page fault page3 page3 *addr3 = 1 page fault pages data migrated to A’s physical memory 12

UNIFIED MEMORY BASICS A’s page table B’s page table page1 page1 A’s phys mem B’s phys mem page2 page2 *addr2 = 1 access replay page3 page3 *addr3 = 1 access replay 13

MEMORY OVERSUBSCRIPTION A’s page table B’s page table page1 page1 page1 page1 A’s phys mem B’s phys mem page3 page3 page4 page4 page5 page5 *addr5 = 1 page fault 14

MEMORY OVERSUBSCRIPTION A’s page table B’s page table page1 page1 page1 page1 A’s phys mem B’s phys mem page3 page3 page4 page4 page5 page5 *addr5 = 1 page fault page4 unmapped from A’s memory and migrated 15

MEMORY OVERSUBSCRIPTION A’s page table B’s page table page1 page1 page1 page1 A’s phys mem B’s phys mem page3 page3 page4 page4 page5 page5 *addr5 = 1 page fault page4 mapped in B’s memory, page5 unmapped and migrated to A 16

MEMORY OVERSUBSCRIPTION A’s page table B’s page table page1 page1 page1 page1 A’s phys mem B’s phys mem page3 page3 page4 page4 page5 page5 *addr5 = 1 access replay 17

SIMPLIFYING DL FRAMEWORK DESIGN class ConvolutionLayer { Existing Design public: Eliminated 3,000 lines of repetitive and void cpu_data() error-prone code in Caffe void cpu_diff() void gpu_data() void gpu_diff() Developers can add new inherited Layer void mutable_cpu_data() class ConvolutionLayer void mutable_cpu_diff() classes in a much simpler manner { void mutable_gpu_data() public: void mutable_gpu_diff() void data() The final call to a CPU function or a GPU void diff() void Forward_cpu() void Forward_gpu() kernel (caffe_gpu_gemm) still need to be void mutable_data() void forward_cpu_gemm() Unified void mutable_diff() explicit void forward_gpu_gemm() Memory void forward_cpu_bias() void Forward() void forward_gpu_bias() Design void forward_gemm() void forward_bias() void Backward_cpu() void Backward_gpu() void Backward() void backward_cpu_gemm() void backward_gemm() void backward_gpu_gemm() void backward_bias() void backward_cpu_bias() } void backward_gpu_bias() } A. A. Awan, C- H Chu, H. Subramoni, X. Lu, D.K. Panda, “ OC-DNN: Designing Out-of-Core Deep Neural Network Training by Exploiting Unified Memory on 18 Pascal and Volta GPUs ”, <double -blind submission under review>

CAN THIS DESIGN OFFER GOOD PERF? DL training with Unified Memory VGG19 training on 1xV100 (16GB) ResNet-50 training on 1xV100 (16GB) Caffe (BVLC) OC-Caffe Caffe (BVLC) OC-Caffe 160 160 140 140 120 120 Images/sec Images/sec 100 100 in-memory out-of-core in-memory out-of-core 80 80 60 60 40 40 20 20 0 0 Batch 110 Batch 120 Batch 40 Batch 45 19 OC-Caffe will be released by the HiDL Team@OSU: hidl.cse.ohio-state.edu, mvapich.cse.ohio-state.edu

CONCURRENT ACCESS A’s page table B’s page table page1 page1 A’s phys mem B’s phys mem page2 page2 page3 page3 20

CONCURRENT ACCESS Exclusive Access* A’s page table B’s page table page1 page1 A’s phys mem B’s phys mem atomicAdd_system atomicAdd_system (addr2, 1) (addr2, 1) page fault local access page2 page2 page3 page3 21 *this is a possible implementation and to guarantee this behavior you need to use cudaMemAdvise policies

CONCURRENT ACCESS Exclusive Access A’s page table B’s page table page1 page1 A’s phys mem B’s phys mem atomicAdd_system (addr2, 1) page fault page2 page2 page3 page3 page2 unmapped in B’s memory and migrated to A 22

CONCURRENT ACCESS Exclusive Access A’s page table B’s page table page1 page1 A’s phys mem B’s phys mem atomicAdd_system (addr2, 1) local access page2 page2 page3 page3 23

CONCURRENT ACCESS Atomics over NVLINK* A’s page table B’s page table page1 page1 A’s phys mem B’s phys mem atomicAdd_system atomicAdd_system (addr2, 1) (addr2, 1) remote access local access page2 page2 page3 page3 24 *both processors need to support atomic operations

CONCURRENT ACCESS Read duplication* A’s page table B’s page table page1 page1 A’s phys mem B’s phys mem val = *addr2; val = *addr2; local access local access page2 page2 page3 page3 25 *each processor must maintain its own page table

CONCURRENT ACCESS Read duplication: write A’s page table B’s page table page1 page1 A’s phys mem B’s phys mem *addr2 = val2; local access page2 page2 page3 page3 a write will collapse all copies into one 26

CONCURRENT ACCESS Read duplication: read after write A’s page table B’s page table page1 page1 A’s phys mem B’s phys mem val = *addr2; local access page2 page2 page3 page3 pages are duplicated again on faults 27

ANALYTICS USE CASE Design of a Concurrent Hybrid Hash Table Multiple CPU Cores Concurrent Inserts and Fetches Hash table implemented via Unified Memory Non-blocking updates using atomic compare&swap Concurrent Fetches Concurrent Fetches GPU 0 GPU 1 28 S8172 - Evaluation of Hybrid Cache-Coherent Concurrent Hash Table on POWER9 System with NVLink 2.0 – Thu 11:00 Room 210F

ANALYTICS USE CASE Concurrent Access To Hash Table SM SM SM L2 HBM SYSMEM page fault x86 29

ANALYTICS USE CASE Concurrent Access To Hash Table SM SM SM L2 HBM page migration SYSMEM access replay page fault x86 30

ANALYTICS USE CASE Concurrent Access To Hash Table SM SM SM SM SM SM L2 L2 HBM HBM page migration P9 can directly update hash entry in GPU memory SYSMEM cache access replay page fault no page faults or P9 migrations! x86 31 S8172 - Evaluation of Hybrid Cache-Coherent Concurrent Hash Table on POWER9 System with NVLink 2.0 – Thu 11:00 Room 210F

UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES GPU GPU GPU GPU GPU GPU GPU GPU 0 1 2 3 4 5 6 7 Single memory view shared by all GPUs 512 GB Unified Memory Automatic migration of data between GPUs GPU GPU GPU GPU GPU GPU GPU GPU 8 9 10 11 12 13 14 15 User control of data locality 32

EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY Nikolay - PowerPoint PPT Presentation

EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY Nikolay Sakharnykh, 3/27/2018 SINGLE POINTER CPU vs GPU CPU code GPU code w/ Unified Memory void data; void data; data = malloc(N); data = malloc(N); cpu_func1(data, N); cpu_func1(data,

UNIFIED MEMORY IN CUDA 6 MARK HARRIS NVIDIA CONFIDENTIAL Unified Memory Dramatically Lower

What You Dont Know What You Dont Know What You Dont Know What You Dont Know That

Everything you need to know Everything you need to know about Writing and Using about Writing

Things you can do Things you can do Things you can do Everything you need to know

WELCOME! You need to know what you know, and know what you dont know. Then work on your areas

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Basics of Unified Sports Ways to get involved with Unified Sports in Ohio Ohio 1 What are

SARVAM UCS Unified Communication Server Unified Communication Server for Modern Enterprises

(11-14) How much do you know about the internet? Make sure you stay SAFE AND SECURE ONLINE YOU

Costas Efthimiou Theory of Everything From Volume 8: Relativity Cosmos Romeo Niram If you know

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Forget everything you knew about Swift Rings (here's everything you need to know about Rings)

CSE 351: Section 10 Memory Allocation Memory Allocation Must allocate any memory you need to

Everything * you need to know about the psychology of perception to make effective

Financial Relief and Support Everything You Need to Know as of April 9, 2020 Daily Webinar |

Virtual Field TRips Frantz Lucien, Intrepid Sea, Air, and Space Museum Charissa Ruth,

Q3 & 9M FY20 Earnings Presentation Feb ebruary 20 2020 20 1 Di Disclaimer This

14 th August 2014 Introduction Cowpeas (Kunde) grown for seeds, leaves and pods. Av.

39 th Annual Florida Educational Negotiators Conference May 2019 Gregg Morton, Deputy

Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng,

Devices with nesCheck Daniele MIDI, Mathias PAYER, Elisa BERTINO AsiaCCS 2017 Purdue due Univer

CUDA 6.0 Unified Virtual Memory Juraj Kardo (University of Lugano) July 9, 2014 Juraj Kardo

Early Literacy Development Kathleen Whitbread, Ph.D., University of Saint Joseph Myths You

EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY Nikolay - PowerPoint PPT Presentation

EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY Nikolay Sakharnykh, 3/27/2018 SINGLE POINTER CPU vs GPU CPU code GPU code w/ Unified Memory void *data; void *data; data = malloc(N); data = malloc(N); cpu_func1(data, N); cpu_func1(data,

UNIFIED MEMORY IN CUDA 6 MARK HARRIS NVIDIA CONFIDENTIAL Unified Memory Dramatically Lower

What You Dont Know What You Dont Know What You Dont Know What You Dont Know That

Everything you need to know Everything you need to know about Writing and Using about Writing

Things you can do Things you can do Things you can do Everything you need to know

WELCOME! You need to know what you know, and know what you dont know. Then work on your areas

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Basics of Unified Sports Ways to get involved with Unified Sports in Ohio Ohio 1 What are

SARVAM UCS Unified Communication Server Unified Communication Server for Modern Enterprises

(11-14) How much do you know about the internet? Make sure you stay SAFE AND SECURE ONLINE YOU

Costas Efthimiou Theory of Everything From Volume 8: Relativity Cosmos Romeo Niram If you know

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Forget everything you knew about Swift Rings (here's everything you need to know about Rings)

CSE 351: Section 10 Memory Allocation Memory Allocation Must allocate any memory you need to

Everything * you need to know about the psychology of perception to make effective

Financial Relief and Support Everything You Need to Know as of April 9, 2020 Daily Webinar |

Virtual Field TRips Frantz Lucien, Intrepid Sea, Air, and Space Museum Charissa Ruth,

Q3 &amp; 9M FY20 Earnings Presentation Feb ebruary 20 2020 20 1 Di Disclaimer This

14 th August 2014 Introduction Cowpeas (Kunde) grown for seeds, leaves and pods. Av.

39 th Annual Florida Educational Negotiators Conference May 2019 Gregg Morton, Deputy

Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng,

Devices with nesCheck Daniele MIDI, Mathias PAYER, Elisa BERTINO AsiaCCS 2017 Purdue due Univer

CUDA 6.0 Unified Virtual Memory Juraj Kardo (University of Lugano) July 9, 2014 Juraj Kardo

Early Literacy Development Kathleen Whitbread, Ph.D., University of Saint Joseph Myths You

EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY Nikolay Sakharnykh, 3/27/2018 SINGLE POINTER CPU vs GPU CPU code GPU code w/ Unified Memory void data; void data; data = malloc(N); data = malloc(N); cpu_func1(data, N); cpu_func1(data,

Q3 & 9M FY20 Earnings Presentation Feb ebruary 20 2020 20 1 Di Disclaimer This