MEMORY MANAGEMENT ON MODERN GPU ARCHITECTURES Nikolay Sakharnykh, - PowerPoint PPT Presentation

MEMORY MANAGEMENT ON MODERN GPU ARCHITECTURES Nikolay Sakharnykh, Tue Mar 19, 3:00 PM

HOW DO WE ALLOCATE MEMORY IN CUDA? cudaMallocHost cudaHostRegister cudaMalloc cudaMallocManaged cudaMalloc3D cudaMallocArray 2

HOW DO WE ALLOCATE MEMORY IN CUDA? cudaMallocHost cudaHostRegister cudaMalloc • Accessible by CPU & GPU cudaMallocManaged Accessible by GPU only • • Pinned to CPU mem node Pinned to single GPU • • Accessible by CPU & GPU • Can “migrate” 3

Key principles Performance tuning AGENDA Multi-GPU systems Summit & Sierra OS integration *Here is some behavior that may change in the future 4

UNIFIED MEMORY BASICS Process P CPU 0’s GPU A’s memory memory page1 GPU B’s memory page2 page3 CPU 1’s GPU C’s memory memory … Single virtual memory shared between computing processors 5

UNIFIED MEMORY BASICS A’s page table B’s page table page1 page1 A’s phys mem B’s phys mem GPU A GPU B page2 page2 page3 page3 … … 6

EXAMPLE: LOCAL ACCESS A’s page table B’s page table page1 page1 A’s phys mem B’s phys mem *addr1 = 1 local access page2 page2 page3 page3 … … 7

EXAMPLE: POPULATE A’s page table B’s page table page1 page1 A’s phys mem B’s phys mem *addr3 = 1 page2 page2 page fault page3 page3 … … 8

EXAMPLE: POPULATE A’s page table B’s page table page1 page1 A’s phys mem B’s phys mem *addr3 = 1 page2 page2 page fault page3 page3 … … allocate memory for page3’s data 9

EXAMPLE: POPULATE A’s page table B’s page table page1 page1 A’s phys mem B’s phys mem *addr3 = 1 page2 page2 page fault page3 page3 … … populate page3 and map into the new location 10

EXAMPLE: POPULATE A’s page table B’s page table page1 page1 A’s phys mem B’s phys mem *addr3 = 1 page2 page2 access replay page3 page3 … … 11

EXAMPLE: MIGRATE A’s page table B’s page table *addr2 = 1 page1 page1 A’s phys mem B’s phys mem page fault page2 page2 page3 page3 *addr3 = 1 page fault … … 12

EXAMPLE: MIGRATE A’s page table B’s page table *addr2 = 1 page1 page1 A’s phys mem B’s phys mem page fault page2 page2 page3 page3 *addr3 = 1 page fault … … unmap page2 and page3 from B’s memory 13

EXAMPLE: MIGRATE A’s page table B’s page table *addr2 = 1 page1 page1 A’s phys mem B’s phys mem page fault page2 page2 page3 page3 *addr3 = 1 page fault … … copy pages’ data from B to A 14

EXAMPLE: MIGRATE A’s page table B’s page table *addr2 = 1 page1 page1 A’s phys mem B’s phys mem page fault page2 page2 page3 page3 *addr3 = 1 page fault … … map page2 and page3 into A’s memory 15

EXAMPLE: MIGRATE A’s page table B’s page table *addr2 = 1 page1 page1 A’s phys mem B’s phys mem access replay page2 page2 page3 page3 *addr3 = 1 access replay … … 16

EXAMPLE: OVERSUBSCRIBE A’s page table B’s page table page1 page1 A’s phys mem B’s phys mem page2 page2 page3 page3 … … GPU memory is FULL 17

EXAMPLE: OVERSUBSCRIBE A’s page table B’s page table page1 page1 A’s phys mem B’s phys mem page2 page2 page3 page3 … … page6 page6 *addr6 = 1 page fault 18

EXAMPLE: OVERSUBSCRIBE A’s page table B’s page table page1 page1 A’s phys mem B’s phys mem page2 page2 page3 page3 … … unmap page3 page6 page6 from A’s memory *addr6 = 1 page fault 19

EXAMPLE: OVERSUBSCRIBE A’s page table B’s page table page1 page1 A’s phys mem B’s phys mem page2 page2 page3 page3 … … copy page3’s data page6 page6 to B’s memory *addr6 = 1 page fault 20

EXAMPLE: OVERSUBSCRIBE A’s page table B’s page table page1 page1 A’s phys mem B’s phys mem page2 page2 page3 page3 … … page6 page6 *addr6 = 1 page fault 21

EXAMPLE: OVERSUBSCRIBE A’s page table B’s page table page1 page1 A’s phys mem B’s phys mem page2 page2 page3 page3 … … populate page6 page6 page6 *addr6 = 1 page fault 22

EXAMPLE: OVERSUBSCRIBE A’s page table B’s page table page1 page1 A’s phys mem B’s phys mem page2 page2 page3 page3 … … page6 page6 *addr6 = 1 access replay *B cannot be a GPU in this case 23

RECAP Proc A Proc B Migrate Populate Proc A Proc B Oversubscribe 24

APPLICATIONS IN ANALYTICS AND DL Thursday, Mar 21, 3:00 PM S9726 - Unified Memory for Data Analytics and Deep Learning – SJCC Room 211A (Concourse Level) CSV Arrow Arrow Arrow read CSV concat DF DF filter join convert groupby DMatrix XGboost Arrow 25

APPLICATIONS IN ANALYTICS AND DL Thursday, Mar 21, 3:00 PM S9726 - Unified Memory for Data Analytics and Deep Learning – SJCC Room 211A (Concourse Level) … CPU GPU Mem oversubscribe … … 26

Key principles Performance tuning AGENDA Multi-GPU systems Summit & Sierra OS integration 27

PREFETCH A’s page table B’s page table anticipating page1 page1 access A’s phys mem B’s phys mem in the future page2 page2 page3 page3 … … 28

PREFETCH A’s page table B’s page table anticipating page1 page1 access A’s phys mem B’s phys mem in the future page2 page2 page3 page3 … … cudaMemPrefetchAsync (ptr, size, proc A, stream) 29

MIGRATION PERFORMANCE __global__ void kernel(int *host, int *device) { Tesla V100 PCIe3 throughput (GB/s) int i = threadIdx.x + blockDim.x * blockIdx.x; 14.0 device[i] = host[i]; 12.0 12.0 11.4 } 10.0 // allocate and initialize memory cudaMallocManaged(&host, size); 8.0 memset(host, 0, size); 5.8 6.0 // benchmark CPU->GPU migration 4.0 if (prefetch) 2.0 cudaMemPrefetchAsync(host, size, gpuId); kernel<<<grid, block>>>(host, device); 0.0 fault-based prefetch cudaMemcpy 32 For more details see blog: https://devblogs.nvidia.com/maximizing-unified-memory-performance-cuda/

MIGRATION W/ OVERSUBSCRIPTION // pre-populate GPU memory Tesla V100 PCIe, throughput (GB/s) cudaMallocManaged(&tmp, GPU_MEM_SIZE); GPU memory free GPU memory fully populated cudaMemPrefetchAsync(tmp, GPU_MEM_SIZE, gpuId); 14.0 12.0 11.4 12.0 // allocate and initialize memory 10.0 8.8 cudaMallocManaged(&host, size); 8.0 memset(host, 0, size); 5.8 6.0 // benchmark CPU->GPU migration 3.8 4.0 if (prefetch) cudaMemPrefetchAsync(host, size, gpuId); 2.0 kernel<<<grid, block>>>(host, device); 0.0 fault-based prefetch cudaMemcpy 33

POPULATION PERFORMANCE Tesla V100 population throughput (GB/s) fault-based, driver 410 fault-based, driver418 prefetch, driver 410 prefetch, driver 418 // fault-based 160 cudaMallocManaged(&ptr, size); 140 cudaMemset(ptr, 0, size); 120 100 // prefetch cudaMallocManaged(&ptr, size); 80 cudaMemPrefetchAsync(ptr, size, gpuId); 60 cudaMemset(ptr, 0, size); 40 20 no migration traffic, just page population 0 6 2 4 8 6 6 2 4 8 6 2 4 8 6 2 4 8 6 2 4 8 9 9 8 6 3 7 4 8 7 5 0 0 1 3 6 2 5 1 2 4 9 8 6 2 0 1 3 7 5 0 1 2 5 1 3 6 2 4 8 7 4 9 8 7 4 8 7 4 8 4 8 6 2 5 1 2 4 7 5 0 1 3 7 1 3 6 3 6 2 7 5 0 1 3 7 4 8 6 4 9 9 8 1 2 5 0 0 1 3 7 5 1 2 4 8 7 4 9 3 7 4 1 2 4 8 6 3 7 4 8 6 1 3 6 3 6 3 7 4 9 1 2 5 0 1 2 1 2 4 buffer size (bytes) 34

PREFETCH GOTCHAS CPU overhead related to updating page table mappings Driver may defer prefetches to a background thread How this may impact your applications: • DtoH prefetch may not return until the operation is completed • Achieving good DtoH / HtoD overlap may be difficult in some cases We’re actively working on improving prefetch implementation to alleviate those issues For more details see my blog: https://devblogs.nvidia.com/maximizing-unified-memory-performance-cuda/ 35

USER POLICIES GPU1 GPU0 Default: data migrates on access/prefetch GPU1 GPU0 ReadMostly : data duplicated on read/prefetch GPU1 GPU0 PreferredLocation : resist migrating away from it GPU1 GPU0 AccessedBy : establish direct mapping / avoid faults 36

READ DUPLICATION char *data; cudaMallocManaged(&data, N); populates data on the CPU init_data(data, N); cudaMemAdvise(data, N, ..SetReadMostly, myGpuId); creates a copy on the GPU cudaMemPrefetchAsync(data, N, myGpuId, s); mykernel<<<..., s>>>(data, N); both CPU and GPU can read data use_data(data, N); simultaneously without faults cudaDeviceSynchronize(); writes will collapse all copies into one, cudaFree(data); subsequent reads will fault & duplicate 37

MEMORY MANAGEMENT ON MODERN GPU ARCHITECTURES Nikolay Sakharnykh, - PowerPoint PPT Presentation

MEMORY MANAGEMENT ON MODERN GPU ARCHITECTURES Nikolay Sakharnykh, Tue Mar 19, 3:00 PM HOW DO WE ALLOCATE MEMORY IN CUDA? cudaMallocHost cudaHostRegister cudaMalloc cudaMallocManaged cudaMalloc3D cudaMallocArray 2 HOW DO WE ALLOCATE

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

MODERN 1 MODERN 2 MODERN 3 MODERN 4 MODERN A peep at some distant orb has power to raise

MULTI GPU PROGRAMMING WITH MPI Jiri Kraus, Senior Devtech Compute, April 4th 2016 MPI+CUDA

MPI AND OPENACC JIRI KRAUS, NVIDIA MPI+OPENACC System System System GDDR5 Memory GDDR5

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Architectures Architectural styles Software architectures Architectures versus middleware

HPC Architectures Types of resource currently in use Outline Shared memory architectures

HPC Architectures Types of resource currently in use Outline Shared memory architectures

Introduction Modern DRAM Memory Architectures Sam Miller Memory subsystem is a bottleneck

Real-Time GPU Management Heechul Yun 1 This Week Topic: General Purpose Graphic Processing

Modern Risk Modern Risk Modern Risk Management Modern Risk Management anagement Concepts:

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Dynamic Memory Management 333 Dynamic Memory Management Process Memory Layout Process Memory

28.05.04 09:50 Memory Management The computer memory is a limited resource so the Memory

Artwork Personalization at Netflix Justin Basilico QCon SF 2018 2018-11-05 @JustinBasilico

Project Plan Mobile Audit Itinerary The Capstone Experience Team Auto-Owners Jacob Burger

OPS An Opportunistic Networking Protocol Simulator for OMNeT++ Asanga Udugama , Anna Frster,

Pre-ICANN57 Policy Update Webinar 20 October 2016 | 10 UTC & 19 UTC Welcome to the

An Introduction to SELinux Presentation Toshaan Bharvani - VanTosh bvba < toshaan@vantosh.com

Cisco

TalkTalk FY20 Preliminary Results - Transcript of Pre-recorded Presentation Thursday 11 th June

Alternative way of starting up the WinCC OA UI using HTTP protocol Hannu Kamarainen, Piotr

MEMORY MANAGEMENT ON MODERN GPU ARCHITECTURES Nikolay Sakharnykh, - PowerPoint PPT Presentation

MEMORY MANAGEMENT ON MODERN GPU ARCHITECTURES Nikolay Sakharnykh, Tue Mar 19, 3:00 PM HOW DO WE ALLOCATE MEMORY IN CUDA? cudaMallocHost cudaHostRegister cudaMalloc cudaMallocManaged cudaMalloc3D cudaMallocArray 2 HOW DO WE ALLOCATE

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

MODERN 1 MODERN 2 MODERN 3 MODERN 4 MODERN A peep at some distant orb has power to raise

MULTI GPU PROGRAMMING WITH MPI Jiri Kraus, Senior Devtech Compute, April 4th 2016 MPI+CUDA

MPI AND OPENACC JIRI KRAUS, NVIDIA MPI+OPENACC System System System GDDR5 Memory GDDR5

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Architectures Architectural styles Software architectures Architectures versus middleware

HPC Architectures Types of resource currently in use Outline Shared memory architectures

HPC Architectures Types of resource currently in use Outline Shared memory architectures

Introduction Modern DRAM Memory Architectures Sam Miller Memory subsystem is a bottleneck

Real-Time GPU Management Heechul Yun 1 This Week Topic: General Purpose Graphic Processing

Modern Risk Modern Risk Modern Risk Management Modern Risk Management anagement Concepts:

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

Dynamic Memory Management 333 Dynamic Memory Management Process Memory Layout Process Memory

28.05.04 09:50 Memory Management The computer memory is a limited resource so the Memory

Artwork Personalization at Netflix Justin Basilico QCon SF 2018 2018-11-05 @JustinBasilico

Project Plan Mobile Audit Itinerary The Capstone Experience Team Auto-Owners Jacob Burger

OPS An Opportunistic Networking Protocol Simulator for OMNeT++ Asanga Udugama , Anna Frster,

Pre-ICANN57 Policy Update Webinar 20 October 2016 | 10 UTC &amp; 19 UTC Welcome to the

An Introduction to SELinux Presentation Toshaan Bharvani - VanTosh bvba &lt; toshaan@vantosh.com

Cisco

TalkTalk FY20 Preliminary Results - Transcript of Pre-recorded Presentation Thursday 11 th June

Alternative way of starting up the WinCC OA UI using HTTP protocol Hannu Kamarainen, Piotr

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Pre-ICANN57 Policy Update Webinar 20 October 2016 | 10 UTC & 19 UTC Welcome to the

An Introduction to SELinux Presentation Toshaan Bharvani - VanTosh bvba < toshaan@vantosh.com