memory management on modern gpu architectures
play

MEMORY MANAGEMENT ON MODERN GPU ARCHITECTURES Nikolay Sakharnykh, - PowerPoint PPT Presentation

MEMORY MANAGEMENT ON MODERN GPU ARCHITECTURES Nikolay Sakharnykh, Tue Mar 19, 3:00 PM HOW DO WE ALLOCATE MEMORY IN CUDA? cudaMallocHost cudaHostRegister cudaMalloc cudaMallocManaged cudaMalloc3D cudaMallocArray 2 HOW DO WE ALLOCATE


  1. MEMORY MANAGEMENT ON MODERN GPU ARCHITECTURES Nikolay Sakharnykh, Tue Mar 19, 3:00 PM

  2. HOW DO WE ALLOCATE MEMORY IN CUDA? cudaMallocHost cudaHostRegister cudaMalloc cudaMallocManaged cudaMalloc3D cudaMallocArray 2

  3. HOW DO WE ALLOCATE MEMORY IN CUDA? cudaMallocHost cudaHostRegister cudaMalloc • Accessible by CPU & GPU cudaMallocManaged Accessible by GPU only • • Pinned to CPU mem node Pinned to single GPU • • Accessible by CPU & GPU • Can “migrate” 3

  4. Key principles Performance tuning AGENDA Multi-GPU systems Summit & Sierra OS integration *Here is some behavior that may change in the future 4

  5. UNIFIED MEMORY BASICS Process P CPU 0’s GPU A’s memory memory page1 GPU B’s memory page2 page3 CPU 1’s GPU C’s memory memory … Single virtual memory shared between computing processors 5

  6. UNIFIED MEMORY BASICS A’s page table B’s page table page1 page1 A’s phys mem B’s phys mem GPU A GPU B page2 page2 page3 page3 … … 6

  7. EXAMPLE: LOCAL ACCESS A’s page table B’s page table page1 page1 A’s phys mem B’s phys mem *addr1 = 1 local access page2 page2 page3 page3 … … 7

  8. EXAMPLE: POPULATE A’s page table B’s page table page1 page1 A’s phys mem B’s phys mem *addr3 = 1 page2 page2 page fault page3 page3 … … 8

  9. EXAMPLE: POPULATE A’s page table B’s page table page1 page1 A’s phys mem B’s phys mem *addr3 = 1 page2 page2 page fault page3 page3 … … allocate memory for page3’s data 9

  10. EXAMPLE: POPULATE A’s page table B’s page table page1 page1 A’s phys mem B’s phys mem *addr3 = 1 page2 page2 page fault page3 page3 … … populate page3 and map into the new location 10

  11. EXAMPLE: POPULATE A’s page table B’s page table page1 page1 A’s phys mem B’s phys mem *addr3 = 1 page2 page2 access replay page3 page3 … … 11

  12. EXAMPLE: MIGRATE A’s page table B’s page table *addr2 = 1 page1 page1 A’s phys mem B’s phys mem page fault page2 page2 page3 page3 *addr3 = 1 page fault … … 12

  13. EXAMPLE: MIGRATE A’s page table B’s page table *addr2 = 1 page1 page1 A’s phys mem B’s phys mem page fault page2 page2 page3 page3 *addr3 = 1 page fault … … unmap page2 and page3 from B’s memory 13

  14. EXAMPLE: MIGRATE A’s page table B’s page table *addr2 = 1 page1 page1 A’s phys mem B’s phys mem page fault page2 page2 page3 page3 *addr3 = 1 page fault … … copy pages’ data from B to A 14

  15. EXAMPLE: MIGRATE A’s page table B’s page table *addr2 = 1 page1 page1 A’s phys mem B’s phys mem page fault page2 page2 page3 page3 *addr3 = 1 page fault … … map page2 and page3 into A’s memory 15

  16. EXAMPLE: MIGRATE A’s page table B’s page table *addr2 = 1 page1 page1 A’s phys mem B’s phys mem access replay page2 page2 page3 page3 *addr3 = 1 access replay … … 16

  17. EXAMPLE: OVERSUBSCRIBE A’s page table B’s page table page1 page1 A’s phys mem B’s phys mem page2 page2 page3 page3 … … GPU memory is FULL 17

  18. EXAMPLE: OVERSUBSCRIBE A’s page table B’s page table page1 page1 A’s phys mem B’s phys mem page2 page2 page3 page3 … … page6 page6 *addr6 = 1 page fault 18

  19. EXAMPLE: OVERSUBSCRIBE A’s page table B’s page table page1 page1 A’s phys mem B’s phys mem page2 page2 page3 page3 … … unmap page3 page6 page6 from A’s memory *addr6 = 1 page fault 19

  20. EXAMPLE: OVERSUBSCRIBE A’s page table B’s page table page1 page1 A’s phys mem B’s phys mem page2 page2 page3 page3 … … copy page3’s data page6 page6 to B’s memory *addr6 = 1 page fault 20

  21. EXAMPLE: OVERSUBSCRIBE A’s page table B’s page table page1 page1 A’s phys mem B’s phys mem page2 page2 page3 page3 … … page6 page6 *addr6 = 1 page fault 21

  22. EXAMPLE: OVERSUBSCRIBE A’s page table B’s page table page1 page1 A’s phys mem B’s phys mem page2 page2 page3 page3 … … populate page6 page6 page6 *addr6 = 1 page fault 22

  23. EXAMPLE: OVERSUBSCRIBE A’s page table B’s page table page1 page1 A’s phys mem B’s phys mem page2 page2 page3 page3 … … page6 page6 *addr6 = 1 access replay *B cannot be a GPU in this case 23

  24. RECAP Proc A Proc B Migrate Populate Proc A Proc B Oversubscribe 24

  25. APPLICATIONS IN ANALYTICS AND DL Thursday, Mar 21, 3:00 PM S9726 - Unified Memory for Data Analytics and Deep Learning – SJCC Room 211A (Concourse Level) CSV Arrow Arrow Arrow read CSV concat DF DF filter join convert groupby DMatrix XGboost Arrow 25

  26. APPLICATIONS IN ANALYTICS AND DL Thursday, Mar 21, 3:00 PM S9726 - Unified Memory for Data Analytics and Deep Learning – SJCC Room 211A (Concourse Level) … CPU GPU Mem oversubscribe … … 26

  27. Key principles Performance tuning AGENDA Multi-GPU systems Summit & Sierra OS integration 27

  28. PREFETCH A’s page table B’s page table anticipating page1 page1 access A’s phys mem B’s phys mem in the future page2 page2 page3 page3 … … 28

  29. PREFETCH A’s page table B’s page table anticipating page1 page1 access A’s phys mem B’s phys mem in the future page2 page2 page3 page3 … … cudaMemPrefetchAsync (ptr, size, proc A, stream) 29

  30. PREFETCH A’s page table B’s page table anticipating page1 page1 access A’s phys mem B’s phys mem in the future page2 page2 page3 page3 … … cudaMemPrefetchAsync (ptr, size, proc A, stream) 30

  31. PREFETCH A’s page table B’s page table anticipating page1 page1 access A’s phys mem B’s phys mem in the future page2 page2 page3 page3 … … cudaMemPrefetchAsync (ptr, size, proc A, stream) 31

  32. MIGRATION PERFORMANCE __global__ void kernel(int *host, int *device) { Tesla V100 PCIe3 throughput (GB/s) int i = threadIdx.x + blockDim.x * blockIdx.x; 14.0 device[i] = host[i]; 12.0 12.0 11.4 } 10.0 // allocate and initialize memory cudaMallocManaged(&host, size); 8.0 memset(host, 0, size); 5.8 6.0 // benchmark CPU->GPU migration 4.0 if (prefetch) 2.0 cudaMemPrefetchAsync(host, size, gpuId); kernel<<<grid, block>>>(host, device); 0.0 fault-based prefetch cudaMemcpy 32 For more details see blog: https://devblogs.nvidia.com/maximizing-unified-memory-performance-cuda/

  33. MIGRATION W/ OVERSUBSCRIPTION // pre-populate GPU memory Tesla V100 PCIe, throughput (GB/s) cudaMallocManaged(&tmp, GPU_MEM_SIZE); GPU memory free GPU memory fully populated cudaMemPrefetchAsync(tmp, GPU_MEM_SIZE, gpuId); 14.0 12.0 11.4 12.0 // allocate and initialize memory 10.0 8.8 cudaMallocManaged(&host, size); 8.0 memset(host, 0, size); 5.8 6.0 // benchmark CPU->GPU migration 3.8 4.0 if (prefetch) cudaMemPrefetchAsync(host, size, gpuId); 2.0 kernel<<<grid, block>>>(host, device); 0.0 fault-based prefetch cudaMemcpy 33

  34. POPULATION PERFORMANCE Tesla V100 population throughput (GB/s) fault-based, driver 410 fault-based, driver418 prefetch, driver 410 prefetch, driver 418 // fault-based 160 cudaMallocManaged(&ptr, size); 140 cudaMemset(ptr, 0, size); 120 100 // prefetch cudaMallocManaged(&ptr, size); 80 cudaMemPrefetchAsync(ptr, size, gpuId); 60 cudaMemset(ptr, 0, size); 40 20 no migration traffic, just page population 0 6 2 4 8 6 6 2 4 8 6 2 4 8 6 2 4 8 6 2 4 8 9 9 8 6 3 7 4 8 7 5 0 0 1 3 6 2 5 1 2 4 9 8 6 2 0 1 3 7 5 0 1 2 5 1 3 6 2 4 8 7 4 9 8 7 4 8 7 4 8 4 8 6 2 5 1 2 4 7 5 0 1 3 7 1 3 6 3 6 2 7 5 0 1 3 7 4 8 6 4 9 9 8 1 2 5 0 0 1 3 7 5 1 2 4 8 7 4 9 3 7 4 1 2 4 8 6 3 7 4 8 6 1 3 6 3 6 3 7 4 9 1 2 5 0 1 2 1 2 4 buffer size (bytes) 34

  35. PREFETCH GOTCHAS CPU overhead related to updating page table mappings Driver may defer prefetches to a background thread How this may impact your applications: • DtoH prefetch may not return until the operation is completed • Achieving good DtoH / HtoD overlap may be difficult in some cases We’re actively working on improving prefetch implementation to alleviate those issues For more details see my blog: https://devblogs.nvidia.com/maximizing-unified-memory-performance-cuda/ 35

  36. USER POLICIES GPU1 GPU0 Default: data migrates on access/prefetch GPU1 GPU0 ReadMostly : data duplicated on read/prefetch GPU1 GPU0 PreferredLocation : resist migrating away from it GPU1 GPU0 AccessedBy : establish direct mapping / avoid faults 36

  37. READ DUPLICATION char *data; cudaMallocManaged(&data, N); populates data on the CPU init_data(data, N); cudaMemAdvise(data, N, ..SetReadMostly, myGpuId); creates a copy on the GPU cudaMemPrefetchAsync(data, N, myGpuId, s); mykernel<<<..., s>>>(data, N); both CPU and GPU can read data use_data(data, N); simultaneously without faults cudaDeviceSynchronize(); writes will collapse all copies into one, cudaFree(data); subsequent reads will fault & duplicate 37

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend