Nikolay Sakharnykh, Chirayu Garg, and Dmitri Vainbrand, Thu Mar 19, 3:00 PM
UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING Nikolay - - PowerPoint PPT Presentation
UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING Nikolay - - PowerPoint PPT Presentation
UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING Nikolay Sakharnykh, Chirayu Garg, and Dmitri Vainbrand, Thu Mar 19, 3:00 PM RAPIDS CUDA-accelerated Data Science Libraries PYTHON DL RAPIDS FRAMEWORKS DASK / SPARK cuGraph cuDF cuML
2
RAPIDS
CUDA-accelerated Data Science Libraries
CUDA PYTHON APACHE ARROW on GPU Memory DASK / SPARK cuDNN cuML cuDF DL FRAMEWORKS RAPIDS cuGraph
3
MORTGAGE PIPELINE: ETL
https://github.com/rapidsai/notebooks/blob/master/mortgage/E2E.ipynb
CSV DF Arrow
read CSV filter join groupby
4
MORTGAGE PIPELINE: PREP + ML
https://github.com/rapidsai/notebooks/blob/master/mortgage/E2E.ipynb
Arrow Arrow Arrow DF
concat
DMatrix
convert XGboost
5
GTC EU KEYNOTE RESULTS ON DGX-1
20 40 60 80 100 120 140 ETL PREP ML
Mortage workflow time breakdown on DGX-1 (s)
6
MAXIMUM MEMORY USAGE ON DGX-1
5 10 15 20 25 30 35 1 2 3 4 5 6 7 8 GB GPU ID
Tesla V100 limit – 32GB
7
ETL INPUT
CSV CSV CSV CSV CSV CSV
- riginal input set
112 quarters (~2-3GB)
CSV CSV CSV CSV
https://rapidsai.github.io/demos/datasets/mortgage-data
CSV
240 quarters (1GB)
CSV CSV CSV CSV CSV CSV CSV …
8
CAN WE AVOID INPUT SPLITTING?
5 10 15 20 25 30 35 40 1 1388 2775 4162 5549 6936 8323 9710 11097 12484 13871 15258 16645 18032 19419 20806 22193 23580 24967 26354 27741 29128 30515 31902 33289 34676 36063 37450
GPU memory usage (GB) - ETL (112 parts)
5 10 15 20 25 30 35 40 1 75 149 223 297 371 445 519 593 667 741 815 889 963 1037 1111 1185 1259 1333 1407 1481 1555 1629 1703 1777 1851 1925 1999
GPU memory usage (GB) - ETL (original dataset)
OOM CRASH Tesla V100 limit – 32GB
9
ML INPUT
Arrow Arrow Arrow DF
concat
DMatrix
convert XGboost
Some # of quarters are used for ML training
10
CAN WE TRAIN ON MORE DATA?
5 10 15 20 25 30 35 1 38 75 112 149 186 223 260 297 334 371 408 445 482 519 556 593 630 667 704 741 778 815 852 889 926 963 1000
GPU memory usage (GB) - PREP (112->20 parts)
5 10 15 20 25 30 35 1 24 47 70 93 116 139 162 185 208 231 254 277 300 323 346 369 392 415 438 461 484 507 530 553 576 599 622
GPU memory usage (GB) - PREP (112->28 parts)
OOM CRASH Tesla V100 limit – 32GB
11
HOW MEMORY MANAGED IN RAPIDS
12
RAPIDS MEMORY MANAGER
RAPIDS Memory Manager (RMM) is:
- A replacement allocator for CUDA Device Memory
- A pool allocator to make CUDA device memory allocation faster & asynchronous
- A central place for all device memory allocations in cuDF and other RAPIDS libraries
https://github.com/rapidsai/rmm
13
WHY DO WE NEED MEMORY POOLS
cudaMalloc/cudaFree are synchronous
- block the device
cudaMalloc/cudaFree are expensive
- cudaFree must zero memory for security
- cudaMalloc creates peer mappings for all GPUs
Using cnmem memory pool improves RAPIDS ETL time by 10x cudaMalloc(&buffer, size_in_bytes); cudaFree(buffer);
14
RAPIDS MEMORY MANAGER (RMM)
C/C++
Fast, Asynchronous Device Memory Management
RMM_ALLOC(&buffer, size_in_bytes, stream_id); RMM_FREE(buffer, stream_id); dev_ones = rmm.device_array(np.ones(count)) dev_twos = rmm.device_array_like(dev_ones) # also rmm.to_device(), rmm.auto_device(), etc. #include <rmm_thrust_allocator.h> rmm::device_vector<int> dvec(size); thrust::sort(rmm::exec_policy(stream)->on(stream), …);
Python: drop-in replacement for Numba API Thrust: device vector and execution policies
15
MANAGING MEMORY IN THE E2E PIPELINE
perf optimization required to avoid OOM At this point all ETL processing is done and memory stored in arrow
Arrow
16
KEY MEMORY MANAGEMENT QUESTIONS
- Can we make memory management easier?
- Can we avoid artificial pre-processing of input data?
- Can we train on larger datasets?
17
SOLUTION: UNIFIED MEMORY
cudaMallocManaged
Page on GPU
...
cudaMallocManaged
Empty GPU memory Partially Occupied GPU memory Fully Occupied GPU memory
cudaMallocManaged
Oversubscription
Evict
CPU Memory Page on GPU (oversubscribed)
... ...
cudaMallocManaged
18
HOW TO USE UNIFIED MEMORY IN CUDF
from librmm_cffi import librmm_config as rmm_cfg rmm_cfg.use_pool_allocator = True # default is False rmm_cfg.use_managed_memory = True # default is False
Python
19
IMPLEMENTATION DETAILS
if (mFlags & CNMEM_FLAGS_MANAGED) { CNMEM_DEBUG_INFO("cudaMallocManaged(%lu)\n", size); CNMEM_CHECK_CUDA(cudaMallocManaged(&data, size)); CNMEM_CHECK_CUDA(cudaMemPrefetchAsync(data, size, mDevice)); } else { CNMEM_DEBUG_INFO("cudaMalloc(%lu)\n", size); CNMEM_CHECK_CUDA(cudaMalloc(&data, size)); }
Pool allocator (CNMEM): Regular RMM allocation:
if (rmm::Manager::usePoolAllocator()) { RMM_CHECK(rmm::Manager::getInstance().registerStream(stream)); RMM_CHECK_CNMEM(cnmemMalloc(reinterpret_cast<void**>(ptr), size, stream)); } else if (rmm::Manager::useManagedMemory()) RMM_CHECK_CUDA(cudaMallocManaged(reinterpret_cast<void**>(ptr), size)); else RMM_CHECK_CUDA(cudaMalloc(reinterpret_cast<void**>(ptr), size));
20
- 1. UNSPLIT DATASET “JUST WORKS”
10 20 30 40 50 60 70 80 90 100 1 254 507 760 1013 1266 1519 1772 2025 2278 2531 2784 3037 3290 3543 3796 4049 4302 4555 4808 5061 5314
GPU memory usage (GB) - ETL (original dataset) - cudaMallocManaged
mem used pool size 10 20 30 40 50 60 70 80 90 100 1 77 153 229 305 381 457 533 609 685 761 837 913 989 1065 1141 1217 1293 1369 1445 1521 1597 1673 1749 1825 1901 1977
GPU memory usage (GB) - ETL (original dataset) – cudaMalloc
OOM CRASH Tesla V100 limit – 32GB
21
- 2. SPEED-UP ON CONVERSION
46 36 20 40 60 80 100 120 140 20 quarters cudaMalloc 20 quarters cudaMallocManaged
DGX-1 time (s)
ETL PREP ML
25% speed-up on PREP!
22
- 3. LARGER ML TRAINING SET
OOM!
20 40 60 80 100 120 140 160 20 quarters cudaMalloc 20 quarters cudaMallocManaged 28 quarters cudaMalloc 28 quarters cudaMallocManaged
DGX-1 time (s)
ETL PREP ML
23
UNIFIED MEMORY GOTCHAS
- 1. UVM doesn’t work with CUDA IPC - careful when sharing data between processes
Workaround - separate (small) cudaMalloc pool for communication buffers In the future it will work transparently with Linux HMM
- 2. Yes, you can oversubscribe, but there is danger that it will just run very slowly
Capture Nsight or nvprof profiles to check eviction traffic In the future RMM may show some warnings about this
24
RECAP
Just to run the full pipeline on the GPU you need Unified Memory
makes life easier for data scientists – less tweaking! improves performance – sometimes it’s faster to allocate less often & oversubscribe enables easy experiments with larger datasets carefully partition input data adjust memory pool options throughout the pipeline limit training size to fit in memory
25
MEMORY MANAGEMENT IN THE FUTURE
Contribute to RAPIDS: https://github.com/rapidsai/cudf Contribute to RMM: https://github.com/rapidsai/rmm BlazingDB OmniSci XGBoost cuDNN Databases NEXT BIG THING cuDF cuML
26
UNIFIED MEMORY FOR DEEP LEARNING
27
FROM ANALYTICS TO DEEP LEARNING
Data Preparation Machine Learning Deep Learning
28
PYTORCH INTEGRATION
PyTorch uses a caching allocator to manage GPU memory Small allocations distributed from fixed buffer (for ex: 1 MB) Large allocations are dedicated cudaMalloc’s Trivial change Replace cudaMalloc with cudaMallocManaged Immediately call cudaMemPrefetchAsync to allocate pages on GPU Otherwise cuDNN may select sub-optimal kernels
29
PYTORCH ALLOCATOR VS RMM
Memory pool to avoid synchronization on malloc/free Directly uses CUDA APIs for memory allocations Pool size not fixed Specific to PyTorch C++ library
PyTorch Caching Allocator
Memory pool to avoid synchronization on malloc/free Uses Cnmem for memory allocation and management Reserves half the available GPU memory for pool Re-usable across projects and with interfaces for various languages
RMM
30
WORKLOADS
ResNet-1001 DenseNet-264 VNet
Image Models
BN-ReLU-Conv 1x1 BN-ReLU-Conv 3x3 BN-ReLU-Conv 1x1
+
31
WORKLOADS
Word Language Modelling Dictionary Size = 33278 Embedding Size = 256 LSTM units = 256 Back propagation through time = 1408 and 2800
Language Models
LSTM FC Softmax Loss Embedding
32
WORKLOADS
Baseline Training Performance on V100-32GB
Model FP16 FP32 Batch Size Samples/sec Batch Size Samples/sec ResNet-1001 98 98.7 48 44.3 DenseNet-264 218 255.8 109 143.1 Vnet 30 3.56 15 3.4 Lang_Model-1408 32 94.9 40 77.9 Lang_Model-2800 16 46.5 18 35.7
Optimal Batch Size Selected for High Throughput All results in this presentation are using PyTorch 1.0rc1, R418 driver , Tesla V100-32GB
33
GPU OVERSUBSCRIPTION
Upto 3x Optimal Batch Size
20 40 60 80 100 120 2 14 26 38 50 62 74 86 98 110 122 134 146 158 170 182 194 206 218 230 242 254 266 278 290
Samples/sec Batch Size
ResNet-1001
FP16 FP32 50 100 150 200 250 300 2 26 50 74 98 122 146 170 194 218 242 266 290 314 338 362 386 410 434 458 482 506 530 554 578 602 626 650
Samples/sec Batch Size
DenseNet-264
FP16 FP32
34
GPU OVERSUBSCRIPTION
Fill
GPU
CPU Mem
… … …
35
GPU OVERSUBSCRIPTION
Evict
GPU
CPU Mem
… … …
36
GPU OVERSUBSCRIPTION
Page Fault-Evict-Fetch
GPU
CPU Mem
… … …
37
GPU OVERSUBSCRIPTION
Results
Model FP16 FP32 Batch Size Samples/sec Batch Size Samples/sec ResNet-1001 202 10.1 98 5 DenseNet-264 430 22.3 218 12.1 VNet 32 3 32 1.1 Lang_Model-1408 44 8.4 44 10 Lang_Model-2800 22 4.1 22 4.9
38
GPU OVERSUBSCRIPTION
Page Faults - ResNet-1001 Training Iteration
200000 400000 600000 800000 1000000 1200000 1 1.5 2 2.5 3
Page Fault Count Over Subscription (Batch Size / Optimal Batch Size)
ResNet-1001
39
GPU OVERSUBSCRIPTION
Add cudaMemPrefetchAsync before kernels are called
Manual API Prefetch
cudaMemPrefetchAsync(…) // input, output, wparam cudnnConvolutionForward(…)
- cudaMemPrefetchAsync(…)
// A, B, C kernelPointWiseApply3(…)
40
GPU OVERSUBSCRIPTION
No Prefetch vs Manual API Prefetch
41
GPU OVERSUBSCRIPTION
Speed up from Manual API Prefetch
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 ResNet-1001 DenseNet-264 VNet Lang_Model-1408 Lang_Model-2800
Speed-Up(x)
No Prefetch vs CudaMemPrefetchAsync
FP16_Prefetch FP32_Prefetch
Observe upto 1.6x speed-up
42
GPU OVERSUBSCRIPTION
Prefetch memory before kernel to improve performance cudaMemPrefetchAsync takes CPU cycles – degrades performance when not required Automatic prefetching needed to achieve high performance
Prefetch Only When Needed
20 40 60 80 100 120 2 14 26 38 50 62 74 86 98 110 122 134 146 158 170 182 194 206 218 230 242 254 266 278 290
Samples/sec Batch Size
ResNet-1001
FP16 FP16_prefetch
43
DRIVER PREFETCH
Driver initiated (density) prefetching from CPU to GPU GPU pages tracked as chunk of smaller sysmem page Driver logic: Prefetch rest of the GPU page when 51% is migrated to GPU Change to 5% Observe up to 20% gain in performance vs default settings
Aggressive driver prefetching
44
FRAMEWORK FUTURE
Framework can develop intelligence to insert prefetch before calling GPU kernels Smart evictions: Activation’s
- nly
Lazy Prefetch: Catch kernel calls right before execution and add prefetch calls Eager Prefetch - Identify and add prefetch calls before the kernels are called
…
nn.Conv2d(…) (Hook) Replace: nn.Prefetch(…) nn.Conv2d(…)
x W1 * y W2 * z
Execute Parallelly
45
TAKEAWAY
Unified Memory oversubscription solves the memory pool fragmentation issue Simple way to train bigger models and on larger input data Minimal user effort, no change in framework programming Frameworks can get better performance by adding prefetch’s Try it out and contribute: https://github.com/rapidsai/cudf https://github.com/rapidsai/rmm