UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING Nikolay - PowerPoint PPT Presentation

UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING Nikolay Sakharnykh, Chirayu Garg, and Dmitri Vainbrand, Thu Mar 19, 3:00 PM

RAPIDS CUDA-accelerated Data Science Libraries PYTHON DL RAPIDS FRAMEWORKS DASK / SPARK cuGraph cuDF cuML cuDNN CUDA APACHE ARROW on GPU Memory 2

MORTGAGE PIPELINE: ETL https://github.com/rapidsai/notebooks/blob/master/mortgage/E2E.ipynb CSV read CSV filter DF join groupby Arrow 3

MORTGAGE PIPELINE: PREP + ML https://github.com/rapidsai/notebooks/blob/master/mortgage/E2E.ipynb Arrow Arrow Arrow concat DF convert DMatrix XGboost 4

GTC EU KEYNOTE RESULTS ON DGX-1 Mortage workflow time breakdown on DGX-1 (s) 140 120 100 80 60 40 20 0 ETL PREP ML 5

MAXIMUM MEMORY USAGE ON DGX-1 35 Tesla V100 limit – 32GB 30 25 20 GB 15 10 5 0 1 2 3 4 5 6 7 8 GPU ID 6

ETL INPUT https://rapidsai.github.io/demos/datasets/mortgage-data 112 quarters (~2-3GB) 240 quarters (1GB) original input set CSV CSV CSV CSV CSV CSV CSV CSV CSV CSV CSV CSV CSV CSV CSV … CSV CSV CSV 7

10 15 20 25 30 35 40 0 5 1 1388 2775 Tesla V100 limit – 32GB 4162 5549 6936 GPU memory usage (GB) - ETL 8323 CAN WE AVOID INPUT SPLITTING? 9710 11097 12484 13871 15258 16645 (112 parts) 18032 19419 20806 22193 23580 24967 26354 27741 29128 30515 31902 33289 34676 36063 37450 10 15 20 25 30 35 40 0 5 1 75 149 223 GPU memory usage (GB) - ETL 297 371 445 519 (original dataset) 593 667 741 815 889 963 1037 1111 1185 1259 1333 1407 1481 1555 1629 1703 CRASH 1777 OOM 1851 1925 1999 8

ML INPUT Some # of quarters are used for ML training Arrow Arrow Arrow concat DF convert DMatrix XGboost 9

CAN WE TRAIN ON MORE DATA? GPU memory usage (GB) - PREP GPU memory usage (GB) - PREP (112->20 parts) (112->28 parts) 35 35 Tesla V100 limit – 32GB 30 30 25 25 OOM CRASH 20 20 15 15 10 10 5 5 0 0 1 38 75 112 149 186 223 260 297 334 371 408 445 482 519 556 593 630 667 704 741 778 815 852 889 926 963 1000 1 24 47 70 93 116 139 162 185 208 231 254 277 300 323 346 369 392 415 438 461 484 507 530 553 576 599 622 10

HOW MEMORY MANAGED IN RAPIDS 11

RAPIDS MEMORY MANAGER https://github.com/rapidsai/rmm RAPIDS Memory Manager (RMM) is: • A replacement allocator for CUDA Device Memory A pool allocator to make CUDA device memory allocation faster & asynchronous • • A central place for all device memory allocations in cuDF and other RAPIDS libraries 12

WHY DO WE NEED MEMORY POOLS cudaMalloc/cudaFree are synchronous cudaMalloc(&buffer, size_in_bytes); • block the device cudaFree(buffer); cudaMalloc/cudaFree are expensive cudaFree must zero memory for security • • cudaMalloc creates peer mappings for all GPUs Using cnmem memory pool improves RAPIDS ETL time by 10x 13

RAPIDS MEMORY MANAGER (RMM) Fast, Asynchronous Device Memory Management RMM_ALLOC(&buffer, size_in_bytes, stream_id); C/C++ RMM_FREE(buffer, stream_id); dev_ones = rmm.device_array(np.ones(count)) Python : drop-in replacement dev_twos = rmm.device_array_like(dev_ones) for Numba API # also rmm.to_device(), rmm.auto_device(), etc. #include <rmm_thrust_allocator.h> Thrust : device vector and rmm::device_vector<int> dvec(size); execution policies thrust::sort(rmm::exec_policy(stream)- >on(stream), …); 14

MANAGING MEMORY IN THE E2E PIPELINE perf optimization At this point all ETL processing is done and memory stored in arrow required to avoid OOM Arrow 15

KEY MEMORY MANAGEMENT QUESTIONS Can we make memory management easier? • • Can we avoid artificial pre-processing of input data? Can we train on larger datasets? • 16

SOLUTION: UNIFIED MEMORY Partially Occupied Fully Occupied Empty Oversubscription GPU memory GPU memory GPU memory cudaMallocManaged cudaMallocManaged ... cudaMallocManaged ... cudaMallocManaged ... Evict Page on GPU Page on GPU (oversubscribed) CPU Memory 17

HOW TO USE UNIFIED MEMORY IN CUDF from librmm_cffi import librmm_config as rmm_cfg Python rmm_cfg.use_pool_allocator = True # default is False rmm_cfg.use_managed_memory = True # default is False 18

IMPLEMENTATION DETAILS Regular RMM allocation: if (rmm::Manager::usePoolAllocator()) { RMM_CHECK(rmm::Manager::getInstance().registerStream(stream)); RMM_CHECK_CNMEM(cnmemMalloc(reinterpret_cast<void**>(ptr), size, stream)); } else if (rmm::Manager::useManagedMemory()) RMM_CHECK_CUDA(cudaMallocManaged(reinterpret_cast<void**>(ptr), size)); else RMM_CHECK_CUDA(cudaMalloc(reinterpret_cast<void**>(ptr), size)); Pool allocator (CNMEM): if (mFlags & CNMEM_FLAGS_MANAGED) { CNMEM_DEBUG_INFO("cudaMallocManaged(%lu)\n", size); CNMEM_CHECK_CUDA(cudaMallocManaged(&data, size)); CNMEM_CHECK_CUDA(cudaMemPrefetchAsync(data, size, mDevice)); } else { CNMEM_DEBUG_INFO("cudaMalloc(%lu)\n", size); CNMEM_CHECK_CUDA(cudaMalloc(&data, size)); } 19

1. UNSPLIT DATASET “JUST WORKS” GPU memory usage (GB) - ETL GPU memory usage (GB) - ETL (original (original dataset) – cudaMalloc dataset) - cudaMallocManaged 100 100 90 90 80 80 70 70 60 60 50 50 OOM mem used CRASH pool size Tesla V100 limit – 32GB 40 40 30 30 20 20 10 10 0 0 1065 1141 1217 1293 1369 1445 1521 1597 1673 1749 1825 1901 1977 1013 1266 1519 1772 2025 2278 2531 2784 3037 3290 3543 3796 4049 4302 4555 4808 5061 5314 1 77 153 229 305 381 457 533 609 685 761 837 913 989 1 254 507 760 20

2. SPEED-UP ON CONVERSION DGX-1 time (s) 140 120 100 80 ETL 60 PREP 46 36 ML 40 20 0 20 quarters 20 quarters cudaMalloc cudaMallocManaged 25% speed-up on PREP! 21

3. LARGER ML TRAINING SET DGX-1 time (s) 160 140 120 100 ETL 80 PREP 60 ML OOM! 40 20 0 20 quarters 20 quarters 28 quarters 28 quarters cudaMalloc cudaMallocManaged cudaMalloc cudaMallocManaged 22

UNIFIED MEMORY GOTCHAS 1. UVM doesn’t work with CUDA IPC - careful when sharing data between processes Workaround - separate (small) cudaMalloc pool for communication buffers In the future it will work transparently with Linux HMM 2. Yes, you can oversubscribe, but there is danger that it will just run very slowly Capture Nsight or nvprof profiles to check eviction traffic In the future RMM may show some warnings about this 23

RECAP carefully partition input data Just to run the full pipeline on the GPU you need adjust memory pool options throughout the pipeline limit training size to fit in memory makes life easier for data scientists – less tweaking! Unified Memory improves performance – sometimes it’s faster to allocate less often & oversubscribe enables easy experiments with larger datasets 24

MEMORY MANAGEMENT IN THE FUTURE OmniSci XGBoost BlazingDB cuDNN cuDF cuML NEXT BIG Databases THING Contribute to RAPIDS: https://github.com/rapidsai/cudf Contribute to RMM: https://github.com/rapidsai/rmm 25

UNIFIED MEMORY FOR DEEP LEARNING 26

FROM ANALYTICS TO DEEP LEARNING Machine Learning Deep Learning Data Preparation 27

PYTORCH INTEGRATION PyTorch uses a caching allocator to manage GPU memory Small allocations distributed from fixed buffer (for ex: 1 MB) Large allocations are dedicated cudaMalloc’s Trivial change Replace cudaMalloc with cudaMallocManaged Immediately call cudaMemPrefetchAsync to allocate pages on GPU Otherwise cuDNN may select sub-optimal kernels 28

PYTORCH ALLOCATOR VS RMM PyTorch Caching Allocator RMM Memory pool to avoid synchronization on Memory pool to avoid synchronization on malloc/free malloc/free Directly uses CUDA APIs for memory Uses Cnmem for memory allocation and allocations management Pool size not fixed Reserves half the available GPU memory for pool Specific to PyTorch C++ library Re-usable across projects and with interfaces for various languages 29

WORKLOADS Image Models BN-ReLU-Conv 1x1 BN-ReLU-Conv 1x1 BN-ReLU-Conv 3x3 ResNet-1001 + DenseNet-264 VNet 30

WORKLOADS Language Models Word Language Modelling Loss Dictionary Size = 33278 Softmax Embedding Size = 256 FC LSTM units = 256 LSTM Back propagation through time = 1408 and 2800 Embedding 31

WORKLOADS Baseline Training Performance on V100-32GB FP16 FP32 Model Batch Size Samples/sec Batch Size Samples/sec ResNet-1001 98 98.7 48 44.3 DenseNet-264 218 255.8 109 143.1 Vnet 30 3.56 15 3.4 Lang_Model-1408 32 94.9 40 77.9 Lang_Model-2800 16 46.5 18 35.7 Optimal Batch Size Selected for High Throughput 32 All results in this presentation are using PyTorch 1.0rc1, R418 driver , Tesla V100-32GB

Samples/sec 100 120 20 40 60 80 0 2 14 26 38 50 62 74 86 98 ResNet-1001 110 FP16 122 Batch Size 134 GPU OVERSUBSCRIPTION 146 158 FP32 170 182 Upto 3x Optimal Batch Size 194 206 218 230 242 254 266 278 290 Samples/sec 100 150 200 250 300 50 0 2 26 50 74 98 122 146 170 194 218 DenseNet-264 242 FP16 266 290 Batch Size 314 338 FP32 362 386 410 434 458 482 506 530 554 578 33 602 626 650

GPU OVERSUBSCRIPTION Fill … CPU GPU Mem … … 34

GPU OVERSUBSCRIPTION Evict … CPU GPU Mem … … 35

UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING Nikolay - PowerPoint PPT Presentation

UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING Nikolay Sakharnykh, Chirayu Garg, and Dmitri Vainbrand, Thu Mar 19, 3:00 PM RAPIDS CUDA-accelerated Data Science Libraries PYTHON DL RAPIDS FRAMEWORKS DASK / SPARK cuGraph cuDF cuML

UNIFIED MEMORY IN CUDA 6 MARK HARRIS NVIDIA CONFIDENTIAL Unified Memory Dramatically Lower

Analytics and Data Summit 2020 Analytics and Data Summit 2020 Analytics and Data Summit 2020

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 6: Deep Learning Systems 1 Outline

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Basics of Unified Sports Ways to get involved with Unified Sports in Ohio Ohio 1 What are

SARVAM UCS Unified Communication Server Unified Communication Server for Modern Enterprises

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Undergraduate Business Analytics Minor Spreadsheet Analytics BANA-2081 Business Analytics

Deep Data Analytics for Pricing: Uses, Issues, and Solutions Walter R. Paczkowski, Ph.D. Data

EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY Nikolay Sakharnykh, 3/27/2018 SINGLE POINTER

Unified Straight and Curved Steel Girder Design Specifications Introduction Unified Steel

SPORTS! Unified Basketball Special Olympics U NIFIED B ASKETBALL Unified Basketball helps

Executing business opportunities Good start to the 2020 ambition Munich, 20 March 2019

Investor Presentation 16 May 2019 Agenda Group profile Overview of the industry The

JavaScript Testing In And Around WordPress Josh Pollock (he/him) Hi I'm Josh About Me

FAST, SECURE .AND RECONCILED! How MasterCard is changing the way we think about commercial

The PerfSONAR framework for inter-domain monitoring in the GLIF infrastructure Authors: ing.

Delay Tolerant Bulk Data Transfers on the Internet by N. Laoutaris et al., SIGMETRICS09 Ilias

Goodwill Territory Spans 50 Counties in AL and GA Goodwills Career Services Locations: 1

intelligent cargo handling Investor presentation March 2017 1 Content 1. Cargotec in brief 2.