UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING Nikolay - - PowerPoint PPT Presentation

unified memory for data analytics and deep learning
SMART_READER_LITE
LIVE PREVIEW

UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING Nikolay - - PowerPoint PPT Presentation

UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING Nikolay Sakharnykh, Chirayu Garg, and Dmitri Vainbrand, Thu Mar 19, 3:00 PM RAPIDS CUDA-accelerated Data Science Libraries PYTHON DL RAPIDS FRAMEWORKS DASK / SPARK cuGraph cuDF cuML


slide-1
SLIDE 1

Nikolay Sakharnykh, Chirayu Garg, and Dmitri Vainbrand, Thu Mar 19, 3:00 PM

UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING

slide-2
SLIDE 2

2

RAPIDS

CUDA-accelerated Data Science Libraries

CUDA PYTHON APACHE ARROW on GPU Memory DASK / SPARK cuDNN cuML cuDF DL FRAMEWORKS RAPIDS cuGraph

slide-3
SLIDE 3

3

MORTGAGE PIPELINE: ETL

https://github.com/rapidsai/notebooks/blob/master/mortgage/E2E.ipynb

CSV DF Arrow

read CSV filter join groupby

slide-4
SLIDE 4

4

MORTGAGE PIPELINE: PREP + ML

https://github.com/rapidsai/notebooks/blob/master/mortgage/E2E.ipynb

Arrow Arrow Arrow DF

concat

DMatrix

convert XGboost

slide-5
SLIDE 5

5

GTC EU KEYNOTE RESULTS ON DGX-1

20 40 60 80 100 120 140 ETL PREP ML

Mortage workflow time breakdown on DGX-1 (s)

slide-6
SLIDE 6

6

MAXIMUM MEMORY USAGE ON DGX-1

5 10 15 20 25 30 35 1 2 3 4 5 6 7 8 GB GPU ID

Tesla V100 limit – 32GB

slide-7
SLIDE 7

7

ETL INPUT

CSV CSV CSV CSV CSV CSV

  • riginal input set

112 quarters (~2-3GB)

CSV CSV CSV CSV

https://rapidsai.github.io/demos/datasets/mortgage-data

CSV

240 quarters (1GB)

CSV CSV CSV CSV CSV CSV CSV …

slide-8
SLIDE 8

8

CAN WE AVOID INPUT SPLITTING?

5 10 15 20 25 30 35 40 1 1388 2775 4162 5549 6936 8323 9710 11097 12484 13871 15258 16645 18032 19419 20806 22193 23580 24967 26354 27741 29128 30515 31902 33289 34676 36063 37450

GPU memory usage (GB) - ETL (112 parts)

5 10 15 20 25 30 35 40 1 75 149 223 297 371 445 519 593 667 741 815 889 963 1037 1111 1185 1259 1333 1407 1481 1555 1629 1703 1777 1851 1925 1999

GPU memory usage (GB) - ETL (original dataset)

OOM CRASH Tesla V100 limit – 32GB

slide-9
SLIDE 9

9

ML INPUT

Arrow Arrow Arrow DF

concat

DMatrix

convert XGboost

Some # of quarters are used for ML training

slide-10
SLIDE 10

10

CAN WE TRAIN ON MORE DATA?

5 10 15 20 25 30 35 1 38 75 112 149 186 223 260 297 334 371 408 445 482 519 556 593 630 667 704 741 778 815 852 889 926 963 1000

GPU memory usage (GB) - PREP (112->20 parts)

5 10 15 20 25 30 35 1 24 47 70 93 116 139 162 185 208 231 254 277 300 323 346 369 392 415 438 461 484 507 530 553 576 599 622

GPU memory usage (GB) - PREP (112->28 parts)

OOM CRASH Tesla V100 limit – 32GB

slide-11
SLIDE 11

11

HOW MEMORY MANAGED IN RAPIDS

slide-12
SLIDE 12

12

RAPIDS MEMORY MANAGER

RAPIDS Memory Manager (RMM) is:

  • A replacement allocator for CUDA Device Memory
  • A pool allocator to make CUDA device memory allocation faster & asynchronous
  • A central place for all device memory allocations in cuDF and other RAPIDS libraries

https://github.com/rapidsai/rmm

slide-13
SLIDE 13

13

WHY DO WE NEED MEMORY POOLS

cudaMalloc/cudaFree are synchronous

  • block the device

cudaMalloc/cudaFree are expensive

  • cudaFree must zero memory for security
  • cudaMalloc creates peer mappings for all GPUs

Using cnmem memory pool improves RAPIDS ETL time by 10x cudaMalloc(&buffer, size_in_bytes); cudaFree(buffer);

slide-14
SLIDE 14

14

RAPIDS MEMORY MANAGER (RMM)

C/C++

Fast, Asynchronous Device Memory Management

RMM_ALLOC(&buffer, size_in_bytes, stream_id); RMM_FREE(buffer, stream_id); dev_ones = rmm.device_array(np.ones(count)) dev_twos = rmm.device_array_like(dev_ones) # also rmm.to_device(), rmm.auto_device(), etc. #include <rmm_thrust_allocator.h> rmm::device_vector<int> dvec(size); thrust::sort(rmm::exec_policy(stream)->on(stream), …);

Python: drop-in replacement for Numba API Thrust: device vector and execution policies

slide-15
SLIDE 15

15

MANAGING MEMORY IN THE E2E PIPELINE

perf optimization required to avoid OOM At this point all ETL processing is done and memory stored in arrow

Arrow

slide-16
SLIDE 16

16

KEY MEMORY MANAGEMENT QUESTIONS

  • Can we make memory management easier?
  • Can we avoid artificial pre-processing of input data?
  • Can we train on larger datasets?
slide-17
SLIDE 17

17

SOLUTION: UNIFIED MEMORY

cudaMallocManaged

Page on GPU

...

cudaMallocManaged

Empty GPU memory Partially Occupied GPU memory Fully Occupied GPU memory

cudaMallocManaged

Oversubscription

Evict

CPU Memory Page on GPU (oversubscribed)

... ...

cudaMallocManaged

slide-18
SLIDE 18

18

HOW TO USE UNIFIED MEMORY IN CUDF

from librmm_cffi import librmm_config as rmm_cfg rmm_cfg.use_pool_allocator = True # default is False rmm_cfg.use_managed_memory = True # default is False

Python

slide-19
SLIDE 19

19

IMPLEMENTATION DETAILS

if (mFlags & CNMEM_FLAGS_MANAGED) { CNMEM_DEBUG_INFO("cudaMallocManaged(%lu)\n", size); CNMEM_CHECK_CUDA(cudaMallocManaged(&data, size)); CNMEM_CHECK_CUDA(cudaMemPrefetchAsync(data, size, mDevice)); } else { CNMEM_DEBUG_INFO("cudaMalloc(%lu)\n", size); CNMEM_CHECK_CUDA(cudaMalloc(&data, size)); }

Pool allocator (CNMEM): Regular RMM allocation:

if (rmm::Manager::usePoolAllocator()) { RMM_CHECK(rmm::Manager::getInstance().registerStream(stream)); RMM_CHECK_CNMEM(cnmemMalloc(reinterpret_cast<void**>(ptr), size, stream)); } else if (rmm::Manager::useManagedMemory()) RMM_CHECK_CUDA(cudaMallocManaged(reinterpret_cast<void**>(ptr), size)); else RMM_CHECK_CUDA(cudaMalloc(reinterpret_cast<void**>(ptr), size));

slide-20
SLIDE 20

20

  • 1. UNSPLIT DATASET “JUST WORKS”

10 20 30 40 50 60 70 80 90 100 1 254 507 760 1013 1266 1519 1772 2025 2278 2531 2784 3037 3290 3543 3796 4049 4302 4555 4808 5061 5314

GPU memory usage (GB) - ETL (original dataset) - cudaMallocManaged

mem used pool size 10 20 30 40 50 60 70 80 90 100 1 77 153 229 305 381 457 533 609 685 761 837 913 989 1065 1141 1217 1293 1369 1445 1521 1597 1673 1749 1825 1901 1977

GPU memory usage (GB) - ETL (original dataset) – cudaMalloc

OOM CRASH Tesla V100 limit – 32GB

slide-21
SLIDE 21

21

  • 2. SPEED-UP ON CONVERSION

46 36 20 40 60 80 100 120 140 20 quarters cudaMalloc 20 quarters cudaMallocManaged

DGX-1 time (s)

ETL PREP ML

25% speed-up on PREP!

slide-22
SLIDE 22

22

  • 3. LARGER ML TRAINING SET

OOM!

20 40 60 80 100 120 140 160 20 quarters cudaMalloc 20 quarters cudaMallocManaged 28 quarters cudaMalloc 28 quarters cudaMallocManaged

DGX-1 time (s)

ETL PREP ML

slide-23
SLIDE 23

23

UNIFIED MEMORY GOTCHAS

  • 1. UVM doesn’t work with CUDA IPC - careful when sharing data between processes

Workaround - separate (small) cudaMalloc pool for communication buffers In the future it will work transparently with Linux HMM

  • 2. Yes, you can oversubscribe, but there is danger that it will just run very slowly

Capture Nsight or nvprof profiles to check eviction traffic In the future RMM may show some warnings about this

slide-24
SLIDE 24

24

RECAP

Just to run the full pipeline on the GPU you need Unified Memory

makes life easier for data scientists – less tweaking! improves performance – sometimes it’s faster to allocate less often & oversubscribe enables easy experiments with larger datasets carefully partition input data adjust memory pool options throughout the pipeline limit training size to fit in memory

slide-25
SLIDE 25

25

MEMORY MANAGEMENT IN THE FUTURE

Contribute to RAPIDS: https://github.com/rapidsai/cudf Contribute to RMM: https://github.com/rapidsai/rmm BlazingDB OmniSci XGBoost cuDNN Databases NEXT BIG THING cuDF cuML

slide-26
SLIDE 26

26

UNIFIED MEMORY FOR DEEP LEARNING

slide-27
SLIDE 27

27

FROM ANALYTICS TO DEEP LEARNING

Data Preparation Machine Learning Deep Learning

slide-28
SLIDE 28

28

PYTORCH INTEGRATION

PyTorch uses a caching allocator to manage GPU memory Small allocations distributed from fixed buffer (for ex: 1 MB) Large allocations are dedicated cudaMalloc’s Trivial change Replace cudaMalloc with cudaMallocManaged Immediately call cudaMemPrefetchAsync to allocate pages on GPU Otherwise cuDNN may select sub-optimal kernels

slide-29
SLIDE 29

29

PYTORCH ALLOCATOR VS RMM

Memory pool to avoid synchronization on malloc/free Directly uses CUDA APIs for memory allocations Pool size not fixed Specific to PyTorch C++ library

PyTorch Caching Allocator

Memory pool to avoid synchronization on malloc/free Uses Cnmem for memory allocation and management Reserves half the available GPU memory for pool Re-usable across projects and with interfaces for various languages

RMM

slide-30
SLIDE 30

30

WORKLOADS

ResNet-1001 DenseNet-264 VNet

Image Models

BN-ReLU-Conv 1x1 BN-ReLU-Conv 3x3 BN-ReLU-Conv 1x1

+

slide-31
SLIDE 31

31

WORKLOADS

Word Language Modelling Dictionary Size = 33278 Embedding Size = 256 LSTM units = 256 Back propagation through time = 1408 and 2800

Language Models

LSTM FC Softmax Loss Embedding

slide-32
SLIDE 32

32

WORKLOADS

Baseline Training Performance on V100-32GB

Model FP16 FP32 Batch Size Samples/sec Batch Size Samples/sec ResNet-1001 98 98.7 48 44.3 DenseNet-264 218 255.8 109 143.1 Vnet 30 3.56 15 3.4 Lang_Model-1408 32 94.9 40 77.9 Lang_Model-2800 16 46.5 18 35.7

Optimal Batch Size Selected for High Throughput All results in this presentation are using PyTorch 1.0rc1, R418 driver , Tesla V100-32GB

slide-33
SLIDE 33

33

GPU OVERSUBSCRIPTION

Upto 3x Optimal Batch Size

20 40 60 80 100 120 2 14 26 38 50 62 74 86 98 110 122 134 146 158 170 182 194 206 218 230 242 254 266 278 290

Samples/sec Batch Size

ResNet-1001

FP16 FP32 50 100 150 200 250 300 2 26 50 74 98 122 146 170 194 218 242 266 290 314 338 362 386 410 434 458 482 506 530 554 578 602 626 650

Samples/sec Batch Size

DenseNet-264

FP16 FP32

slide-34
SLIDE 34

34

GPU OVERSUBSCRIPTION

Fill

GPU

CPU Mem

… … …

slide-35
SLIDE 35

35

GPU OVERSUBSCRIPTION

Evict

GPU

CPU Mem

… … …

slide-36
SLIDE 36

36

GPU OVERSUBSCRIPTION

Page Fault-Evict-Fetch

GPU

CPU Mem

… … …

slide-37
SLIDE 37

37

GPU OVERSUBSCRIPTION

Results

Model FP16 FP32 Batch Size Samples/sec Batch Size Samples/sec ResNet-1001 202 10.1 98 5 DenseNet-264 430 22.3 218 12.1 VNet 32 3 32 1.1 Lang_Model-1408 44 8.4 44 10 Lang_Model-2800 22 4.1 22 4.9

slide-38
SLIDE 38

38

GPU OVERSUBSCRIPTION

Page Faults - ResNet-1001 Training Iteration

200000 400000 600000 800000 1000000 1200000 1 1.5 2 2.5 3

Page Fault Count Over Subscription (Batch Size / Optimal Batch Size)

ResNet-1001

slide-39
SLIDE 39

39

GPU OVERSUBSCRIPTION

Add cudaMemPrefetchAsync before kernels are called

Manual API Prefetch

cudaMemPrefetchAsync(…) // input, output, wparam cudnnConvolutionForward(…)

  • cudaMemPrefetchAsync(…)

// A, B, C kernelPointWiseApply3(…)

slide-40
SLIDE 40

40

GPU OVERSUBSCRIPTION

No Prefetch vs Manual API Prefetch

slide-41
SLIDE 41

41

GPU OVERSUBSCRIPTION

Speed up from Manual API Prefetch

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 ResNet-1001 DenseNet-264 VNet Lang_Model-1408 Lang_Model-2800

Speed-Up(x)

No Prefetch vs CudaMemPrefetchAsync

FP16_Prefetch FP32_Prefetch

Observe upto 1.6x speed-up

slide-42
SLIDE 42

42

GPU OVERSUBSCRIPTION

Prefetch memory before kernel to improve performance cudaMemPrefetchAsync takes CPU cycles – degrades performance when not required Automatic prefetching needed to achieve high performance

Prefetch Only When Needed

20 40 60 80 100 120 2 14 26 38 50 62 74 86 98 110 122 134 146 158 170 182 194 206 218 230 242 254 266 278 290

Samples/sec Batch Size

ResNet-1001

FP16 FP16_prefetch

slide-43
SLIDE 43

43

DRIVER PREFETCH

Driver initiated (density) prefetching from CPU to GPU GPU pages tracked as chunk of smaller sysmem page Driver logic: Prefetch rest of the GPU page when 51% is migrated to GPU Change to 5% Observe up to 20% gain in performance vs default settings

Aggressive driver prefetching

slide-44
SLIDE 44

44

FRAMEWORK FUTURE

Framework can develop intelligence to insert prefetch before calling GPU kernels Smart evictions: Activation’s

  • nly

Lazy Prefetch: Catch kernel calls right before execution and add prefetch calls Eager Prefetch - Identify and add prefetch calls before the kernels are called

nn.Conv2d(…) (Hook) Replace: nn.Prefetch(…) nn.Conv2d(…)

x W1 * y W2 * z

Execute Parallelly

slide-45
SLIDE 45

45

TAKEAWAY

Unified Memory oversubscription solves the memory pool fragmentation issue Simple way to train bigger models and on larger input data Minimal user effort, no change in framework programming Frameworks can get better performance by adding prefetch’s Try it out and contribute: https://github.com/rapidsai/cudf https://github.com/rapidsai/rmm

slide-46
SLIDE 46