Using Tensor Swapping and NVLink to Overcome GPU Memory Limits with - PowerPoint PPT Presentation

Using Tensor Swapping and NVLink to Overcome GPU Memory Limits with TensorFlow Sam Matzek

Deep learning is memory constrained • GPUs have limited memory • Neural networks are growing deeper and wider • Amount and size of data to process is always growing

GPU Memory Usage GPU memory Tensors Input data (Layer outputs) Kernels loss

Model Training in GPU Memory GPU memory Tensor 3 Tensor 2 Tensor 1 loss

Model Training with Tensor Swapping GPU memory Tensor 2 Tensor 3 Tensor 4 Tensor 1 loss System memory

TensorFlow Large Model Support Graph Modifications GPU CPU A Swap out Swap in B https://arxiv.org/pdf/1807.02037.pdf

Enabling TensorFlow Large Model Support Keras API Estimator API * Examples for TFLMS v2.0.0

What’s possible with Large Model Support? • 10x image resolution - Keras ResNet50 • 10x image resolution - DeepLabV3 2D image segmentation • 5x MRI resolution - 3D U-Net 3D image segmentation Measured with TFLMS v2.0.0 on TensorFlow 1.13, CUDA 10.1, cuDNN 7.5

3D U-Net image segmentation • 3D U-Net generally has high memory usage requirements • International Multimodal Brain Tumor Segmentation Challenge (BraTS) • Existing Keras model with TensorFlow backend

Effect of 2x resolution on Dice Coefficients (higher is better)

“Swapping makes everything slow”

Typical GPU connectivity System Memory System memory bus: 76.8 GB/s CPU PCIe: 32 GB/s GPU GPU NVLin k

POWER9 CPU to GPU connectivity System Memory System memory bus: 170 GB/s CPU NVLink 2.0: 150 GB/s NVLink 2.0: 150 GB/s GPU GPU NVLink 2.0

Effects of NVLink 2.0 on Large Model Support PCIe connected GPU training one high res 3D MRI with large model support NVLink 2.0 connected GPU training one high res 3D MRI with large model support

Effects of NVLink 2.0 on epoch times

Effects of NVLink 2.0 on GPU Utilization

Multi-GPU model training with NVLink 2.0 2.1x faster with HALF the number of GPUs ! http://ibm.biz/3dunet-tflms-multigpu

Patches versus whole image 3.5x 3x faster faster https://arxiv.org/abs/1812.07816

Overhead of Large Model Support with NVLink 2.0 Measured with TFLMS v2.0.0 on TensorFlow 1.13, CUDA 10.1, cuDNN 7.5

Overhead of Large Model Support with NVLink 2.0 DeepLabV3 on POWER9 with 32GB NVIDIA Volta V100 1.2 GB transferred to GPU, GPU utilization 81% 438 GB transferred to GPU, GPU utilization 89% LMS enabled 1.4 TB transferred to GPU, 148 GB transferred to GPU, GPU utilization 64% GPU utilization 90% 826 GB transferred to GPU, GPU utilization 84% Using bs=16, fine_tune_batch_norm=true, measured on 32GB GPU with TensorFlow 1.13, CUDA 10.1, cuDNN 7.5

Large Model Support with NVLink 2.0 • Tensor swapping can be used to overcome GPU memory limits • Allows training of: • deeper models • higher resolution data • larger batch sizes • NVLink 2.0 between CPU and GPU allow tensor swapping with minimal overhead

More information TensorFlow Large Model Support https://github.com/IBM/tensorflow-large-model-support TFLMS: Large Model Support in TensorFlow by Graph Rewriting https://arxiv.org/pdf/1807.02037.pdf TensorFlow Large Model Support Case Study https://developer.ibm.com/linuxonpower/2018/07/27/tensorflow-large-model-support-case-study-3d-image-segmentation/ Performance of 3DUnet Multi GPU Model for Medical Image Segmentation using TensorFlow Large Model Support http://ibm.biz/3dunet-tflms-multigpu Fast and Accurate 3D Medical Image Segmentation with Data-swapping Method https://arxiv.org/abs/1812.07816 Data-parallel distributed training of very large models beyond GPU capacity https://arxiv.org/abs/1811.12174 POWER9 server with NVLink 2.0 connections between CPU and GPU (IBM AC922): https://www.ibm.com/us-en/marketplace/power-systems-ac922

Using Tensor Swapping and NVLink to Overcome GPU Memory Limits with - PowerPoint PPT Presentation

Using Tensor Swapping and NVLink to Overcome GPU Memory Limits with TensorFlow Sam Matzek Deep learning is memory constrained GPUs have limited memory Neural networks are growing deeper and wider Amount and size of data to process is

8. Tensor Field Visualization Tensor: extension of concept of scalar and vector Tensor data

Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink Rajesh

Operating System Principles: Memory Management Swapping, Paging, and Virtual Memory CS 111

Swapping ! Active processes use more physical memory than system has Address Binding Swap out

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

MULTI GPU PROGRAMMING WITH MPI Jiri Kraus, Senior Devtech Compute, April 4th 2016 MPI+CUDA

MPI AND OPENACC JIRI KRAUS, NVIDIA MPI+OPENACC System System System GDDR5 Memory GDDR5

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

(Some) Challenges in (Some) Challenges in Tensor Mining Tensor Mining Evrim Acar Sandia

Tensor Field Techniques Lecture 11 March 5, 2020 Outline Basics of tensor algebra Tensor

TENSOR ALGEBRA Continuum Mechanics Course (MMC) - ETSECCPB - UPC Introduction to Tensors Tensor

Tensor-Matrix Products with a Compressed Sparse Tensor Shaden Smith George Karypis University

Tensor Field Visualization 9-1 Ronald Peikert SciVis 2007 - Tensor Fields Tensors

TENSOR LAYERS FOR COMPRESSION OF DEEP LEARNING NETWORKS Cris Cecka Senior Research Scientist,

Graph Cuts for Image Segmentation Meghshyam G. Prasad CSE Department IIT Bombay Mumbai.

Auditing Derivatives and Hedge Contracts Under ASC 815, 820 and Other Guidance Mastering Key

CDIAC provides information, education and technical assistance on public debt and investments to

SEE THE BIG PICTURE: HOW TO BUILD LARGE DISPLAY WALLS USING NVIDIA DESIGNWORKS APIS/TOOLS

Creating a A/V Show Project Options Main Check Aspect ratio, Default slide duration (5 secs)

Credit Default Swaps: Financial Markets, Corporate Finance and Regulation Ma Marti G. i G.

Chess Vision Chua Huiyan Le Vinh Wong Lai Kuan Outline Introduction Background Studies

Mee Meeting 6 Monday, May 21, 2018 Agenda Select Committee Chair(s) CAC Survey Response