NVIDIA GPUDIRECT TECHNOLOGIES Davide Rossetti, Elena Agostini Tue - PowerPoint PPT Presentation

S9653 – HOW TO MAKE YOUR LIFE EASIER IN THE AGE OF EXASCALE COMPUTING USING NVIDIA GPUDIRECT TECHNOLOGIES Davide Rossetti, Elena Agostini Tue 3/19, 2PM, Room 211A

• GPUDirect & Topology How system topology may affect GPUDirect technologies and • communication API A case study • • GPUDirect RDMA: Memory consistency problems when dealing with you NIC AGENDA • • Problem statement and possible solutions • L4T (Tegra) Xavier topology insights • Application guideline • 2

GPUDIRECT & SYSTEM TOPOLOGY: A CASE STUDY 3

THE ISING BETHE LATTICE Overview A system of binary variables (i.e., variables that can assume only one out of two possible values) that • interact each other. The variables are the vertices of a random graph. The graph is bipartite meaning that the red variables • interact only with the blue ones Same-type variables can run in parallel • Each red vertex has only 4 blue neighbors and vice versa • The simulation performs a sort of relaxation dynamics that emulates the training of artificial neural • networks (corresponding to the minimization of the loss function in a high-dimensional space). Paper "Benchmarking multi-GPU applications on modern multi-GPU integrated systems", M. Bernaschi, E. Agostini, D.Rossetti Submitted to "Special Issue of Concurrency and Computation, Practice and Experience 2018" 4

THE ISING BETHE LATTICE Multi-GPU system GPU X Variables are distributed among all the • GPUs in the system Interaction pattern, each variable may • interact with any number of other GPUs Exchanging during each step of the • simulation the single chunks of memory needed by each variable would result in a Exchange Exchange huge amount of small size messages among GPUs • Most convenient to exchange all the red results (i.e. the entire device memory buffer) at the end of their interaction with the blue and vice versa GPU Y 5

THE ISING BETHE LATTICE Device buffers communication Technology Communication API Single Process Multi-Process GPUDirect P2P (CE) cudaMemcpyPeer X GPUDirect P2P (SM) NcclAllGather X X GPUDirect RDMA MVAPICH2 GDR X • MVAPICH2 + GPUDirect RDMA support: directly exchange device memory NCCL 2.2: single and multi-process modes • • AllGather 6

THE ISING BETHE LATTICE DGX-1V Not all the GPU pairs have the same type of connection: • GPUs 0 and 1, directly connected, 1 NVLink, BW 50 GB/sec P2P with CE or NCCL (SM) • • GPUs 0 and 3, directly connected, 2 NVLinks, BW 100 GB/sec • P2P with CE or NCCL (SM) GPUs 0 and 5, not directly connected. Best • connection path could be through NVLink to GPU 1 or alternatively, CPU or HCA P2P with NCCL (SM) • IB cards with MVAPICH2-GDR or NCCL • 7

THE ISING BETHE LATTICE DGX-1V – speed up 6.00 6.00 5.00 5.00 4.00 4.00 3.00 3.00 2.00 2.00 1.00 1.00 0.00 0.00 2� GPU 4� GPU 8� GPU 2� GPU 4� GPU 8� GPU NCCL P2P MVAPICH NCCL� - NVLink NCCL� - IB Speed-up single process configurations with respect to Speed-up multi-process configurations with respect to mono-GPU configuration, grid size 2 25 mono-GPU configuration, grid size 2 25 8

THE ISING BETHE LATTICE IBM AC922 – Power9 CPU Only 4 GPUs in the system • • GPU and CPU P9 connected through 3 NVLinks -> 150 GB/s GPU 0 is connected to: • • GPU 1 with NVLink • GPU 2 and 3 through SMP bus -> effective P2P BW is 20 GB/s (experimentally) NVLink transactions can be tunneled over • SMP bus -> GPUDirect P2P (CE) is supported across sockets NCCL and P2P are always applicable • • No need to use IB cards 9

THE ISING BETHE LATTICE IBM AC922 – Speed up Due to the limited bandwidth when • 2.00 crossing the two POWER9 NUMA nodes, the performance does not improve when using 4 GPUs. 1.50 Similarly to DGX-1V, performance of NCCL • 1.00 single or multi-process are basically the same up to 4 GPUs, confirming that a single CPU thread is enough to manage 4 0.50 GPUs efficiently 0.00 P2P CE is actually slightly slower that • 2� GPU 4� GPU NCCL P2P NCCL� - SP NCCL� - MP Speed-up all configurations with respect to mono-GPU configuration, grid size 2 25 10

GPUDIRECT RDMA & MEMORY CONSTISTENCY 11

GPUDIRECT RDMA Loose memory consistency, x86 CUDA kernel is polling on some dev_flag 1. CPU while(dev_flag == 0); • nic_flag NIC receives and writes data into the GPU memory 2. PCIe switch NIC/CPU set dev_flag = 1 3. write CUDA kernel observes dev_flag dev_flag == 1 4. CUDA kernel consumes received data 5. dev_flag NIC data data GPU SM may observe inconsistent data! 12

GPUDIRECT RDMA Memory consistency issue • PCIe ordering guarantees are not preserved all the way inside the GPU Explicit fencing is required • • Fencing mechanisms: GPU work launch (kernels, memory copies) • • Read of GPU memory mapping exposed on GPU BAR1 • Active CPU read NIC proxied read • 13

GPUDIRECT RDMA Active CPU read CPU reads any GPU memory location ➢ CPU nic_flag CPU set dev_flag = 1 ➢ Read The GPU memory location must be visible from the dev_flag ➢ & CPU PCIe switch Write dev_flag == 1 one way to create a CPU mapping of GPU memory is • by using GDRCopy https://github.com/NVIDIA/gdrcopy • dev_flag NIC data data GPU 14

GPUDIRECT RDMA NIC proxied read Hack: loopback RDMA WRITE CPU CPU observes nic_flag ➢ nic_flag CPU issue NIC RDMA WRITE ➢ PCIe switch ➢ Source is GPU BAR1, dev_src=1 CPU triggers a loopback RDMA ➢ Destination is GPU BAR1 of dev_flag PUT NIC execute RDMA WRITE ➢ NIC data data dev_flag ➢ Implicitly flushing dev_src GPU observe dev_flag=1 GPU ➢ 15

GPUDIRECT RDMA ON L4T 16

JETSON AGX XAVIER HW & SW overview Tegra Jetson AGX Xavier is a 64-bit ARM high-performance SoC for autonomous machines introduced in 2018: • iGPU 512-core Volta GPU with Tensor Cores CPU 8-core ARM v8.2 64-bit CPU, 8MB L2 + 4MB L3 • Memory 16GB 256-Bit LPDDR4x | 137GB/s • Storage 32GB eMMC 5.1 • • PCIe x8 Gen2/3/4 slot Any PCIe card can be connected. The PCIe slot is of x16 size • to connect x16 card but operates in x8 mode. • OS: Linux for Tegra (L4T) L4T v32.1 will have GPUDirect RDMA kernel API! • 17

SYSTEM TOPOLOGY Desktop vs Tegra • BAR1 page size = 64KB • Page size = 4 KB • PCIe access GPU memory via L2 cache • Sysmem only PCIe and iGPU L2 are not coherent • PCI read/write see the latest value from GPU • • GPU memory is separated from Sysmem cudaMalloc returns GPU cached memory • • Allocator is cudaMalloc • Need to use uncached memory portion • https://docs.nvidia.com/cuda/gpudirect- (cudaMallocHost) rdma/index.html 18

GPUDIRECT RDMA Desktop ioctl 19

GPUDIRECT RDMA L4T 20

GPUDIRECT RDMA ON L4T next release Currently L4T public release is v31 GPUDirect RDMA support starting from L4T v32.1 (JetPack 4.2) Note, /usr/src/linux-headers-#KERNEL_VERSION-tegra/nvgpu/include/linux/nv-p2p.h https://developer.nvidia.com/embedded/jetpack 21

NVIDIA GPUDIRECT TECHNOLOGIES Davide Rossetti, Elena Agostini Tue - PowerPoint PPT Presentation

S9653 HOW TO MAKE YOUR LIFE EASIER IN THE AGE OF EXASCALE COMPUTING USING NVIDIA GPUDIRECT TECHNOLOGIES Davide Rossetti, Elena Agostini Tue 3/19, 2PM, Room 211A GPUDirect & Topology How system topology may affect GPUDirect

STATE OF GPUDIRECT TECHNOLOGIES Davide Rossetti(*) Sreeram Potluri David Fontaine GPUDirect

GPU WITH A NETWORK INTERFACE DAVIDE ROSSETTI, SW COMPUTE TEAM GPUDIRECT FAMILY 1 GPUDirect Shared

SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT Davide Rossetti, Elena Agostini 1 GPUDIRECT ELSEWHERE

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC13 NVIDIA Booth by D.K. Panda The Ohio

NVIDIA VIDEO TECHNOLOGIES Abhijit Patait, 3/20/2019 NVIDIA Video Technologies Overview Turing

NVIDIA VIDEO TECHNOLOGIES Abhijit Patait, 5/8/2017 NVIDIA Video Technologies New SDK Release

NVIDIA VIDEO TECHNOLOGIES Abhijit Patait, 3/26/2018 NVIDIA Video Technologies Overview Video

FOR THE BEST VDI USER EXPERIENCE NVIDIA VIRTUAL GPU PRODUCT POSITIONING NVIDIA GRID NVIDIA

NVIDIA NSIGHT ECLIPSE EDITION CHRISTOPH ANGERER, NVIDIA JULIEN DEMOUTH, NVIDIA WHAT YOU WILL

NVIDIA Quadro and NVS Video Walls NVIDIA Quadro and NVS Video Walls Using NVIDIA technology to

GENERATION OF GAMING TECHNOLOGY Samuel Lo, NVIDIA AI Technology Centre samuell@nvidia.com NVIDIA

Red Hat and the NVIDIA DGX: Tried, Tested, Trusted NVIDIA GTC 2019 Jeremy Eder, Andre Beausoleil,

NVIDIA INDEX IMPLEMENTING CLOUD SERVICES FOR MASSIVE DATA VISUALIZATION Marc Nienhaus (NVIDIA),

NVIDIA DESIGNWORKS Ankit Patel - ankitp@nvidia.com Prerna Dogra - pdogra@nvidia.com 1 Autonomous

NVIDIA VGPU LINUX KVM Neo Jia, Dec 19th 2019 AGENDA NVIDIA vGPU

GET TO KNOW THE NVIDIA GRID TM SDK Shounak Deshpande, NVIDIA Background NVIDIA GRID SDK AGENDA

BlueTTT A Bluetooth Based P2P Tic Tac Toe Game Developed as a partial fulfillment of the

Who shares and why? Assessing the diffusion potential of peer-to-peer mobility innovations

Growing the P2P Transportation Sharing Economy in Rural Georgia Renee M. Shelby Georgia

P2P Networks as Content P2P Networks as Content Delivery Networks Delivery Networks FINAL

Proc ocure ure to Pa o Pay Ov Over ervi view October 20, 2016 ISS SS Group up Int

OCASI Ontario Council of Agencies Serving Immigrants Measuring what counts: Priorities and

STUDENT-LED COMMUNITY CONVERSATION April 25, 2018 Presented by Kathy Hickok Marjorie Mayberry

RedGate - Enterprise MSE Project - Phase I Integration Server Motivation 2 Motivation 2