high performance networking from GPU kernels Feras Daoud Technion - PowerPoint PPT Presentation

GPUrdma: GPU-side library for high performance networking from GPU kernels Feras Daoud Technion – Israel Institute of Technology Mark Silberstein Amir Watad Technion Technion 1

Agenda 1 Introduction 2 InfiniBand Background 3 GPUrdma 4 GPUrdma Evaluation 5 GPI2 2

What • A GPU-side library for performing RDMA directly from GPU kernels Why • To improve communication performance between distributed GPUs Results • 5 µsec GPU-to-GPU communication latency and up to 50 Gbps transfer bandwidth 3

Evolution of GPU-HCA interaction: Naive Version GPU CPU GPU CPU RAM RAM HCA Data Path Control Path 4

Evolution of GPU-HCA interaction: GPUDirect RDMA Naive Version GPU CPU GPU CPU GPU CPU GPU CPU RAM RAM RAM RAM HCA HCA Direct Data Path Data Path Control Path 5

Evolution of GPU-HCA interaction: GPUDirect RDMA Naive Version GPUrdma GPU CPU GPU CPU GPU CPU GPU CPU GPU CPU GPU CPU RAM RAM RAM RAM RAM RAM HCA HCA HCA Direct Data Path Direct Data Path Control Path Control Path 6

Motivations GPUDirect RDMA Node CPU_rdma_read() GPU_kernel<<<>>> { GPU_Compute() } CPU_rdma_write() 7

Motivations GPUDirect RDMA Node CPU_rdma_read() GPU_kernel<<<>>> { GPU_Compute() CPU Overhead } CPU_rdma_write() 8

Motivations GPUDirect RDMA Node CPU_rdma_read() Communication Bulk-synchronous design GPU_kernel<<<>>> { and explicit pipelining GPU_Compute() Computation } CPU_rdma_write() Communication 9

Motivations GPUDirect RDMA Node CPU_rdma_read() Multiple GPU kernel invocations CPU_rdma_read() CPU_rdma_read() CPU_rdma_read() GPU_kernel<<<>>> { GPU_kernel<<<>>> { GPU_kernel<<<>>> { GPU_kernel<<<>>> { 1. kernel calls overhead GPU_Compute() GPU_Compute() GPU_Compute() 2. Inefficient shared memory usage GPU_Compute() } } } } CPU_rdma_write() CPU_rdma_write() CPU_rdma_write() CPU_rdma_write() 10

Motivations GPUDirect RDMA Node 0 3 5 2 5 1 8 CPU_rdma_read() GPU_kernel<<<>>> { 1 0 0 1 0 0 1 Find_Even_Num() } 0x0 CPU_rdma_write() Sparse data 0x3 Offsets 0x6 11

GPUrdma library GPUrdma Node • No CPU intervention GPU_kernel<<<>>> { • Overlapping communication and computation GPU_rdma_read() • One kernel call GPU_Compute() • Efficient shared memory usage GPU_rdma_write() • Send spare data directly from the kernel } 12

InfiniBand Background 1. Queue pair buffer (QP) QP Buffer • Send queue • Task WQE WQE Receive queue HCA Verbs 2. Work Queue Element • Contains communication instructions Completion CQE CQE 3. Completion queue buffer (CQ) • Contains completion elements CQ Buffer 4. Completion queue element • Contains information about completed jobs 14

InfiniBand Background Ring the Door-Bell to execute jobs Control Path • MMIO address CPU • Informs the HCA about new jobs 1. Write work queue element to QP buffer CPU Memory 01100110 2. Ring the Door-Bell 01101001 10011001 QP CQ Data Data 3. Check completion queue element status Path 15

GPUrdma Node • Control Path Direct path for data exchange GPU • Direct HCA control from GPU kernels • No CPU intervention GPU Memory 01100110 01101001 Native GPU Node 10011001 QP CQ Data Data Path 17

GPUrdma Implementation CPU Memory 0110011001 CPU 1010011001 1001010101 QP CQ Data Data Path GPU Memory GPU 18

GPUrdma Implementation CPU Memory Data Path - GPUDirect RDMA CPU QP CQ Data Path GPU Memory GPU 0110011001 1010011001 1001010101 Data 19

GPUrdma Implementation CPU Memory 1. Move QP, CQ to GPU memory CPU QP CQ Data Path GPU Memory GPU 0110011001 1010011001 1001010101 Data 20

GPUrdma Implementation CPU Memory 1. Move QP, CQ to GPU memory CPU Modify InfiniBand Verbs • ibv_create_qp() Data Path • ibv_create_cq() GPU Memory GPU 0110011001 1010011001 1001010101 QP CQ Data 21

GPUrdma Implementation CPU Memory 2. Map the HCA doorbell address CPU into GPU address space Data Path GPU Memory GPU 0110011001 1010011001 1001010101 QP CQ Data 22

GPUrdma Implementation CPU Memory 2. Map the HCA doorbell address CPU into GPU address space Modify NVIDIA driver Data Path GPU Memory GPU 0110011001 1010011001 1001010101 QP CQ Data 23

GPUrdma Evaluation • Single QP • Multiple QP • Scalability - Optimal QP/CQ location NVIDIA Tesla K40c GPU Mellanox Connect-IB HCA 25

GPUrdma – 1 thread , 1 QP • Best Performance CPU controller VS GPU controller CPU GPU 26

GPUrdma – 1 thread , 1 QP • GPU controller – Optimize doorbell rings CPU Doorbell Optimized 3x GPU 27

GPUrdma – 1 thread , 1 QP • GPU controller – Optimize CQ poll CPU Doorbell Optimized CQ Optimized 28

GPUrdma – 32 threads , 1 QP • GPU controller – Write parallel jobs CPU Parallel writes CQ Optimized 29

GPUDirect RDMA • CPU controller CPU 1 QP CPU 3 0 QPs 30

GPUrdma – 30 QPs • 1 QP per Block vs 30 QPs per Block 50 Gbps 1 QP per Block 30 QPs per Block CPU 3 0 QPs 3x 31

Scalability – Optimal QP/CQ location: CPU Memory 1. QP and CQ in GPU memory CPU 2. QP in GPU and CQ in system memory 3. CQ in GPU and QP in system memory Data Path 4. QP and CQ in GPU memory GPU Memory GPU 0110011001 1010011001 1001010101 QP CQ Data 32

Scalability – Optimal QP/CQ location: CPU Memory 1. QP and CQ in GPU memory CPU 2. QP in GPU and CQ in system memory CQ 3. CQ in GPU and QP in system memory Data Path 4. QP and CQ in GPU memory GPU Memory GPU 0110011001 1010011001 1001010101 QP Data 33

Scalability – Optimal QP/CQ location: CPU Memory 1. QP and CQ in GPU memory CPU 2. QP in GPU and CQ in system memory QP 3. CQ in GPU and QP in system memory Data Path 4. QP and CQ in GPU memory GPU Memory GPU 0110011001 1010011001 1001010101 CQ Data 34

Scalability – Optimal QP/CQ location: CPU Memory 1. QP and CQ in GPU memory CPU 2. QP in GPU and CQ in system memory QP CQ 3. CQ in GPU and QP in system memory Data Path 4. QP and CQ in system memory GPU Memory GPU 0110011001 1010011001 1001010101 Data 35

Optimal QP/CQ location: o Throughput: No difference o Latency: QP in CPU QP in GPU CQ in CPU 8.6 6.2 CQ in GPU 6.8 4.8 Transfer latency [µsec] 36

Limitations GPUDirect RDMA - CUDA v7.5: Running kernel may observe STALE DATA or data that arrives OUT-OF-ORDER Scenario: Intensive RDMA writes to GPU memory Good news: NVIDIA announced a CUDA 8 feature that enables consistent update Suggested fix: CRC32 integrity check API for error detection 37

GPI2 for GPUs: GPI - A framework to implement P artitioned G lobal A ddress S pace (PGAS) GPI2 - Extends this global address space to GPU memory Global Global Global Global memory Host memory Host memory GPU memory GPU Local memory Local memory Local memory Local memory GPU GPU Host Host Threads Threads Threads Threads 39

GPI2 code example CPU Node GPU Node gaspi_segment_create (CPU_MEM) gaspi_segment_create (GPU_MEM) Initialize data gaspi_notify_waitsome gaspi_write_notify gaspi_notify_waitsome GPU_Compute _data<<<>>> gaspi_write_notify gaspi_proc_term gaspi_proc_term 40

GPI2 using GPUrdma CPU Node GPU Node gaspi_segment_create (CPU_MEM) gaspi_segment_create (GPU_MEM) Initialize data GPU_start_kernel <<<>>> { gaspi_write_notify gpu_gaspi_notify_waitsome gaspi_notify_waitsome Compute_data() gpu_gaspi_write_notify } gaspi_proc_term gaspi_proc_term 41

GPUrdma Multi-Matrix vector product Batch size GPI2 GPUrdma CPU Node GPU Node [Vectors] 480 2.6 11.7 Start timer 960 4.8 18.8 gaspi_write_notify gpu_notify_waitsome 1920 8.4 25.2 gaspi_notify_waitsome Matrix_compute() 3840 13.9 29.1 gpu_write_notify 7680 19.9 30.3 Stop timer 15360 24.3 31.5 • System throughput in millions of 32x1 vector multiplications per second as a function of the batch size 42

Related works Lena Oden, Fraunhofer Institute for Industrial Mathematics: • Infiniband-Verbs on GPU: A case study of controlling an Infiniband network device from the GPU • Analyzing Put/Get APIs for Thread-collaborative Processors Mark Silberstein, Technion – Israel Institute of Technology: • GPUnet: networking abstractions for GPU programs • GPUfs: Integrating a file system with GPUs Thanks 43

high performance networking from GPU kernels Feras Daoud Technion - PowerPoint PPT Presentation

GPUrdma: GPU-side library for high performance networking from GPU kernels Feras Daoud Technion Israel Institute of Technology Mark Silberstein Amir Watad Technion Technion 1 Agenda 1 Introduction 2 InfiniBand Background 3 GPUrdma

Drupal High Availability High Performance Samstag, 3. November 12 Drupal High Availability

Eagle Scholars: High Eagle Scholars: High Eagle Scholars: High Eagle Scholars: High Eagle

Getting the Performance Out Of Getting the Performance Out Of High Performance Computing High

High Performance Computing in Web Browsers CE Seminar WT14/15 Henning Lohse High Performance

Corporate Presentation voestalpine High Performance Metals India voestalpine High Performance

Parallel Programming and High-Performance Computing Part 2: High-Performance Networks Dr.

New York University High Performance Computing High Performance Computing Information

X10: a High-Productivity Approach to X10: a High-Productivity Approach to High Performance

No CDN On-net Off-net Deep off-net User Experience Low Medium High Very High

High Performance Systems EuroMPI 2015 Objectives Yet another performance analysis tool

GROUP 10 High Performance/High Involvement Work Systems By: Ryan Cahill, Tri Luu, Truc Le,

High Availability High Performance How to sleep without the server-crash-fear Michael Schmid

Glendale Union High School District Honors / Advanced Placement Information Night Our Schools

CAL IF ORNIA HIGH- - SPE SPE E D RAIL CAL IF ORNIA HIGH E D RAIL CAL IF ORNIA HIGH-

Performance and Scalability (Chapter 11) Performance and Scalability Performance: How long

March 2019 CONTENTS Page Combined Partner Performance 1 Breckland Performance Reports 2-6

Probabilistic Modeling: Bayesian Networks Bioinformatics: Sequence Analysis COMP 571 - Spring

Neutrino Oscillations and and Lorentz Violation Results from MiniBooNE Outline: z - LSND -

with the Guarded Action Language Quentin Meunier, Yann Thierry-Mieg, Emmanuelle Encrenaz

Markov Models and Hidden Markov Models Robert Platt Northeastern University Some images and

Reverse Engineering Queries in Ontology-Enriched Systems The Case of Expressive Horn Description

3/3/2009 Niagara CQ : A Scalable Outline Continuous Query System for Internet Databases

Design Challenges of User- Level Protocols By: Chethan K Rudramuni 1 Presentation Overview

for or FG FGT T El Elec ectroma romagnetic gnetic Ca Calorim rimeter eter Maharna arnab