GPUrdma: GPU-side library for high performance networking from GPU kernels
Feras Daoud Technion – Israel Institute of Technology
Mark Silberstein Technion Amir Watad Technion
1
high performance networking from GPU kernels Feras Daoud Technion - - PowerPoint PPT Presentation
GPUrdma: GPU-side library for high performance networking from GPU kernels Feras Daoud Technion Israel Institute of Technology Mark Silberstein Amir Watad Technion Technion 1 Agenda 1 Introduction 2 InfiniBand Background 3 GPUrdma
Mark Silberstein Technion Amir Watad Technion
1
2
3
GPU kernels
GPUs
transfer bandwidth
GPU GPU RAM CPU CPU RAM HCA Naive Version Data Path Control Path
4
GPU GPU RAM CPU CPU RAM HCA GPU GPU RAM CPU CPU RAM HCA Naive Version GPUDirect RDMA
5
Data Path Control Path Direct Data Path
GPU GPU RAM CPU CPU RAM HCA GPU GPU RAM CPU CPU RAM HCA GPU GPU RAM CPU CPU RAM HCA Naive Version GPUDirect RDMA GPUrdma
6
Data Path Control Path Direct Data Path Direct Control Path
7
8
9
10
11
1 1 1 3 5 2 5 1 8 Offsets 0x0 0x3 0x6
12
13
14
Task Verbs Completion
QP Buffer CQ Buffer
WQE WQE CQE CQE
15
Ring the Door-Bell to execute jobs
CPU Memory
QP CQ
CPU
Data
Control Path Data Path
01100110 01101001 10011001
16
GPU
Control Path
17
GPU Memory
QP CQ Data
01100110 01101001 10011001
Data Path
18
CPU Memory
QP CQ Data
0110011001 1010011001 1001010101
CPU GPU Data Path GPU Memory
Data Path - GPUDirect RDMA
19
CPU Memory CPU GPU Data Path GPU Memory
Data
0110011001 1010011001 1001010101
QP CQ
20
CPU Memory
QP CQ
CPU GPU GPU Memory
Data
0110011001 1010011001 1001010101
Data Path
Modify InfiniBand Verbs
21
CPU Memory CPU GPU Data Path GPU Memory
QP CQ Data
0110011001 1010011001 1001010101
into GPU address space
22
CPU Memory CPU GPU Data Path GPU Memory
QP CQ Data
0110011001 1010011001 1001010101
23
CPU Memory CPU GPU Data Path GPU Memory
QP CQ Data
0110011001 1010011001 1001010101
into GPU address space Modify NVIDIA driver
24
NVIDIA Tesla K40c GPU Mellanox Connect-IB HCA
25
26
CPU GPU
27
Doorbell Optimized CPU GPU
3x
28
CQ Optimized Doorbell Optimized CPU
29
Parallel writes CQ Optimized CPU
30
CPU 30 QPs CPU 1 QP
31
30 QPs per Block CPU 30 QPs 1 QP per Block
50 Gbps 3x
32
CPU Memory CPU GPU Data Path GPU Memory
QP CQ Data
0110011001 1010011001 1001010101
33
CPU Memory CPU GPU Data Path GPU Memory
QP CQ Data
0110011001 1010011001 1001010101
34
CPU Memory CPU GPU Data Path GPU Memory
QP Data
0110011001 1010011001 1001010101
CQ
35
CPU Memory CPU GPU Data Path GPU Memory
Data
0110011001 1010011001 1001010101
QP CQ
Transfer latency [µsec]
QP in CPU QP in GPU CQ in CPU 8.6 6.2 CQ in GPU 6.8 4.8
36
37
GPUDirect RDMA - CUDA v7.5: Running kernel may observe STALE DATA or data that arrives OUT-OF-ORDER Scenario: Intensive RDMA writes to GPU memory Good news: NVIDIA announced a CUDA 8 feature that enables consistent update Suggested fix: CRC32 integrity check API for error detection
38
GPI - A framework to implement Partitioned Global Address Space (PGAS) GPI2 - Extends this global address space to GPU memory Threads Global memory Host Local memory Host Global memory GPU Local memory GPU
39
Global memory Host Local memory Host Threads Threads Global memory GPU Local memory GPU Threads
gaspi_segment_create (CPU_MEM) Initialize data gaspi_write_notify gaspi_notify_waitsome gaspi_proc_term
gaspi_segment_create (GPU_MEM) gaspi_notify_waitsome GPU_Compute _data<<<>>> gaspi_write_notify gaspi_proc_term
40
gaspi_segment_create (CPU_MEM) Initialize data gaspi_write_notify gaspi_notify_waitsome gaspi_proc_term
gaspi_segment_create (GPU_MEM) GPU_start_kernel <<<>>> { gpu_gaspi_notify_waitsome Compute_data() gpu_gaspi_write_notify } gaspi_proc_term
41
Start timer gaspi_write_notify gaspi_notify_waitsome Stop timer
gpu_notify_waitsome Matrix_compute() gpu_write_notify Batch size [Vectors] GPI2 GPUrdma 480 2.6 11.7 960 4.8 18.8 1920 8.4 25.2 3840 13.9 29.1 7680 19.9 30.3 15360 24.3 31.5
multiplications per second as a function of the batch size
42
Lena Oden, Fraunhofer Institute for Industrial Mathematics:
the GPU
Mark Silberstein, Technion – Israel Institute of Technology:
43