high performance networking from GPU kernels Feras Daoud Technion - - PowerPoint PPT Presentation

high performance
SMART_READER_LITE
LIVE PREVIEW

high performance networking from GPU kernels Feras Daoud Technion - - PowerPoint PPT Presentation

GPUrdma: GPU-side library for high performance networking from GPU kernels Feras Daoud Technion Israel Institute of Technology Mark Silberstein Amir Watad Technion Technion 1 Agenda 1 Introduction 2 InfiniBand Background 3 GPUrdma


slide-1
SLIDE 1

GPUrdma: GPU-side library for high performance networking from GPU kernels

Feras Daoud Technion – Israel Institute of Technology

Mark Silberstein Technion Amir Watad Technion

1

slide-2
SLIDE 2

Agenda

2

Introduction

1

InfiniBand Background

2

GPUrdma

3

GPUrdma Evaluation

4

GPI2

5

slide-3
SLIDE 3

3

What

  • A GPU-side library for performing RDMA directly from

GPU kernels

Why

  • To improve communication performance between distributed

GPUs

Results

  • 5 µsec GPU-to-GPU communication latency and up to 50 Gbps

transfer bandwidth

slide-4
SLIDE 4

Evolution of GPU-HCA interaction:

GPU GPU RAM CPU CPU RAM HCA Naive Version Data Path Control Path

4

slide-5
SLIDE 5

Evolution of GPU-HCA interaction:

GPU GPU RAM CPU CPU RAM HCA GPU GPU RAM CPU CPU RAM HCA Naive Version GPUDirect RDMA

5

Data Path Control Path Direct Data Path

slide-6
SLIDE 6

Evolution of GPU-HCA interaction:

GPU GPU RAM CPU CPU RAM HCA GPU GPU RAM CPU CPU RAM HCA GPU GPU RAM CPU CPU RAM HCA Naive Version GPUDirect RDMA GPUrdma

6

Data Path Control Path Direct Data Path Direct Control Path

slide-7
SLIDE 7

Motivations

GPUDirect RDMA Node

CPU_rdma_read() GPU_kernel<<<>>> { GPU_Compute() } CPU_rdma_write()

7

slide-8
SLIDE 8

Motivations

GPUDirect RDMA Node

CPU_rdma_read() GPU_kernel<<<>>> { GPU_Compute() } CPU_rdma_write()

8

CPU Overhead

slide-9
SLIDE 9

Motivations

GPUDirect RDMA Node

CPU_rdma_read() GPU_kernel<<<>>> { GPU_Compute() } CPU_rdma_write()

9

Bulk-synchronous design and explicit pipelining Communication Communication Computation

slide-10
SLIDE 10

Motivations

GPUDirect RDMA Node

CPU_rdma_read() GPU_kernel<<<>>> { GPU_Compute() } CPU_rdma_write()

10

CPU_rdma_read() GPU_kernel<<<>>> { GPU_Compute() } CPU_rdma_write() CPU_rdma_read() GPU_kernel<<<>>> { GPU_Compute() } CPU_rdma_write() CPU_rdma_read() GPU_kernel<<<>>> { GPU_Compute() } CPU_rdma_write()

Multiple GPU kernel invocations

  • 1. kernel calls overhead
  • 2. Inefficient shared memory usage
slide-11
SLIDE 11

Motivations

GPUDirect RDMA Node

CPU_rdma_read() GPU_kernel<<<>>> { Find_Even_Num() } CPU_rdma_write()

11

Sparse data

1 1 1 3 5 2 5 1 8 Offsets 0x0 0x3 0x6

slide-12
SLIDE 12

GPUrdma library

GPU_kernel<<<>>> { GPU_rdma_read() GPU_Compute() GPU_rdma_write() }

GPUrdma Node

12

  • No CPU intervention
  • Overlapping communication and computation
  • One kernel call
  • Efficient shared memory usage
  • Send spare data directly from the kernel
slide-13
SLIDE 13

Agenda

13

GPUrdma

3

GPUrdma Evaluation

4

GPI2

5

Introduction

1

InfiniBand Background

2

slide-14
SLIDE 14

InfiniBand Background

  • 1. Queue pair buffer (QP)
  • Send queue
  • Receive queue
  • 2. Work Queue Element
  • Contains communication instructions
  • 3. Completion queue buffer (CQ)
  • Contains completion elements
  • 4. Completion queue element
  • Contains information about completed jobs

14

Task Verbs Completion

QP Buffer CQ Buffer

HCA

WQE WQE CQE CQE

slide-15
SLIDE 15

InfiniBand Background

  • 1. Write work queue element to QP buffer
  • 2. Ring the Door-Bell
  • 3. Check completion queue element status

15

Ring the Door-Bell to execute jobs

  • MMIO address
  • Informs the HCA about new jobs

CPU Memory

QP CQ

CPU

Data

Control Path Data Path

01100110 01101001 10011001

slide-16
SLIDE 16

Agenda

16

GPUrdma Evaluation

4

GPI2

5

Introduction

1

InfiniBand Background

2

GPUrdma

3

slide-17
SLIDE 17

GPUrdma Node

  • Direct path for data exchange
  • Direct HCA control from GPU kernels
  • No CPU intervention

Native GPU Node

GPU

Control Path

17

GPU Memory

QP CQ Data

01100110 01101001 10011001

Data Path

slide-18
SLIDE 18

GPUrdma Implementation

18

CPU Memory

QP CQ Data

0110011001 1010011001 1001010101

CPU GPU Data Path GPU Memory

slide-19
SLIDE 19

GPUrdma Implementation

Data Path - GPUDirect RDMA

19

CPU Memory CPU GPU Data Path GPU Memory

Data

0110011001 1010011001 1001010101

QP CQ

slide-20
SLIDE 20

GPUrdma Implementation

20

CPU Memory

QP CQ

CPU GPU GPU Memory

  • 1. Move QP, CQ to GPU memory

Data

0110011001 1010011001 1001010101

Data Path

slide-21
SLIDE 21

GPUrdma Implementation

  • 1. Move QP, CQ to GPU memory

Modify InfiniBand Verbs

  • ibv_create_qp()
  • ibv_create_cq()

21

CPU Memory CPU GPU Data Path GPU Memory

QP CQ Data

0110011001 1010011001 1001010101

slide-22
SLIDE 22

GPUrdma Implementation

  • 2. Map the HCA doorbell address

into GPU address space

22

CPU Memory CPU GPU Data Path GPU Memory

QP CQ Data

0110011001 1010011001 1001010101

slide-23
SLIDE 23

GPUrdma Implementation

23

CPU Memory CPU GPU Data Path GPU Memory

QP CQ Data

0110011001 1010011001 1001010101

  • 2. Map the HCA doorbell address

into GPU address space Modify NVIDIA driver

slide-24
SLIDE 24

Agenda

24

GPI2

5

Introduction

1

InfiniBand Background

2

GPUrdma Evaluation

4

GPUrdma

3

slide-25
SLIDE 25

GPUrdma Evaluation

  • Single QP
  • Multiple QP
  • Scalability - Optimal QP/CQ location

NVIDIA Tesla K40c GPU Mellanox Connect-IB HCA

25

slide-26
SLIDE 26

GPUrdma – 1 thread , 1 QP

26

CPU GPU

  • Best Performance CPU controller VS GPU controller
slide-27
SLIDE 27

GPUrdma – 1 thread , 1 QP

27

  • GPU controller – Optimize doorbell rings

Doorbell Optimized CPU GPU

3x

slide-28
SLIDE 28

GPUrdma – 1 thread , 1 QP

28

  • GPU controller – Optimize CQ poll

CQ Optimized Doorbell Optimized CPU

slide-29
SLIDE 29

GPUrdma – 32 threads , 1 QP

29

  • GPU controller – Write parallel jobs

Parallel writes CQ Optimized CPU

slide-30
SLIDE 30

GPUDirect RDMA

30

  • CPU controller

CPU 30 QPs CPU 1 QP

slide-31
SLIDE 31

GPUrdma – 30 QPs

31

  • 1 QP per Block vs 30 QPs per Block

30 QPs per Block CPU 30 QPs 1 QP per Block

50 Gbps 3x

slide-32
SLIDE 32

Scalability – Optimal QP/CQ location:

  • 1. QP and CQ in GPU memory
  • 2. QP in GPU and CQ in system memory
  • 3. CQ in GPU and QP in system memory
  • 4. QP and CQ in GPU memory

32

CPU Memory CPU GPU Data Path GPU Memory

QP CQ Data

0110011001 1010011001 1001010101

slide-33
SLIDE 33

Scalability – Optimal QP/CQ location:

33

CPU Memory CPU GPU Data Path GPU Memory

QP CQ Data

0110011001 1010011001 1001010101

  • 1. QP and CQ in GPU memory
  • 2. QP in GPU and CQ in system memory
  • 3. CQ in GPU and QP in system memory
  • 4. QP and CQ in GPU memory
slide-34
SLIDE 34

Scalability – Optimal QP/CQ location:

34

CPU Memory CPU GPU Data Path GPU Memory

QP Data

0110011001 1010011001 1001010101

  • 1. QP and CQ in GPU memory
  • 2. QP in GPU and CQ in system memory
  • 3. CQ in GPU and QP in system memory
  • 4. QP and CQ in GPU memory

CQ

slide-35
SLIDE 35

Scalability – Optimal QP/CQ location:

35

CPU Memory CPU GPU Data Path GPU Memory

Data

0110011001 1010011001 1001010101

  • 1. QP and CQ in GPU memory
  • 2. QP in GPU and CQ in system memory
  • 3. CQ in GPU and QP in system memory
  • 4. QP and CQ in system memory

QP CQ

slide-36
SLIDE 36

Optimal QP/CQ location:

Transfer latency [µsec]

QP in CPU QP in GPU CQ in CPU 8.6 6.2 CQ in GPU 6.8 4.8

36

  • Throughput: No difference
  • Latency:
slide-37
SLIDE 37

Limitations

37

GPUDirect RDMA - CUDA v7.5: Running kernel may observe STALE DATA or data that arrives OUT-OF-ORDER Scenario: Intensive RDMA writes to GPU memory Good news: NVIDIA announced a CUDA 8 feature that enables consistent update Suggested fix: CRC32 integrity check API for error detection

slide-38
SLIDE 38

Agenda

38

GPI2 Introduction

1

InfiniBand Background

2

GPUrdma

3

GPUrdma Evaluation

4 5

slide-39
SLIDE 39

GPI2 for GPUs:

GPI - A framework to implement Partitioned Global Address Space (PGAS) GPI2 - Extends this global address space to GPU memory Threads Global memory Host Local memory Host Global memory GPU Local memory GPU

39

Global memory Host Local memory Host Threads Threads Global memory GPU Local memory GPU Threads

slide-40
SLIDE 40

GPI2 code example

CPU Node

gaspi_segment_create (CPU_MEM) Initialize data gaspi_write_notify gaspi_notify_waitsome gaspi_proc_term

GPU Node

gaspi_segment_create (GPU_MEM) gaspi_notify_waitsome GPU_Compute _data<<<>>> gaspi_write_notify gaspi_proc_term

40

slide-41
SLIDE 41

GPI2 using GPUrdma

CPU Node

gaspi_segment_create (CPU_MEM) Initialize data gaspi_write_notify gaspi_notify_waitsome gaspi_proc_term

GPU Node

gaspi_segment_create (GPU_MEM) GPU_start_kernel <<<>>> { gpu_gaspi_notify_waitsome Compute_data() gpu_gaspi_write_notify } gaspi_proc_term

41

slide-42
SLIDE 42

GPUrdma Multi-Matrix vector product

CPU Node

Start timer gaspi_write_notify gaspi_notify_waitsome Stop timer

GPU Node

gpu_notify_waitsome Matrix_compute() gpu_write_notify Batch size [Vectors] GPI2 GPUrdma 480 2.6 11.7 960 4.8 18.8 1920 8.4 25.2 3840 13.9 29.1 7680 19.9 30.3 15360 24.3 31.5

  • System throughput in millions of 32x1 vector

multiplications per second as a function of the batch size

42

slide-43
SLIDE 43

Lena Oden, Fraunhofer Institute for Industrial Mathematics:

  • Infiniband-Verbs on GPU: A case study of controlling an Infiniband network device from

the GPU

  • Analyzing Put/Get APIs for Thread-collaborative Processors

Mark Silberstein, Technion – Israel Institute of Technology:

  • GPUnet: networking abstractions for GPU programs
  • GPUfs: Integrating a file system with GPUs

43

Related works

Thanks