GPUnet: networking abstractions for GPU programs Mark Silberstein - - PowerPoint PPT Presentation

gpunet networking abstractions for gpu programs
SMART_READER_LITE
LIVE PREVIEW

GPUnet: networking abstractions for GPU programs Mark Silberstein - - PowerPoint PPT Presentation

GPUnet: networking abstractions for GPU programs Mark Silberstein Technion Israel Institute of Technology Sangman Kim, Seonggu Huh, Xinya Zhang Amir Wated Yige Hu, Emmett Witchel Technion University of Texas at Austin Mark Silberstein -


slide-1
SLIDE 1

Mark Silberstein - EE, Technion

GPUnet: networking abstractions for GPU programs

Mark Silberstein Technion – Israel Institute of Technology

Amir Wated Technion Sangman Kim, Seonggu Huh, Xinya Zhang Yige Hu, Emmett Witchel University of Texas at Austin

slide-2
SLIDE 2

Mark Silberstein - EE, Technion

What

A socket API for programs running on GPU

Why

GPU-accelerated servers are hard to build

Results

GPU vs. CPU 50% throughput, 60% latency, ½ LOC

slide-3
SLIDE 3

Mark Silberstein - EE, Technion

Motivation: GPU-accelerated networking applications

Data processing server Data processing server

GPU GPU GPU

MapReduce MapReduce

GPU GPU GPU GPU

slide-4
SLIDE 4

Mark Silberstein - EE, Technion

Recent GPU-accelerated networking applications

SSLShader (Jang 2011), GPU MapReduce (Stuart 2011), Deep Neural Networks (Coates 2013), Dandelion (Rossbach 2013), Rhythm (Agrawal 2014) ...

slide-5
SLIDE 5

Mark Silberstein - EE, Technion

required heroic efforts

SSLShader (Jang 2011), GPU MapReduce (Stuart 2011), Deep Neural Networks (Coates 2013), Dandelion (Rossbach 2013), Rhythm (Agrawal 2014) ...

Recent GPU-accelerated networking applications

slide-6
SLIDE 6

Mark Silberstein - EE, Technion

GPU-accelerated networking apps: Recurring themes

Request batching NIC-GPU interaction Pipelining and buffer management

slide-7
SLIDE 7

Mark Silberstein - EE, Technion

GPU-accelerated networking apps: Recurring themes

Request batching CPU-GPU-NIC Pipelining NIC-GPU interaction

We will sidestep these problems

slide-8
SLIDE 8

Mark Silberstein - EE, Technion

The real problem: CPU is the only boss

GPU NIC Storage

CPU

slide-9
SLIDE 9

Mark Silberstein - EE, Technion

Example: CPU server

CPU NIC Memory

compute() recv() send()

slide-10
SLIDE 10

Mark Silberstein - EE, Technion

Inside a GPU-accelerated server

CPU GPU NIC Memory Memory

PCIe bus

GPU_compute() recv() send()

Theory

GPU_compute() recv() send()

Theory

slide-11
SLIDE 11

Mark Silberstein - EE, Technion

Inside a GPU-accelerated server

CPU GPU NIC Memory Memory

recv();

recv();

batch();

GPU_compute() recv() send()

Theory

slide-12
SLIDE 12

Mark Silberstein - EE, Technion

Inside a GPU-accelerated server

CPU GPU NIC Memory Memory

transfer();

recv();

batch();

  • ptimize();

transfer();

GPU_compute() recv() send()

Theory

slide-13
SLIDE 13

Mark Silberstein - EE, Technion

Inside a GPU-accelerated server

CPU NIC Memory Memory

invoke();

recv();

batch();

  • ptimize();

transfer(); balance();

GPU_compute();

GPU_compute() recv() send()

GPU_compute()

Theory

slide-14
SLIDE 14

Mark Silberstein - EE, Technion

Inside a GPU-accelerated server

CPU GPU NIC Memory Memory

transfer();

recv();

batch();

  • ptimize();

transfer(); balance();

GPU_compute();

transfer(); cleanup();

GPU_compute() recv() send()

GPU_compute()

Theory

slide-15
SLIDE 15

Mark Silberstein - EE, Technion

Inside a GPU-accelerated server

CPU GPU NIC Memory Memory

send();

recv();

batch();

  • ptimize();

transfer(); balance(); transfer(); cleanup(); dispatch();

send();

GPU_compute() recv() send()

GPU_compute()

Theory

slide-16
SLIDE 16

Mark Silberstein - EE, Technion

Inside a GPU-accelerated server

CPU NIC Memory Memory

Aggressive pipelining

Double buffering, asynchrony, multithreading

recv();

batch();

  • ptimize();

transfer(); balance();

GPU_compute();

transfer(); cleanup(); dispatch();

send(); recv();

batch();

  • ptimize();

transfer(); balance();

GPU_compute();

transfer(); cleanup(); dispatch();

send(); recv();

batch();

  • ptimize();

transfer(); balance();

GPU_compute();

transfer(); cleanup(); dispatch();

send(); recv();

batch();

  • ptimize();

transfer(); balance();

GPU_compute();

transfer(); cleanup(); dispatch();

send(); GPU_compute()

GPU_compute() recv() send()

slide-17
SLIDE 17

Mark Silberstein - EE, Technion

recv();

batch();

  • ptimize();

transfer(); balance();

GPU_compute();

transfer(); cleanup(); dispatch();

send(); recv();

batch();

  • ptimize();

transfer(); balance();

GPU_compute();

transfer(); cleanup(); dispatch();

send(); recv();

batch();

  • ptimize();

transfer(); balance();

GPU_compute();

transfer(); cleanup(); dispatch();

send(); GPU_compute()

This code is for a CPU to manage a GPU

batch();

  • ptimize();

transfer(); balance(); transfer(); cleanup(); dispatch();

slide-18
SLIDE 18

Mark Silberstein - EE, Technion

GPUs are not co-processors GPUs are peer-processors They need I/O abstractions

File system I/O – [GPUfs ASPLOS13] Network I/O – this work

slide-19
SLIDE 19

Mark Silberstein - EE, Technion

GPUnet: socket API for GPUs Application view

socket(AF_INET,SOCK_STREAM); connect(“node0:2340”); socket(AF_INET,SOCK_STREAM); connect(“node0:2340”) GPUnet

GPU native native client

socket(AF_INET,SOCK_STREAM); listen(:2340)

GPU native native server

node0.technion.ac.il GPUnet

CPU client

Network

slide-20
SLIDE 20

Mark Silberstein - EE, Technion

GPU-accelerated server with GPUnet

CPU GPU NIC Memory Memory

PCIe bus

CPU not involved

GPU_compute() recv() send()

slide-21
SLIDE 21

Mark Silberstein - EE, Technion

GPU-accelerated server with GPUnet

GPU NIC Memory

PCIe bus

GPU_compute() recv() send()

slide-22
SLIDE 22

Mark Silberstein - EE, Technion

GPU-accelerated server with GPUnet

NIC Memory send() recv()

No request batching

GPU_compute() recv() send() GPU_compute() recv() send() GPU_compute() recv() send()

slide-23
SLIDE 23

Mark Silberstein - EE, Technion

GPU-accelerated server with GPUnet

NIC Memory send() recv()

Automatic request pipelining Automatic buffer management

GPU_compute() recv() send() GPU_compute() recv() send() GPU_compute() recv() send()

slide-24
SLIDE 24

Mark Silberstein - EE, Technion

Building a socket abstraction for GPUs

slide-25
SLIDE 25

Mark Silberstein - EE, Technion

Goals

CPU GPU

recv()

NIC Memory Memory

PCIe bus

Simplicity

Reliable streaming abstraction for GPUs

Performance

NIC → GPU data path optimizations

slide-26
SLIDE 26

Mark Silberstein - EE, Technion

Memory

Design option 1: Transport layer processing on CPU

CPU GPU

recv()

NIC Network buffers Transport processing GPU controls the flow of data

slide-27
SLIDE 27

Mark Silberstein - EE, Technion

Memory

Design option 1: Transport layer processing on CPU

CPU GPU

recv()

NIC

Extra CPU-GPU memory transfers

Network buffers Transport processing

slide-28
SLIDE 28

Mark Silberstein - EE, Technion

Design option 2: Transport layer processing on GPU

CPU GPU NIC Memory

P2P DMA P2P DMA

recv()

Network buffers Transport processing

slide-29
SLIDE 29

Mark Silberstein - EE, Technion

Design option 2: Transport layer processing on GPU

CPU GPU NIC

P2P DMA

recv()

CPU applications access network through GPU? TCP/IP

  • n GPU?

Network buffers Transport processing

slide-30
SLIDE 30

Mark Silberstein - EE, Technion

Not CPU, Not GPU

We need help from NIC hardware

slide-31
SLIDE 31

Mark Silberstein - EE, Technion

RDMA: offloading transport layer processing to NIC

CPU GPU NIC Message buffers Message buffers Streaming Reliable RDMA Streaming

slide-32
SLIDE 32

Mark Silberstein - EE, Technion

GPUnet layers

Reliable channel Reliable in-order streaming GPU Socket API Non-RDMA Transports

UNIX Domain Socket, TCP/IP

RDMA Transports

Infiniband

slide-33
SLIDE 33

Mark Silberstein - EE, Technion

GPUnet layers

Reliable channel Reliable in-order streaming GPU Socket API

GPU NIC CPU

Simplicity Performance

Non-RDMA Transports

UNIX Domain Socket, TCP/IP

RDMA Transports

Infiniband

slide-34
SLIDE 34

Mark Silberstein - EE, Technion

See the paper for

  • Coalesced API calls
  • Latency-optimized GPU-CPU flow control
  • Memory management
  • Bounce buffers
  • Non-RDMA support
  • GPU performance optimizations
slide-35
SLIDE 35

Mark Silberstein - EE, Technion

Implementation

  • Standard API calls, blocking/nonblocking
  • libGPUnet.a: AF_INET, Streaming over

Infiniband RDMA

  • Fully compatible with CPU rsocket library
  • libUNIXnet.a: AF_LOCAL: Unix Domain

Sockets support for inter GPU/CPU-GPU

slide-36
SLIDE 36

Mark Silberstein - EE, Technion

Implementation

GPU

GPU application GPUnet socket library

CPU

Network buffers Flow control GPUnet proxy Bounce buffers

GPU memory

NIC

CPU memory fallback

slide-37
SLIDE 37

Mark Silberstein - EE, Technion

Evaluation

  • Analysis of GPU-native server design
  • Matrix product server
  • In-GPU-memory MapReduce
  • Face verification server

2x6 Intel E5-2620, NVIDIA Tesla K20Xm GPU, Mellanox Connect-IB HCA, Switch-X bridge

slide-38
SLIDE 38

Mark Silberstein - EE, Technion

In-GPU-memory MapReduce

Sort Reduce Map

GPU

GPUnet GPUfs Map

GPU

Receiver Receiver Sort Reduce

slide-39
SLIDE 39

Mark Silberstein - EE, Technion

In-GPU-memory MapReduce: Scalability

1 GPU (no network) 4 GPUs (GPUnet) K-means 5.6 sec 1.6 sec (3.5x) Word-count 29.6 sec 10 sec (2.9x)

GPUnet enables scale-out for GPU – accelerated systems

slide-40
SLIDE 40

Mark Silberstein - EE, Technion

Face verification server

=

?

memcached (unmodified) via rsocket GPU server (GPUnet) CPU client (unmodified) via rsocket

recv() features() query_DB() compare() send()

Infiniband

GPU_features() GPU_compare()

slide-41
SLIDE 41

Mark Silberstein - EE, Technion

Face verification: Different implementations

1 GPU (no GPUnet)

1 GPU GPUnet

CPU 6 cores

500 1000 1500 2000 2500

Latency (μsec)

34 54 23

Throughput (KReq/sec) 99th % Median 25th-75th%

slide-42
SLIDE 42

Mark Silberstein - EE, Technion

Face verification: Different implementations

1 GPU (no GPUnet)

1 GPU GPUnet

CPU 6 cores

500 1000 1500 2000 2500

Latency (μsec)

34 54 23

Throughput (KReq/sec) 99th % Median 25th-75th%

1.9x throughput 1/3x latency ½ LOC

slide-43
SLIDE 43

Mark Silberstein - EE, Technion

Face verification: Different implementations

1 GPU (no GPUnet)

1 GPU GPUnet

CPU 6 cores

500 1000 1500 2000 2500

Latency (μsec)

34 54 23

Throughput (KReq/sec) 99th % Median 25th-75th%

Large variability in latency

slide-44
SLIDE 44

Mark Silberstein - EE, Technion

Face verification on all processors 2xGPU + 10xCPU

1 GPU GPUnet

2xGPUnet+ 10xCPU

500 1000 1500 2000 2500

Latency (μsec) Latency

  • ptimized

164 186 54

Throughput (KReq/sec)

23 34

Similar latency 4.5x throughput

CPU 6 cores Throughput

  • ptimized
slide-45
SLIDE 45

Mark Silberstein - EE, Technion

Set GPUs free!

mark@ee.technion.ac.il

CPU

GPU

CPU

GPU

GPUnet

GPUnet is a library providing networking abstractions for GPUs https://github.com/ut-osa/gpunet