GPUnet: networking abstractions for GPU programs Mark Silberstein - PowerPoint PPT Presentation

GPUnet: networking abstractions for GPU programs Mark Silberstein Technion – Israel Institute of Technology Sangman Kim, Seonggu Huh, Xinya Zhang Amir Wated Yige Hu, Emmett Witchel Technion University of Texas at Austin Mark Silberstein - EE, Technion

What A socket API for programs running on GPU Why GPU-accelerated servers are hard to build Results GPU vs. CPU 50% throughput, 60% latency , ½ LOC Mark Silberstein - EE, Technion

Motivation: GPU-accelerated networking applications Data processing server Data processing server GPU GPU GPU MapReduce MapReduce GPU GPU GPU GPU Mark Silberstein - EE, Technion

Recent GPU-accelerated networking applications SSLShader (Jang 2011), GPU MapReduce (Stuart 2011), Deep Neural Networks (Coates 2013), Dandelion (Rossbach 2013), Rhythm (Agrawal 2014) ... Mark Silberstein - EE, Technion

Recent GPU-accelerated networking applications SSLShader (Jang 2011), GPU MapReduce (Stuart 2011), Deep Neural Networks (Coates 2013), Dandelion (Rossbach 2013), Rhythm (Agrawal 2014) ... required heroic efforts Mark Silberstein - EE, Technion

GPU-accelerated networking apps: Recurring themes NIC-GPU interaction Pipelining and buffer management Request batching Mark Silberstein - EE, Technion

GPU-accelerated networking apps: Recurring themes NIC-GPU interaction CPU-GPU-NIC Pipelining We will sidestep these problems Request batching Mark Silberstein - EE, Technion

The real problem: CPU is the only boss NIC Storage GPU CPU Mark Silberstein - EE, Technion

Example: CPU server CPU recv() compute() Memory NIC send() Mark Silberstein - EE, Technion

Inside a GPU-accelerated server CPU GPU Memory Memory NIC PCIe bus Theory Theory recv() recv() GPU_compute() GPU_compute() send() send() Mark Silberstein - EE, Technion

Inside a GPU-accelerated server recv(); CPU GPU Memory Memory NIC recv(); batch(); Theory recv() GPU_compute() send() Mark Silberstein - EE, Technion

Inside a GPU-accelerated server transfer(); CPU GPU Memory Memory NIC recv(); batch(); Theory optimize(); transfer(); recv() GPU_compute() send() Mark Silberstein - EE, Technion

Inside a GPU-accelerated server invoke(); CPU Memory Memory NIC recv(); batch(); Theory optimize(); transfer(); balance(); recv() GPU_compute(); GPU_compute() GPU_compute() send() Mark Silberstein - EE, Technion

Inside a GPU-accelerated server transfer(); CPU GPU Memory Memory NIC recv(); batch(); Theory optimize(); transfer(); balance(); recv() GPU_compute(); GPU_compute() GPU_compute() transfer(); cleanup(); send() Mark Silberstein - EE, Technion

Inside a GPU-accelerated server send(); CPU GPU Memory Memory NIC recv (); batch(); Theory optimize(); transfer(); recv() balance(); GPU_compute() GPU_compute() transfer(); cleanup(); send() dispatch(); send(); Mark Silberstein - EE, Technion

Aggressive pipelining Inside a GPU-accelerated server Double buffering, asynchrony, multithreading CPU Memory Memory NIC recv (); recv (); recv (); recv (); batch(); batch(); batch(); optimize(); batch(); optimize(); optimize(); transfer(); optimize(); transfer(); transfer(); balance(); transfer(); balance(); recv() balance(); GPU_compute(); balance(); GPU_compute(); GPU_compute(); transfer(); GPU_compute(); GPU_compute() GPU_compute() transfer(); transfer(); cleanup(); transfer(); cleanup(); cleanup(); dispatch(); send() cleanup(); dispatch(); dispatch(); send(); dispatch(); send(); send(); send(); Mark Silberstein - EE, Technion

This code is for a CPU to manage a GPU recv (); recv (); recv (); batch(); batch(); batch(); batch(); optimize(); optimize(); optimize(); optimize(); transfer(); transfer(); transfer(); transfer(); balance(); balance(); balance(); balance(); GPU_compute(); GPU_compute(); GPU_compute(); transfer(); GPU_compute() transfer(); transfer(); cleanup(); transfer(); cleanup(); cleanup(); dispatch(); dispatch(); cleanup(); dispatch(); send(); send(); send(); dispatch(); Mark Silberstein - EE, Technion

GPUs are not co- processors GPUs are peer- processors They need I/O abstractions File system I/O – [GPUfs ASPLOS13] Network I/O – this work Mark Silberstein - EE, Technion

GPUnet: socket API for GPUs Application view node0.technion.ac.il GPU native native server socket (AF_INET,SOCK_STREAM); listen (:2340) GPUnet Network GPU native native client CPU client socket (AF_INET,SOCK_STREAM); socket(AF_INET,SOCK_STREAM); connect (“node0:2340”); connect (“node0:2340”) GPUnet Mark Silberstein - EE, Technion

GPU-accelerated server with GPUnet CPU not involved CPU GPU Memory Memory NIC PCIe bus recv() GPU_compute() send() Mark Silberstein - EE, Technion

GPU-accelerated server with GPUnet GPU Memory NIC PCIe bus recv() GPU_compute() send() Mark Silberstein - EE, Technion

GPU-accelerated server with GPUnet No request batching send() recv() Memory NIC recv() recv() recv() GPU_compute() GPU_compute() GPU_compute() send() send() send() Mark Silberstein - EE, Technion

GPU-accelerated server with GPUnet Automatic send() request pipelining recv() Memory NIC Automatic buffer recv() recv() recv() management GPU_compute() GPU_compute() GPU_compute() send() send() send() Mark Silberstein - EE, Technion

Building a socket abstraction for GPUs Mark Silberstein - EE, Technion

Goals CPU GPU recv() Memory Memory NIC PCIe bus Simplicity Performance Reliable streaming NIC → GPU abstraction for GPUs data path optimizations Mark Silberstein - EE, Technion

Design option 1: Transport layer processing on CPU CPU GPU recv() Transport GPU controls processing the flow of data Network Memory buffers NIC Mark Silberstein - EE, Technion

Design option 1: Transport layer processing on CPU CPU GPU recv() Transport processing Network Memory buffers Extra CPU-GPU NIC memory transfers Mark Silberstein - EE, Technion

Design option 2: Transport layer processing on GPU CPU GPU recv() Transport processing Network Memory buffers P2P DMA P2P DMA NIC Mark Silberstein - EE, Technion

Design option 2: Transport layer processing on GPU CPU GPU recv() Transport processing TCP/IP on GPU? Network buffers CPU applications P2P DMA access network through GPU? NIC Mark Silberstein - EE, Technion

Not CPU, Not GPU We need help from NIC hardware Mark Silberstein - EE, Technion

RDMA: offloading transport layer processing to NIC CPU GPU Streaming Streaming Message Message buffers buffers Reliable RDMA NIC Mark Silberstein - EE, Technion

GPUnet layers GPU Socket API Reliable in-order streaming Reliable channel RDMA Transports Non-RDMA Transports Infiniband UNIX Domain Socket, TCP/IP Mark Silberstein - EE, Technion

GPUnet layers Simplicity GPU Socket API Reliable in-order streaming GPU Reliable channel RDMA Transports Non-RDMA Transports Infiniband UNIX Domain Socket, TCP/IP NIC CPU Performance Mark Silberstein - EE, Technion

See the paper for ● Coalesced API calls ● Latency-optimized GPU-CPU flow control ● Memory management ● Bounce buffers ● Non-RDMA support ● GPU performance optimizations Mark Silberstein - EE, Technion

Implementation ● Standard API calls, blocking/nonblocking ● libGPUnet.a : AF_INET, Streaming over Infiniband RDMA ● Fully compatible with CPU rsocket library ● libUNIXnet.a : AF_LOCAL: Unix Domain Sockets support for inter GPU/CPU-GPU Mark Silberstein - EE, Technion

Implementation GPU CPU GPU application GPUnet socket library Bounce GPUnet Flow Network buffers proxy control buffers CPU memory GPU memory NIC fallback Mark Silberstein - EE, Technion

Evaluation ● Analysis of GPU-native server design ● Matrix product server ● In-GPU-memory MapReduce ● Face verification server 2x6 Intel E5-2620, NVIDIA Tesla K20Xm GPU, Mellanox Connect-IB HCA, Switch-X bridge Mark Silberstein - EE, Technion

In-GPU-memory MapReduce GPUfs GPU GPU Map Map GPUnet Receiver Receiver Sort Sort Reduce Reduce Mark Silberstein - EE, Technion

In-GPU-memory MapReduce: Scalability 1 GPU 4 GPUs (no network) (GPUnet) K-means 5.6 sec 1.6 sec ( 3.5x ) Word-count 29.6 sec 10 sec ( 2.9x ) GPUnet enables scale-out for GPU – accelerated systems Mark Silberstein - EE, Technion

Face verification server memcached CPU client GPU server (unmodified) (unmodified) (GPUnet) via rsocket via rsocket Infiniband ? = recv() features() GPU_features() query_DB() GPU_compare() compare() send() Mark Silberstein - EE, Technion

Face verification: Different implementations 1 GPU 2500 Latency (μsec) (no GPUnet) CPU 2000 6 cores 99 th % 1500 25 th -75 th % 1 GPU 1000 Median GPUnet 500 23 34 54 Throughput (KReq/sec) Mark Silberstein - EE, Technion

Face verification: Different implementations 1.9x throughput 1 GPU 1/3x latency 2500 Latency (μsec) (no GPUnet) CPU ½ LOC 2000 6 cores 99 th % 1500 25 th -75 th % 1 GPU 1000 Median GPUnet 500 23 34 54 Throughput (KReq/sec) Mark Silberstein - EE, Technion

GPUnet: networking abstractions for GPU programs Mark Silberstein - PowerPoint PPT Presentation

GPUnet: networking abstractions for GPU programs Mark Silberstein Technion Israel Institute of Technology Sangman Kim, Seonggu Huh, Xinya Zhang Amir Wated Yige Hu, Emmett Witchel Technion University of Texas at Austin Mark Silberstein -

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Abstractions for Routing Abstractions for Network Routing Brighten Godfrey Brighten Godfrey

Planning and Optimization D2. Abstractions: Additive Abstractions Gabriele R oger and Thomas

Automatically Deriving Abstraction Heuristics PDB Abstractions Explicit-State Abstractions

Unified L2 Abstractions for L3-Driven Fast Handover draft-irtf-mobopts-l2-abstractions-01 F.

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Efficient Abstractions for GPGPU Programming . Mathias Bourgoin 10.03.2015 Efficient

Networking in Eastern Networking in Eastern Networking in Eastern Networking in Eastern Europe

Multiple Programs How do programs communicate? 1 Multiple Programs How do programs communicate?

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

Balance of Electric and Diffusion Forces Ions flow into and out of the neuron under the forces of

Malleable Proof Systems and Applications Melissa Chase (MSR Redmond) Markulf Kohlweiss (MSR

Estimating treatment effects from observational data using teffects, stteffects, and eteffects

Linking Design to Analysis of Cluster Randomized Trials: Covariate Balancing Strategies Fan

Magic Lessons: Designing and Balancing Game Objects K. Robert Gutschera Director of Development

Distributed load balancing Real case example using open source on commodity hardware Pavlos

TritonSort A Balanced Large-Scale Sorting System Alex Rasmussen , George Porter, Michael Conley,

Load Balancing Load Balancing: Example Example Problem Consider 6 jobs whose processing times

GPUnet: networking abstractions for GPU programs Mark Silberstein - PowerPoint PPT Presentation

GPUnet: networking abstractions for GPU programs Mark Silberstein Technion Israel Institute of Technology Sangman Kim, Seonggu Huh, Xinya Zhang Amir Wated Yige Hu, Emmett Witchel Technion University of Texas at Austin Mark Silberstein -

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Abstractions for Routing Abstractions for Network Routing Brighten Godfrey Brighten Godfrey

Planning and Optimization D2. Abstractions: Additive Abstractions Gabriele R oger and Thomas

Automatically Deriving Abstraction Heuristics PDB Abstractions Explicit-State Abstractions

Unified L2 Abstractions for L3-Driven Fast Handover draft-irtf-mobopts-l2-abstractions-01 F.

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

Efficient Abstractions for GPGPU Programming . Mathias Bourgoin 10.03.2015 Efficient

Networking in Eastern Networking in Eastern Networking in Eastern Networking in Eastern Europe

Multiple Programs How do programs communicate? 1 Multiple Programs How do programs communicate?

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Super GPU &amp; Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

Balance of Electric and Diffusion Forces Ions flow into and out of the neuron under the forces of

Malleable Proof Systems and Applications Melissa Chase (MSR Redmond) Markulf Kohlweiss (MSR

Estimating treatment effects from observational data using teffects, stteffects, and eteffects

Linking Design to Analysis of Cluster Randomized Trials: Covariate Balancing Strategies Fan

Magic Lessons: Designing and Balancing Game Objects K. Robert Gutschera Director of Development

Distributed load balancing Real case example using open source on commodity hardware Pavlos

TritonSort A Balanced Large-Scale Sorting System Alex Rasmussen , George Porter, Michael Conley,

Load Balancing Load Balancing: Example Example Problem Consider 6 jobs whose processing times

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,