GPU WITH A NETWORK INTERFACE DAVIDE ROSSETTI, SW COMPUTE TEAM - - PowerPoint PPT Presentation

gpu with a network interface
SMART_READER_LITE
LIVE PREVIEW

GPU WITH A NETWORK INTERFACE DAVIDE ROSSETTI, SW COMPUTE TEAM - - PowerPoint PPT Presentation

GPUDIRECT: INTEGRATING THE GPU WITH A NETWORK INTERFACE DAVIDE ROSSETTI, SW COMPUTE TEAM GPUDIRECT FAMILY 1 GPUDirect Shared GPU-Sysmem for inter-node copy optimization GPUDirect P2P for intra-node, accelerated GPU-GPU memcpy GPUDirect P2P


slide-1
SLIDE 1

DAVIDE ROSSETTI, SW COMPUTE TEAM

GPUDIRECT: INTEGRATING THE GPU WITH A NETWORK INTERFACE

slide-2
SLIDE 2

GPUDIRECT FAMILY1

GPUDirect Shared GPU-Sysmem for inter-node copy

  • ptimization

GPUDirect P2P for intra-node, accelerated GPU-GPU memcpy GPUDirect P2P for intra-node, inter-GPU LD/ST access GPUDirect RDMA2 for inter-node copy optimization

[1] developer info: https://developer.nvidia.com/gpudirect [2] http://docs.nvidia.com/cuda/gpudirect-rdma

slide-3
SLIDE 3

GPUDIRECT RDMA CAPABILITIES & LIMITATIONS

GPUDirect RDMA

direct HCA access to GPU memory

CPU still driving computing + communication

Fast CPU needed Implications: power, latency, TCO Risks: limited scaling …

slide-4
SLIDE 4

MOVING DATA AROUND

Data plane Control plane

GPUDirect RDMA

GPU CPU IOH

HCA

CPU prepares and queues communication tasks on HCA CPU synchronizes with GPU tasks HCA directly accesses GPU memory

CPU GPU

slide-5
SLIDE 5

GPU

MEET NEXT THING

Data plane Control plane

GPUDirect Async

CPU IOH

CPU prepares and queues compute and communication tasks on GPU GPU triggers communication on HCA HCA directly accesses GPU memory

HCA CPU GPU GPU

GPUDirect RDMA

slide-6
SLIDE 6

CPU OFF THE CRITICAL PATH

CPU prepares work plan

hardly parallelizable, branch intensive

GPU orchestrates flow

Runs on optimized scheduling unit Same one scheduling GPU work Now also scheduling network communications

slide-7
SLIDE 7

KERNEL+SEND

NORMAL FLOW

a_kernel<<<…,stream>>>(buf); cudaStreamSynchronize(stream); ibv_post_send(buf); while (!done) ibv_poll_cq(txcq); b_kernel<<<…,stream>>>(buf);

GPU CPU HCA

slide-8
SLIDE 8

KERNEL+SEND

GPUDIRECT ASYNC

a_kernel<<<…,stream>>>(buf); gds_stream_queue_send(stream,qp,buf); gds_stream_wait_cq(stream,txcq); b_kernel<<…,stream>>(buf);

GPU CPU HCA

CPU is free Kernel launch latency is hidden

slide-9
SLIDE 9

RECEIVE+KERNEL

NORMAL FLOW

while (!done) ibv_poll_cq(); a_kernel<<<…,stream>>>(buf); cuStreamSynchronize(stream);

GPU CPU HCA incoming message GPU kernel execution triggered

slide-10
SLIDE 10

RECEIVE+KERNEL

GPUDIRECT ASYNC

gds_stream_wait_cq(stream,rx_cq); a_kernel<<<…,stream>>>(buf); cuStreamSynchronize(stream);

GPU CPU HCA kernel queued to GPU incoming message

Kernel launch moved way earlier latency is hidden!!! CPU is idle deep sleep state!!!

GPU kernel execution triggered

slide-11
SLIDE 11

USE CASE SCENARIOS

Performance mode (~ Top500)

enable batching increase performance CPU available, additional GFlops

Economy mode (~ Green500)

enable GPU IRQ waiting mode free more CPU cycles Optionally slimmer CPU

slide-12
SLIDE 12

0.00 5.00 10.00 15.00 20.00 25.00 30.00 35.00 40.00 45.00 50.00 4096 8192 16384 32768 65536 Latency (us) compute buffer size (Bytes)

RDMA

  • nly

+Async TX

  • nly

+Async

PERFORMANCE MODE

[*] modified ud_pingpong test: recv+GPU kernel+send on each side. 2 nodes: Ivy Bridge Xeon + K40 + Connect-IB + MLNX switch, 10000 iterations, message size: 128B, batch size: 20

40% faster

slide-13
SLIDE 13

10 20 30 40 50 60 70 80 2 4 Average me per itera on Number

  • f

nodes

2D stencil benchmark

RDMA

  • nly

+Async

2D STENCIL BENCHMARK

weak scaling 256^2 local lattice 2x1, 2x2 node grids 1 GPU per node

27% faster 23% faster

[*] 4 nodes: Ivy Bridge Xeon + K40 + Connect-IB + MLNX switch

slide-14
SLIDE 14

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% % load

  • f

single CPU core

CPU u liza on

RDMA

  • nly

RDMA w/IRQ +Async

ECONOMY MODE

[*] modified ud_pingpong test, HW same as in previous slide

0.00 20.00 40.00 60.00 80.00 100.00 120.00 140.00 160.00 180.00 200.00

  • size=16384

RDMA

  • nly

39.28 RDMA w/IRQ 178.62 +Async 29.46 latency (us)

Round-trip latency 25% faster

45% less CPU load

slide-15
SLIDE 15

SUMMARY

Meet Async, next generation of GPUDirect GPU orchestrates network operations CPU off the critical path 40% faster, 45% less CPU load Excited about these topics ? collaborations & jobs @NVIDIA

slide-16
SLIDE 16

NVIDIA REGISTERED DEVELOPER PROGRAMS

Everything you need to develop with NVIDIA products Membership is your first step in establishing a working relationship with NVIDIA Engineering

Exclusive access to pre-releases Submit bugs and features requests Stay informed about latest releases and training opportunities Access to exclusive downloads Exclusive activities and special offers Interact with other developers in the NVIDIA Developer Forums

REGISTER FOR FREE AT: developer.nvidia.com

slide-17
SLIDE 17

THANK YOU

slide-18
SLIDE 18

PERFORMANCE VS ECONOMY

[*] modified ud_pingpong test, HW same as in previous slide, NUMA binding to socket0/core0, SBIOS power-saving profile

Performance mode Economy mode