gpu with a network interface
play

GPU WITH A NETWORK INTERFACE DAVIDE ROSSETTI, SW COMPUTE TEAM - PowerPoint PPT Presentation

GPUDIRECT: INTEGRATING THE GPU WITH A NETWORK INTERFACE DAVIDE ROSSETTI, SW COMPUTE TEAM GPUDIRECT FAMILY 1 GPUDirect Shared GPU-Sysmem for inter-node copy optimization GPUDirect P2P for intra-node, accelerated GPU-GPU memcpy GPUDirect P2P


  1. GPUDIRECT: INTEGRATING THE GPU WITH A NETWORK INTERFACE DAVIDE ROSSETTI, SW COMPUTE TEAM

  2. GPUDIRECT FAMILY 1 GPUDirect Shared GPU-Sysmem for inter-node copy optimization GPUDirect P2P for intra-node, accelerated GPU-GPU memcpy GPUDirect P2P for intra-node, inter-GPU LD/ST access GPUDirect RDMA 2 for inter-node copy optimization [ 1 ] developer info: https://developer.nvidia.com/gpudirect [ 2 ] http://docs.nvidia.com/cuda/gpudirect-rdma

  3. GPUDIRECT RDMA CAPABILITIES & LIMITATIONS GPUDirect RDMA direct HCA access to GPU memory CPU still driving computing + communication Fast CPU needed Implications: power, latency, TCO Risks: limited scaling …

  4. MOVING DATA AROUND GPU HCA Data plane GPUDirect RDMA GPU IOH CPU CPU HCA directly accesses GPU memory CPU synchronizes with GPU tasks CPU prepares and queues Control plane communication tasks on HCA

  5. MEET NEXT THING GPU HCA Data plane GPUDirect RDMA GPU IOH GPUDirect Async CPU CPU GPU CPU prepares and queues compute and communication tasks on GPU Control plane GPU triggers communication on HCA HCA directly accesses GPU memory

  6. CPU OFF THE CRITICAL PATH CPU prepares work plan hardly parallelizable, branch intensive GPU orchestrates flow Runs on optimized scheduling unit Same one scheduling GPU work Now also scheduling network communications

  7. KERNEL+SEND GPU CPU HCA NORMAL FLOW a_kernel <<<…,stream>>> (buf); cudaStreamSynchronize(stream); ibv_post_send(buf); while (!done) ibv_poll_cq(txcq); b_kernel <<<…,stream>>>( buf);

  8. KERNEL+SEND GPU CPU HCA GPUDIRECT ASYNC a_kernel <<<…,stream>>> (buf); gds_stream_queue_send(stream,qp,buf); Kernel launch latency is hidden gds_stream_wait_cq(stream,txcq); b_kernel< <…,stream >>(buf); CPU is free

  9. RECEIVE+KERNEL GPU CPU HCA NORMAL FLOW incoming message while (!done) ibv_poll_cq(); a_kernel <<<…,stream>>>( buf); cuStreamSynchronize(stream); GPU kernel execution triggered

  10. RECEIVE+KERNEL GPU CPU HCA GPUDIRECT ASYNC Kernel launch moved way earlier latency is hidden!!! kernel queued to GPU gds_stream_wait_cq(stream,rx_cq); incoming message a_kernel <<<…,stream>>> (buf); cuStreamSynchronize(stream); CPU is idle GPU kernel deep sleep state!!! execution triggered

  11. USE CASE SCENARIOS Performance mode (~ Top500) Economy mode (~ Green500) enable batching enable GPU IRQ waiting mode increase performance free more CPU cycles CPU available, additional GFlops Optionally slimmer CPU

  12. PERFORMANCE MODE 50.00� 40% faster 45.00� 40.00� 35.00� 30.00� (us)� RDMA� only� Latency� 25.00� +Async� TX� only� 20.00� +Async� 15.00� 10.00� 5.00� 0.00� 4096� 8192� 16384� 32768� 65536� compute� buffer� size� (Bytes)� [*] modified ud_pingpong test: recv+GPU kernel+send on each side. 2 nodes: Ivy Bridge Xeon + K40 + Connect-IB + MLNX switch, 10000 iterations, message size: 128B, batch size: 20

  13. 2D STENCIL BENCHMARK weak scaling 2D� stencil� benchmark� 256^2 local lattice 80� 27% faster 23% faster 2x1, 2x2 node grids 70� itera on� 1 GPU per node 60� 50� per� Average� me� 40� RDMA� only� 30� +Async� 20� 10� 0� 2� 4� Number� of� nodes� [*] 4 nodes: Ivy Bridge Xeon + K40 + Connect-IB + MLNX switch

  14. ECONOMY MODE latency� 25% faster Round-trip� CPU� u liza on� 200.00� 100%� 180.00� 90%� 45% less CPU load 160.00� 80%� 140.00� 70%� (us)� 120.00� 60%� latency� 100.00� 50%� 80.00� 40%� 60.00� 30%� 40.00� 20%� 20.00� 10%� 0.00� � size=16384� 0%� %� load� of� single� CPU� core� RDMA� only� 39.28� RDMA� w/IRQ� 178.62� RDMA� only� RDMA� w/IRQ� +Async� +Async� 29.46� [*] modified ud_pingpong test, HW same as in previous slide

  15. SUMMARY Meet Async, next generation of GPUDirect GPU orchestrates network operations CPU off the critical path 40% faster , 45% less CPU load Excited about these topics ? collaborations & jobs @NVIDIA

  16. NVIDIA REGISTERED DEVELOPER PROGRAMS Everything you need to develop with NVIDIA products Membership is your first step in establishing a working relationship with NVIDIA Engineering Exclusive access to pre-releases Submit bugs and features requests Stay informed about latest releases and training opportunities Access to exclusive downloads Exclusive activities and special offers Interact with other developers in the NVIDIA Developer Forums REGISTER FOR FREE AT: developer.nvidia.com

  17. THANK YOU

  18. PERFORMANCE VS ECONOMY Performance mode Economy mode [*] modified ud_pingpong test, HW same as in previous slide, NUMA binding to socket0/core0, SBIOS power-saving profile

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend