GPU WITH A NETWORK INTERFACE DAVIDE ROSSETTI, SW COMPUTE TEAM - PowerPoint PPT Presentation

GPUDIRECT: INTEGRATING THE GPU WITH A NETWORK INTERFACE DAVIDE ROSSETTI, SW COMPUTE TEAM

GPUDIRECT FAMILY 1 GPUDirect Shared GPU-Sysmem for inter-node copy optimization GPUDirect P2P for intra-node, accelerated GPU-GPU memcpy GPUDirect P2P for intra-node, inter-GPU LD/ST access GPUDirect RDMA 2 for inter-node copy optimization [ 1 ] developer info: https://developer.nvidia.com/gpudirect [ 2 ] http://docs.nvidia.com/cuda/gpudirect-rdma

GPUDIRECT RDMA CAPABILITIES & LIMITATIONS GPUDirect RDMA direct HCA access to GPU memory CPU still driving computing + communication Fast CPU needed Implications: power, latency, TCO Risks: limited scaling …

MOVING DATA AROUND GPU HCA Data plane GPUDirect RDMA GPU IOH CPU CPU HCA directly accesses GPU memory CPU synchronizes with GPU tasks CPU prepares and queues Control plane communication tasks on HCA

MEET NEXT THING GPU HCA Data plane GPUDirect RDMA GPU IOH GPUDirect Async CPU CPU GPU CPU prepares and queues compute and communication tasks on GPU Control plane GPU triggers communication on HCA HCA directly accesses GPU memory

CPU OFF THE CRITICAL PATH CPU prepares work plan hardly parallelizable, branch intensive GPU orchestrates flow Runs on optimized scheduling unit Same one scheduling GPU work Now also scheduling network communications

KERNEL+SEND GPU CPU HCA NORMAL FLOW a_kernel <<<…,stream>>> (buf); cudaStreamSynchronize(stream); ibv_post_send(buf); while (!done) ibv_poll_cq(txcq); b_kernel <<<…,stream>>>( buf);

KERNEL+SEND GPU CPU HCA GPUDIRECT ASYNC a_kernel <<<…,stream>>> (buf); gds_stream_queue_send(stream,qp,buf); Kernel launch latency is hidden gds_stream_wait_cq(stream,txcq); b_kernel< <…,stream >>(buf); CPU is free

RECEIVE+KERNEL GPU CPU HCA NORMAL FLOW incoming message while (!done) ibv_poll_cq(); a_kernel <<<…,stream>>>( buf); cuStreamSynchronize(stream); GPU kernel execution triggered

RECEIVE+KERNEL GPU CPU HCA GPUDIRECT ASYNC Kernel launch moved way earlier latency is hidden!!! kernel queued to GPU gds_stream_wait_cq(stream,rx_cq); incoming message a_kernel <<<…,stream>>> (buf); cuStreamSynchronize(stream); CPU is idle GPU kernel deep sleep state!!! execution triggered

USE CASE SCENARIOS Performance mode (~ Top500) Economy mode (~ Green500) enable batching enable GPU IRQ waiting mode increase performance free more CPU cycles CPU available, additional GFlops Optionally slimmer CPU

PERFORMANCE MODE 50.00� 40% faster 45.00� 40.00� 35.00� 30.00� (us)� RDMA� only� Latency� 25.00� +Async� TX� only� 20.00� +Async� 15.00� 10.00� 5.00� 0.00� 4096� 8192� 16384� 32768� 65536� compute� buffer� size� (Bytes)� [*] modified ud_pingpong test: recv+GPU kernel+send on each side. 2 nodes: Ivy Bridge Xeon + K40 + Connect-IB + MLNX switch, 10000 iterations, message size: 128B, batch size: 20

2D STENCIL BENCHMARK weak scaling 2D� stencil� benchmark� 256^2 local lattice 80� 27% faster 23% faster 2x1, 2x2 node grids 70� itera on� 1 GPU per node 60� 50� per� Average� me� 40� RDMA� only� 30� +Async� 20� 10� 0� 2� 4� Number� of� nodes� [*] 4 nodes: Ivy Bridge Xeon + K40 + Connect-IB + MLNX switch

ECONOMY MODE latency� 25% faster Round-trip� CPU� u liza on� 200.00� 100%� 180.00� 90%� 45% less CPU load 160.00� 80%� 140.00� 70%� (us)� 120.00� 60%� latency� 100.00� 50%� 80.00� 40%� 60.00� 30%� 40.00� 20%� 20.00� 10%� 0.00� � size=16384� 0%� %� load� of� single� CPU� core� RDMA� only� 39.28� RDMA� w/IRQ� 178.62� RDMA� only� RDMA� w/IRQ� +Async� +Async� 29.46� [*] modified ud_pingpong test, HW same as in previous slide

SUMMARY Meet Async, next generation of GPUDirect GPU orchestrates network operations CPU off the critical path 40% faster , 45% less CPU load Excited about these topics ? collaborations & jobs @NVIDIA

NVIDIA REGISTERED DEVELOPER PROGRAMS Everything you need to develop with NVIDIA products Membership is your first step in establishing a working relationship with NVIDIA Engineering Exclusive access to pre-releases Submit bugs and features requests Stay informed about latest releases and training opportunities Access to exclusive downloads Exclusive activities and special offers Interact with other developers in the NVIDIA Developer Forums REGISTER FOR FREE AT: developer.nvidia.com

THANK YOU

PERFORMANCE VS ECONOMY Performance mode Economy mode [*] modified ud_pingpong test, HW same as in previous slide, NUMA binding to socket0/core0, SBIOS power-saving profile

GPU WITH A NETWORK INTERFACE DAVIDE ROSSETTI, SW COMPUTE TEAM - PowerPoint PPT Presentation

GPUDIRECT: INTEGRATING THE GPU WITH A NETWORK INTERFACE DAVIDE ROSSETTI, SW COMPUTE TEAM GPUDIRECT FAMILY 1 GPUDirect Shared GPU-Sysmem for inter-node copy optimization GPUDirect P2P for intra-node, accelerated GPU-GPU memcpy GPUDirect P2P

I/O Bus and Interface Data Bus Addr Bus CPU Control Interface Interface Interface Interface

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

MVAPICH2-GPU: Op0mized GPU to GPU Communica0on for InfiniBand

Real-Time GPU Management Heechul Yun 1 This Week Topic: General Purpose Graphic Processing

GPU programming Dr. Bernhard Kainz 1 Overview About myself Last week Motivation GPU

GPU PROGRAMMING 2 GPU Programming Assignment 4 Consists of

NASDAQ: MFRM Investor Presentation June 2016 Forward Looking Statements and Non-GAAP Information

SleepKit A Mobile Sleep Tracking Application for Promoting Healthy Sleep Behavior Supervisors:

SLEEP THE WHATS, THE WHYS AND THE HOWS Presented by: Dr. Janice Patterson Beacon Hill

DPM: Dynamic Power Management for the Microsecond Era Chih-Hsun Chou Laxmi N. Bhuyan

What is Normal? Adolescent Emotional and Cognitive Developmen t H I L L A RY S . B U R K E , P

C RPSFAC EBOOK LIVE In May 2009 I developed the symptoms of C RPS It took many doctors to

VARICOSE VEINS OPEN EVENING Mr Andrew Sandison Consultant Vascular Surgeon Dr Mo Faris

4/18/2018 Pulmonary Embolism: Co-Investigators Jayme Hartzell, PharmD, MS, BCPS

GPU WITH A NETWORK INTERFACE DAVIDE ROSSETTI, SW COMPUTE TEAM - PowerPoint PPT Presentation

GPUDIRECT: INTEGRATING THE GPU WITH A NETWORK INTERFACE DAVIDE ROSSETTI, SW COMPUTE TEAM GPUDIRECT FAMILY 1 GPUDirect Shared GPU-Sysmem for inter-node copy optimization GPUDirect P2P for intra-node, accelerated GPU-GPU memcpy GPUDirect P2P

I/O Bus and Interface Data Bus Addr Bus CPU Control Interface Interface Interface Interface

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Super GPU &amp; Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

MVAPICH2-GPU: Op0mized GPU to GPU Communica0on for InfiniBand

Real-Time GPU Management Heechul Yun 1 This Week Topic: General Purpose Graphic Processing

GPU programming Dr. Bernhard Kainz 1 Overview About myself Last week Motivation GPU

GPU PROGRAMMING 2 GPU Programming Assignment 4 Consists of

NASDAQ: MFRM Investor Presentation June 2016 Forward Looking Statements and Non-GAAP Information

SleepKit A Mobile Sleep Tracking Application for Promoting Healthy Sleep Behavior Supervisors:

SLEEP THE WHATS, THE WHYS AND THE HOWS Presented by: Dr. Janice Patterson Beacon Hill

DPM: Dynamic Power Management for the Microsecond Era Chih-Hsun Chou Laxmi N. Bhuyan

What is Normal? Adolescent Emotional and Cognitive Developmen t H I L L A RY S . B U R K E , P

C RPSFAC EBOOK LIVE In May 2009 I developed the symptoms of C RPS It took many doctors to

VARICOSE VEINS OPEN EVENING Mr Andrew Sandison Consultant Vascular Surgeon Dr Mo Faris

4/18/2018 Pulmonary Embolism: Co-Investigators Jayme Hartzell, PharmD, MS, BCPS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,