Clusters of GPUs Michael LeBeane mlebeane@utexas.edu Advisor : Lizy - PowerPoint PPT Presentation

PhD Defense Optimizing Communication for Clusters of GPUs Michael LeBeane mlebeane@utexas.edu Advisor : Lizy K. John

Problem Statement GPUs and Networks in the Wild ▪ GPUs are everywhere in HPC, Big Data, Machine Learning, and beyond – Excellent performance/watt for many classes of data-parallel computation ▪ Many GPUs are required to solve the biggest computational problems – Can only fit so many GPUs in a single node! – GPUs need to talk to each other through Network Interface Controllers (NICs) – Path between GPU and NIC needs to be efficient ▪ Vendor’s are selling machines filled with many GPUs and NICs: Nvidia’s DGX -2 AMD’s Project 47 Node 16 Tesla V100 4 Radeon Instinct GPUs 8 Mellanox 100G NICs 2 Mellanox 100G NICs 2 Ethernet NICs 1 EPYC 7601 32-Core CPU 2 Xeon Platinum 2:1 GPU/NIC Ratio 1.6:1 GPU/NIC Ratio Michael LeBeane – PhD Defense 2 07/16/2018

Problem Statement IOC = IO Controller Today’s GPU Networks ▪ Largely focused on an optimized data plane – Path taken by the application data that needs to be transferred by the network – Industry technologies such as ROCn RDMA and GPUDirect RDMA allow peer-to-peer data transfers Initiator Target CPU GPU CPU GPU IOC IOC Cache Cache Memory Memory Network Memory Memory Memory NIC NIC Memory Michael LeBeane – PhD Defense 3 07/16/2018

Problem Statement IOC = IO Controller Challenges with Today’s GPU Networks ▪ Control plane is unoptimized! – Focused on a host-centric model where only the CPU can coordinate network transfers – Very high latencies to perform networking from the GPU Target Initiator CPU GPU CPU GPU IOC IOC Cache Cache Memory Memory Network Memory Memory Memory NIC NIC Memory Michael LeBeane – PhD Defense 4 07/16/2018

Problem Statement Motivating Example for Control Plane Optimizations 6 3 6 3 9 8 4 6 Buffers 1 1 1 1 4 6 5 2 5 2 5 2 5 2 5 2 0 0 0 0 0 Nodes/ GPUs 1 2 1 2 1 2 1 2 1 2 1 1 3 5 3 5 1 1 1 1 1 1 3 5 1 1 3 5 1 1 3 5 5 2 3 5 5 2 8 7 6 3 8 7 6 3 4 6 8 7 4 6 8 7 9 8 9 8 Initial Communication Compute Communication Compute Time ▪ GPU Allreduce Computation – Many communication/computation phases – Scaling out increases the number phases Michael LeBeane – PhD Defense 5 07/16/2018

Problem Statement Thesis Statement GPU networking can be improved by both software and hardware enhancements that enable GPUs to more directly interface with the network control plane. ▪ Proposed Solutions – Extended Task Queuing • Direct NIC-to-GPU active messaging – Command Processor Networking • Dynamic communication using on-chip GPU Command Processor – GPU Triggered Networking • Initiate messages without critical path CPU Michael LeBeane – PhD Defense 6 07/16/2018

Outline ▪ Introduction ▪ Contribution 1: Extended Task Queuing ▪ Contribution 2: Command Processor Networking ▪ Contribution 3: GPU Triggered Networking ▪ Conclusion Michael LeBeane – PhD Defense 7 07/16/2018

Contribution 1: Extended Task Queuing (XTQ) Local GPU Work Dispatch ▪ GPUs consume work through in-memory command queues Devices – Queue format standardized through GPU/CPU GPU Heterogeneous System Architecture (HSA) (Producer) (Consumer) – Any device can produce work for another device Command – Assumes unified virtual address space Packet ▪ Can we extend this across a node? Command – NIC doesn’t know how to talk to HSA queues Queue Virtual – Initiator doesn’t know the virtual addresses of Memory resources at the target Michael LeBeane – PhD Defense 8 07/16/2018

Contribution 1: Extended Task Queuing (XTQ) Extended Task Queuing (XTQ) Overview ▪ XTQ allows direct access to remote GPU queues – Teach NICs how to speak with HSA queues ▪ Enables Active Messaging without target CPU involvement – Improves latency and frees CPU service thread(s) Target Initiator CPU GPU XTQ NIC NIC XTQ NIC IC GPU CPU Cache Cache Cache Cache Cache Cache Memory Memory M. LeBeane, B. Potter, A. Pan, A. Dutu, V. Agarwala, W. Lee, D. Majeti, B. Ghimire, E. Van Tassell, S. Wasmundt, B. Benton, M. Breternitz, M. L. Chu, M. Thottethodi, L. K. John, and S. K. Reinhardt, \Extended task queuing: active messages for heterogeneous systems," in Proc. of the Intl. Conf. for High Performance Computing, Networking, Storage and Analysis (SC), 2016. Michael LeBeane – PhD Defense 9 07/16/2018

Contribution 1: Extended Task Queuing (XTQ) Target-side XTQ Operation ▪ Payload data streams into target-side receive buffer ▪ Command descriptor is placed into command queue Tightly Coupled Devices XT XTQ NIC NIC GPU GP CPU Doorbell Lookup Virtual Memory Payload Signal Data Command Queue Michael LeBeane – PhD Defense 10 07/16/2018

Contribution 1: Extended Task Queuing (XTQ) Target-side XTQ Operation ▪ NIC notifies the GPU using memory-mapped doorbell ▪ GPU reads command packet Tightly Coupled Devices XT XTQ NIC NIC GP GPU CPU Doorbell Lookup Virtual Memory Payload Signal Data Command Queue Michael LeBeane – PhD Defense 11 07/16/2018

Contribution 1: Extended Task Queuing (XTQ) Target-side XTQ Operation ▪ GPU reads transferred data ▪ GPU writes shared memory completion signal Tightly Coupled Devices XT XTQ NIC NIC GP GPU CPU Doorbell Lookup Virtual Memory Payload Signal Data Command Queue Michael LeBeane – PhD Defense 12 07/16/2018

Contribution 1: Extended Task Queuing (XTQ) XTQ Coordinated Indices ▪ How does initiator know about remote VAs at the target? ▪ Use coordinated indices specified by the initiator ▪ Lookup tables are populated by the target-side XTQ Library Queue Lookup Initiator Target Unified Virtual Target PID Table RDMA Header Memory 0xF123 ෍ 𝑦 Command Packet Queue Index .... .... Queue Lookup Table Kernel Arguments Base Address Register .... Data Payload Example Queue Lookup Michael LeBeane – PhD Defense 13 07/16/2018

Contribution 1: Extended Task Queuing (XTQ) XTQ Runtime API ▪ XTQ Put is implemented as a simple extension to standard RDMA put operation – Compatible with many low-level RDMA transports (e.g. InfiniBand, RoCE, Portals 4, iWARP, etc.) ▪ XTQ Registration API is used to provide address index-to-address translations Regular RDMA Put Operation XTQ-Enhanced RDMA Put Operation XTQ Rewrite Registration API  Register Queue Put Command Fields Additional XTQ Fields ‒ Queue Desc. VA Target NID/PID Remote Queue Index  Register Function Send Buffer Ptr. ‒ Function Ptr. VA Remote Function/Kernel Index ‒ Target Side Buffer VA Send Buffer Length GPU command packet  Register Kernel Target Buffer Index Kernel/Function Launch ‒ Kernel Ptr. VA Parameters Transport specific metadata ‒ Target Side Buffer VA ‒ Kernel Argument Size ‒ Completion Signal VA Michael LeBeane – PhD Defense 14 07/16/2018

Contribution 1: Extended Task Queuing (XTQ) Experimental Setup GPU NIC CPU ▪ CPU: Standard CPU-only systems Cache Cache Cache – Baseline non-accelerated system CPU and Memory Configuration Type 4-wide OOO, x86, 8 cores @ 4GHz Memory I,D-Cache 64KB, 2-way, 2 cycles L2-Cache 2MB, 8-way, 8 cycles L3-Cache 16MB, 16-way, 20 cycles ▪ HSA: Currently available GPU NIC GPU CPU DRAM DDR3, 8 Channels, 800MHz systems Cache Cache Cache GPU Configuration – Involves CPU runtime Type AMD GCN3 @ 1GHz CU Config 24 CUs with 4 SIMD-16 engines Memory Wavefronts 40 Waves per SIMD (64 lanes) V-Cache 32KB, 16-way, 12 cycles, per CU K-Cache 32KB, 8-way, 12 cycles, per 4 CU NIC GPU CPU ▪ XTQ: Extended Task Queuing I-Cache 64KB, 8-way, 12 cycles, per 4 CU Cache Cache Cache – Enables efficient active messaging L2-Cache 1MB, 16-way, 8 banks, 100 cycles style communication that bypasses the NIC Configuration CPU on the target Link Speed 100ns/ 100Gbps Memory Topology Star Michael LeBeane – PhD Defense 15 07/16/2018

Contribution 1: Extended Task Queuing (XTQ) Results ▪ Latency Decomposition Smaller is Better 0.07 CPU 0.31 0.22 0.42 0.14 0.65 4KB HSA 0.31 0.22 0.43 0.14 0.66 0.55 15% 0.31 0.24 0.44 0.15 0.28 0.59 XTQ 0.07 0.06 0.06 0.08 CPU 0.31 0.11 0.31 0.21 64B HSA 0.31 0.11 0.30 0.61 0.23 19% 0.31 0.16 0.31 0.25 0.23 XTQ 0.09 0.07 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50 Time (µs) CPU PtlPut NIC Initiator Put Network NIC Target Put GPU Launch GPU Kernel Execution CPU Completion Bigger is Better Smaller is Better 2000 CPU CPU 10 HSA 1500 HSA XTQ Runtime (us) Speedup XTQ 1000 1 500 1 2 3 0 0.1 1 16 256 4K 64K 1M 0 8 16 24 32 40 48 56 64 1 16 256 4,096 65,536 1,048,576 Data Items (4 Byte Integers) Nodes ▪ MPI Accumulate ▪ MPI Allreduce Michael LeBeane – PhD Defense 16 07/16/2018

Clusters of GPUs Michael LeBeane mlebeane@utexas.edu Advisor : Lizy - PowerPoint PPT Presentation

PhD Defense Optimizing Communication for Clusters of GPUs Michael LeBeane mlebeane@utexas.edu Advisor : Lizy K. John Problem Statement GPUs and Networks in the Wild GPUs are everywhere in HPC, Big Data, Machine Learning, and beyond

I nternational research The evidence on clusters is clear Firms located in clusters are more

Internet Server Clusters Internet Server Clusters Jeff Chase Duke University, Department of

Why use GPUs for graph processing? FOSDEM 2020 2 GPUs and Graphs Graphs GPUs Found

Finite Element Multigrid Solvers for PDE Problems on GPUs and GPU Clusters Part 2: Applications

Locational narratives in creative clusters An exploration of place, reputation and creative

Dynamic Virtual Clusters in a Grid Dynamic Virtual Clusters in a Grid Site Manager Site Manager

Scott Le Grand Some Things Never Change (GPUs vs the World) How Best to Exploit GPUs

Unleashing the Power of GPUs over the Web Vishal Vaidyanathan Royal Caliber LLC GPUs are

Efficient Video Decoding on GPUs Efficient Video Decoding on GPUs by Point Based Rendering by

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

MD5 Chosen-Prefix Collisions on GPUs Marc Bevand m.bevand@gmail.com marc.bevand@rapid7.com

Analyzing Throughput of GPUs Analyzing Throughput of GPUs Exploiting Within-Die Core-to-Core

Finite Element Multigrid Solvers for PDE Problems on GPUs and GPU Clusters Robert Strzodka

Report on the Clusters at Fermilab Don Holmgren USQCD All-Hands Meeting JLab May 6-7, 2011

Farmer Clusters Pete Thompson Game & Wildlife Conservation Trust Biodiversity Adviser

Inland Empire Clusters of Opportunity Action Plan June 16, 2011 Identifying Inland Empire

High Performance MPI on IBM 12x InfiniBand Architecture Abhinav Vishnu , Brad Benton 1 and

Reservation-Based Scheduling for IRQ Threads Luca Abeni, Nicola Manica, Luigi Palopoli

IP over Web-Avian Carriers (IPoWAC) Dan Ldtke Historical Context IP over Avian Carriers

CPSC 213 2.7.1-2.7.3, 2.7.5-2.7.6 Textbook 3.6.1-3.6.5 Introduction to Computer

Mobile Performance from Radio Up WebRTC battery, latency, and bandwidth optimization for the

Declarative Routing Seminar in Distributed Computing 08 with papers chosen by Prof. T. Roscoe

The FF Planning System Jorge A. Baier Department of Computer Science University of Toronto

http://cs224w.stanford.edu (1) New problem: Outbreak detection (2) Develop an approximation

Clusters of GPUs Michael LeBeane mlebeane@utexas.edu Advisor : Lizy - PowerPoint PPT Presentation

PhD Defense Optimizing Communication for Clusters of GPUs Michael LeBeane mlebeane@utexas.edu Advisor : Lizy K. John Problem Statement GPUs and Networks in the Wild GPUs are everywhere in HPC, Big Data, Machine Learning, and beyond

I nternational research The evidence on clusters is clear Firms located in clusters are more

Internet Server Clusters Internet Server Clusters Jeff Chase Duke University, Department of

Why use GPUs for graph processing? FOSDEM 2020 2 GPUs and Graphs Graphs GPUs Found

Finite Element Multigrid Solvers for PDE Problems on GPUs and GPU Clusters Part 2: Applications

Locational narratives in creative clusters An exploration of place, reputation and creative

Dynamic Virtual Clusters in a Grid Dynamic Virtual Clusters in a Grid Site Manager Site Manager

Scott Le Grand Some Things Never Change (GPUs vs the World) How Best to Exploit GPUs

Unleashing the Power of GPUs over the Web Vishal Vaidyanathan Royal Caliber LLC GPUs are

Efficient Video Decoding on GPUs Efficient Video Decoding on GPUs by Point Based Rendering by

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

MD5 Chosen-Prefix Collisions on GPUs Marc Bevand m.bevand@gmail.com marc.bevand@rapid7.com

Analyzing Throughput of GPUs Analyzing Throughput of GPUs Exploiting Within-Die Core-to-Core

Finite Element Multigrid Solvers for PDE Problems on GPUs and GPU Clusters Robert Strzodka

Report on the Clusters at Fermilab Don Holmgren USQCD All-Hands Meeting JLab May 6-7, 2011

Farmer Clusters Pete Thompson Game &amp; Wildlife Conservation Trust Biodiversity Adviser

Inland Empire Clusters of Opportunity Action Plan June 16, 2011 Identifying Inland Empire

High Performance MPI on IBM 12x InfiniBand Architecture Abhinav Vishnu , Brad Benton 1 and

Reservation-Based Scheduling for IRQ Threads Luca Abeni, Nicola Manica, Luigi Palopoli

IP over Web-Avian Carriers (IPoWAC) Dan Ldtke Historical Context IP over Avian Carriers

CPSC 213 2.7.1-2.7.3, 2.7.5-2.7.6 Textbook 3.6.1-3.6.5 Introduction to Computer

Mobile Performance from Radio Up WebRTC battery, latency, and bandwidth optimization for the

Declarative Routing Seminar in Distributed Computing 08 with papers chosen by Prof. T. Roscoe

The FF Planning System Jorge A. Baier Department of Computer Science University of Toronto

http://cs224w.stanford.edu (1) New problem: Outbreak detection (2) Develop an approximation

Farmer Clusters Pete Thompson Game & Wildlife Conservation Trust Biodiversity Adviser