PhD Defense
Optimizing Communication for Clusters of GPUs
Michael LeBeane mlebeane@utexas.edu Advisor: Lizy K. John
Clusters of GPUs Michael LeBeane mlebeane@utexas.edu Advisor : Lizy - - PowerPoint PPT Presentation
PhD Defense Optimizing Communication for Clusters of GPUs Michael LeBeane mlebeane@utexas.edu Advisor : Lizy K. John Problem Statement GPUs and Networks in the Wild GPUs are everywhere in HPC, Big Data, Machine Learning, and beyond
PhD Defense
Michael LeBeane mlebeane@utexas.edu Advisor: Lizy K. John
▪ GPUs are everywhere in HPC, Big Data, Machine Learning, and beyond
– Excellent performance/watt for many classes of data-parallel computation
▪ Many GPUs are required to solve the biggest computational problems
– Can only fit so many GPUs in a single node! – GPUs need to talk to each other through Network Interface Controllers (NICs) – Path between GPU and NIC needs to be efficient ▪ Vendor’s are selling machines filled with many GPUs and NICs:
2 Michael LeBeane – PhD Defense 07/16/2018
Nvidia’s DGX-2
16 Tesla V100 8 Mellanox 100G NICs 2 Ethernet NICs 2 Xeon Platinum 1.6:1 GPU/NIC Ratio
AMD’s Project 47 Node
4 Radeon Instinct GPUs 2 Mellanox 100G NICs 1 EPYC 7601 32-Core CPU 2:1 GPU/NIC Ratio
Problem Statement
▪ Largely focused on an optimized data plane
– Path taken by the application data that needs to be transferred by the network – Industry technologies such as ROCn RDMA and GPUDirect RDMA allow peer-to-peer data transfers
3 07/16/2018
IOC = IO Controller
Initiator Target CPU Cache Memory NIC Memory Network
IOC
CPU Cache Memory NIC Memory GPU
IOC
GPU Memory Memory
Problem Statement
Michael LeBeane – PhD Defense
▪ Control plane is unoptimized!
– Focused on a host-centric model where only the CPU can coordinate network transfers – Very high latencies to perform networking from the GPU
4 07/16/2018
IOC = IO Controller
Problem Statement
Michael LeBeane – PhD Defense
Initiator Target CPU Cache Memory NIC Memory Network
IOC
CPU Cache Memory NIC Memory GPU
IOC
GPU Memory Memory
▪ GPU Allreduce Computation
– Many communication/computation phases – Scaling out increases the number phases
5 07/16/2018
1 2 2 5 5 3 1 1 1 2 2 5 5 3 1 1 1 1 2 5 5 3 1 2 2 5 1 1 6 4 5 3 7 8 2 5 3 6 1 1 5 3 1 2 2 5 5 3 1 1 6 4 3 6 7 8 3 6 6 4 7 8 1 2 2 5 6 4 8 9 7 8 8 9 3 6 8 9 1 1 1 1
Initial Communication Compute Communication Compute
Time
Nodes/ GPUs Buffers Problem Statement
Michael LeBeane – PhD Defense
6 07/16/2018
GPU networking can be improved by both software and hardware enhancements that enable GPUs to more directly interface with the network control plane. ▪ Proposed Solutions
– Extended Task Queuing
– Command Processor Networking
– GPU Triggered Networking
Problem Statement
Michael LeBeane – PhD Defense
▪ Introduction ▪ Contribution 1: Extended Task Queuing ▪ Contribution 2: Command Processor Networking ▪ Contribution 3: GPU Triggered Networking ▪ Conclusion
7 07/16/2018 Michael LeBeane – PhD Defense
▪ GPUs consume work through in-memory command queues
– Queue format standardized through Heterogeneous System Architecture (HSA) – Any device can produce work for another device – Assumes unified virtual address space
▪ Can we extend this across a node?
– NIC doesn’t know how to talk to HSA queues – Initiator doesn’t know the virtual addresses of resources at the target
8 07/16/2018
GPU/CPU
(Producer) Devices Virtual Memory Command Queue
GPU
(Consumer) Command Packet
Contribution 1: Extended Task Queuing (XTQ)
Michael LeBeane – PhD Defense
Cache Cache CPU Cache Memory Cache Cache CPU Cache Memory XTQ NIC NIC XTQ NIC IC GPU GPU Initiator Target
▪ XTQ allows direct access to remote GPU queues
– Teach NICs how to speak with HSA queues
– Improves latency and frees CPU service thread(s)
9 07/16/2018
messages for heterogeneous systems," in Proc. of the Intl. Conf. for High Performance Computing, Networking, Storage and Analysis (SC), 2016.
Contribution 1: Extended Task Queuing (XTQ)
Michael LeBeane – PhD Defense
10 07/16/2018
CPU GP GPU Tightly Coupled Devices XT XTQ NIC NIC
Doorbell
Payload Data
Command Queue Lookup
Virtual Memory
Signal
Contribution 1: Extended Task Queuing (XTQ)
Michael LeBeane – PhD Defense
11 07/16/2018
CPU Tightly Coupled Devices XT XTQ NIC NIC
Doorbell
Payload Data
Command Queue Lookup
Virtual Memory
Signal
GP GPU
Contribution 1: Extended Task Queuing (XTQ)
Michael LeBeane – PhD Defense
12 07/16/2018
CPU Tightly Coupled Devices XT XTQ NIC NIC
Doorbell
Payload Data
Command Queue Lookup
Virtual Memory
Signal
GP GPU
Contribution 1: Extended Task Queuing (XTQ)
Michael LeBeane – PhD Defense
▪ How does initiator know about remote VAs at the target? ▪ Use coordinated indices specified by the initiator ▪ Lookup tables are populated by the target-side XTQ Library
13 07/16/2018
Command Packet Data Payload Kernel Arguments RDMA Header Queue Lookup Table Queue Lookup Table Base Address Register Target PID 0xF123 Queue Index .... .... Initiator Target
𝑦
Unified Virtual Memory
....
Example Queue Lookup
Contribution 1: Extended Task Queuing (XTQ)
Michael LeBeane – PhD Defense
14 07/16/2018
▪ XTQ Put is implemented as a simple extension to standard RDMA put operation
– Compatible with many low-level RDMA transports (e.g. InfiniBand, RoCE, Portals 4, iWARP, etc.)
▪ XTQ Registration API is used to provide address index-to-address translations
Put Command Fields Target NID/PID Send Buffer Ptr. Send Buffer Length Target Buffer Index Transport specific metadata Additional XTQ Fields Remote Queue Index Remote Function/Kernel Index GPU command packet Kernel/Function Launch Parameters Regular RDMA Put Operation XTQ-Enhanced RDMA Put Operation XTQ Rewrite Registration API Register Queue
‒ Queue Desc. VA
Register Function
‒ Function Ptr. VA ‒ Target Side Buffer VA
Register Kernel
‒ Kernel Ptr. VA ‒ Target Side Buffer VA ‒ Kernel Argument Size ‒ Completion Signal VA
Contribution 1: Extended Task Queuing (XTQ)
Michael LeBeane – PhD Defense
▪ CPU: Standard CPU-only systems
– Baseline non-accelerated system
▪ HSA: Currently available GPU systems
– Involves CPU runtime
▪ XTQ: Extended Task Queuing
– Enables efficient active messaging style communication that bypasses the CPU on the target
15 07/16/2018
CPU and Memory Configuration
Type 4-wide OOO, x86, 8 cores @ 4GHz I,D-Cache 64KB, 2-way, 2 cycles L2-Cache 2MB, 8-way, 8 cycles L3-Cache 16MB, 16-way, 20 cycles DRAM DDR3, 8 Channels, 800MHz
GPU Configuration
Type AMD GCN3 @ 1GHz CU Config 24 CUs with 4 SIMD-16 engines Wavefronts 40 Waves per SIMD (64 lanes) V-Cache 32KB, 16-way, 12 cycles, per CU K-Cache 32KB, 8-way, 12 cycles, per 4 CU I-Cache 64KB, 8-way, 12 cycles, per 4 CU L2-Cache 1MB, 16-way, 8 banks, 100 cycles
NIC Configuration
Link Speed 100ns/ 100Gbps Topology Star
NIC Cache Cache CPU Cache Memory GPU NIC Cache Cache CPU Cache Memory GPU NIC Cache Cache CPU Cache Memory GPU
Contribution 1: Extended Task Queuing (XTQ)
Michael LeBeane – PhD Defense
16 07/16/2018
0.1 1 10 1 16 256 4,096 65,536 1,048,576 Speedup Data Items (4 Byte Integers) CPU HSA XTQ 500 1000 1500 2000 8 16 24 32 40 48 56 64 Runtime (us) CPU HSA XTQ Nodes 1 16 256 4K 64K 1M
Bigger is Better Smaller is Better
▪ MPI Accumulate ▪ MPI Allreduce
0.31 0.31 0.31 0.31 0.31 0.31 0.16 0.11 0.11 0.24 0.22 0.22 0.31 0.30 0.31 0.44 0.43 0.42 0.09 0.06 0.07 0.15 0.14 0.14 0.25 0.61 0.28 0.66 0.23 0.23 0.59 0.55 0.07 0.08 0.21 0.06 0.07 0.65 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50 Time (µs) CPU PtlPut NIC Initiator Put Network NIC Target Put GPU Launch GPU Kernel Execution CPU Completion
XTQ
4KB 64B
HSA CPU XTQ HSA CPU
19% 15%
Smaller is Better
▪ Latency Decomposition
1 2 3
Contribution 1: Extended Task Queuing (XTQ)
Michael LeBeane – PhD Defense
17 07/16/2018
Workload Name Domain %Blocked Reductions Alex Net Classification 14% 4672 AN4 LSTM Speech 50% 131192 CIFAR Classification 4% 939820 Large Synth Synthetic 28% 52800 MNIST Conv Text Recognition 12% 900000 MNIST Hidden Text Recognition 29% 900000
Bigger is Better
0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 AlexNet AN4 LST CIFAR Large Synth MNIST Conv MNIST Hidden Projected Speedup CPU HSA XTQ
Contribution 1: Extended Task Queuing (XTQ)
Michael LeBeane – PhD Defense
18 07/16/2018
▪ Introduction ▪ Contribution 1: Extended Task Queuing ▪ Contribution 2: Command Processor Networking ▪ Contribution 3: GPU Triggered Networking ▪ Conclusion
Michael LeBeane – PhD Defense
▪ XTQ provides optimized remote kernel invocation
– But still at kernel boundaries – Kernel launches are expensive! – Best case ~3µs
▪ Can we do better?
– Networking from within a kernel? – What have other researchers tried?
19 07/16/2018
4 8 12 16 20 1 4 16 64 256 Launch Latency (µs) Kernel Commands Queued GPU 1 GPU 2 GPU 3
Smaller is Better
Contribution 2: Command Processor Networking (ComP-Net)
Michael LeBeane – PhD Defense
▪ GPU can send messages inside a kernel ▪ CPU thread is responsible for taking packets from GPU and poking NIC ▪ Will refer to this style of intra- kernel networking as GPU Host Networking
20 07/16/2018
Kernel Wait Send Wait Launch Put CPU GPU NIC Done Send Wait Launch Send Wait Launch Put CPU GPU NIC Done Kernel Kernel
Conference for High Performance Computing, Networking, Storage and Analysis (SC). 2016.
▪ Host Driven Networking (e.g., MPI + CUDA) ▪ GPU Host Networking
Contribution 2: Command Processor Networking (ComP-Net)
Michael LeBeane – PhD Defense
▪ Need multiple trips over IO bus ▪ Where to place queues?
– GPU memory vs. host memory – High latency in both cases
▪ Not scalable
– 4096 Work-groups fills the GPU – Still 40µs latency with 8 threads
21 07/16/2018
Kernel Wait Send Wait Launch Put CPU GPU NIC Done Send 20 40 60 80 100 1 8 64 512 4096 Service Time (us) Active Workgroups Host Queues GPU Queues Network Latency 20 40 60 80 100 16 128 1024 Service Time (us) Active Workgroups 1 Thread 2 Threads 4 Threads 8 Threads Network Latency
Smaller is Better
4096 Contribution 2: Command Processor Networking (ComP-Net)
Michael LeBeane – PhD Defense
▪ GPUs have built in CPUs called Command Processors (CPs)
– Scalar cores == good at running network runtime code – Connect to GPU CUs through a shared LLC
▪ Traditionally used to launch kernels
– But intra-kernel networking encourages less kernels…..
22 07/16/2018
Local Data Share L2 Cache L1 Cache CPU Core GPU Memory Compute Unit Command Processor L1 Cache SIMD SIMD SIMD SIMD
Contribution 2: Command Processor Networking (ComP-Net)
Michael LeBeane – PhD Defense
▪ Uses built in CP to support network operations ▪ CP/GPU communicate over shared L2 cache instead of PCIe ▪ Potentially much faster (lower latency) than other GHN designs ▪ Scales naturally
– Every GPU has multiple CP threads
23 07/16/2018
Processor Networking for Efficient Intra-kernel Communications on GPUs," in Proc. of the Intl. Conf Parallel Architectures and Compilation Techniques (PACT), 2018.
NIC …
Host Queues Memory
CUs CPUs GPU Host
PCIe
Memory
…
Network Queues
PCIe
Host
PCIe
L2 Cache Host Queues Memory Network Queues
CUs CPs
PCIe
NIC GPU
Contribution 2: Command Processor Networking (ComP-Net)
Michael LeBeane – PhD Defense
▪ Main component of ComP-Net Runtime is CP/GPU producer/consumer queue ▪ Most steps are straightforward
24 07/16/2018
Registers / Non Coherent Cache
Cache/Memory/GPU Coherence Point
Queue Entry Queue Entry Queue Entry Queue Entry
Read Idx Status Status Status Status
CP-Net GPU Context
Write Idx
LDS / Non Coherent Cache
Base Ptr Read Idx Ptr Local Read Idx
1 1
CP-Net GPU Context
Base Ptr Local Read Idx
Registers / Non Coherent Cache
4 CP-Net GPU Context
Base Ptr Local Read Idx
Work-Group Command Processor Thread
Contribution 2: Command Processor Networking (ComP-Net)
Michael LeBeane – PhD Defense
▪ 1a) Check if queue is full (using local Read Idx) ▪ 1b) If full, update Read Idx and loop till not full
25 07/16/2018
Registers / Non Coherent Cache
Cache/Memory/GPU Coherence Point
Queue Entry Queue Entry Queue Entry Queue Entry
Read Idx Status Status Status Status
CP-Net GPU Context
Write Idx
LDS / Non Coherent Cache
Base Ptr Local Read Idx
1 1
CP-Net GPU Context
Base Ptr Local Read Idx
Registers / Non Coherent Cache
4 CP-Net GPU Context
Base Ptr Local Read Idx
Work-Group Command Processor Thread 1b 1a
<= Read Idx Ptr Contribution 2: Command Processor Networking (ComP-Net)
Michael LeBeane – PhD Defense
▪ 2) Fill Queue Entry with networking metadata
– Or Inline small payloads in the Queue Entry itself
26 07/16/2018
Registers / Non Coherent Cache
Cache/Memory/GPU Coherence Point
Queue Entry Queue Entry Queue Entry Queue Entry
Read Idx Status Status Status Status
CP-Net GPU Context
Write Idx
LDS / Non Coherent Cache
Base Ptr Read Idx Ptr Local Read Idx
1 1
CP-Net GPU Context
Base Ptr Local Read Idx
Registers / Non Coherent Cache
4 CP-Net GPU Context
Base Ptr Local Read Idx
Work-Group Command Processor Thread 2
Contribution 2: Command Processor Networking (ComP-Net)
Michael LeBeane – PhD Defense
▪ 3) Set status flag with release marker to notify CP
27 07/16/2018
Registers / Non Coherent Cache
Cache/Memory/GPU Coherence Point
Queue Entry Queue Entry Queue Entry Queue Entry
Read Idx Status Status Status Status
CP-Net GPU Context
Write Idx
LDS / Non Coherent Cache
Base Ptr Read Idx Ptr Local Read Idx
1 1 1
CP-Net GPU Context
Base Ptr Local Read Idx
Registers / Non Coherent Cache
4 CP-Net GPU Context
Base Ptr Local Read Idx
Work-Group Command Processor Thread 3
Contribution 2: Command Processor Networking (ComP-Net)
Michael LeBeane – PhD Defense
▪ 4) Increment local Write Idx
28 07/16/2018
Registers / Non Coherent Cache
Cache/Memory/GPU Coherence Point
Queue Entry Queue Entry Queue Entry Queue Entry
Read Idx Status Status Status Status
CP-Net GPU Context
Write Idx
LDS / Non Coherent Cache
Base Ptr Read Idx Ptr Local Read Idx
1 1 1
CP-Net GPU Context
Base Ptr Local Read Idx
Registers / Non Coherent Cache
4 CP-Net GPU Context
Base Ptr Local Read Idx
Work-Group Command Processor Thread 4
++ Contribution 2: Command Processor Networking (ComP-Net)
Michael LeBeane – PhD Defense
▪ 5) Check status bit to determine when CP completes operation
29 07/16/2018
Registers / Non Coherent Cache
Cache/Memory/GPU Coherence Point
Queue Entry Queue Entry Queue Entry Queue Entry
Read Idx Status Status Status Status
CP-Net GPU Context
Write Idx
LDS / Non Coherent Cache
Base Ptr Read Idx Ptr Local Read Idx
1 1 1
CP-Net GPU Context
Base Ptr Local Read Idx
Registers / Non Coherent Cache
4 CP-Net GPU Context
Base Ptr Local Read Idx
Work-Group Command Processor Thread 5
== 1 Contribution 2: Command Processor Networking (ComP-Net)
Michael LeBeane – PhD Defense
▪ 1) Poll on next Queue Entry based on local Read Idx with acquire marker
30 07/16/2018
Registers / Non Coherent Cache
Cache/Memory/GPU Coherence Point
Queue Entry Queue Entry Queue Entry Queue Entry
Read Idx Status Status Status Status
CP-Net GPU Context
Write Idx
LDS / Non Coherent Cache
Base Ptr Read Idx Ptr Local Read Idx
1 1 1
CP-Net GPU Context
Base Ptr Local Read Idx
Registers / Non Coherent Cache
4 CP-Net GPU Context
Base Ptr Local Read Idx
Work-Group Command Processor Thread 1
== 0 Contribution 2: Command Processor Networking (ComP-Net)
Michael LeBeane – PhD Defense
▪ 2) Read data from Queue Entry
31 07/16/2018
Registers / Non Coherent Cache
Cache/Memory/GPU Coherence Point
Queue Entry Queue Entry Queue Entry Queue Entry
Read Idx Status Status Status Status
CP-Net GPU Context
Write Idx
LDS / Non Coherent Cache
Base Ptr Read Idx Ptr Local Read Idx
1 1 1
CP-Net GPU Context
Base Ptr Local Read Idx
Registers / Non Coherent Cache
4 CP-Net GPU Context
Base Ptr Local Read Idx
Work-Group Command Processor Thread 2
Contribution 2: Command Processor Networking (ComP-Net)
Michael LeBeane – PhD Defense
▪ 3) Perform Network operation and set Status flag to 0 when complete with release marker
32 07/16/2018
Registers / Non Coherent Cache
Cache/Memory/GPU Coherence Point
Queue Entry Queue Entry Queue Entry Queue Entry
Read Idx Status Status Status Status
CP-Net GPU Context
Write Idx
LDS / Non Coherent Cache
Base Ptr Read Idx Ptr Local Read Idx
1 1
CP-Net GPU Context
Base Ptr Local Read Idx
Registers / Non Coherent Cache
4 CP-Net GPU Context
Base Ptr Local Read Idx
Work-Group Command Processor Thread 3
Contribution 2: Command Processor Networking (ComP-Net)
Michael LeBeane – PhD Defense
▪ 4a) Update global read Idx ▪ 4b) Update local read Idx with release marker
33 07/16/2018
Registers / Non Coherent Cache
Cache/Memory/GPU Coherence Point
Queue Entry Queue Entry Queue Entry Queue Entry
Read Idx Status Status Status Status
CP-Net GPU Context
Write Idx
LDS / Non Coherent Cache
Base Ptr Read Idx Ptr Local Read Idx
1 1
CP-Net GPU Context
Base Ptr Local Read Idx
Registers / Non Coherent Cache
4 CP-Net GPU Context
Base Ptr Local Read Idx
Work-Group Command Processor Thread 4b
++
4a
++ Contribution 2: Command Processor Networking (ComP-Net)
Michael LeBeane – PhD Defense
▪ Residency of data in GPU L2 is very small ▪ Work-group data produced for CP is evicted when other work-groups are performing streaming memory accesses ▪ Can be solved through cache line locking
– Preliminary results are promising – Still much to explore here
34 07/16/2018
Bigger is Better
Contribution 2: Command Processor Networking (ComP-Net)
Michael LeBeane – PhD Defense
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 L2 Hit Rate for CP Networking Wavefronts / Streaming Wavefronts Baseline LLC Locking
▪ CPU: Standard CPU-only systems
– Baseline non-accelerated system
▪ HDN: Host Driven Networking
– Kernel boundary networking (host MPI + CUDA)
Intra-kernel Networking Schemes: ▪ APU: CPU/GPU on the Same Die
– Intra-kernel networking through host threads on an APU
▪ dGPU: GPU Host Networking
– Intra-kernel networking through host threads on a dGPU
▪ ComP-Net: Command Processor Networking
– Intra-kernel networking through command processor
35 07/16/2018
CPU and Memory Configuration
Type 8-wide OOO, x86, 8 cores @ 4GHz I,D-Cache 64KB, 2-way, 2 cycles L2-Cache 2MB, 8-way, 8 cycles L3-Cache 16MB, 16-way, 20 cycles DRAM DDR4, 8 Channels, 2133MHz
GPU Configuration
Type AMD GCN3 @ 1.5GHz CU Config 12 CUs with 4 SIMD-16 engines Wavefronts 40 Waves per SIMD (64 lanes) V-Cache 32KB, 16-way, 12 cycles, per CU K-Cache 32KB, 8-way, 12 cycles, per 4 CU I-Cache 64KB, 8-way, 12 cycles, per 4 CU L2-Cache 1MB, 16-way, 8 banks, 100 cycles
CP Configuration
Type 2-wide OOO, x86, 2 cores @ 2GHz D-Cache 32KB, 8-way, 4 cycles I-Cache 16KB, 8-way, 4 cycles
Contribution 2: Command Processor Networking (ComP-Net)
Michael LeBeane – PhD Defense
36 07/16/2018
▪ 2D Jacobi Stencil
– 1D data decomposition – Iterative compute and halo exchange – Three regions of interest
0.8 0.9 1 1.1 1.2 1.3 16 64 256 1024 Relative Speedup v dGPU Baseline Per-node Problem Size (N x N Grid) ComP-Net dGPU APU HDN CPU
Bigger is Better
Node 1 (Bottom) Node 0 (Top)
Halo Exchange
1 2 3 Contribution 2: Command Processor Networking (ComP-Net)
Michael LeBeane – PhD Defense
– APU performs better than ComP-Net – ComP-Net is much more energy efficient
37 07/16/2018
0.6 0.8 1 1.2 1.4 4 8 12 16 20 24 28 32 36 Relative Speedup Number of Nodes in Reduction ComP-Net dGPU APU HDN CPU 0.2 0.4 0.6 0.8 1 1.2 4 8 12 16 20 24 28 32 36 Energy Consumption Number of Nodes in Reduction ComP-Net dGPU APU
Bigger is Better Smaller is Better
1 2 5 2 1 1 3 5 9 8 Vector Sum
Contribution 2: Command Processor Networking (ComP-Net)
Michael LeBeane – PhD Defense
38 07/16/2018
0.8 0.85 0.9 0.95 1 1.05 1.1 1.15 AlexNet AN4 LSTM CIFAR MNIST Conv MNIST Hidden Average Projected Speedup CPU HDN dGPU APU ComP-Net Workload Name Domain %Blocked Reductions Alex Net Classification 14% 4672 AN4 LSTM Speech 50% 131192 CIFAR Classification 4% 939820 Large Synth Synthetic 28% 52800 MNIST Conv Text Recognition 12% 900000 MNIST Hidden Text Recognition 29% 900000
Bigger is Better
Contribution 2: Command Processor Networking (ComP-Net)
Michael LeBeane – PhD Defense
39 07/16/2018
▪ Introduction ▪ Contribution 1: Extended Task Queuing ▪ Contribution 2: Command Processor Networking ▪ Contribution 3: GPU Triggered Networking ▪ Conclusion
Michael LeBeane – PhD Defense
▪ CPU creates network
– Registers with the NIC
▪ GPU simply ‘triggers’
ready ▪ Provides intra-kernel GPU networking without requiring a CPU thread
40 07/16/2018
Send Launch Kernel
GPU Triggered Networking
Put CPU GPU NIC Done Kernel Wait Send Wait Launch
GPU Host Networking
Put CPU GPU NIC Done Send Wait Launch Send Wait Launch
Host-Driven Networking
Put CPU GPU NIC Done Kernel Kernel
Communications,“ in Intl. Conf. for High Performance Computing, Networking, Storage and Analysis (SC), 2017.
Contribution 3: GPU Triggered Networking (GPU-TN)
Michael LeBeane – PhD Defense
▪ CPU Creates Triggered Entry
– Trigger Entry consists of:
– Appends entry to Trigger List
▪ GPU Fills Send Buffer
– During kernel execution
41 07/16/2018
Network
Send Buffer ….. CPU GPU Trigger List 2 3 4 NIC
Trigger Entry Trigger Entry
…… 1 1 2
Contribution 3: GPU Triggered Networking (GPU-TN)
Michael LeBeane – PhD Defense
▪ GPU initiates Put operation
– GPU Provides Tag
▪ NIC sends message
– Message triggered when counter >= CPU provided threshold
▪ HW complexity?
– ‘Trigger list’ might not be a list
▪ CPU/GPU race conditions?
– Allocate null entry for unexpected triggers
42 07/16/2018
Network
Send Buffer ….. CPU GPU Trigger List 2 3 4 NIC
Trigger Entry Trigger Entry
…… 1
Trigger Entry Network Operation Counter Tag Threshold == ++ >= Begin Network Operation
WR En
Tags
3 4
Contribution 3: GPU Triggered Networking (GPU-TN)
Michael LeBeane – PhD Defense
▪ CPU: Standard CPU-only systems
– Baseline non-accelerated system
▪ HDN: Host Driven Networking
– No driver interactions on the critical path, but may involve CPU runtime
▪ GDS-Sim: GPUDirect Async
– Preregistration of communication but at kernel boundaries
▪ GHN: GPU Host Networking
– Intra-kernel networking through host threads
▪ GPU-TN: GPU Triggered Networking
– Preregistration of network operations and intra-kernel networking
43 07/16/2018
CPU and Memory Configuration
Type 8-wide OOO, x86, 8 cores @ 4GHz I,D-Cache 64KB, 2-way, 2 cycles L2-Cache 2MB, 8-way, 8 cycles L3-Cache 16MB, 16-way, 20 cycles DRAM DDR4, 8 Channels, 2133MHz
GPU Configuration
Type AMD GCN3 @ 1.5GHz CU Config 24 CUs with 4 SIMD-16 engines Wavefronts 40 Waves per SIMD (64 lanes) V-Cache 32KB, 16-way, 12 cycles, per CU K-Cache 32KB, 8-way, 12 cycles, per 4 CU I-Cache 64KB, 8-way, 12 cycles, per 4 CU L2-Cache 1MB, 16-way, 8 banks, 100 cycles
NIC Configuration
Link Speed 100ns/ 100Gbps Topology Star
Contribution 3: GPU Triggered Networking (GPU-TN)
Michael LeBeane – PhD Defense
44
1 1.05 1.1 1.15 1.2 16 64 256 1024 Speedup VS HDN Local 2D Grid Size (N X N) CPU GDS-Sim GHN GPU-TN
Bigger is Better
0.8 1 1.2 1.4 1.6 2 5 8 11 14 17 20 23 26 29 32 Speedup Nodes HDN GDS-Sim GHN GPU-TN
Bigger is Better
▪ 64MB Reduction (strong scaling) ▪ 2D Jacobi Stencil ▪ Machine Learning Training Phase
07/16/2018
Contribution 3: GPU Triggered Networking (GPU-TN)
Michael LeBeane – PhD Defense
0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 AlexNet AN4 LSTM CIFAR Large Synth MNIST Conv MNIST Hidden Projected Speedup CPU HDN GDS-Sim GHN GPU-TN
45 07/16/2018
▪ Introduction ▪ Contribution 1: Extended Task Queuing ▪ Contribution 2: Command Processor Networking ▪ Contribution 3: GPU Triggered Networking ▪ Conclusion
Michael LeBeane – PhD Defense
46 07/16/2018
▪ Presented 3 enhancements to improve GPU networking
– Extended Task Queuing
– Command Processor Networking
– GPU Triggered Networking
Conclusion
Michael LeBeane – PhD Defense
Target
▪ XTQ allows direct access to remote GPU queues
– Teach NICs how to speak with HSA queues
– Improves latency and frees CPU service thread(s)
▪ Improves application performance by ~15%
47 07/16/2018
messages for heterogeneous systems," in Proc. of the Intl. Conf. for High Performance Computing, Networking, Storage and Analysis (SC), 2016.
Conclusion
Michael LeBeane – PhD Defense
Cache Cache CPU Cache Memory Cache Cache CPU Cache Memory XTQ NIC NIC XTQ NIC IC GPU GPU Initiator
▪ Uses built in CP to support network
▪ CP/GPU communicate over shared L2 cache instead of PCIe ▪ Potentially much faster (lower latency) than other GHN designs ▪ Scales naturally – Every GPU has multiple CP threads ▪ Improves application performance ~20% vs other GHN approaches
48 07/16/2018
Processor Networking for Efficient Intra-kernel Communications on GPUs," in Proc. of the Intl. Conf Parallel Architectures and Compilation Techniques (PACT), 2018.
NIC …
Host Queues Memory
CUs CPUs GPU Host
PCIe
Memory
…
Network Queues
PCIe
Host
PCIe
L2 Cache Host Queues Memory Network Queues
CUs CPs
PCIe
NIC GPU
Conclusion
Michael LeBeane – PhD Defense
▪ CPU creates network operation off the critical path – Registers with the NIC ▪ GPU simply ‘triggers’ operation when the data is ready ▪ Provides intra-kernel GPU networking without requiring a CPU thread ▪ Improves application performance ~20% vs GPUDirect Async
49 07/16/2018
Send Launch Kernel
GPU Triggered Networking
Put CPU GPU NIC Done Kernel Wait Send Wait Launch
GPU Host Networking
Put CPU GPU NIC Done Send Wait Launch Send Wait Launch
Host-Driven Networking
Put CPU GPU NIC Done Kernel Kernel
Communications,“ in Intl. Conf. for High Performance Computing, Networking, Storage and Analysis (SC), 2017.
Conclusion
Michael LeBeane – PhD Defense
▪ This dissertation motivates the need for more independent accelerators
– Cannot funnel everything through a central CPU! – Concepts are applicable to many types of accelerators and networks
▪ Still much to do!
– Application Redesign Opportunities
– Leveraging Emerging NIC Technologies for GPUs
50 07/16/2018
Conclusion
Michael LeBeane – PhD Defense
51 07/16/2018 Michael LeBeane – PhD Defense
▪ CPU controls networking through driver/runtime ▪ Messages sent at kernel boundaries ▪ Research implementations include:
– CUDA-Aware MPI [Kraus ‘14] – CUDA-Aware OpenSHMEM [Hamidouche ’16]
52 07/16/2018
Wait Launch Send Wait Launch
Host-Driven Networking
Put CPU GPU NIC Done Kernel Kernel
2016. Mellanox, “Mellanox GPUDirect RDMA User Manual,” http://www.mellanox.com/related-docs/prod_software/Mellanox GPUDirect User Manual v1.2.pdf. 2015
Michael LeBeane – PhD Defense
▪ GPU runs networking stack ▪ Persistent kernels and LDS memory used for network data structures ▪ Research implementations include:
– GPUrdma [Daoud ’16] – IBV on GPUs [Oden ‘14]
53 02/27/2017
Wait Launch Send Wait Launch
Host-Driven Networking
Put CPU GPU NIC Done Kernel Kernel Launch Kernel
GPU Native Networking
Put CPU GPU NIC Done Send
Runtime and Operating Systems for Supercomputers (ROSS). 2016.
Michael LeBeane – PhD Defense
AMD <=> Nvidia Translator ▪ Work-item = Thread ▪ Wavefront (64 Threads) = Warp (32 Threads)
– Unit of thread dispatch
▪ Work-group = Thread Block
– Unit of Synchronization
▪ Local Data Share (LDS) = Shared Memory
– Work-group scratchpad
▪ Compute Unit (CU) = Streaming Multi-Processor (SM)
– Collection of SIMD engines sharing LDS and L1 cache
54 07/16/2018
▪ Kernel
– GPU SIMT Function
▪ Command Processor (CP)
– Dispatch engine and scheduler Local Data Share L2 Cache L1 Cache CPU Core GPU Memory Compute Unit Command Processor L1 Cache SIMD SIMD SIMD SIMD
Michael LeBeane – PhD Defense
55 07/16/2018
__kernel void kern1(__global char *trigAddr, const int tagBase, __global void *buffer) { // do work buffer = ...; int id = get_global_id(); *trigAddr = tagBase + id; // do additional work ... } __kernel void kern2(__global char *trigAddr, const int tagBase, __global void *buffer) { // do work buffer = ...; wg_barrier(); if (!get_local_id()) { int id = get_group_id(); *trigAddr = tagBase + id; } // do additional work ... } __kernel void kern3(__global char *trigAddr, const int tag, __global void *buffer) { // do work buffer = ...; wg_barrier(); if (!get_local_id()) *trigAddr = tag; // do additional work ... }
Work-item Level Work-group Level Kernel Level
Michael LeBeane – PhD Defense
▪ gem5 + AMD GCN3 GPU model + Custom Portals4 NIC Model – CPU power model with McPAT – Baseline model is coherent APU
▪ Each section has slightly different parameters
– Will be discussed before results presented
56 07/16/2018
Directory Memory Controllers Memory
GPU CPU
Core L2 L1I L1D
…
Core L2 L1I L1D Core L2 L1I L1D L3 GPU Core L1D GPU Core L1D GPU Core L1D GPU Core L1D
Sequencer Cache (SQC)
L2 GPU Core L1D GPU Core L1D GPU Core L1D GPU Core L1D
Sequencer Cache (SQC)
…
NIC
NIC Processors DMA Engines L1I L1D CP Core IF Network
Michael LeBeane – PhD Defense
▪ RDMA allows for direct access of remote memory without involving CPU
– Heavy lifting is performed on the NIC (off-load networking model) – Generally expressed in terms of remote Put/Get operations
▪ Maps naturally to “one-sided” communication semantics
– Puts/Gets vs. Send/Receive
57 07/16/2018
Initiator Target Network CPU Cache NIC Memory
IOC
Memory CPU Cache NIC Memory
IOC
Memory
Michael LeBeane – PhD Defense
58 07/16/2018
__host__ void hostInit() { //Initialize ComP-Net cpnet_handle_t* cpnet_handle; cpnet_init(&cpnet_handle, GRID_SZ / WG_SZ); // Allocate symmetric heap memory char* buf = cpnet_shmalloc(sizeof(char) * GRID_SZ / WG_SZ); //Initiator/target launches kernel if (cpnet_handle->pe == INITIATOR) { hipLaunchKernel(Ping, GRID_SZ, GRID_SZ / WG_SZ, 0, 0, cpnet_handle, buf); } else { /* Launch target kernel. */ } } __device__ void Ping(cpnet_handle_t *cpnet_handle, char* wg_buffer) { // Extract context from global handle __shared__ cpnet_ctx_t cpnet_ctx; cpnet_ctx_create(cpnet_handle, cpnet_ctx); // Each WG pings target cpnet_shmem_char_p(cpnet_ctx, wg_buffer[hipBlockIdx_x], 1, TARGET); // Each WG waits for pong target cpnet_shmem_char_wait_until( wg_buffer[hipBlockIdx_x, 1); cpnet_ctx_destroy(cpnet_ctx); }
Host Code GPU Code
Michael LeBeane – PhD Defense
▪ One-sided put latency benchmark – Initiator launches dummy kernel, executes network command, and terminates – Target polls on put location ▪ Take-away messages – HDN < GDS-Sim < GPU-TN – GPU-TN actually overlaps kernel teardown with network transfer!
59 07/16/2018
1.51 1.50 1.50 0.41 0.43 0.49 1.50 1.51 1.49 0.30 0.05 4.21 3.76 2.71 0.5 1 1.5 2 2.5 3 3.5 4 4.5 Time (µs) Kernel Launch Kernel Exeuction Kernel Teardown Put Wait
Target Initiator Target Initiator
GPU-TN GDS-Sim
Target Initiator
HDN Smaller is Better
Michael LeBeane – PhD Defense
Sweep of payload size for 1 WG and 1 Thread
60 07/16/2018
2 4 6 8 10 12 1 8 64 512 4096 32768 Remote Get Time Observed from GPU (µs) Network Payload Size ComP-Net dGPU APU 20 40 60 80 100 2 4 6 8 10 Remote Get Time Observed from GPU (µs) Number of Network Service Threads ComP-Net dGPU APU 0.2 0.4 0.6 0.8 1 1.2 2 4 6 8 10 Energy Consumed by Network Threads w.r.t dGPU Number of Network Service Threads ComP-Net dGPU APU
Sweep of threads for 1 byte transfers and 480 WGs
Michael LeBeane – PhD Defense
▪ Friendlier programming abstractions
– Nicer abstractions in CUDA and OpenCL
– Single-source, kernel-less programming support
▪ Architectural Support
– User-level kernel-launch – Shared virtual address space – Virtualization – Multiprocessing – (Sometimes) Coherent caches
61 07/16/2018
MMU
CPU
Tightly Coupled Devices Physical Memory
GPU
OS Driver IOMMU
CPU
(Producer) Tightly Coupled Devices Virtual Memory Command Queue
GPU
(Consumer) Command Packet
Architected Queuing Shared Virtual Memory
What about networking support?
Michael LeBeane – PhD Defense