Clusters of GPUs Michael LeBeane mlebeane@utexas.edu Advisor : Lizy - - PowerPoint PPT Presentation

clusters of gpus
SMART_READER_LITE
LIVE PREVIEW

Clusters of GPUs Michael LeBeane mlebeane@utexas.edu Advisor : Lizy - - PowerPoint PPT Presentation

PhD Defense Optimizing Communication for Clusters of GPUs Michael LeBeane mlebeane@utexas.edu Advisor : Lizy K. John Problem Statement GPUs and Networks in the Wild GPUs are everywhere in HPC, Big Data, Machine Learning, and beyond


slide-1
SLIDE 1

PhD Defense

Optimizing Communication for Clusters of GPUs

Michael LeBeane mlebeane@utexas.edu Advisor: Lizy K. John

slide-2
SLIDE 2

▪ GPUs are everywhere in HPC, Big Data, Machine Learning, and beyond

– Excellent performance/watt for many classes of data-parallel computation

▪ Many GPUs are required to solve the biggest computational problems

– Can only fit so many GPUs in a single node! – GPUs need to talk to each other through Network Interface Controllers (NICs) – Path between GPU and NIC needs to be efficient ▪ Vendor’s are selling machines filled with many GPUs and NICs:

GPUs and Networks in the Wild

2 Michael LeBeane – PhD Defense 07/16/2018

Nvidia’s DGX-2

16 Tesla V100 8 Mellanox 100G NICs 2 Ethernet NICs 2 Xeon Platinum 1.6:1 GPU/NIC Ratio

AMD’s Project 47 Node

4 Radeon Instinct GPUs 2 Mellanox 100G NICs 1 EPYC 7601 32-Core CPU 2:1 GPU/NIC Ratio

Problem Statement

slide-3
SLIDE 3

▪ Largely focused on an optimized data plane

– Path taken by the application data that needs to be transferred by the network – Industry technologies such as ROCn RDMA and GPUDirect RDMA allow peer-to-peer data transfers

Today’s GPU Networks

3 07/16/2018

IOC = IO Controller

Initiator Target CPU Cache Memory NIC Memory Network

IOC

CPU Cache Memory NIC Memory GPU

IOC

GPU Memory Memory

Problem Statement

Michael LeBeane – PhD Defense

slide-4
SLIDE 4

▪ Control plane is unoptimized!

– Focused on a host-centric model where only the CPU can coordinate network transfers – Very high latencies to perform networking from the GPU

Challenges with Today’s GPU Networks

4 07/16/2018

IOC = IO Controller

Problem Statement

Michael LeBeane – PhD Defense

Initiator Target CPU Cache Memory NIC Memory Network

IOC

CPU Cache Memory NIC Memory GPU

IOC

GPU Memory Memory

slide-5
SLIDE 5

▪ GPU Allreduce Computation

– Many communication/computation phases – Scaling out increases the number phases

Motivating Example for Control Plane Optimizations

5 07/16/2018

1 2 2 5 5 3 1 1 1 2 2 5 5 3 1 1 1 1 2 5 5 3 1 2 2 5 1 1 6 4 5 3 7 8 2 5 3 6 1 1 5 3 1 2 2 5 5 3 1 1 6 4 3 6 7 8 3 6 6 4 7 8 1 2 2 5 6 4 8 9 7 8 8 9 3 6 8 9 1 1 1 1

Initial Communication Compute Communication Compute

Time

Nodes/ GPUs Buffers Problem Statement

Michael LeBeane – PhD Defense

slide-6
SLIDE 6

Thesis Statement

6 07/16/2018

GPU networking can be improved by both software and hardware enhancements that enable GPUs to more directly interface with the network control plane. ▪ Proposed Solutions

– Extended Task Queuing

  • Direct NIC-to-GPU active messaging

– Command Processor Networking

  • Dynamic communication using on-chip GPU Command Processor

– GPU Triggered Networking

  • Initiate messages without critical path CPU

Problem Statement

Michael LeBeane – PhD Defense

slide-7
SLIDE 7

▪ Introduction ▪ Contribution 1: Extended Task Queuing ▪ Contribution 2: Command Processor Networking ▪ Contribution 3: GPU Triggered Networking ▪ Conclusion

Outline

7 07/16/2018 Michael LeBeane – PhD Defense

slide-8
SLIDE 8

▪ GPUs consume work through in-memory command queues

– Queue format standardized through Heterogeneous System Architecture (HSA) – Any device can produce work for another device – Assumes unified virtual address space

▪ Can we extend this across a node?

– NIC doesn’t know how to talk to HSA queues – Initiator doesn’t know the virtual addresses of resources at the target

Local GPU Work Dispatch

8 07/16/2018

GPU/CPU

(Producer) Devices Virtual Memory Command Queue

GPU

(Consumer) Command Packet

Contribution 1: Extended Task Queuing (XTQ)

Michael LeBeane – PhD Defense

slide-9
SLIDE 9

Cache Cache CPU Cache Memory Cache Cache CPU Cache Memory XTQ NIC NIC XTQ NIC IC GPU GPU Initiator Target

▪ XTQ allows direct access to remote GPU queues

– Teach NICs how to speak with HSA queues

▪ Enables Active Messaging without target CPU involvement

– Improves latency and frees CPU service thread(s)

Extended Task Queuing (XTQ) Overview

9 07/16/2018

  • M. LeBeane, B. Potter, A. Pan, A. Dutu, V. Agarwala, W. Lee, D. Majeti, B. Ghimire, E. Van Tassell, S. Wasmundt, B. Benton, M. Breternitz, M. L. Chu, M. Thottethodi, L. K. John, and S. K. Reinhardt, \Extended task queuing: active

messages for heterogeneous systems," in Proc. of the Intl. Conf. for High Performance Computing, Networking, Storage and Analysis (SC), 2016.

Contribution 1: Extended Task Queuing (XTQ)

Michael LeBeane – PhD Defense

slide-10
SLIDE 10

▪ Payload data streams into target-side receive buffer ▪ Command descriptor is placed into command queue

Target-side XTQ Operation

10 07/16/2018

CPU GP GPU Tightly Coupled Devices XT XTQ NIC NIC

Doorbell

Payload Data

Command Queue Lookup

Virtual Memory

Signal

Contribution 1: Extended Task Queuing (XTQ)

Michael LeBeane – PhD Defense

slide-11
SLIDE 11

▪ NIC notifies the GPU using memory-mapped doorbell ▪ GPU reads command packet

Target-side XTQ Operation

11 07/16/2018

CPU Tightly Coupled Devices XT XTQ NIC NIC

Doorbell

Payload Data

Command Queue Lookup

Virtual Memory

Signal

GP GPU

Contribution 1: Extended Task Queuing (XTQ)

Michael LeBeane – PhD Defense

slide-12
SLIDE 12

▪ GPU reads transferred data ▪ GPU writes shared memory completion signal

Target-side XTQ Operation

12 07/16/2018

CPU Tightly Coupled Devices XT XTQ NIC NIC

Doorbell

Payload Data

Command Queue Lookup

Virtual Memory

Signal

GP GPU

Contribution 1: Extended Task Queuing (XTQ)

Michael LeBeane – PhD Defense

slide-13
SLIDE 13

▪ How does initiator know about remote VAs at the target? ▪ Use coordinated indices specified by the initiator ▪ Lookup tables are populated by the target-side XTQ Library

XTQ Coordinated Indices

13 07/16/2018

Command Packet Data Payload Kernel Arguments RDMA Header Queue Lookup Table Queue Lookup Table Base Address Register Target PID 0xF123 Queue Index .... .... Initiator Target

෍ 𝑦

Unified Virtual Memory

....

Example Queue Lookup

Contribution 1: Extended Task Queuing (XTQ)

Michael LeBeane – PhD Defense

slide-14
SLIDE 14

14 07/16/2018

▪ XTQ Put is implemented as a simple extension to standard RDMA put operation

– Compatible with many low-level RDMA transports (e.g. InfiniBand, RoCE, Portals 4, iWARP, etc.)

▪ XTQ Registration API is used to provide address index-to-address translations

XTQ Runtime API

Put Command Fields Target NID/PID Send Buffer Ptr. Send Buffer Length Target Buffer Index Transport specific metadata Additional XTQ Fields Remote Queue Index Remote Function/Kernel Index GPU command packet Kernel/Function Launch Parameters Regular RDMA Put Operation XTQ-Enhanced RDMA Put Operation XTQ Rewrite Registration API  Register Queue

‒ Queue Desc. VA

 Register Function

‒ Function Ptr. VA ‒ Target Side Buffer VA

 Register Kernel

‒ Kernel Ptr. VA ‒ Target Side Buffer VA ‒ Kernel Argument Size ‒ Completion Signal VA

Contribution 1: Extended Task Queuing (XTQ)

Michael LeBeane – PhD Defense

slide-15
SLIDE 15

▪ CPU: Standard CPU-only systems

– Baseline non-accelerated system

▪ HSA: Currently available GPU systems

– Involves CPU runtime

▪ XTQ: Extended Task Queuing

– Enables efficient active messaging style communication that bypasses the CPU on the target

Experimental Setup

15 07/16/2018

CPU and Memory Configuration

Type 4-wide OOO, x86, 8 cores @ 4GHz I,D-Cache 64KB, 2-way, 2 cycles L2-Cache 2MB, 8-way, 8 cycles L3-Cache 16MB, 16-way, 20 cycles DRAM DDR3, 8 Channels, 800MHz

GPU Configuration

Type AMD GCN3 @ 1GHz CU Config 24 CUs with 4 SIMD-16 engines Wavefronts 40 Waves per SIMD (64 lanes) V-Cache 32KB, 16-way, 12 cycles, per CU K-Cache 32KB, 8-way, 12 cycles, per 4 CU I-Cache 64KB, 8-way, 12 cycles, per 4 CU L2-Cache 1MB, 16-way, 8 banks, 100 cycles

NIC Configuration

Link Speed 100ns/ 100Gbps Topology Star

NIC Cache Cache CPU Cache Memory GPU NIC Cache Cache CPU Cache Memory GPU NIC Cache Cache CPU Cache Memory GPU

Contribution 1: Extended Task Queuing (XTQ)

Michael LeBeane – PhD Defense

slide-16
SLIDE 16

Results

16 07/16/2018

0.1 1 10 1 16 256 4,096 65,536 1,048,576 Speedup Data Items (4 Byte Integers) CPU HSA XTQ 500 1000 1500 2000 8 16 24 32 40 48 56 64 Runtime (us) CPU HSA XTQ Nodes 1 16 256 4K 64K 1M

Bigger is Better Smaller is Better

▪ MPI Accumulate ▪ MPI Allreduce

0.31 0.31 0.31 0.31 0.31 0.31 0.16 0.11 0.11 0.24 0.22 0.22 0.31 0.30 0.31 0.44 0.43 0.42 0.09 0.06 0.07 0.15 0.14 0.14 0.25 0.61 0.28 0.66 0.23 0.23 0.59 0.55 0.07 0.08 0.21 0.06 0.07 0.65 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50 Time (µs) CPU PtlPut NIC Initiator Put Network NIC Target Put GPU Launch GPU Kernel Execution CPU Completion

XTQ

4KB 64B

HSA CPU XTQ HSA CPU

19% 15%

Smaller is Better

▪ Latency Decomposition

1 2 3

Contribution 1: Extended Task Queuing (XTQ)

Michael LeBeane – PhD Defense

slide-17
SLIDE 17

Results

17 07/16/2018

Workload Name Domain %Blocked Reductions Alex Net Classification 14% 4672 AN4 LSTM Speech 50% 131192 CIFAR Classification 4% 939820 Large Synth Synthetic 28% 52800 MNIST Conv Text Recognition 12% 900000 MNIST Hidden Text Recognition 29% 900000

Bigger is Better

▪ Use Microsoft’s Cognitive Toolkit and sample workloads ▪ Projected using simulation results + profiling data from TACC’s Stampede supercomputer ▪ Speedups bound by % time application blocked on network data

0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 AlexNet AN4 LST CIFAR Large Synth MNIST Conv MNIST Hidden Projected Speedup CPU HSA XTQ

Contribution 1: Extended Task Queuing (XTQ)

Michael LeBeane – PhD Defense

slide-18
SLIDE 18

Outline

18 07/16/2018

▪ Introduction ▪ Contribution 1: Extended Task Queuing ▪ Contribution 2: Command Processor Networking ▪ Contribution 3: GPU Triggered Networking ▪ Conclusion

Michael LeBeane – PhD Defense

slide-19
SLIDE 19

▪ XTQ provides optimized remote kernel invocation

– But still at kernel boundaries – Kernel launches are expensive! – Best case ~3µs

  • Network latency is < 0.7µs……

▪ Can we do better?

– Networking from within a kernel? – What have other researchers tried?

Motivating Intra-kernel Networking

19 07/16/2018

4 8 12 16 20 1 4 16 64 256 Launch Latency (µs) Kernel Commands Queued GPU 1 GPU 2 GPU 3

Smaller is Better

Contribution 2: Command Processor Networking (ComP-Net)

Michael LeBeane – PhD Defense

slide-20
SLIDE 20

▪ GPU can send messages inside a kernel ▪ CPU thread is responsible for taking packets from GPU and poking NIC ▪ Will refer to this style of intra- kernel networking as GPU Host Networking

Prior-art in Intra-kernel Networking

20 07/16/2018

Kernel Wait Send Wait Launch Put CPU GPU NIC Done Send Wait Launch Send Wait Launch Put CPU GPU NIC Done Kernel Kernel

  • S. Kim, S. Huh, Y. Hu, X. Zhang, E. Witchel, A. Wated, and M. Silberstein, “GPUnet: Networking Abstractions for GPU Programs,” In USENIX
  • Conf. on Operating Systems Design and Implementation (OSDI). 2014.
  • J. A. Stuart and J. D. Owens, “Message passing on data-parallel architectures,” In Intl. Symp. on Parallel Distributed Processing (IPDPS). 2009.
  • T. Gysi, J. Bär, and T. Hoefler., “dCUDA: hardware supported overlap of computation and communication,” In Proceedings of the International

Conference for High Performance Computing, Networking, Storage and Analysis (SC). 2016.

▪ Host Driven Networking (e.g., MPI + CUDA) ▪ GPU Host Networking

Contribution 2: Command Processor Networking (ComP-Net)

Michael LeBeane – PhD Defense

slide-21
SLIDE 21

▪ Need multiple trips over IO bus ▪ Where to place queues?

– GPU memory vs. host memory – High latency in both cases

▪ Not scalable

– 4096 Work-groups fills the GPU – Still 40µs latency with 8 threads

Performance Problems with GPU Host Networking

21 07/16/2018

Kernel Wait Send Wait Launch Put CPU GPU NIC Done Send 20 40 60 80 100 1 8 64 512 4096 Service Time (us) Active Workgroups Host Queues GPU Queues Network Latency 20 40 60 80 100 16 128 1024 Service Time (us) Active Workgroups 1 Thread 2 Threads 4 Threads 8 Threads Network Latency

Smaller is Better

4096 Contribution 2: Command Processor Networking (ComP-Net)

Michael LeBeane – PhD Defense

slide-22
SLIDE 22

▪ GPUs have built in CPUs called Command Processors (CPs)

– Scalar cores == good at running network runtime code – Connect to GPU CUs through a shared LLC

▪ Traditionally used to launch kernels

– But intra-kernel networking encourages less kernels…..

Command Processor Overview

22 07/16/2018

Local Data Share L2 Cache L1 Cache CPU Core GPU Memory Compute Unit Command Processor L1 Cache SIMD SIMD SIMD SIMD

GPU

Contribution 2: Command Processor Networking (ComP-Net)

Michael LeBeane – PhD Defense

slide-23
SLIDE 23

▪ Uses built in CP to support network operations ▪ CP/GPU communicate over shared L2 cache instead of PCIe ▪ Potentially much faster (lower latency) than other GHN designs ▪ Scales naturally

– Every GPU has multiple CP threads

Command Processor Networking (ComP-Net) Overview

23 07/16/2018

  • M. LeBeane, K. Hamidouche, B. Benton, M. Breternitz, S. K. Reinhardt, and L. K. John, “ComP-Net: Command

Processor Networking for Efficient Intra-kernel Communications on GPUs," in Proc. of the Intl. Conf Parallel Architectures and Compilation Techniques (PACT), 2018.

NIC …

Host Queues Memory

CUs CPUs GPU Host

PCIe

Memory

Network Queues

PCIe

GPU Host Networking

Host

PCIe

L2 Cache Host Queues Memory Network Queues

CUs CPs

PCIe

NIC GPU

ComP-Net

Contribution 2: Command Processor Networking (ComP-Net)

Michael LeBeane – PhD Defense

slide-24
SLIDE 24

▪ Main component of ComP-Net Runtime is CP/GPU producer/consumer queue ▪ Most steps are straightforward

ComP-Net Producer/Consumer Queue

24 07/16/2018

Registers / Non Coherent Cache

Cache/Memory/GPU Coherence Point

Queue Entry Queue Entry Queue Entry Queue Entry

…..

Read Idx Status Status Status Status

CP-Net GPU Context

Write Idx

LDS / Non Coherent Cache

Base Ptr Read Idx Ptr Local Read Idx

…..

1 1

CP-Net GPU Context

Base Ptr Local Read Idx

…..

Registers / Non Coherent Cache

4 CP-Net GPU Context

Base Ptr Local Read Idx

…..

….

Work-Group Command Processor Thread

Contribution 2: Command Processor Networking (ComP-Net)

Michael LeBeane – PhD Defense

slide-25
SLIDE 25

▪ 1a) Check if queue is full (using local Read Idx) ▪ 1b) If full, update Read Idx and loop till not full

ComP-Net Producer/Consumer Queue

25 07/16/2018

Registers / Non Coherent Cache

Cache/Memory/GPU Coherence Point

Queue Entry Queue Entry Queue Entry Queue Entry

…..

Read Idx Status Status Status Status

CP-Net GPU Context

Write Idx

LDS / Non Coherent Cache

Base Ptr Local Read Idx

…..

1 1

CP-Net GPU Context

Base Ptr Local Read Idx

…..

Registers / Non Coherent Cache

4 CP-Net GPU Context

Base Ptr Local Read Idx

…..

….

Work-Group Command Processor Thread 1b 1a

<= Read Idx Ptr Contribution 2: Command Processor Networking (ComP-Net)

Michael LeBeane – PhD Defense

slide-26
SLIDE 26

▪ 2) Fill Queue Entry with networking metadata

– Or Inline small payloads in the Queue Entry itself

ComP-Net Producer/Consumer Queue

26 07/16/2018

Registers / Non Coherent Cache

Cache/Memory/GPU Coherence Point

Queue Entry Queue Entry Queue Entry Queue Entry

…..

Read Idx Status Status Status Status

CP-Net GPU Context

Write Idx

LDS / Non Coherent Cache

Base Ptr Read Idx Ptr Local Read Idx

…..

1 1

CP-Net GPU Context

Base Ptr Local Read Idx

…..

Registers / Non Coherent Cache

4 CP-Net GPU Context

Base Ptr Local Read Idx

…..

….

Work-Group Command Processor Thread 2

Contribution 2: Command Processor Networking (ComP-Net)

Michael LeBeane – PhD Defense

slide-27
SLIDE 27

▪ 3) Set status flag with release marker to notify CP

ComP-Net Producer/Consumer Queue

27 07/16/2018

Registers / Non Coherent Cache

Cache/Memory/GPU Coherence Point

Queue Entry Queue Entry Queue Entry Queue Entry

…..

Read Idx Status Status Status Status

CP-Net GPU Context

Write Idx

LDS / Non Coherent Cache

Base Ptr Read Idx Ptr Local Read Idx

…..

1 1 1

CP-Net GPU Context

Base Ptr Local Read Idx

…..

Registers / Non Coherent Cache

4 CP-Net GPU Context

Base Ptr Local Read Idx

…..

….

Work-Group Command Processor Thread 3

Contribution 2: Command Processor Networking (ComP-Net)

Michael LeBeane – PhD Defense

slide-28
SLIDE 28

▪ 4) Increment local Write Idx

ComP-Net Producer/Consumer Queue

28 07/16/2018

Registers / Non Coherent Cache

Cache/Memory/GPU Coherence Point

Queue Entry Queue Entry Queue Entry Queue Entry

…..

Read Idx Status Status Status Status

CP-Net GPU Context

Write Idx

LDS / Non Coherent Cache

Base Ptr Read Idx Ptr Local Read Idx

…..

1 1 1

CP-Net GPU Context

Base Ptr Local Read Idx

…..

Registers / Non Coherent Cache

4 CP-Net GPU Context

Base Ptr Local Read Idx

…..

….

Work-Group Command Processor Thread 4

++ Contribution 2: Command Processor Networking (ComP-Net)

Michael LeBeane – PhD Defense

slide-29
SLIDE 29

▪ 5) Check status bit to determine when CP completes operation

ComP-Net Producer/Consumer Queue

29 07/16/2018

Registers / Non Coherent Cache

Cache/Memory/GPU Coherence Point

Queue Entry Queue Entry Queue Entry Queue Entry

…..

Read Idx Status Status Status Status

CP-Net GPU Context

Write Idx

LDS / Non Coherent Cache

Base Ptr Read Idx Ptr Local Read Idx

…..

1 1 1

CP-Net GPU Context

Base Ptr Local Read Idx

…..

Registers / Non Coherent Cache

4 CP-Net GPU Context

Base Ptr Local Read Idx

…..

….

Work-Group Command Processor Thread 5

== 1 Contribution 2: Command Processor Networking (ComP-Net)

Michael LeBeane – PhD Defense

slide-30
SLIDE 30

▪ 1) Poll on next Queue Entry based on local Read Idx with acquire marker

ComP-Net Producer/Consumer Queue

30 07/16/2018

Registers / Non Coherent Cache

Cache/Memory/GPU Coherence Point

Queue Entry Queue Entry Queue Entry Queue Entry

…..

Read Idx Status Status Status Status

CP-Net GPU Context

Write Idx

LDS / Non Coherent Cache

Base Ptr Read Idx Ptr Local Read Idx

…..

1 1 1

CP-Net GPU Context

Base Ptr Local Read Idx

…..

Registers / Non Coherent Cache

4 CP-Net GPU Context

Base Ptr Local Read Idx

…..

….

Work-Group Command Processor Thread 1

== 0 Contribution 2: Command Processor Networking (ComP-Net)

Michael LeBeane – PhD Defense

slide-31
SLIDE 31

▪ 2) Read data from Queue Entry

ComP-Net Producer/Consumer Queue

31 07/16/2018

Registers / Non Coherent Cache

Cache/Memory/GPU Coherence Point

Queue Entry Queue Entry Queue Entry Queue Entry

…..

Read Idx Status Status Status Status

CP-Net GPU Context

Write Idx

LDS / Non Coherent Cache

Base Ptr Read Idx Ptr Local Read Idx

…..

1 1 1

CP-Net GPU Context

Base Ptr Local Read Idx

…..

Registers / Non Coherent Cache

4 CP-Net GPU Context

Base Ptr Local Read Idx

…..

….

Work-Group Command Processor Thread 2

Contribution 2: Command Processor Networking (ComP-Net)

Michael LeBeane – PhD Defense

slide-32
SLIDE 32

▪ 3) Perform Network operation and set Status flag to 0 when complete with release marker

ComP-Net Producer/Consumer Queue

32 07/16/2018

Registers / Non Coherent Cache

Cache/Memory/GPU Coherence Point

Queue Entry Queue Entry Queue Entry Queue Entry

…..

Read Idx Status Status Status Status

CP-Net GPU Context

Write Idx

LDS / Non Coherent Cache

Base Ptr Read Idx Ptr Local Read Idx

…..

1 1

CP-Net GPU Context

Base Ptr Local Read Idx

…..

Registers / Non Coherent Cache

4 CP-Net GPU Context

Base Ptr Local Read Idx

…..

….

Work-Group Command Processor Thread 3

Contribution 2: Command Processor Networking (ComP-Net)

Michael LeBeane – PhD Defense

slide-33
SLIDE 33

▪ 4a) Update global read Idx ▪ 4b) Update local read Idx with release marker

ComP-Net Producer/Consumer Queue

33 07/16/2018

Registers / Non Coherent Cache

Cache/Memory/GPU Coherence Point

Queue Entry Queue Entry Queue Entry Queue Entry

…..

Read Idx Status Status Status Status

CP-Net GPU Context

Write Idx

LDS / Non Coherent Cache

Base Ptr Read Idx Ptr Local Read Idx

…..

1 1

CP-Net GPU Context

Base Ptr Local Read Idx

…..

Registers / Non Coherent Cache

4 CP-Net GPU Context

Base Ptr Local Read Idx

…..

….

Work-Group Command Processor Thread 4b

++

4a

++ Contribution 2: Command Processor Networking (ComP-Net)

Michael LeBeane – PhD Defense

slide-34
SLIDE 34

▪ Residency of data in GPU L2 is very small ▪ Work-group data produced for CP is evicted when other work-groups are performing streaming memory accesses ▪ Can be solved through cache line locking

– Preliminary results are promising – Still much to explore here

Tackling GPU Cache Thrashing

34 07/16/2018

Bigger is Better

Contribution 2: Command Processor Networking (ComP-Net)

Michael LeBeane – PhD Defense

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 L2 Hit Rate for CP Networking Wavefronts / Streaming Wavefronts Baseline LLC Locking

slide-35
SLIDE 35

▪ CPU: Standard CPU-only systems

– Baseline non-accelerated system

▪ HDN: Host Driven Networking

– Kernel boundary networking (host MPI + CUDA)

Intra-kernel Networking Schemes: ▪ APU: CPU/GPU on the Same Die

– Intra-kernel networking through host threads on an APU

▪ dGPU: GPU Host Networking

– Intra-kernel networking through host threads on a dGPU

▪ ComP-Net: Command Processor Networking

– Intra-kernel networking through command processor

Experimental Setup

35 07/16/2018

CPU and Memory Configuration

Type 8-wide OOO, x86, 8 cores @ 4GHz I,D-Cache 64KB, 2-way, 2 cycles L2-Cache 2MB, 8-way, 8 cycles L3-Cache 16MB, 16-way, 20 cycles DRAM DDR4, 8 Channels, 2133MHz

GPU Configuration

Type AMD GCN3 @ 1.5GHz CU Config 12 CUs with 4 SIMD-16 engines Wavefronts 40 Waves per SIMD (64 lanes) V-Cache 32KB, 16-way, 12 cycles, per CU K-Cache 32KB, 8-way, 12 cycles, per 4 CU I-Cache 64KB, 8-way, 12 cycles, per 4 CU L2-Cache 1MB, 16-way, 8 banks, 100 cycles

CP Configuration

Type 2-wide OOO, x86, 2 cores @ 2GHz D-Cache 32KB, 8-way, 4 cycles I-Cache 16KB, 8-way, 4 cycles

Contribution 2: Command Processor Networking (ComP-Net)

Michael LeBeane – PhD Defense

slide-36
SLIDE 36

Results

36 07/16/2018

▪ 2D Jacobi Stencil

– 1D data decomposition – Iterative compute and halo exchange – Three regions of interest

0.8 0.9 1 1.1 1.2 1.3 16 64 256 1024 Relative Speedup v dGPU Baseline Per-node Problem Size (N x N Grid) ComP-Net dGPU APU HDN CPU

Bigger is Better

Node 1 (Bottom) Node 0 (Top)

Halo Exchange

1 2 3 Contribution 2: Command Processor Networking (ComP-Net)

Michael LeBeane – PhD Defense

slide-37
SLIDE 37

▪ 64MB Reduction (strong scaling)

– APU performs better than ComP-Net – ComP-Net is much more energy efficient

Results

37 07/16/2018

0.6 0.8 1 1.2 1.4 4 8 12 16 20 24 28 32 36 Relative Speedup Number of Nodes in Reduction ComP-Net dGPU APU HDN CPU 0.2 0.4 0.6 0.8 1 1.2 4 8 12 16 20 24 28 32 36 Energy Consumption Number of Nodes in Reduction ComP-Net dGPU APU

Bigger is Better Smaller is Better

1 2 5 2 1 1 3 5 9 8 Vector Sum

Contribution 2: Command Processor Networking (ComP-Net)

Michael LeBeane – PhD Defense

slide-38
SLIDE 38

Results

38 07/16/2018

0.8 0.85 0.9 0.95 1 1.05 1.1 1.15 AlexNet AN4 LSTM CIFAR MNIST Conv MNIST Hidden Average Projected Speedup CPU HDN dGPU APU ComP-Net Workload Name Domain %Blocked Reductions Alex Net Classification 14% 4672 AN4 LSTM Speech 50% 131192 CIFAR Classification 4% 939820 Large Synth Synthetic 28% 52800 MNIST Conv Text Recognition 12% 900000 MNIST Hidden Text Recognition 29% 900000

Bigger is Better

Contribution 2: Command Processor Networking (ComP-Net)

Michael LeBeane – PhD Defense

slide-39
SLIDE 39

Outline

39 07/16/2018

▪ Introduction ▪ Contribution 1: Extended Task Queuing ▪ Contribution 2: Command Processor Networking ▪ Contribution 3: GPU Triggered Networking ▪ Conclusion

Michael LeBeane – PhD Defense

slide-40
SLIDE 40

▪ CPU creates network

  • peration off the critical path

– Registers with the NIC

▪ GPU simply ‘triggers’

  • peration when the data is

ready ▪ Provides intra-kernel GPU networking without requiring a CPU thread

GPU Triggered Networking (GPU-TN) Overview

40 07/16/2018

Send Launch Kernel

GPU Triggered Networking

Put CPU GPU NIC Done Kernel Wait Send Wait Launch

GPU Host Networking

Put CPU GPU NIC Done Send Wait Launch Send Wait Launch

Host-Driven Networking

Put CPU GPU NIC Done Kernel Kernel

  • M. LeBeane, K Hamidouche, B. Benton, M. Breternitz, S. K. Reinhardt, and L. K. John, “GPU Triggered Networking for Intra-Kernel

Communications,“ in Intl. Conf. for High Performance Computing, Networking, Storage and Analysis (SC), 2017.

Contribution 3: GPU Triggered Networking (GPU-TN)

Michael LeBeane – PhD Defense

slide-41
SLIDE 41

▪ CPU Creates Triggered Entry

– Trigger Entry consists of:

  • Network Operation
  • Tag
  • Counter
  • Threshold

– Appends entry to Trigger List

▪ GPU Fills Send Buffer

– During kernel execution

GPU-TN Architecture

41 07/16/2018

Network

Send Buffer ….. CPU GPU Trigger List 2 3 4 NIC

Trigger Entry Trigger Entry

…… 1 1 2

Contribution 3: GPU Triggered Networking (GPU-TN)

Michael LeBeane – PhD Defense

slide-42
SLIDE 42

▪ GPU initiates Put operation

– GPU Provides Tag

▪ NIC sends message

– Message triggered when counter >= CPU provided threshold

▪ HW complexity?

– ‘Trigger list’ might not be a list

▪ CPU/GPU race conditions?

– Allocate null entry for unexpected triggers

GPU-TN Architecture

42 07/16/2018

Network

Send Buffer ….. CPU GPU Trigger List 2 3 4 NIC

Trigger Entry Trigger Entry

…… 1

Trigger Entry Network Operation Counter Tag Threshold == ++ >= Begin Network Operation

WR En

Tags

3 4

Contribution 3: GPU Triggered Networking (GPU-TN)

Michael LeBeane – PhD Defense

slide-43
SLIDE 43

▪ CPU: Standard CPU-only systems

– Baseline non-accelerated system

▪ HDN: Host Driven Networking

– No driver interactions on the critical path, but may involve CPU runtime

▪ GDS-Sim: GPUDirect Async

– Preregistration of communication but at kernel boundaries

▪ GHN: GPU Host Networking

– Intra-kernel networking through host threads

▪ GPU-TN: GPU Triggered Networking

– Preregistration of network operations and intra-kernel networking

Experimental Setup

43 07/16/2018

CPU and Memory Configuration

Type 8-wide OOO, x86, 8 cores @ 4GHz I,D-Cache 64KB, 2-way, 2 cycles L2-Cache 2MB, 8-way, 8 cycles L3-Cache 16MB, 16-way, 20 cycles DRAM DDR4, 8 Channels, 2133MHz

GPU Configuration

Type AMD GCN3 @ 1.5GHz CU Config 24 CUs with 4 SIMD-16 engines Wavefronts 40 Waves per SIMD (64 lanes) V-Cache 32KB, 16-way, 12 cycles, per CU K-Cache 32KB, 8-way, 12 cycles, per 4 CU I-Cache 64KB, 8-way, 12 cycles, per 4 CU L2-Cache 1MB, 16-way, 8 banks, 100 cycles

NIC Configuration

Link Speed 100ns/ 100Gbps Topology Star

Contribution 3: GPU Triggered Networking (GPU-TN)

Michael LeBeane – PhD Defense

slide-44
SLIDE 44

Results

44

1 1.05 1.1 1.15 1.2 16 64 256 1024 Speedup VS HDN Local 2D Grid Size (N X N) CPU GDS-Sim GHN GPU-TN

Bigger is Better

0.8 1 1.2 1.4 1.6 2 5 8 11 14 17 20 23 26 29 32 Speedup Nodes HDN GDS-Sim GHN GPU-TN

Bigger is Better

▪ 64MB Reduction (strong scaling) ▪ 2D Jacobi Stencil ▪ Machine Learning Training Phase

07/16/2018

Contribution 3: GPU Triggered Networking (GPU-TN)

Michael LeBeane – PhD Defense

0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 AlexNet AN4 LSTM CIFAR Large Synth MNIST Conv MNIST Hidden Projected Speedup CPU HDN GDS-Sim GHN GPU-TN

slide-45
SLIDE 45

Outline

45 07/16/2018

▪ Introduction ▪ Contribution 1: Extended Task Queuing ▪ Contribution 2: Command Processor Networking ▪ Contribution 3: GPU Triggered Networking ▪ Conclusion

Michael LeBeane – PhD Defense

slide-46
SLIDE 46

Summary

46 07/16/2018

▪ Presented 3 enhancements to improve GPU networking

– Extended Task Queuing

  • Direct NIC-to-GPU active messaging

– Command Processor Networking

  • Dynamic communication using on-chip GPU Command Processor

– GPU Triggered Networking

  • Initiate messages without critical path CPU

Conclusion

Michael LeBeane – PhD Defense

slide-47
SLIDE 47

Target

▪ XTQ allows direct access to remote GPU queues

– Teach NICs how to speak with HSA queues

▪ Enables Active Messaging without target CPU involvement

– Improves latency and frees CPU service thread(s)

▪ Improves application performance by ~15%

Extended Task Queuing (XTQ) Summary

47 07/16/2018

  • M. LeBeane, B. Potter, A. Pan, A. Dutu, V. Agarwala, W. Lee, D. Majeti, B. Ghimire, E. Van Tassell, S. Wasmundt, B. Benton, M. Breternitz, M. L. Chu, M. Thottethodi, L. K. John, and S. K. Reinhardt, \Extended task queuing: active

messages for heterogeneous systems," in Proc. of the Intl. Conf. for High Performance Computing, Networking, Storage and Analysis (SC), 2016.

Conclusion

Michael LeBeane – PhD Defense

Cache Cache CPU Cache Memory Cache Cache CPU Cache Memory XTQ NIC NIC XTQ NIC IC GPU GPU Initiator

slide-48
SLIDE 48

▪ Uses built in CP to support network

  • perations

▪ CP/GPU communicate over shared L2 cache instead of PCIe ▪ Potentially much faster (lower latency) than other GHN designs ▪ Scales naturally – Every GPU has multiple CP threads ▪ Improves application performance ~20% vs other GHN approaches

Command Processor Networking (ComP-Net) Summary

48 07/16/2018

  • M. LeBeane, K. Hamidouche, B. Benton, M. Breternitz, S. K. Reinhardt, and L. K. John, “ComP-Net: Command

Processor Networking for Efficient Intra-kernel Communications on GPUs," in Proc. of the Intl. Conf Parallel Architectures and Compilation Techniques (PACT), 2018.

NIC …

Host Queues Memory

CUs CPUs GPU Host

PCIe

Memory

Network Queues

PCIe

GPU Host Networking

Host

PCIe

L2 Cache Host Queues Memory Network Queues

CUs CPs

PCIe

NIC GPU

ComP-Net

Conclusion

Michael LeBeane – PhD Defense

slide-49
SLIDE 49

▪ CPU creates network operation off the critical path – Registers with the NIC ▪ GPU simply ‘triggers’ operation when the data is ready ▪ Provides intra-kernel GPU networking without requiring a CPU thread ▪ Improves application performance ~20% vs GPUDirect Async

GPU Triggered Networking (GPU-TN) Summary

49 07/16/2018

Send Launch Kernel

GPU Triggered Networking

Put CPU GPU NIC Done Kernel Wait Send Wait Launch

GPU Host Networking

Put CPU GPU NIC Done Send Wait Launch Send Wait Launch

Host-Driven Networking

Put CPU GPU NIC Done Kernel Kernel

  • M. LeBeane, K Hamidouche, B. Benton, M. Breternitz, S. K. Reinhardt, and L. K. John, “GPU Triggered Networking for Intra-Kernel

Communications,“ in Intl. Conf. for High Performance Computing, Networking, Storage and Analysis (SC), 2017.

Conclusion

Michael LeBeane – PhD Defense

slide-50
SLIDE 50

▪ This dissertation motivates the need for more independent accelerators

– Cannot funnel everything through a central CPU! – Concepts are applicable to many types of accelerators and networks

▪ Still much to do!

– Application Redesign Opportunities

  • Applications presented in this dissertation are scratching the surface
  • Algorithms with dynamic communication could significantly benefit from these techniques

– Leveraging Emerging NIC Technologies for GPUs

  • Mellanox BlueField, collective offload, programmable message handlers
  • How could more intelligent NICs assist with GPU networking?

Towards the Future…..

50 07/16/2018

Conclusion

Michael LeBeane – PhD Defense

slide-51
SLIDE 51

Thank You!

51 07/16/2018 Michael LeBeane – PhD Defense

slide-52
SLIDE 52

▪ CPU controls networking through driver/runtime ▪ Messages sent at kernel boundaries ▪ Research implementations include:

– CUDA-Aware MPI [Kraus ‘14] – CUDA-Aware OpenSHMEM [Hamidouche ’16]

  • GPUDirect RDMA [Mellanox ‘13]

Host-Driven Networking (HDN)

52 07/16/2018

Wait Launch Send Wait Launch

Host-Driven Networking

Put CPU GPU NIC Done Kernel Kernel

  • J. Kraus. “Introduction to CUDA-aware MPI and Nvidia GPUDirect,” GPU Tech. Conference. 2014.
  • K. Hamidouche, A. Venkatesh, A. A. Awan, H. Subramoni, C.H. Chu, and D. K. Panda, “CUDA-Aware OpenSHMEM,” Journal on Parallel Computing.

2016. Mellanox, “Mellanox GPUDirect RDMA User Manual,” http://www.mellanox.com/related-docs/prod_software/Mellanox GPUDirect User Manual v1.2.pdf. 2015

Michael LeBeane – PhD Defense

slide-53
SLIDE 53

▪ GPU runs networking stack ▪ Persistent kernels and LDS memory used for network data structures ▪ Research implementations include:

– GPUrdma [Daoud ’16] – IBV on GPUs [Oden ‘14]

GPU Native Networking (GNN)

53 02/27/2017

Wait Launch Send Wait Launch

Host-Driven Networking

Put CPU GPU NIC Done Kernel Kernel Launch Kernel

GPU Native Networking

Put CPU GPU NIC Done Send

  • F. Daoud, A. Watad, and M. Silberstein, “GPUrdma: GPU-side Library for High Performance Networking from GPU Kernels,” In Intl. Workshop on

Runtime and Operating Systems for Supercomputers (ROSS). 2016.

  • L. Oden, H. Froning, and F. J. Pfreundt, “Infiniband-Verbs on GPU: A Case Study of Controlling an Infiniband Network Device from the GPU,” In
  • Intl. Conf. on Parallel Distributed Processing Symposium Workshops (IPDPSW). 2014.

Michael LeBeane – PhD Defense

slide-54
SLIDE 54

AMD <=> Nvidia Translator ▪ Work-item = Thread ▪ Wavefront (64 Threads) = Warp (32 Threads)

– Unit of thread dispatch

▪ Work-group = Thread Block

– Unit of Synchronization

▪ Local Data Share (LDS) = Shared Memory

– Work-group scratchpad

▪ Compute Unit (CU) = Streaming Multi-Processor (SM)

– Collection of SIMD engines sharing LDS and L1 cache

GPU Architecture and Terminology

54 07/16/2018

▪ Kernel

– GPU SIMT Function

▪ Command Processor (CP)

– Dispatch engine and scheduler Local Data Share L2 Cache L1 Cache CPU Core GPU Memory Compute Unit Command Processor L1 Cache SIMD SIMD SIMD SIMD

GPU

Michael LeBeane – PhD Defense

slide-55
SLIDE 55

GPU-TN Kernel Programming Interface

55 07/16/2018

__kernel void kern1(__global char *trigAddr, const int tagBase, __global void *buffer) { // do work buffer = ...; int id = get_global_id(); *trigAddr = tagBase + id; // do additional work ... } __kernel void kern2(__global char *trigAddr, const int tagBase, __global void *buffer) { // do work buffer = ...; wg_barrier(); if (!get_local_id()) { int id = get_group_id(); *trigAddr = tagBase + id; } // do additional work ... } __kernel void kern3(__global char *trigAddr, const int tag, __global void *buffer) { // do work buffer = ...; wg_barrier(); if (!get_local_id()) *trigAddr = tag; // do additional work ... }

Work-item Level Work-group Level Kernel Level

Michael LeBeane – PhD Defense

slide-56
SLIDE 56

▪ gem5 + AMD GCN3 GPU model + Custom Portals4 NIC Model – CPU power model with McPAT – Baseline model is coherent APU

  • dGPU modeled with extra delay for IO bus, different memory controllers, and by disabling coherence probes

▪ Each section has slightly different parameters

– Will be discussed before results presented

Simulation Infrastructure

56 07/16/2018

Directory Memory Controllers Memory

GPU CPU

Core L2 L1I L1D

Core L2 L1I L1D Core L2 L1I L1D L3 GPU Core L1D GPU Core L1D GPU Core L1D GPU Core L1D

Sequencer Cache (SQC)

L2 GPU Core L1D GPU Core L1D GPU Core L1D GPU Core L1D

Sequencer Cache (SQC)

NIC

NIC Processors DMA Engines L1I L1D CP Core IF Network

Michael LeBeane – PhD Defense

slide-57
SLIDE 57

▪ RDMA allows for direct access of remote memory without involving CPU

– Heavy lifting is performed on the NIC (off-load networking model) – Generally expressed in terms of remote Put/Get operations

▪ Maps naturally to “one-sided” communication semantics

– Puts/Gets vs. Send/Receive

Remote Direct Memory Access (RDMA)

57 07/16/2018

Initiator Target Network CPU Cache NIC Memory

IOC

Memory CPU Cache NIC Memory

IOC

Memory

Michael LeBeane – PhD Defense

slide-58
SLIDE 58

ComP-Net Host and GPU API

58 07/16/2018

__host__ void hostInit() { //Initialize ComP-Net cpnet_handle_t* cpnet_handle; cpnet_init(&cpnet_handle, GRID_SZ / WG_SZ); // Allocate symmetric heap memory char* buf = cpnet_shmalloc(sizeof(char) * GRID_SZ / WG_SZ); //Initiator/target launches kernel if (cpnet_handle->pe == INITIATOR) { hipLaunchKernel(Ping, GRID_SZ, GRID_SZ / WG_SZ, 0, 0, cpnet_handle, buf); } else { /* Launch target kernel. */ } } __device__ void Ping(cpnet_handle_t *cpnet_handle, char* wg_buffer) { // Extract context from global handle __shared__ cpnet_ctx_t cpnet_ctx; cpnet_ctx_create(cpnet_handle, cpnet_ctx); // Each WG pings target cpnet_shmem_char_p(cpnet_ctx, wg_buffer[hipBlockIdx_x], 1, TARGET); // Each WG waits for pong target cpnet_shmem_char_wait_until( wg_buffer[hipBlockIdx_x, 1); cpnet_ctx_destroy(cpnet_ctx); }

Host Code GPU Code

Michael LeBeane – PhD Defense

slide-59
SLIDE 59

▪ One-sided put latency benchmark – Initiator launches dummy kernel, executes network command, and terminates – Target polls on put location ▪ Take-away messages – HDN < GDS-Sim < GPU-TN – GPU-TN actually overlaps kernel teardown with network transfer!

Latency Microbenchmark

59 07/16/2018

1.51 1.50 1.50 0.41 0.43 0.49 1.50 1.51 1.49 0.30 0.05 4.21 3.76 2.71 0.5 1 1.5 2 2.5 3 3.5 4 4.5 Time (µs) Kernel Launch Kernel Exeuction Kernel Teardown Put Wait

Target Initiator Target Initiator

GPU-TN GDS-Sim

Target Initiator

HDN Smaller is Better

Michael LeBeane – PhD Defense

slide-60
SLIDE 60

Sweep of payload size for 1 WG and 1 Thread

Microbenchmarks

60 07/16/2018

2 4 6 8 10 12 1 8 64 512 4096 32768 Remote Get Time Observed from GPU (µs) Network Payload Size ComP-Net dGPU APU 20 40 60 80 100 2 4 6 8 10 Remote Get Time Observed from GPU (µs) Number of Network Service Threads ComP-Net dGPU APU 0.2 0.4 0.6 0.8 1 1.2 2 4 6 8 10 Energy Consumed by Network Threads w.r.t dGPU Number of Network Service Threads ComP-Net dGPU APU

Sweep of threads for 1 byte transfers and 480 WGs

Michael LeBeane – PhD Defense

slide-61
SLIDE 61

▪ Friendlier programming abstractions

– Nicer abstractions in CUDA and OpenCL

  • Dynamic Parallelism, Unified Memory, etc.

– Single-source, kernel-less programming support

  • C++ AMP, OpenMP, AMD HC Language, etc.

▪ Architectural Support

– User-level kernel-launch – Shared virtual address space – Virtualization – Multiprocessing – (Sometimes) Coherent caches

Where are GPUs heading?

61 07/16/2018

MMU

CPU

Tightly Coupled Devices Physical Memory

GPU

OS Driver IOMMU

CPU

(Producer) Tightly Coupled Devices Virtual Memory Command Queue

GPU

(Consumer) Command Packet

Architected Queuing Shared Virtual Memory

What about networking support?

Michael LeBeane – PhD Defense