S9709 Dynamic Sharing of f GPUs and IO IO in in a PCIe Network
Håkon Kvale Stensland
Senior Research Scientist / Associate Professor Simula Research Laboratory / University of Oslo
S9709 Dynamic Sharing of f GPUs and IO IO in in a PCIe Network - - PowerPoint PPT Presentation
S9709 Dynamic Sharing of f GPUs and IO IO in in a PCIe Network Hkon Kvale Stensland Senior Research Scientist / Associate Professor Simula Research Laboratory / University of Oslo Outline Motivation PCIe Overview
Håkon Kvale Stensland
Senior Research Scientist / Associate Professor Simula Research Laboratory / University of Oslo
Front-end Interconnect
. . . . . . . . . . . .
Control + Signaling + Data
Compute node Compute node Compute node
… … …
implementation
Front-end Logical view of resources
… … … … … … . . .
Control + Signaling + Data
Handled in software
Local Remote
Application Application
Local resource Remote resource using middleware
CUDA library + driver CUDA – middleware integration PCIe IO bus CUDA driver Interconnect transport (RDMA) Interconnect transport (RDMA) Middleware service/daemon Interconnect Middleware service PCIe IO bus
External PCIe cable PCIe interconnect switch RAM Memory bus PCIe bus PCIe interconnect host adapter PCIe IO device CPU and chipset Interconnect switch
Local Remote
Application
Local resource
CUDA library + driver PCIe IO bus PCIe IO bus PCIe-based interconnect
Remote resource over native fabric
Application CUDA library + driver PCIe IO bus
10 20 30 40 50 60 70
Gen3 Gen4 Gen5
Gigabytes per second (GB/s)
PCIe x4 PCIe x8 PCIe x16
Device (endpoint) Switch Root port
$ lspci –tv
RAM PCIe device PCIe device PCIe device CPU and chipset
RAM PCIe device PCIe device PCIe device CPU and chipset Address space
0x00000… 0xFFFFF…
IO device IO device IO device
Interrupt vecs
RAM
0xfee00xxx
External PCIe Cable Non-Transparent Bridge (NTB)
RAM
PCIe NTB adapter CPU and chipset PCIe NTB adapter CPU and chipset Address space NTB Local host Remote host Local RAM
Local Remote 0xf000 0x9000 . . . . . .
NTB addr mapping RAM
NTB-based interconnect Global addr space
A B C
Addr space in B Addr space in C Local RAM Local IO devices Global addr space A’s addr space Local RAM Local IO devices C’s addr space Exported address range Addr space in A
Local Remote
Application
Borrowed remote resource
CUDA library + driver PCIe IO bus PCIe IO bus PCIe NTB interconnect
Unmodified local driver (with hot-plug support) Resource appears local to OS, driver, and app Hardware mappings ensure fast data path Works with any PCIe device (even individual SR-IOV functions)
Local Remote
Application Application
Borrowed remote resource Remote resource using middleware
CUDA library + driver CUDA – middleware integration CUDA driver Interconnect transport (RDMA) Interconnect transport (RDMA) Middleware service/daemon Interconnect Middleware service PCIe IO bus PCIe IO bus PCIe IO bus PCIe NTB interconnect
2 4 6 8 10 12 14
4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB 1 MB 2 MB 4 MB 8 MB 16 MB
Gigabytes per second (GB/s) Transfer size
bandwidthTest (Local) bandwidthTest (Borrowed) PXH830 DMA (GPUDirect RDMA)
Device pool
GPU SSD SSD SSD NIC FPGA GPU GPU GPU GPU CPU + chipset RAM CPU + chipset RAM CPU + chipset RAM NTB NTB NTB SSD SSD SSD NIC FPGA
Task B Task A
GPU GPU GPU SSD FPGA NIC GPU SSD SSD GPU GPU GPU
Task C
Device pool
GPU SSD SSD SSD NIC FPGA GPU GPU GPU GPU CPU + chipset RAM CPU + chipset RAM CPU + chipset RAM NTB NTB NTB SSD SSD SSD NIC FPGA
Task A Task A Task A
GPU SSD GPU SSD GPU SSD GPU GPU GPU
P9258 - Efficient Processing of Medical Videos in a Multi-auditory Environment Using GPU Lending
machine room
computer vision algorithms and machine learning.
S9563 - Efficient Distributed Storage I/O using NVMe and GPU Direct in a PCIe Network
Visit Dolphin Interconnect Solutions in booth 1520
RAM CPU and chipset Disk memory (registers)
0x00000… 0xFFFFF…
Queue0 doorbell Queue1 doorbell QueueN doorbell
Interrupt vectors RAM
Command Queue0 Command Queue1 Command QueueN Data
NVMe disk
NVMe driver
Read N blocks to address 0x9000 Command complete
RAM CPU and chipset Disk memory (registers)
0x00000… 0xFFFFF…
Queue0 doorbell Queue1 doorbell QueueN doorbell
GPU memory NVMe disk CUDA program GPU
Userspace NVMe driver using GPUDirect https://github.com/enfiskutensykkel/ssd-gpu-dma
Command Queue0 Command Queue1 Command QueueN Mapped doorbells buf = cudaMalloc(...); addr = nvidia_p2p_get_pages(buf); ptr = mmap(...); devptr = cudaHostRegister(ptr);
NTB-based interconnect RAM CPU and chipset NVMe disk
A B C
NVMe disk C’s addr space Exported address range Queue doorbells RAM NTB Mapped doorbell A’s addr space
Command Queue0
RAM NTB Mapped doorbell B’s addr space
Command Queue1
NTB Mapped Queue0 Exported
Physical View Pass-through of Physical Resources Virtual or Paravirtualized Resources
Minimal Virtualization Overhead Dynamic Provisioning & Flexible Composition
Traversing the NTB
Guest OS: Ubuntu 17.04, Host OS: CentOS 7 VM: Qemu 2.17 using KVM NVMe Disk: Intel 900P Optane (PCIe x4 Gen3)
Almost same bandwidth
“Device Lending in PCI Express Networks” ACM NOSSDAV 2016 “Efficient Processing of Video in a Multi Auditory Environment using Device Lending of GPUs” ACM Multimedia Systems 2016 (MMSys’16) “Flexible Device Sharing in PCIe Clusters using Device Lending”, International Conference on Parallel Processing Companion (ICPP'18 Comp)
Selected publications