S9709 Dynamic Sharing of f GPUs and IO IO in in a PCIe Network - PowerPoint PPT Presentation

S9709 Dynamic Sharing of f GPUs and IO IO in in a PCIe Network Håkon Kvale Stensland Senior Research Scientist / Associate Professor Simula Research Laboratory / University of Oslo

Outline • Motivation • PCIe Overview • Non-Transparent Bridges • Dolphin SmartIO • Example Application • NVMe sharing • SmartIO in Virtual Machines

Distributed applications may need to access and use IO resources that are physically located inside remote hosts Front-end . . . Control + Signaling + Data Interconnect . . . . . . . . . … … … Compute node Compute node Compute node

Software abstractions simplify the use and allocation of resources in a cluster and facilitate development of distributed applications Control + Handled in software Signaling + . . . • rCUDA Data … • CUDA-aware Open MPI … • Custom GPUDirect RDMA … implementation Front-end … • . . . … … Logical view of resources

Local resource Remote resource using middleware Application Application CUDA library + driver CUDA – middleware integration Local Middleware service PCIe IO bus Interconnect transport (RDMA) Interconnect Interconnect transport (RDMA) Middleware service/daemon Remote CUDA driver PCIe IO bus

In PCIe clusters, the same fabric is used both as local IO bus within a single node and as the interconnect between separate nodes Memory bus PCIe interconnect switch RAM External PCIe cable CPU and chipset PCIe bus Interconnect switch PCIe interconnect PCIe IO device host adapter

Remote resource over native fabric Local resource Application Application CUDA library + driver CUDA library + driver Local PCIe IO bus PCIe IO bus PCIe-based interconnect Remote PCIe IO bus

PCIe Overview

PCI Express (PCIe) is the most widely adopted I/O interconnection technology used in computer systems today Near future -ish 70 60 Gigabytes per second (GB/s) Current standard 50 Most common PCIe x4 40 today PCIe x8 30 PCIe x16 20 10 0 Gen3 Gen4 Gen5

The PCIe fabric is structured as a tree, where devices form the leaf nodes (endpoints) and the CPU is on top of the root Switch Root port Device (endpoint)

The PCIe fabric is structured as a tree, where devices form the leaf nodes (endpoints) and the CPU is on top of the root $ lspci – tv

Memory reads and writes are handled by PCIe as transactions that are packet-switched through the fabric depending on the address CPU and chipset • Upstream RAM • Downstream • Peer-to-peer (shortest path) PCIe device PCIe device PCIe device

IO devices and the CPU share the same physical address space, allowing devices to access system memory and other devices Address space 0x00000 … Interrupt vecs 0xfee00xxx IO device CPU and chipset IO device RAM IO device RAM 0xFFFFF … • PCIe device Memory-mapped IO (MMIO / PIO) • Direct Memory Access (DMA) • Message-Signaled Interrupts (MSI-X) PCIe device PCIe device

Non-Transparent Bridges

We can interconnect separate PCIe root complexes and translate addresses between them using a non-transparent bridge (NTB) Non-Transparent External Bridge (NTB) PCIe Cable

Remote address space can be mapped into local address space by using PCIe Non-Transparent Bridges (NTBs) Address space NTB CPU and chipset CPU and chipset Local RAM RAM RAM Local host NTB addr mapping Remote host Local Remote 0xf000 0x9000 . . . . . . PCIe NTB adapter PCIe NTB adapter

Using NTBs, each node in the cluster take part in a shared address space and have their own “window” into the global address space A’s addr space Global addr space Addr space in A Local IO devices Addr space in B Global addr space Addr space in C Local RAM C’s addr space A B C Local IO devices Exported address range NTB-based Local RAM interconnect

SmartIO

Borrowed remote resource Resource appears local Application to OS, driver, and app CUDA library + driver Local Unmodified local driver PCIe IO bus (with hot-plug support) Hardware mappings PCIe NTB interconnect ensure fast data path Works with any PCIe device Remote (even individual SR-IOV functions) PCIe IO bus

Borrowed remote resource Remote resource using middleware Application Application CUDA library + driver CUDA – middleware integration Local Middleware service PCIe IO bus Interconnect transport (RDMA) PCIe NTB interconnect Interconnect Interconnect transport (RDMA) Middleware service/daemon Remote CUDA driver PCIe IO bus PCIe IO bus

Device to host transfers: Comparing local to borrowed GPU 14 Gigabytes per second (GB/s) 12 10 8 6 4 2 0 4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB 1 MB 2 MB 4 MB 8 MB 16 MB Transfer size bandwidthTest (Local) bandwidthTest (Borrowed) PXH830 DMA (GPUDirect RDMA)

Using Device Lending, nodes in a PCIe cluster can share resources through a process of borrowing and giving back devices RAM Task A Task A Task B Task C CPU + chipset FPGA NIC SSD SSD SSD SSD NTB GPU SSD GPU SSD GPU GPU RAM Task B CPU + chipset NIC NIC FPGA GPU NTB GPU GPU GPU GPU SSD SSD RAM SSD Task C FPGA CPU + chipset Peer-to-peer GPU GPU GPU NTB Device pool

Using Device Lending, nodes in a PCIe cluster can share resources through a process of borrowing and giving back devices RAM Task A Task A CPU + chipset SSD SSD SSD NTB GPU SSD GPU SSD GPU SSD RAM Task A CPU + chipset NIC NIC FPGA GPU NTB GPU GPU GPU GPU SSD SSD RAM SSD Task A FPGA CPU + chipset GPU GPU GPU NTB Device pool

Example Application Processing of Medical Videos P9258 - Efficient Processing of Medical Videos in a Multi-auditory Environment Using GPU Lending

Scenario: Real-time computer-aided polyp detection • PCIe fiber cables can be up to 100 meters. • Enable ”thin clients” to use GPUs in remote machine room

Flexible sharing of GPU resources between multiple examination rooms • System uses a combination of classic computer vision algorithms and machine learning. • Research prototype since 2016.

Sharing of NVMe drives For more details: S9563 - Efficient Distributed Storage I/O using NVMe and GPU Direct in a PCIe Network or Visit Dolphin Interconnect Solutions in booth 1520

0x00000 … Example: NVMe disk operation (simplified) Interrupt vectors Read N blocks to Disk memory address 0x9000 (registers) NVMe driver Queue0 doorbell Queue1 doorbell CPU and chipset QueueN doorbell RAM RAM Command Queue0 Command Queue1 Command QueueN NVMe disk Data Command complete 0xFFFFF …

0x00000 … SmartIO enabled driver: NVMe on GPU Disk memory buf = cudaMalloc(...); addr = nvidia_p2p_get_pages(buf); (registers) CUDA program Queue0 doorbell Queue1 doorbell CPU and chipset QueueN doorbell ptr = mmap(...); GPU memory devptr = cudaHostRegister(ptr); RAM Command Queue0 Command Queue1 Command QueueN Mapped doorbells NVMe disk 0xFFFFF … Peer-to-peer Userspace NVMe driver using GPUDirect GPU https://github.com/enfiskutensykkel/ssd-gpu-dma

C’s addr space Example: NVMe queues hosted remotely Queue NVMe disk A’s addr space B’s addr space doorbells Exported address NTB NTB range Mapped doorbell Mapped doorbell NTB Mapped Queue0 RAM RAM Command Queue0 Command Queue1 RAM Exported CPU and chipset A B C NTB-based interconnect NVMe disk

Read latency for reading blocks from a NVMe disk into a GPU: Local versus borrowed disk

SmartIO in Virtual Machines

SmartIO fully supports to lend devices to virtual machines running in Linux KVM uning Virtual Function IO API (VFIO)

Pass-through allows physical devices to be used by VMs with minimal overhead, but is not as flexible as resource virtualization Minimal Virtualization Overhead Pass-through of Physical Resources Dynamic Provisioning & Flexible Composition Physical View Virtual or Paravirtualized Resources

Passing through a remote NVMe disk to a VM only adds the latency of traversing the NTB and is comparable to a physical borrower Traversing the NTB Almost same bandwidth Guest OS: Ubuntu 17.04, Host OS: CentOS 7 VM: Qemu 2.17 using KVM NVMe Disk: Intel 900P Optane (PCIe x4 Gen3)

Thank you! “Device Lending in PCI Express Networks” Selected ACM NOSSDAV 2016 publications “Efficient Processing of Video in a Multi Auditory haakonks@simula.no Environment using Device Lending of GPUs” ACM Multimedia Systems 2016 (MMSys’16) “Flexible Device Sharing in PCIe Clusters using Device Lending”, International Conference on Parallel Processing Companion (ICPP'18 Comp) SmartIO & Device Lending demo with GPUs, NVMe and more Visit Dolphin in the exhibition area (booth 1520)

S9709 Dynamic Sharing of f GPUs and IO IO in in a PCIe Network - PowerPoint PPT Presentation

S9709 Dynamic Sharing of f GPUs and IO IO in in a PCIe Network Hkon Kvale Stensland Senior Research Scientist / Associate Professor Simula Research Laboratory / University of Oslo Outline Motivation PCIe Overview

PCI Express Rx-Tx-Protocol Solutions Customer Presentation December 13, 2013 Agenda PCIe

S7281: Device Lending: Dynamic Sharing of GPUs in a PCIe Cluster Jonas Markussen PhD student

Why use GPUs for graph processing? FOSDEM 2020 2 GPUs and Graphs Graphs GPUs Found

Dual T1E1 Express (PCIe) Analysis & Emulation Boards 818 West Diamond Avenue - Third Floor,

Secret Sharing and Visual Cryptography Outline Secret Sharing Visual Secret Sharing

Advanced Tools from Modern Cryptography Lecture 3 Secret-Sharing (ctd.) Secret-Sharing Last

Scott Le Grand Some Things Never Change (GPUs vs the World) How Best to Exploit GPUs

Dynamic Front End Sharing In Graphics Dynamic Front End Sharing In Graphics Processing

ESCRI-SA Knowledge Sharing Sharing Objectives and Components A presentation for the ESCRI-SA

COMMUNICATING [with empathy] @ DY DYNAMIC JILL JILL @ DY DYNAMIC JILL TENSION IS INEVITABLE @

Dynamic Adaptation Dynamic Adaptation Dynamic Adaptation Dynamic Adaptation Minema Minema

A Dynamic Programming-based MCMC Framework for Solving DCOPs with GPUs Ferdinando Fioretto 1(2)

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Clusters of GPUs Michael LeBeane mlebeane@utexas.edu Advisor : Lizy K. John Problem Statement

Efficient Video Decoding on GPUs Efficient Video Decoding on GPUs by Point Based Rendering by

Unleashing the Power of GPUs over the Web Vishal Vaidyanathan Royal Caliber LLC GPUs are

Performance and Security Evaluation of SDN Networks in OMNeT++/INET Marco Tiloca, Alexandra

Routing in packet-switching networks Circuit switching vs. Packet switching Most of WANs based on

The Future of Network Flow Monitoring Prague Embedded Systems Workshop (PESW 2019) Friday 28 th

ATO Journey IMechE - May 2019 Mainline ATO over ETCS development CR1238 - ATO B3 R2 TEN-T

Maple: Simplifying SDN Programming Using Algorithmic Policies Andreas Voellmy Junchang Wang

Hays County CRIMINAL JUSTICE SYSTEM UPDATE AND JAIL FACILITY ASSESSMENT FINAL REPORT PRESENTATION

Welcome Parents of the Class of 2025 to Woodridge Middle School! Mrs. Shannon Umfleet, Principal

Argus: Low-cost, Comprehensive Error-Detection for Simple Cores Albert Meixner, Michael Bauer,

S9709 Dynamic Sharing of f GPUs and IO IO in in a PCIe Network - PowerPoint PPT Presentation

S9709 Dynamic Sharing of f GPUs and IO IO in in a PCIe Network Hkon Kvale Stensland Senior Research Scientist / Associate Professor Simula Research Laboratory / University of Oslo Outline Motivation PCIe Overview

PCI Express Rx-Tx-Protocol Solutions Customer Presentation December 13, 2013 Agenda PCIe

S7281: Device Lending: Dynamic Sharing of GPUs in a PCIe Cluster Jonas Markussen PhD student

Why use GPUs for graph processing? FOSDEM 2020 2 GPUs and Graphs Graphs GPUs Found

Dual T1E1 Express (PCIe) Analysis &amp; Emulation Boards 818 West Diamond Avenue - Third Floor,

Secret Sharing and Visual Cryptography Outline Secret Sharing Visual Secret Sharing

Advanced Tools from Modern Cryptography Lecture 3 Secret-Sharing (ctd.) Secret-Sharing Last

Scott Le Grand Some Things Never Change (GPUs vs the World) How Best to Exploit GPUs

Dynamic Front End Sharing In Graphics Dynamic Front End Sharing In Graphics Processing

ESCRI-SA Knowledge Sharing Sharing Objectives and Components A presentation for the ESCRI-SA

COMMUNICATING [with empathy] @ DY DYNAMIC JILL JILL @ DY DYNAMIC JILL TENSION IS INEVITABLE @

Dynamic Adaptation Dynamic Adaptation Dynamic Adaptation Dynamic Adaptation Minema Minema

A Dynamic Programming-based MCMC Framework for Solving DCOPs with GPUs Ferdinando Fioretto 1(2)

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Clusters of GPUs Michael LeBeane mlebeane@utexas.edu Advisor : Lizy K. John Problem Statement

Efficient Video Decoding on GPUs Efficient Video Decoding on GPUs by Point Based Rendering by

Unleashing the Power of GPUs over the Web Vishal Vaidyanathan Royal Caliber LLC GPUs are

Performance and Security Evaluation of SDN Networks in OMNeT++/INET Marco Tiloca, Alexandra

Routing in packet-switching networks Circuit switching vs. Packet switching Most of WANs based on

The Future of Network Flow Monitoring Prague Embedded Systems Workshop (PESW 2019) Friday 28 th

ATO Journey IMechE - May 2019 Mainline ATO over ETCS development CR1238 - ATO B3 R2 TEN-T

Maple: Simplifying SDN Programming Using Algorithmic Policies Andreas Voellmy Junchang Wang

Hays County CRIMINAL JUSTICE SYSTEM UPDATE AND JAIL FACILITY ASSESSMENT FINAL REPORT PRESENTATION

Welcome Parents of the Class of 2025 to Woodridge Middle School! Mrs. Shannon Umfleet, Principal

Argus: Low-cost, Comprehensive Error-Detection for Simple Cores Albert Meixner, Michael Bauer,

Dual T1E1 Express (PCIe) Analysis & Emulation Boards 818 West Diamond Avenue - Third Floor,