S7281: Device Lending: Dynamic Sharing of GPUs in a PCIe Cluster - - PowerPoint PPT Presentation

s7281 device lending dynamic sharing of gpus in a pcie
SMART_READER_LITE
LIVE PREVIEW

S7281: Device Lending: Dynamic Sharing of GPUs in a PCIe Cluster - - PowerPoint PPT Presentation

S7281: Device Lending: Dynamic Sharing of GPUs in a PCIe Cluster Jonas Markussen PhD student Simula Research Laboratory Outline Motivation PCIe Overview Non-Transparent Bridges Device Lending Distributed applications may need


slide-1
SLIDE 1

S7281: Device Lending: Dynamic Sharing of GPUs in a PCIe Cluster

Jonas Markussen PhD student Simula Research Laboratory

slide-2
SLIDE 2
  • Motivation
  • PCIe Overview
  • Non-Transparent Bridges
  • Device Lending

Outline

slide-3
SLIDE 3

Distributed applications may need to access and use IO resources that are physically located inside remote hosts

Front-end Interconnect

. . . . . . . . . . . .

Control + Signaling + Data

Compute node Compute node Compute node

… … …

slide-4
SLIDE 4
  • rCUDA
  • CUDA-aware Open MPI
  • Custom GPUDirect RDMA

implementation

  • . . .

Software abstractions simplify the use and allocation of resources in a cluster and facilitate development of distributed applications

Front-end Logical view of resources

… … … … … … . . .

Control + Signaling + Data

Handled in software

slide-5
SLIDE 5

Local Remote

Application Application

Local resource Remote resource using middleware

CUDA library + driver CUDA – middleware integration PCIe IO bus CUDA driver Interconnect transport (RDMA) Interconnect transport (RDMA) Middleware service/daemon Interconnect Middleware service PCIe IO bus

slide-6
SLIDE 6

In PCIe clusters, the same fabric is used both as local IO bus within a single node and as the interconnect between separate nodes

External PCIe cable PCIe interconnect switch RAM Memory bus PCIe bus PCIe interconnect host adapter PCIe IO device CPU and chipset Interconnect switch

slide-7
SLIDE 7

Local Remote

Application

Local resource

CUDA library + driver PCIe IO bus PCIe IO bus PCIe-based interconnect

Remote resource over native fabric

Application CUDA library + driver PCIe IO bus

slide-8
SLIDE 8

PCIe Overview

slide-9
SLIDE 9

PCIe is the dominant IO bus technology in computers today, and can also be used as a high-bandwidth low-latency interconnect

PCI-SIG. PCI Express 3.1 Base Specification, 2010. http://www.eetimes.com/document.asp?doc_id=1259778

5 10 15 20 25 30 35

Gen 2 Gen 3 Gen 4

Gigabytes per second (GB/s)

PCIe x4 PCIe x8 PCIe x16

slide-10
SLIDE 10

Memory reads and writes are handled by PCIe as transactions that are packet-switched through the fabric depending on the address

RAM PCIe device PCIe device PCIe device CPU and chipset

  • Upstream
  • Downstream
  • Peer-to-peer (shortest path)
slide-11
SLIDE 11

IO devices and the CPU share the same physical address space, allowing devices to access system memory and other devices

RAM PCIe device PCIe device PCIe device CPU and chipset Address space

0x00000… 0xFFFFF…

IO device IO device IO device

Interrupt vecs

  • Memory-mapped IO (MMIO / PIO)
  • Direct Memory Access (DMA)
  • Message-Signaled Interrupts (MSI-X)

RAM

0xfee00xxx

slide-12
SLIDE 12

Non-Transparent Bridges

slide-13
SLIDE 13

RAM

Remote address space can be mapped into local address space by using PCIe Non-Transparent Bridges (NTBs)

PCIe NTB adapter CPU and chipset PCIe NTB adapter CPU and chipset Address space NTB Local host Remote host Local RAM

Local Remote 0xf000 0x9000 . . . . . .

NTB addr mapping RAM

slide-14
SLIDE 14

Using NTBs, each node in the cluster take part in a shared address space and have their own “window” into the global address space

NTB-based interconnect Global addr space

A B C

Addr space in B Addr space in C Local RAM Local IO devices Global addr space A’s addr space Local RAM Local IO devices C’s addr space Exported address range Addr space in A

slide-15
SLIDE 15

Device Lending

slide-16
SLIDE 16

A remote IO device can be “borrowed” by mapping it into local address space, making it appear locally installed in the system

RAM NTB adapter 0xe000 CPU and chipset NTB adapter 0x1000 CPU and chipset RAM Physical device 0xb000 Inserted device 0x2000

Device driver Remote Local 0xb000 0x2000 . . . . . .

NTB addr mapping Owner Borrower PCIe hot-plug

slide-17
SLIDE 17

By intercepting DMA API calls to set up IOMMU mappings and inject reverse NTB mappings, physical location is completely transparent

RAM NTB adapter 0xe000 CPU and chipset NTB adapter 0x1000 CPU and chipset RAM Physical device 0xb000 Inserted device 0x2000

Device driver Local Remote 0xf000 0x5000 . . . . . .

NTB addr mapping Owner Borrower

dma_addr = dma_map_page(0x9000);

IOMMU

IOV Phys 0x5000 0x9000 . . . . . . Use addr 0xf000

slide-18
SLIDE 18

Local Remote

Application

Borrowed remote resource

CUDA library + driver PCIe IO bus PCIe IO bus PCIe NTB interconnect

Unmodified local driver (with hot-plug support) Resource appears local to OS, driver, and app Hardware mappings ensure fast data path Works with any PCIe device (even individual SR-IOV functions)

slide-19
SLIDE 19

Local Remote

Application Application

Borrowed remote resource Remote resource using middleware

CUDA library + driver CUDA – middleware integration CUDA driver Interconnect transport (RDMA) Interconnect transport (RDMA) Middleware service/daemon Interconnect Middleware service PCIe IO bus PCIe IO bus PCIe IO bus PCIe NTB interconnect

slide-20
SLIDE 20

Local Remote

Application Application

Borrowed remote resource Local resource

CUDA library + driver PCIe IO bus PCIe IO bus PCIe NTB interconnect CUDA library + driver PCIe IO bus

slide-21
SLIDE 21

2 4 6 8 10 12 14

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB 1 MB 2 MB 4 MB 8 MB 16 MB

Gigabytes per second (GB/s) Transfer size

bandwidthTest (Local) bandwidthTest (Borrowed) PXH830 DMA (GPUDirect RDMA)

  • 1. Nvidia CUDA 8.0 Samples bandwidthTest
  • 2. GPUDirect RDMA benchmark using Dolphin NTB DMA

https://github.com/Dolphinics/cuda-rdma-bench

Device-to-host memory transfer

GPU: Quadro P400 Nvidia driver: Version 375.26 (Centos 7) CPU: Xeon E5-1630 3.7 GHz Memory: DDR4 2133 MHz

slide-22
SLIDE 22

Device pool

Using Device Lending, nodes in a PCIe cluster can share resources through a process of borrowing and giving back devices Task A Task B Task C

GPU SSD SSD SSD NIC FPGA GPU GPU GPU GPU CPU + chipset RAM CPU + chipset RAM CPU + chipset RAM NTB NTB NTB SSD SSD SSD NIC FPGA

Task B Task A

GPU GPU GPU SSD FPGA NIC GPU SSD SSD GPU GPU GPU

Task C

slide-23
SLIDE 23

EIR – Efficient computer aided diagnosis framework for gastrointestinal examination

Examination room Examination room Server room

http://mlab.no/blog/2016/12/eir/

slide-24
SLIDE 24

Moving forward

  • Strategy-based management
  • Fail-over mechanisms
  • VFIO and other API integration (“SmartIO”)
  • Borrowing vGPU functions
slide-25
SLIDE 25

Thank you!

“Device Lending in PCI Express Networks” ACM NOSSDAV 2016 “Efficient Processing of Video in a Multi Auditory Environment using Device Lending of GPUs” ACM Multimedia Systems 2016 (MMSys’16) “PCIe Device Lending” University of Oslo 2015

Device Lending demo and more Visit Dolphin in exhibition area (booth 625) jonassm@simula.no

My email address Selected publications