s7281 device lending dynamic sharing of gpus in a pcie
play

S7281: Device Lending: Dynamic Sharing of GPUs in a PCIe Cluster - PowerPoint PPT Presentation

S7281: Device Lending: Dynamic Sharing of GPUs in a PCIe Cluster Jonas Markussen PhD student Simula Research Laboratory Outline Motivation PCIe Overview Non-Transparent Bridges Device Lending Distributed applications may need


  1. S7281: Device Lending: Dynamic Sharing of GPUs in a PCIe Cluster Jonas Markussen PhD student Simula Research Laboratory

  2. Outline • Motivation • PCIe Overview • Non-Transparent Bridges • Device Lending

  3. Distributed applications may need to access and use IO resources that are physically located inside remote hosts Front-end . . . Control + Signaling + Data Interconnect . . . . . . . . . … … … Compute node Compute node Compute node

  4. Software abstractions simplify the use and allocation of resources in a cluster and facilitate development of distributed applications Control + Handled in software Signaling + . . . • rCUDA Data … • CUDA-aware Open MPI … … • Custom GPUDirect RDMA implementation Front-end … • . . . … … Logical view of resources

  5. Local resource Remote resource using middleware Application Application CUDA library + driver CUDA – middleware integration Local Middleware service PCIe IO bus Interconnect transport (RDMA) Interconnect Interconnect transport (RDMA) Middleware service/daemon Remote CUDA driver PCIe IO bus

  6. In PCIe clusters, the same fabric is used both as local IO bus within a single node and as the interconnect between separate nodes Memory bus PCIe interconnect switch RAM External PCIe cable CPU and chipset PCIe bus Interconnect switch PCIe interconnect PCIe IO device host adapter

  7. Remote resource over native fabric Local resource Application Application CUDA library + driver CUDA library + driver Local PCIe IO bus PCIe IO bus PCIe-based interconnect Remote PCIe IO bus

  8. PCIe Overview

  9. PCIe is the dominant IO bus technology in computers today, and can also be used as a high-bandwidth low-latency interconnect 35 30 Gigabytes per second (GB/s) 25 PCIe x4 20 PCIe x8 15 PCIe x16 10 5 0 Gen 2 Gen 3 Gen 4 PCI-SIG. PCI Express 3.1 Base Specification, 2010. http://www.eetimes.com/document.asp?doc_id=1259778

  10. Memory reads and writes are handled by PCIe as transactions that are packet-switched through the fabric depending on the address CPU and chipset Upstream • RAM Downstream • Peer-to-peer (shortest path) • PCIe device PCIe device PCIe device

  11. IO devices and the CPU share the same physical address space, allowing devices to access system memory and other devices Address space Interrupt vecs 0x00000… 0xfee00xxx IO device CPU and chipset IO device RAM IO device RAM 0xFFFFF… PCIe device Memory-mapped IO (MMIO / PIO) • Direct Memory Access (DMA) • Message-Signaled Interrupts (MSI-X) • PCIe device PCIe device

  12. Non-Transparent Bridges

  13. Remote address space can be mapped into local address space by using PCIe Non-Transparent Bridges (NTBs) Address space NTB CPU and chipset CPU and chipset Local RAM RAM RAM Local host NTB addr mapping Remote host Local Remote 0xf000 0x9000 . . . . . . PCIe NTB adapter PCIe NTB adapter

  14. Using NTBs, each node in the cluster take part in a shared address space and have their own “window” into the global address space A’s addr space Global addr space Addr space in A Local IO devices Addr space in B Global addr space Addr space in C Local RAM C’s addr space A B C Local IO devices Exported address range NTB-based Local RAM interconnect

  15. Device Lending

  16. A remote IO device can be “borrowed” by mapping it into local address space, making it appear locally installed in the system Device driver CPU and chipset CPU and chipset Borrower Owner RAM RAM NTB addr mapping Remote Local 0xb000 0x2000 PCIe hot-plug . . . . . . NTB adapter Inserted device Physical device NTB adapter 0x1000 0x2000 0xb000 0xe000

  17. By intercepting DMA API calls to set up IOMMU mappings and inject reverse NTB mappings, physical location is completely transparent Device driver CPU and chipset CPU and chipset Borrower Owner dma_addr = dma_map_page(0x9000); RAM RAM Use addr NTB addr mapping IOV Phys 0xf000 0x5000 0x9000 Local Remote . . . . . . 0xf000 0x5000 IOMMU . . . . . . NTB adapter Inserted device Physical device NTB adapter 0x1000 0x2000 0xb000 0xe000

  18. Borrowed remote resource Resource appears local Application to OS, driver, and app CUDA library + driver Local Unmodified local driver PCIe IO bus (with hot-plug support) Hardware mappings PCIe NTB interconnect ensure fast data path Works with any PCIe device Remote (even individual SR-IOV functions) PCIe IO bus

  19. Borrowed remote resource Remote resource using middleware Application Application CUDA library + driver CUDA – middleware integration Local Middleware service PCIe IO bus Interconnect transport (RDMA) PCIe NTB interconnect Interconnect Interconnect transport (RDMA) Middleware service/daemon Remote CUDA driver PCIe IO bus PCIe IO bus

  20. Borrowed remote resource Local resource Application Application CUDA library + driver CUDA library + driver Local PCIe IO bus PCIe IO bus PCIe NTB interconnect Remote PCIe IO bus

  21. Device-to-host memory transfer 14 Gigabytes per second (GB/s) 12 10 8 6 4 2 0 4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB 1 MB 2 MB 4 MB 8 MB 16 MB Transfer size bandwidthTest (Local) bandwidthTest (Borrowed) PXH830 DMA (GPUDirect RDMA) 1. Nvidia CUDA 8.0 Samples bandwidthTest GPU: Quadro P400 Nvidia driver: Version 375.26 (Centos 7) 2. GPUDirect RDMA benchmark using Dolphin NTB DMA CPU: Xeon E5-1630 3.7 GHz Memory: DDR4 2133 MHz https://github.com/Dolphinics/cuda-rdma-bench

  22. Using Device Lending, nodes in a PCIe cluster can share resources through a process of borrowing and giving back devices RAM Task A Task A Task B Task C CPU + chipset SSD FPGA NIC SSD SSD SSD NTB GPU SSD GPU GPU GPU SSD RAM Task B CPU + chipset NIC NIC FPGA GPU NTB GPU GPU GPU GPU SSD SSD RAM SSD Task C FPGA CPU + chipset GPU GPU GPU NTB Device pool

  23. Server room http://mlab.no/blog/2016/12/eir/ EIR – Efficient computer aided diagnosis framework for gastrointestinal examination Examination room Examination room

  24. Moving forward • Strategy-based management • Fail-over mechanisms • VFIO and other API integration (“SmartIO”) • Borrowing vGPU functions

  25. Thank you! My email address “Device Lending in PCI Express Networks” Selected ACM NOSSDAV 2016 publications “Efficient Processing of Video in a Multi Auditory jonassm@simula.no Environment using Device Lending of GPUs” ACM Multimedia Systems 2016 (MMSys’16) “PCIe Device Lending” University of Oslo 2015 Device Lending demo and more Visit Dolphin in exhibition area (booth 625)

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend