dcs ctrl
play

DCS-ctrl: A Fast and Flexible Device-Control Mechanism for - PowerPoint PPT Presentation

DCS-ctrl: A Fast and Flexible Device-Control Mechanism for Device-Centric Server Architecture Dongup Kwon 1 , Jaehyung Ahn 2 , Dongju Chae 2 , Mohammadamin Ajdari 2 , Jaewon Lee 1 , Suheon Bae 1 , Youngsok Kim 1 , and Jangwoo Kim 1 1 Dept. of


  1. DCS-ctrl: A Fast and Flexible Device-Control Mechanism for Device-Centric Server Architecture Dongup Kwon 1 , Jaehyung Ahn 2 , Dongju Chae 2 , Mohammadamin Ajdari 2 , Jaewon Lee 1 , Suheon Bae 1 , Youngsok Kim 1 , and Jangwoo Kim 1 1 Dept. of Electrical and Computer Engineering, Seoul National University 2 Dept. of Computer Science and Engineering, POSTECH

  2. Conventional Server Architecture • Primarily rely on “CPU and memory” − CPU-centric computing & in-memory storage − Slow and low-bandwidth peripheral devices CPU Storage Network Compute Host- & CPU-centric 2 /28

  3. Conventional Server Architecture • Primarily rely on “CPU and memory” − CPU-centric computing & in-memory storage − Slow and low-bandwidth peripheral devices CPU Storage Network Compute Host- & CPU-centric 2 /28

  4. Device-centric Server Architecture • Exploit “fast & high-bandwidth devices” − Data processing accelerators (e.g., GPU, FPGA) − Storage (e.g., SSD), network (e.g., 100GbE), PCIe Gen3 Network Storage … … NIC NVM NVM NIC CPU Storage PCIe Accelerator … … GPU GPU FPGA FPGA CPU Network Compute Host- & CPU-centric Device-centric 3 /28

  5. Index • Existing approaches • DCS-ctrl: HW-based device-control mechanism • Experimental results • Conclusion 4 /28

  6. Existing Approaches • Software optimization − Memory mgmt. optimization, user-level device interface − Do not address multi-device tasks • P2P communication − Transfer data directly through PCI Express è D2D comm. • Device integration − Integrate heterogeneous devices è D2D comm. 5 /28

  7. Limitations of Existing D2D Comm. • P2P communication − Direct data transfers through PCI Express è D2D comm. − Slow and high-overhead control path Dev Control Data copy Kernel Others Control Kernel A 100% SW Latency (us) 120 CPU util. (%) Dev CPU 75% 90 B 50% 60 Dev 25% 30 C 0 0% SW SW P2P Data path P2P opt opt Control path 6 /28

  8. Limitations of Existing D2D Comm. • Integrated devices − Integrating heterogeneous devices è D2D comm. − Fast data & control transfers − Fixed and inflexible aggregate implementation Dev A Controllers Dev CPU B New Dev Dev C $$$ 7 /28

  9. Limited Performance Potential while ( true ) { rc_recv = recv (fd_sock, buffer, recv_size, 0); CPU if (rc_recv <= 0) break ; processing (&md_ctx, buffer, recv_size); rc_write = write (fd_file, buffer, recv_size); Dev Dev … A B } • “Intermediate” processing between device ops − Prevent applications from using direct D2D comm. − Cause host-side resource contention (CPU and memory) 8 /28

  10. Design Goals • Performance & scalability − Faster inter-device data & control communication − More scalable with CPU-efficient device operations • Flexibility − Support any types of off-the-shelf devices • Applicability − Increase the opportunity of applying D2D comm. 9 /28

  11. Index • Existing approaches • DCS-ctrl: HW-based device-control mechanism − Key ideas and benefits − Architecture • Experimental results • Conclusion 10 /28

  12. DCS-ctrl: Key Ideas & Benefits • DCS-ctrl: PCIe P2P + “HDC” − Hardware-based device-control (HDC) mechanism − HDC Engine : “FPGA-based” device orchestrator + “near-device” processing unit § Performance & scalability è HDC, device orchestrator § Flexibility è FPGA-based, low-cost device controller § Applicability è near-device processing unit 11 /28

  13. HDC Engine: Overview SW-controlled P2P DCS-ctrl (HW) Application Application HDC Engine (FPGA) Dev A Dev B Dev C Dev A Dev B Dev C NDP Device Device Device Device Device Device driver A driver B driver C ctrl A ctrl B ctrl C Dev A Dev B Dev C Dev C Dev A Dev B 12 /28

  14. DCS-ctrl: Key Ideas & Benefits Dev void ssd_to_nic() Dev CPU CPU { A A get_from_ssd(&data); process_in_HDC(&data); write_to_nic(&data); Dev Dev } B B HDC HDC HDC CPU Dev Dev C C Device Dev Dev HDC New Data path controller A B Dev Control path Optimized dev. control Generic dev. interfaces Near-device processing ⇒ Faster & scalable ⇒ Higher flexibility ⇒ Higher applicability communication 13 /28

  15. Key Idea #1: Device Orchestrator • Perform multi-device tasks w/o CPU involvement − Offload a multi-device task to HDC Engine − Manage all device operations and their dependencies Scoreboard Dev A Multi-device Dev R/W Src Dst Aux State task A Read Addr(DevA) Addr(NDP-A) - Done NDP NDP - - Addr(NDP-A) Addr(NDP-B) Hash Issue B Write Addr(NDP-B) Addr(DevB) - Ready Dev B Fast hardware-level device control 14 /28

  16. Key Idea #2: Device Controller • Provide interfaces between HDC Engine & devices − Include submission & completion queues − Build standard & vendor-specific device commands Submission PCIe switch queue controller Device Device Doorbell registers Completion queue Flexible & low-cost device control 15 /28

  17. Key Idea #3: Near-device Processing • Near-device processing units − Execute intermediate processing between device ops − Scale-out storage app è hash, encryption, compression Processing units LUTs Registers Applications MD5 3.0% 0.69% Swift AES256 3.52% 0.99% HDFS, Swift GZIP 5.36% 2.09% HDFS Easy to be extended & Highly applicable to existing applications support other devices & applications 16 /28

  18. Index • Existing approaches • DCS-ctrl: HW-based device-control mechanism - Key idea and benefits − Architecture • Experimental results • Conclusion 17 /28

  19. Baseline Architecture • Software-controlled P2P − P2P comm. + indirect device-control path SW HW PCIe Device driver A Dev A Dev switch A Application Device driver A Dev B Dev B Device driver A Dev C Dev C 18 /28

  20. DCS-ctrl: HW-based Device Control (1/3) • Offload device-control path to HDC Engine − Scoreboard: schedule device operations in a multi-dev task SW HW PCIe FPGA-based HDC Engine Dev switch A Scoreboard Dev r/w Src Dst Application A – B - C A Dev B B C Dev C 19 /28

  21. DCS-ctrl: Low-cost Integration (2/3) • Implement an FPGA-based device controller − Device controller: directly control devices using P2P SW HW PCIe FPGA-based HDC Engine Dev switch A Scoreboard Device New controller Dev r/w Src Dst Dev Application A – B - C A Dev B B C Dev C 20 /28

  22. DCS-ctrl: Near-device Processing (3/3) • Provide units for intermediate processing − NDP unit: perform data processing on a data path SW HW PCIe FPGA-based HDC Engine Dev switch A Scoreboard Device New controller Dev r/w Src Dst Dev Application A – B - C A Dev B B C Near-device Intermediate Dev processing buffers C 21 /28

  23. DCS-ctrl Prototype HDC Engine implemented on Xilinx Virtex-7 VC707 Supports off-the-shelf devices – Intel 750 SSDs, Broadcom 10GbE NICs, NVIDIA GPUs 22 /28

  24. Index • Existing approaches • DCS-ctrl: HW-based device-control mechanism • Experimental results • Conclusion 23 /28

  25. Reducing Device Control Latency • encrypted_sendfile() : SSD à hash à NIC − SW opt (+P2P): frequent boundary crossings, complex software − DCS-ctrl: less crossings, hardware-based device control HW Kernel Dev ctrl HW Kernel Data Copy Dev ctrl 100 300 Latency (us) Latency (us) 42% 200 SW SW 72% 50 SW 100 0 0 SW opt DCS-ctrl SW opt SW opt DCS-ctrl + P2P without processing with processing (AES256) 24 /28

  26. Reducing CPU Utilization • Swift & HDFS workloads − Offload device control & data transfers to hardware Kernel (Sender) Kernel (Receiver) Kernel (GET) Kernel (PUT) GPU control others GPU control Others 100% CPU utilization CPU utilization 100% Normalized Normalized 75% 50% 52% 49% 75% 50% 50% 25% 25% 0% 0% Send Recv Send Recv Send Recv SW opt SW opt DCS-ctrl +P2P SW opt SW opt DCS-ctrl +P2P Swift HDFS 25 /28

  27. Scalability: More Devices • Swift & HDFS workloads − More CPU-efficient è support more high-performance devices SW opt SW opt DCS-ctrl SW opt SW opt DCS-ctrl + P2P + P2P 6 6 CPU utilization CPU utilization (# cores) (# cores) 4 4 2 2 0 0 0 10 20 30 40 0 10 20 30 40 Throughput (Gbps) Throughput (Gbps) Swift HDFS 26 /28

  28. Conclusion • Fast & flexible device-control mechanism − Hardware-based device-control (HDC) mechanism − FPGA-based standard device controllers − Near-device data processing (NDP) units • Real hardware prototype evaluation − 72% faster inter-device communication − 50% lower CPU utilization for Swift & HDFS 27 /28

  29. Thank you! We will release our IP & tools soon! 28 /28

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend