i nefficient device utilization
play

I nefficient device utilization Host-centric device m anagem ent - PowerPoint PPT Presentation

DCS: A Fast and Scalable Device-Centric Server Architecture Jaehyung Ahn, Dongup Kwon, Youngsok Kim, Mohammadamin Ajdari, Jaewon Lee, and Jangwoo Kim { jh2ekd, nankdu7, elixir, majdari, spiegel0, jangwoo} @postech.ac.kr High Performance


  1. DCS: A Fast and Scalable Device-Centric Server Architecture Jaehyung Ahn, Dongup Kwon, Youngsok Kim, Mohammadamin Ajdari, Jaewon Lee, and Jangwoo Kim { jh2ekd, nankdu7, elixir, majdari, spiegel0, jangwoo} @postech.ac.kr High Performance Computing Lab Pohang University of Science and Technology ( POSTECH)

  2. I nefficient device utilization • Host-centric device m anagem ent − Host manages every device invocation − Frequent host-involved layer crossings  Increases latency and management cost Application Userspace Kernel stack Kernel stack Kernel stack Kernel Driver A Driver B Driver C Hardw are Device A Device B Device C Datapath Metadata/ Command path 1

  3. Latency: High softw are overhead • Single sendfile: Storage read & NI C send − Faster devices, more software overhead Softw are overhead 7% 50% 77% 82% 100% Decomposition (Normalized) Latency 0% HDD NVMe PCM PCM 10Gb NIC 10Gb NIC 10Gb NIC 100Gb NIC Software Storage NIC 2

  4. Cost: High host resource dem and • Sendfile under host resource ( CPU) contention − Faster devices, more host resource consumption Sendfile bandwidth CPU Busy Sendfile bandwidth 100% Sendfile Sendfile CPU usage bandwidth 34% 14% Sendfile CPU usage 6% No contention High contention * Measured from NVMe SSD/ 10Gb NIC 3

  5. I ndex • I nefficient device utilization • Lim itations of existing solutions • DCS: Device-Centric Server architecture • Experim ental results • Conclusion

  6. Lim itations of existing w ork • Single-device optim ization − Do not address inter-device communication e.g., Moneta (SSD), DCA (NIC), mTCP (NIC), Arrakis (Generic) • I nter-device com m unication − Not applicable for unsupported devices e.g., GPUNet (GPU-NIC), GPUDirect RDMA (GPU-Infiniband) • I ntegrating devices − Custom devices and protocols, limited applicability e.g., QuickSAN (SSD+ NIC), BlueDBM (Accelerator – SSD+ NIC) Need for fast, scalable, and generic inter-device com m unication 5

  7. I ndex • I nefficient device utilization • Lim itations of existing solutions • DCS: Device-Centric Server architecture − Key idea and benefits − Design considerations • Experim ental results • Conclusion

  8. DCS: Key idea • Minim ize host involvem ent & data m ovem ent DCS Library Application Userspace DCS Driver Device drivers & Kernel stacks Kernel Hardw are DCS Engine Device A Device B Device C Datapath Metadata/ Command path Single command → Optimized multi-device invocation 7

  9. DCS: Benefits • Better device perform ance − Faster data delivery, lower total operation latency • Better host perform ance/ efficiency − Resource/ time spent for device management now available for other applications • High applicability − Relies on existing drivers / kernel supports / interfaces − Easy to extend and cover more devices 8

  10. I ndex • I nefficient device utilization • Lim itations of existing solutions • DCS: Device-Centric Server architecture − Key idea and benefits − Design considerations  By discussing implementation details • Experim ental results • Conclusion

  11. DCS: Architecture overview Existing System DCS Library Application Userspace sendfile(), encrypted sendfile() DCS Driver Kernel Drivers & Kernel communicator Kernel stack Command generator PCIe Switch Hardw are DCS Engine (on NetFPGA NIC) NVMe SSD Command Per-device GPU Command interpreter manager Queue NetFPGA NIC Fully com patible w ith existing system 1 0

  12. ❺ ❶ ❹ ❸ ❷ Com m unicating w ith storage Userspace Hook / API call Application DCS Library File descriptor Kernel DCS Driver (Virtual) Filesystem Block addr ( in device) / buffer addr ( cached) Hardw are DCS Engine Source device Target Target device NVMe SSD Source device VFS cache Data consistency guaranteed 1 1

  13. ❺ ❶ ❹ ❸ ❷ Com m unicating w ith netw ork interface Userspace Hook / API call Application DCS Library Socket descriptor Kernel Network stack DCS Driver Connection inform ation Hardw are DCS Engine NetFPGA NIC Packet generation & Send Data buffer HW PacketGen HW -assisted packet generation 1 2

  14. ❹ ❶ ❼ ❻ ❺ ❸ ❷ Com m unicating w ith accelerator Kernel invocation Mem ory allocation Call DCS library Userspace GPU user library Application DCS Library Kernel DCS Driver GPU kernel driver Get m em ory m apping Hardw are DCS Engine GPU Process data Memory Source device ( Kernel launch) DMA / NVMe transfer Direct data loading w ithout m em cpy 1 3

  15. I ndex • I nefficient device utilization • Lim itations of existing solutions • DCS: Device-Centric Server architecture • Experim ental results • Conclusion

  16. Experim ental setup • Host: Pow er-efficient system − Core 2 Duo @ 2.00GHz, 2MB LLC − 2GB DDR2 DRAM • Device: Off-the-shelf em erging devices − Storage: Samsung XS1715 NVMe SSD − NIC: NetFPGA with Xilinx Virtex 5 (up to 1Gb bandwidth) − Accelerator: NVIDIA Tesla K20m − Device interconnect: Cyclone Microsystems PCIe2-2707 (Gen 2 switch, 5 slots, up to 80Gbps) 1 5

  17. DCS prototype im plem entation • Our 4 -node DCS prototype − Can support many devices per host NetFPGA NIC GPU NVMe SSD PCIe Switch 1 6

  18. Reducing device utilization latency • Single sendfile: Storage read & NI C send − Host-centric: Per-device layer crossings − DCS: Batch management in HW layer Latency ( µ s) SW 79 DCS 39 75 HW 75 Host-centric DCS 1 7

  19. Reducing device utilization latency • Single sendfile: Storage read & NI C send − Host-centric: Per-device layer crossings − DCS: Batch management in HW layer 2 x latency im provem ent (with low-latency devices) Latency ( µ s) SW 79 DCS 39 Latency 75 HW 75 Host-centric DCS Host-centric DCS 1 8

  20. Host-independent perform ance • Sendfile under host resource ( CPU) contention − Host-centric: host-dependent, high management cost − DCS: host-independent, low management cost CPU Busy Host-centric 1 0 0 % BW / CPU 7 0 % busy DCS Sendfile bandwidth 1 0 0 % BW / CPU 2 9 % busy 7 1 % BW / CPU 1 1 % busy 1 3 % BW / CPU 1 0 % busy No contention High contention High perform ance even on w eak hosts

  21. Multi-device invocation • Encrypted sendfile ( SSD → GPU → NIC, 512MB ) − DCS provides much efficient data movement to GPU − Current bottleneck is NIC (1Gbps) GPU data loading GPU processing Network send NVIDIA driver Host-centric 32 6 62 Network send (1Gb) 14% reduction DCS 6 6 6 68 Normalized processing time 2 0

  22. Multi-device invocation • Encrypted sendfile ( SSD → GPU → NIC, 512MB ) − DCS provides much efficient data movement to GPU − Current bottleneck is NIC (1Gbps) GPU data loading GPU processing Network send NVIDIA driver Host-centric 32 6 12 62 Network send (1Gb) Network send (10Gb) 14% reduction DCS 6 6 6 13 3 8 % reduction 68 Normalized processing time 2 1

  23. Real-w orld w orkload: Hadoop-grep • Hadoop-grep ( 1 0 GB) − Faster input delivery & smaller host resource consumption Map progress Reduce progress % 100 Host-centric Map/ Reduce progress 75 50 25 0 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 % 100 DCS 75 50 25 0 3 8 % faster processing 2 2

  24. Scalability: More devices per host • Doubling # of devices in a single host Host-centric DCS Total device throughput (Normalized) 1 .3 x 2 x SSD SSDx2 SSD SSDx2 Devices NIC NICx2 NIC NICx2 CPU Utilization 60% 100% 22% 37% Scalable m any-device support 2 3

  25. Conclusion • Device-Centric Server architecture − Manages emerging devices on behalf of host − Optimized data transfer and device control − Easily extensible modularized design • Real hardw are prototype evaluation − Device latency reduction: ~ 25% − Host resource savings: ~ 61% − Hadoop-grep speed improvement: ~ 38% 2 4

  26. Thank you! NetFPGA NIC GPU NVMe SSD PCIe Switch Device latency reduction ~25% Host resource savings ~61% Hadoop-grep speed improvement ~38% High Performance Computing Lab Pohang University of Science and Technology ( POSTECH)

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend