I nefficient device utilization Host-centric device m anagem ent - PowerPoint PPT Presentation

DCS: A Fast and Scalable Device-Centric Server Architecture Jaehyung Ahn, Dongup Kwon, Youngsok Kim, Mohammadamin Ajdari, Jaewon Lee, and Jangwoo Kim { jh2ekd, nankdu7, elixir, majdari, spiegel0, jangwoo} @postech.ac.kr High Performance Computing Lab Pohang University of Science and Technology ( POSTECH)

I nefficient device utilization • Host-centric device m anagem ent − Host manages every device invocation − Frequent host-involved layer crossings  Increases latency and management cost Application Userspace Kernel stack Kernel stack Kernel stack Kernel Driver A Driver B Driver C Hardw are Device A Device B Device C Datapath Metadata/ Command path 1

Latency: High softw are overhead • Single sendfile: Storage read & NI C send − Faster devices, more software overhead Softw are overhead 7% 50% 77% 82% 100% Decomposition (Normalized) Latency 0% HDD NVMe PCM PCM 10Gb NIC 10Gb NIC 10Gb NIC 100Gb NIC Software Storage NIC 2

Cost: High host resource dem and • Sendfile under host resource ( CPU) contention − Faster devices, more host resource consumption Sendfile bandwidth CPU Busy Sendfile bandwidth 100% Sendfile Sendfile CPU usage bandwidth 34% 14% Sendfile CPU usage 6% No contention High contention * Measured from NVMe SSD/ 10Gb NIC 3

I ndex • I nefficient device utilization • Lim itations of existing solutions • DCS: Device-Centric Server architecture • Experim ental results • Conclusion

Lim itations of existing w ork • Single-device optim ization − Do not address inter-device communication e.g., Moneta (SSD), DCA (NIC), mTCP (NIC), Arrakis (Generic) • I nter-device com m unication − Not applicable for unsupported devices e.g., GPUNet (GPU-NIC), GPUDirect RDMA (GPU-Infiniband) • I ntegrating devices − Custom devices and protocols, limited applicability e.g., QuickSAN (SSD+ NIC), BlueDBM (Accelerator – SSD+ NIC) Need for fast, scalable, and generic inter-device com m unication 5

I ndex • I nefficient device utilization • Lim itations of existing solutions • DCS: Device-Centric Server architecture − Key idea and benefits − Design considerations • Experim ental results • Conclusion

DCS: Key idea • Minim ize host involvem ent & data m ovem ent DCS Library Application Userspace DCS Driver Device drivers & Kernel stacks Kernel Hardw are DCS Engine Device A Device B Device C Datapath Metadata/ Command path Single command → Optimized multi-device invocation 7

DCS: Benefits • Better device perform ance − Faster data delivery, lower total operation latency • Better host perform ance/ efficiency − Resource/ time spent for device management now available for other applications • High applicability − Relies on existing drivers / kernel supports / interfaces − Easy to extend and cover more devices 8

I ndex • I nefficient device utilization • Lim itations of existing solutions • DCS: Device-Centric Server architecture − Key idea and benefits − Design considerations  By discussing implementation details • Experim ental results • Conclusion

DCS: Architecture overview Existing System DCS Library Application Userspace sendfile(), encrypted sendfile() DCS Driver Kernel Drivers & Kernel communicator Kernel stack Command generator PCIe Switch Hardw are DCS Engine (on NetFPGA NIC) NVMe SSD Command Per-device GPU Command interpreter manager Queue NetFPGA NIC Fully com patible w ith existing system 1 0

❺ ❶ ❹ ❸ ❷ Com m unicating w ith storage Userspace Hook / API call Application DCS Library File descriptor Kernel DCS Driver (Virtual) Filesystem Block addr ( in device) / buffer addr ( cached) Hardw are DCS Engine Source device Target Target device NVMe SSD Source device VFS cache Data consistency guaranteed 1 1

❺ ❶ ❹ ❸ ❷ Com m unicating w ith netw ork interface Userspace Hook / API call Application DCS Library Socket descriptor Kernel Network stack DCS Driver Connection inform ation Hardw are DCS Engine NetFPGA NIC Packet generation & Send Data buffer HW PacketGen HW -assisted packet generation 1 2

❹ ❶ ❼ ❻ ❺ ❸ ❷ Com m unicating w ith accelerator Kernel invocation Mem ory allocation Call DCS library Userspace GPU user library Application DCS Library Kernel DCS Driver GPU kernel driver Get m em ory m apping Hardw are DCS Engine GPU Process data Memory Source device ( Kernel launch) DMA / NVMe transfer Direct data loading w ithout m em cpy 1 3

I ndex • I nefficient device utilization • Lim itations of existing solutions • DCS: Device-Centric Server architecture • Experim ental results • Conclusion

Experim ental setup • Host: Pow er-efficient system − Core 2 Duo @ 2.00GHz, 2MB LLC − 2GB DDR2 DRAM • Device: Off-the-shelf em erging devices − Storage: Samsung XS1715 NVMe SSD − NIC: NetFPGA with Xilinx Virtex 5 (up to 1Gb bandwidth) − Accelerator: NVIDIA Tesla K20m − Device interconnect: Cyclone Microsystems PCIe2-2707 (Gen 2 switch, 5 slots, up to 80Gbps) 1 5

DCS prototype im plem entation • Our 4 -node DCS prototype − Can support many devices per host NetFPGA NIC GPU NVMe SSD PCIe Switch 1 6

Reducing device utilization latency • Single sendfile: Storage read & NI C send − Host-centric: Per-device layer crossings − DCS: Batch management in HW layer Latency ( µ s) SW 79 DCS 39 75 HW 75 Host-centric DCS 1 7

Reducing device utilization latency • Single sendfile: Storage read & NI C send − Host-centric: Per-device layer crossings − DCS: Batch management in HW layer 2 x latency im provem ent (with low-latency devices) Latency ( µ s) SW 79 DCS 39 Latency 75 HW 75 Host-centric DCS Host-centric DCS 1 8

Host-independent perform ance • Sendfile under host resource ( CPU) contention − Host-centric: host-dependent, high management cost − DCS: host-independent, low management cost CPU Busy Host-centric 1 0 0 % BW / CPU 7 0 % busy DCS Sendfile bandwidth 1 0 0 % BW / CPU 2 9 % busy 7 1 % BW / CPU 1 1 % busy 1 3 % BW / CPU 1 0 % busy No contention High contention High perform ance even on w eak hosts

Multi-device invocation • Encrypted sendfile ( SSD → GPU → NIC, 512MB ) − DCS provides much efficient data movement to GPU − Current bottleneck is NIC (1Gbps) GPU data loading GPU processing Network send NVIDIA driver Host-centric 32 6 62 Network send (1Gb) 14% reduction DCS 6 6 6 68 Normalized processing time 2 0

Multi-device invocation • Encrypted sendfile ( SSD → GPU → NIC, 512MB ) − DCS provides much efficient data movement to GPU − Current bottleneck is NIC (1Gbps) GPU data loading GPU processing Network send NVIDIA driver Host-centric 32 6 12 62 Network send (1Gb) Network send (10Gb) 14% reduction DCS 6 6 6 13 3 8 % reduction 68 Normalized processing time 2 1

Real-w orld w orkload: Hadoop-grep • Hadoop-grep ( 1 0 GB) − Faster input delivery & smaller host resource consumption Map progress Reduce progress % 100 Host-centric Map/ Reduce progress 75 50 25 0 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 % 100 DCS 75 50 25 0 3 8 % faster processing 2 2

Scalability: More devices per host • Doubling # of devices in a single host Host-centric DCS Total device throughput (Normalized) 1 .3 x 2 x SSD SSDx2 SSD SSDx2 Devices NIC NICx2 NIC NICx2 CPU Utilization 60% 100% 22% 37% Scalable m any-device support 2 3

Conclusion • Device-Centric Server architecture − Manages emerging devices on behalf of host − Optimized data transfer and device control − Easily extensible modularized design • Real hardw are prototype evaluation − Device latency reduction: ~ 25% − Host resource savings: ~ 61% − Hadoop-grep speed improvement: ~ 38% 2 4

Thank you! NetFPGA NIC GPU NVMe SSD PCIe Switch Device latency reduction ~25% Host resource savings ~61% Hadoop-grep speed improvement ~38% High Performance Computing Lab Pohang University of Science and Technology ( POSTECH)

I nefficient device utilization Host-centric device m anagem ent - PowerPoint PPT Presentation

DCS: A Fast and Scalable Device-Centric Server Architecture Jaehyung Ahn, Dongup Kwon, Youngsok Kim, Mohammadamin Ajdari, Jaewon Lee, and Jangwoo Kim { jh2ekd, nankdu7, elixir, majdari, spiegel0, jangwoo} @postech.ac.kr High Performance

Nquire ask anything Anis Abboud, Chris Snyder, Mario Finelli Device 1 Device 2 Device 1

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Comprehensive Utilization of Comprehensive Utilization of Comprehensive Utilization of Woody

Emergency Department Emergency Department Utilization Team Utilization Team PCP Access Pilot

Maximizing Fleet Utilization Donovan ONeil, Local Government Project Manager Ohio Auditor of

CIMMYT CAGE meeting CIMMYT CAGE meeting Update : Identification and utilization of novel sources

Device Creation with Qt Enterprise Embedded Andy Nichols Overview The challenges of device

Towards a Unified Framework for Mobile Device Security Wayne A. Jansen, NIST Mobile Device

Device Programming Nima Honarmand Spring 2017 :: CSE 506 Device Interface (Logical View)

Device Management Device Management Organization Application Application Process Process API

Power Device Physics Revealed TCAD for Power Device Technologies 2D and 3D TCAD Simulation

Hardware and Device Drivers Device virtualization Device drivers and security Bjrn

Solving Device Tree Issues Use of device tree is mandatory for all new ARM systems. But the

InfiniBand Network Block Device Overview IBNBD: InfiniBand Network Block device Transfer

Statistical Device Variability and Statistical Device Variability and its Impact on Design its

Writing and Adapting Device Drivers for FreeBSD John Baldwin November 5, 2011 What is a Device

Session Objectives Definition of Disability Qualified Reasonable Accommodation

Governors Advisory Council for Veterans Services Arrowheads Community Club Fort Indiantown Gap

Integrating multiple representations of spatial knowledge for mapping, navigation, and

Low Level Vision Theo Pavlidis Distinguished Professor Emeritus Stony Brook University

Self-Driving Cars: Fostering Independence and Mobility for Seniors and People with Disabilities

Why is Computer Vision on a Mobile Device Different? Instructor - Simon Lucey 16-623 - Designing

How to make a distribution accessible, right from its installation some feedback from Debian

Low cost computer vision implementations for plant phenotyping/identification problems Pablo M.

Sambuz

Useful Links

Newsletter

Mail Us