I nefficient device utilization Host-centric device m anagem ent - - PowerPoint PPT Presentation

i nefficient device utilization
SMART_READER_LITE
LIVE PREVIEW

I nefficient device utilization Host-centric device m anagem ent - - PowerPoint PPT Presentation

DCS: A Fast and Scalable Device-Centric Server Architecture Jaehyung Ahn, Dongup Kwon, Youngsok Kim, Mohammadamin Ajdari, Jaewon Lee, and Jangwoo Kim { jh2ekd, nankdu7, elixir, majdari, spiegel0, jangwoo} @postech.ac.kr High Performance


slide-1
SLIDE 1

DCS: A Fast and Scalable

Device-Centric Server Architecture

Jaehyung Ahn, Dongup Kwon, Youngsok Kim, Mohammadamin Ajdari, Jaewon Lee, and Jangwoo Kim

{ jh2ekd, nankdu7, elixir, majdari, spiegel0, jangwoo} @postech.ac.kr

High Performance Computing Lab Pohang University of Science and Technology ( POSTECH)

slide-2
SLIDE 2

I nefficient device utilization

  • Host-centric device m anagem ent

− Host manages every device invocation − Frequent host-involved layer crossings

  • Increases latency and management cost

1

Userspace Kernel Hardw are Application Device A Device B Driver B Kernel stack Driver A Kernel stack Device C Driver C Kernel stack

Datapath Metadata/ Command path

slide-3
SLIDE 3

Latency: High softw are overhead

  • Single sendfile: Storage read & NI C send

− Faster devices, more software overhead

2

Softw are overhead

Latency Decomposition (Normalized)

7%

HDD 10Gb NIC

50%

NVMe 10Gb NIC

77%

PCM 10Gb NIC

82%

PCM 100Gb NIC

Software Storage NIC 0% 100%

slide-4
SLIDE 4

Cost: High host resource dem and

  • Sendfile under host resource ( CPU) contention

− Faster devices, more host resource consumption

3

Sendfile bandwidth 100%

No contention CPU Busy Sendfile bandwidth

* Measured from NVMe SSD/ 10Gb NIC

Sendfile CPU usage 34%

High contention

Sendfile bandwidth 14% Sendfile CPU usage 6%

slide-5
SLIDE 5

I ndex

  • I nefficient device utilization
  • Lim itations of existing solutions
  • DCS: Device-Centric Server architecture
  • Experim ental results
  • Conclusion
slide-6
SLIDE 6

Lim itations of existing w ork

  • Single-device optim ization

− Do not address inter-device communication

e.g., Moneta (SSD), DCA (NIC), mTCP (NIC), Arrakis (Generic)

  • I nter-device com m unication

− Not applicable for unsupported devices

e.g., GPUNet (GPU-NIC), GPUDirect RDMA (GPU-Infiniband)

  • I ntegrating devices

− Custom devices and protocols, limited applicability

e.g., QuickSAN (SSD+ NIC), BlueDBM (Accelerator – SSD+ NIC)

Need for fast, scalable, and generic inter-device com m unication

5

slide-7
SLIDE 7

I ndex

  • I nefficient device utilization
  • Lim itations of existing solutions
  • DCS: Device-Centric Server architecture

− Key idea and benefits − Design considerations

  • Experim ental results
  • Conclusion
slide-8
SLIDE 8

DCS Library Application DCS Driver Device drivers & Kernel stacks

DCS: Key idea

  • Minim ize host involvem ent & data m ovem ent

7

Userspace Kernel Hardw are

Datapath Metadata/ Command path

Single command → Optimized multi-device invocation

Device C Device B Device A DCS Engine

slide-9
SLIDE 9

DCS: Benefits

  • Better device perform ance

− Faster data delivery, lower total operation latency

  • Better host perform ance/ efficiency

− Resource/ time spent for device management

now available for other applications

  • High applicability

− Relies on existing drivers / kernel supports / interfaces

− Easy to extend and cover more devices

8

slide-10
SLIDE 10

I ndex

  • I nefficient device utilization
  • Lim itations of existing solutions
  • DCS: Device-Centric Server architecture

− Key idea and benefits − Design considerations

  • By discussing implementation details
  • Experim ental results
  • Conclusion
slide-11
SLIDE 11

DCS: Architecture overview

1 0

Userspace Kernel Hardw are Application DCS Library sendfile(), encrypted sendfile() DCS Driver Command generator Kernel communicator DCS Engine (on NetFPGA NIC) NVMe SSD GPU NetFPGA NIC

Fully com patible w ith existing system

Command Queue Command interpreter Per-device manager PCIe Switch Drivers & Kernel stack Existing System

slide-12
SLIDE 12

Com m unicating w ith storage

1 1

Userspace Kernel Hardw are Application DCS Library DCS Driver DCS Engine NVMe SSD Target device

Block addr ( in device) / buffer addr ( cached)

VFS cache Source device

File descriptor Hook / API call

Data consistency guaranteed

Source device Target (Virtual) Filesystem

❶ ❷ ❸ ❹ ❺

slide-13
SLIDE 13

Com m unicating w ith netw ork interface

1 2

Userspace Kernel Hardw are Application DCS Library DCS Driver DCS Engine Data buffer Network stack

Connection inform ation

NetFPGA NIC

Packet generation & Send

HW PacketGen

Socket descriptor Hook / API call

HW -assisted packet generation

❶ ❷ ❸ ❹ ❺

slide-14
SLIDE 14

Com m unicating w ith accelerator

1 3

Userspace Kernel Hardw are Application DCS Library DCS Driver DCS Engine Memory GPU

Mem ory allocation

GPU user library GPU kernel driver

Get m em ory m apping

DMA / NVMe transfer Source device

Kernel invocation Process data ( Kernel launch) Call DCS library

Direct data loading w ithout m em cpy

❶ ❷ ❸ ❺ ❻ ❼ ❹

slide-15
SLIDE 15

I ndex

  • I nefficient device utilization
  • Lim itations of existing solutions
  • DCS: Device-Centric Server architecture
  • Experim ental results
  • Conclusion
slide-16
SLIDE 16

Experim ental setup

  • Host: Pow er-efficient system

− Core 2 Duo @ 2.00GHz, 2MB LLC − 2GB DDR2 DRAM

  • Device: Off-the-shelf em erging devices

− Storage: Samsung XS1715 NVMe SSD − NIC: NetFPGA with Xilinx Virtex 5 (up to 1Gb bandwidth) − Accelerator: NVIDIA Tesla K20m − Device interconnect: Cyclone Microsystems PCIe2-2707

(Gen 2 switch, 5 slots, up to 80Gbps)

1 5

slide-17
SLIDE 17

DCS prototype im plem entation

  • Our 4 -node DCS prototype

− Can support many devices per host

1 6

NVMe SSD NetFPGA NIC GPU PCIe Switch

slide-18
SLIDE 18

Reducing device utilization latency

  • Single sendfile: Storage read & NI C send

− Host-centric: Per-device layer crossings − DCS: Batch management in HW layer

1 7

Latency (µs) HW

75

SW

79 75

Host-centric DCS DCS

39

slide-19
SLIDE 19

Reducing device utilization latency

  • Single sendfile: Storage read & NI C send

− Host-centric: Per-device layer crossings − DCS: Batch management in HW layer

1 8

Latency (µs) HW

75

SW

79 75

Host-centric DCS DCS

39

2 x latency im provem ent

(with low-latency devices)

Host-centric DCS

Latency

slide-20
SLIDE 20

7 1 % BW / CPU 1 1 % busy 1 0 0 % BW / CPU 2 9 % busy

Host-independent perform ance

  • Sendfile under host resource ( CPU) contention

− Host-centric: host-dependent, high management cost − DCS: host-independent, low management cost

CPU Busy Sendfile bandwidth Host-centric DCS 1 0 0 % BW / CPU 7 0 % busy 1 3 % BW / CPU 1 0 % busy No contention High contention

High perform ance even on w eak hosts

slide-21
SLIDE 21

Multi-device invocation

  • Encrypted sendfile ( SSD → GPU → NIC, 512MB)

− DCS provides much efficient data movement to GPU − Current bottleneck is NIC (1Gbps)

2 0

Normalized processing time 68 62 6

Host-centric DCS

32 6 6 6 Network send (1Gb) 14% reduction

GPU data loading GPU processing Network send NVIDIA driver

slide-22
SLIDE 22

Multi-device invocation

  • Encrypted sendfile ( SSD → GPU → NIC, 512MB)

− DCS provides much efficient data movement to GPU − Current bottleneck is NIC (1Gbps)

2 1

Normalized processing time 68 62 6

Host-centric DCS

32 6 6 6 Network send (1Gb) 14% reduction 13 12 Network send (10Gb)

3 8 % reduction

GPU data loading GPU processing Network send NVIDIA driver

slide-23
SLIDE 23

Real-w orld w orkload: Hadoop-grep

  • Hadoop-grep ( 1 0 GB)

− Faster input delivery & smaller host resource consumption

2 2

25 50 75 100 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 Map progress Reduce progress

Host-centric %

25 50 75 100

DCS %

Map/ Reduce progress

3 8 % faster processing

slide-24
SLIDE 24

Scalability: More devices per host

  • Doubling # of devices in a single host

2 3

CPU Utilization 60%

Total device throughput (Normalized) 2 x 1 .3 x

Scalable m any-device support

100% 22% 37% Devices

SSD NIC SSDx2 NICx2 SSD NIC SSDx2 NICx2

Host-centric DCS

slide-25
SLIDE 25

Conclusion

  • Device-Centric Server architecture

− Manages emerging devices on behalf of host − Optimized data transfer and device control − Easily extensible modularized design

  • Real hardw are prototype evaluation

− Device latency reduction: ~ 25% − Host resource savings: ~ 61% − Hadoop-grep speed improvement: ~ 38%

2 4

slide-26
SLIDE 26

Thank you!

High Performance Computing Lab Pohang University of Science and Technology ( POSTECH)

Device latency reduction ~25% Host resource savings ~61% Hadoop-grep speed improvement ~38%

NVMe SSD NetFPGA NIC GPU PCIe Switch