DCS-ctrl: A Fast and Flexible Device-Control Mechanism for - - PowerPoint PPT Presentation

dcs ctrl
SMART_READER_LITE
LIVE PREVIEW

DCS-ctrl: A Fast and Flexible Device-Control Mechanism for - - PowerPoint PPT Presentation

DCS-ctrl: A Fast and Flexible Device-Control Mechanism for Device-Centric Server Architecture Dongup Kwon 1 , Jaehyung Ahn 2 , Dongju Chae 2 , Mohammadamin Ajdari 2 , Jaewon Lee 1 , Suheon Bae 1 , Youngsok Kim 1 , and Jangwoo Kim 1 1 Dept. of


slide-1
SLIDE 1

DCS-ctrl:

A Fast and Flexible Device-Control Mechanism for Device-Centric Server Architecture

Dongup Kwon1, Jaehyung Ahn2, Dongju Chae2, Mohammadamin Ajdari2, Jaewon Lee1, Suheon Bae1, Youngsok Kim1, and Jangwoo Kim1

  • 1Dept. of Electrical and Computer Engineering, Seoul National University
  • 2Dept. of Computer Science and Engineering, POSTECH
slide-2
SLIDE 2

Conventional Server Architecture

  • Primarily rely on “CPU and memory”

− CPU-centric computing & in-memory storage − Slow and low-bandwidth peripheral devices CPU

Storage Network Compute

2/28

Host- & CPU-centric

slide-3
SLIDE 3

Conventional Server Architecture

  • Primarily rely on “CPU and memory”

− CPU-centric computing & in-memory storage − Slow and low-bandwidth peripheral devices CPU

Storage Network Compute

2/28

Host- & CPU-centric

slide-4
SLIDE 4

Device-centric Server Architecture

  • Exploit “fast & high-bandwidth devices”

− Data processing accelerators (e.g., GPU, FPGA) − Storage (e.g., SSD), network (e.g., 100GbE), PCIe Gen3

PCIe

CPU …

Storage

Network

… …

Accelerator

GPU GPU FPGA FPGA NVM NVM NIC NIC

Device-centric

CPU

Host- & CPU-centric Storage Network Compute

3/28

slide-5
SLIDE 5

Index

  • Existing approaches
  • DCS-ctrl: HW-based device-control mechanism
  • Experimental results
  • Conclusion

4/28

slide-6
SLIDE 6

Existing Approaches

  • Software optimization

− Memory mgmt. optimization, user-level device interface − Do not address multi-device tasks

  • P2P communication

− Transfer data directly through PCI Express è D2D comm.

  • Device integration

− Integrate heterogeneous devices è D2D comm. 5/28

slide-7
SLIDE 7

Limitations of Existing D2D Comm.

  • P2P communication

− Direct data transfers through PCI Express è D2D comm. − Slow and high-overhead control path

Data path Control path

Dev A Dev C CPU Dev B

30 60 90 120 Control Data copy Kernel SW Latency (us) SW

  • pt

P2P 0% 25% 50% 75% 100% Others Control Kernel CPU util. (%) SW

  • pt

P2P

6/28

slide-8
SLIDE 8

Limitations of Existing D2D Comm.

  • Integrated devices

− Integrating heterogeneous devices è D2D comm. − Fast data & control transfers − Fixed and inflexible aggregate implementation CPU Dev A Dev C Dev B New Dev $$$ Controllers 7/28

slide-9
SLIDE 9

Limited Performance Potential

while (true) { rc_recv = recv(fd_sock, buffer, recv_size, 0); if (rc_recv <= 0) break; processing(&md_ctx, buffer, recv_size); rc_write = write(fd_file, buffer, recv_size); … }

  • “Intermediate” processing between device ops

− Prevent applications from using direct D2D comm. − Cause host-side resource contention (CPU and memory) Dev A Dev B CPU 8/28

slide-10
SLIDE 10

Design Goals

  • Performance & scalability

− Faster inter-device data & control communication − More scalable with CPU-efficient device operations

  • Flexibility

− Support any types of off-the-shelf devices

  • Applicability

− Increase the opportunity of applying D2D comm. 9/28

slide-11
SLIDE 11

Index

  • Existing approaches
  • DCS-ctrl: HW-based device-control mechanism

− Key ideas and benefits − Architecture

  • Experimental results
  • Conclusion

10/28

slide-12
SLIDE 12
  • DCS-ctrl: PCIe P2P + “HDC”

− Hardware-based device-control (HDC) mechanism − HDC Engine: “FPGA-based” device orchestrator

+ “near-device” processing unit

§ Performance & scalability è HDC, device orchestrator § Flexibility è FPGA-based, low-cost device controller § Applicability è near-device processing unit

DCS-ctrl: Key Ideas & Benefits

11/28

slide-13
SLIDE 13

HDC Engine: Overview

Application Dev A Dev B Dev C Device driver A Dev A Device driver B Device driver C HDC Engine (FPGA) Device ctrl A Device ctrl B Device ctrl C NDP Dev A Dev B Dev C

SW-controlled P2P DCS-ctrl (HW)

Application Dev B Dev C Dev A Dev B Dev C

12/28

slide-14
SLIDE 14

DCS-ctrl: Key Ideas & Benefits

HDC HDC

void ssd_to_nic() { get_from_ssd(&data); process_in_HDC(&data); write_to_nic(&data); }

Dev A Dev B CPU

Optimized dev. control ⇒ Faster & scalable communication Generic dev. interfaces ⇒ Higher flexibility Near-device processing ⇒ Higher applicability

New Dev CPU Dev A Dev C Dev B HDC

Device controller Data path Control path

CPU Dev A Dev C Dev B HDC 13/28

slide-15
SLIDE 15

Key Idea #1: Device Orchestrator

Scoreboard

Dev R/W Src Dst Aux State A Read Addr(DevA) Addr(NDP-A)

  • Done
  • Addr(NDP-A) Addr(NDP-B)

Hash Issue B Write Addr(NDP-B) Addr(DevB)

  • Ready
  • Perform multi-device tasks w/o CPU involvement

− Offload a multi-device task to HDC Engine − Manage all device operations and their dependencies

Dev A Dev B NDP Multi-device task NDP

Fast hardware-level device control

14/28

slide-16
SLIDE 16

Key Idea #2: Device Controller

Device controller Submission queue Completion queue Device

  • Provide interfaces between HDC Engine & devices

− Include submission & completion queues − Build standard & vendor-specific device commands Doorbell registers

PCIe switch

Flexible & low-cost device control

15/28

slide-17
SLIDE 17

Key Idea #3: Near-device Processing

  • Near-device processing units

− Execute intermediate processing between device ops − Scale-out storage app è hash, encryption, compression

Easy to be extended & support other devices & applications

Processing units LUTs Registers Applications MD5 3.0% 0.69% Swift AES256 3.52% 0.99% HDFS, Swift GZIP 5.36% 2.09% HDFS

Highly applicable to existing applications

16/28

slide-18
SLIDE 18

Index

  • Existing approaches
  • DCS-ctrl: HW-based device-control mechanism
  • Key idea and benefits

− Architecture

  • Experimental results
  • Conclusion

17/28

slide-19
SLIDE 19

Baseline Architecture

PCIe switch

Dev C Dev B Dev A

Application Dev A Dev B Dev C Device driver A

  • Software-controlled P2P

− P2P comm. + indirect device-control path

Device driver A Device driver A SW HW

18/28

slide-20
SLIDE 20

DCS-ctrl: HW-based Device Control (1/3)

PCIe switch

Dev C Dev B Dev A

Application

  • Offload device-control path to HDC Engine

− Scoreboard: schedule device operations in a multi-dev task

A – B - C

Dev r/w Src Dst A B C

Scoreboard FPGA-based HDC Engine SW HW

19/28

slide-21
SLIDE 21

DCS-ctrl: Low-cost Integration (2/3)

SW PCIe switch

Dev C Dev B Dev A

Application

  • Implement an FPGA-based device controller

− Device controller: directly control devices using P2P

A – B - C FPGA-based HDC Engine

Dev r/w Src Dst A B C

Scoreboard Device controller

New Dev

HW

20/28

slide-22
SLIDE 22

DCS-ctrl: Near-device Processing (3/3)

PCIe switch

Dev C Dev B Dev A

Application

  • Provide units for intermediate processing

− NDP unit: perform data processing on a data path

A – B - C FPGA-based HDC Engine

Dev r/w Src Dst A B C

Scoreboard Device controller Near-device processing Intermediate buffers

New Dev

SW HW

21/28

slide-23
SLIDE 23

HDC Engine implemented on Xilinx Virtex-7 VC707 Supports off-the-shelf devices – Intel 750 SSDs, Broadcom 10GbE NICs, NVIDIA GPUs

DCS-ctrl Prototype

22/28

slide-24
SLIDE 24

Index

  • Existing approaches
  • DCS-ctrl: HW-based device-control mechanism
  • Experimental results
  • Conclusion

23/28

slide-25
SLIDE 25

Reducing Device Control Latency

  • encrypted_sendfile(): SSD à hash à NIC

− SW opt (+P2P): frequent boundary crossings, complex software − DCS-ctrl: less crossings, hardware-based device control

50 100 SW opt DCS-ctrl HW Kernel Dev ctrl 100 200 300 SW opt SW opt + P2P DCS-ctrl HW Kernel Data Copy Dev ctrl Latency (us) Latency (us) SW

without processing with processing (AES256)

SW SW

42% 72%

24/28

slide-26
SLIDE 26

Reducing CPU Utilization

  • Swift & HDFS workloads

− Offload device control & data transfers to hardware

0% 25% 50% 75% 100% SW opt SW opt +P2P DCS-ctrl Kernel (GET) Kernel (PUT) GPU control Others 0% 25% 50% 75% 100% Send Recv Send Recv Send Recv SW opt SW opt +P2P DCS-ctrl Kernel (Sender) Kernel (Receiver) GPU control

  • thers

Swift HDFS

Normalized CPU utilization Normalized CPU utilization

50%

52% 49%

25/28

slide-27
SLIDE 27

Scalability: More Devices

  • Swift & HDFS workloads

− More CPU-efficient è support more high-performance devices

2 4 6 10 20 30 40 SW opt SW opt + P2P DCS-ctrl 2 4 6 10 20 30 40 SW opt SW opt + P2P DCS-ctrl

Swift HDFS

CPU utilization (# cores) CPU utilization (# cores) Throughput (Gbps) Throughput (Gbps)

26/28

slide-28
SLIDE 28
  • Fast & flexible device-control mechanism

− Hardware-based device-control (HDC) mechanism − FPGA-based standard device controllers − Near-device data processing (NDP) units

  • Real hardware prototype evaluation

− 72% faster inter-device communication − 50% lower CPU utilization for Swift & HDFS

Conclusion

27/28

slide-29
SLIDE 29

Thank you!

28/28

We will release our IP & tools soon!