Solros: A Data-Centric Operating System Architecture for - - PowerPoint PPT Presentation

solros a data centric operating system architecture for
SMART_READER_LITE
LIVE PREVIEW

Solros: A Data-Centric Operating System Architecture for - - PowerPoint PPT Presentation

Solros: A Data-Centric Operating System Architecture for Heterogeneous Computing Changwoo Min , Woonhak Kang, Mohan Kumar, Sanidhya Kashyap, Steffen Maass, Heeseung Jo, Taesoo Kim Virginia Tech, eBay, Georgia Tech, Chonbuk National University


slide-1
SLIDE 1

Solros: A Data-Centric Operating System Architecture for Heterogeneous Computing

Changwoo Min, Woonhak Kang, Mohan Kumar, Sanidhya Kashyap, Steffen Maass, Heeseung Jo, Taesoo Kim Virginia Tech, eBay, Georgia Tech, Chonbuk National University April 26, 2018

Changwoo Min Solros: Data-Centric OS April 26, 2018 1 / 21

slide-2
SLIDE 2

Cambrian Explosion of Processor Architecture

Changwoo Min Solros: Data-Centric OS April 26, 2018 2 / 21

Specialization of general-purpose processors

slide-3
SLIDE 3

Cambrian Explosion of Processor Architecture

Changwoo Min Solros: Data-Centric OS April 26, 2018 2 / 21

Specialization of general-purpose processors Generalization of co-processors

slide-4
SLIDE 4

Cambrian Explosion of Processor Architecture

Changwoo Min Solros: Data-Centric OS April 26, 2018 2 / 21

Specialization of general-purpose processors Generalization of co-processors Specialization of co-processors

slide-5
SLIDE 5

Blazingly fast IO Devices

Changwoo Min Solros: Data-Centric OS April 26, 2018 3 / 21

Blazingly fast storage/memory

slide-6
SLIDE 6

Blazingly fast IO Devices

Changwoo Min Solros: Data-Centric OS April 26, 2018 3 / 21

Blazingly fast storage/memory Blazingly fast network

slide-7
SLIDE 7

Blazingly fast IO Devices

Changwoo Min Solros: Data-Centric OS April 26, 2018 3 / 21

Blazingly fast storage/memory Blazingly fast network How to exploit the full potential of such hardware devices without pain? System-wide performance Ease of programming

slide-8
SLIDE 8

Outline

1

Heterogeneous Computing Architectures

2

Solros: Split-Kernel Approach Solros Architecture Operating System Services

3

Evaluation

Changwoo Min Solros: Data-Centric OS April 26, 2018 4 / 21

slide-9
SLIDE 9

Host-Centric Approach

Host OS controls co-processors and IO devices Examples: OpenCL, CUDA

Application

Host processor

OS Mem Core

I/O device

SSD / NIC

Co-processor

Application Mem Core control data

Changwoo Min Solros: Data-Centric OS April 26, 2018 5 / 21

slide-10
SLIDE 10

Host-Centric Approach

Host OS controls co-processors and IO devices Examples: OpenCL, CUDA

Application

Host processor

OS Mem Core

I/O device

SSD / NIC

Co-processor

Application Mem Core

control data Changwoo Min Solros: Data-Centric OS April 26, 2018 5 / 21

slide-11
SLIDE 11

Host-Centric Approach

Host OS controls co-processors and IO devices Examples: OpenCL, CUDA

Application

Host processor

OS Mem Core

I/O device

SSD / NIC

Co-processor

Application Mem Core

① ②

control data Changwoo Min Solros: Data-Centric OS April 26, 2018 5 / 21

slide-12
SLIDE 12

Host-Centric Approach

Host OS controls co-processors and IO devices Examples: OpenCL, CUDA

Application

Host processor

OS Mem Core

I/O device

SSD / NIC

Co-processor

Application Mem Core

① ② ③

control data Changwoo Min Solros: Data-Centric OS April 26, 2018 5 / 21

slide-13
SLIDE 13

Host-Centric Approach

Host OS controls co-processors and IO devices Examples: OpenCL, CUDA

Application

Host processor

OS Mem Core

I/O device

SSD / NIC

Co-processor

Application Mem Core

① ② ③

control data

Problem

Redundant data communication Complex to program and hard to optimize

Changwoo Min Solros: Data-Centric OS April 26, 2018 5 / 21

slide-14
SLIDE 14

Coprocessor-Centric Architecture

Co-processors control IO devices Examples: Xeon Phi (Linux), GPUfs [ASPLOS13], GPUNet [OSDI14]

Application

Host processor

Mem Core

I/O device

SSD / NIC Application

Co-processor

OS Mem Core OS control data

Changwoo Min Solros: Data-Centric OS April 26, 2018 6 / 21

slide-15
SLIDE 15

Coprocessor-Centric Architecture

Co-processors control IO devices Examples: Xeon Phi (Linux), GPUfs [ASPLOS13], GPUNet [OSDI14]

Application

Host processor

Mem Core

I/O device

SSD / NIC Application

Co-processor

OS Mem Core

OS control data

Changwoo Min Solros: Data-Centric OS April 26, 2018 6 / 21

slide-16
SLIDE 16

Coprocessor-Centric Architecture

Co-processors control IO devices Examples: Xeon Phi (Linux), GPUfs [ASPLOS13], GPUNet [OSDI14]

Application

Host processor

Mem Core

I/O device

SSD / NIC Application

Co-processor

OS Mem Core

① ②

OS control data

Changwoo Min Solros: Data-Centric OS April 26, 2018 6 / 21

slide-17
SLIDE 17

Coprocessor-Centric Architecture

Co-processors control IO devices Examples: Xeon Phi (Linux), GPUfs [ASPLOS13], GPUNet [OSDI14]

Application

Host processor

Mem Core

I/O device

SSD / NIC Application

Co-processor

OS Mem Core

① ②

OS control data

Problem

Significant effort required for porting IO stack to co-processor Not completely exploiting powerful host processors

Changwoo Min Solros: Data-Centric OS April 26, 2018 6 / 21

slide-18
SLIDE 18

Outline

1

Heterogeneous Computing Architectures

2

Solros: Split-Kernel Approach Solros Architecture Operating System Services

3

Evaluation

Changwoo Min Solros: Data-Centric OS April 26, 2018 7 / 21

slide-19
SLIDE 19

Solros Goal

Ease of programming Best use of processor architecture System-wide optimization

Changwoo Min Solros: Data-Centric OS April 26, 2018 8 / 21

slide-20
SLIDE 20

Solros Goal

Ease of programming Best use of processor architecture System-wide optimization

Challenge

Co-processor needs IO abstraction IO stacks is branch-divergent and difficult to parallelize It needs system-wide information

Changwoo Min Solros: Data-Centric OS April 26, 2018 8 / 21

slide-21
SLIDE 21

Solros Architecture

Split-Kernel Architecture Data-plane OS

Runs on a co-processor Provides IO abstraction Delegates actual IO operations to a control-plane OS

Control-plane OS

Runs on a host processor Runs actual IO stack Performs system-wide coordination

Changwoo Min Solros: Data-Centric OS April 26, 2018 9 / 21

slide-22
SLIDE 22

Solros Architecture

Control-plane OS: actual OS service + system-wide coordination Data-plane OS: thin communication layer to host processor

control data Application

Host processor

OS proxy Mem Core

I/O device

SSD / NIC Application

Co-processor

OS stub Core Mem Policy Changwoo Min Solros: Data-Centric OS April 26, 2018 10 / 21

slide-23
SLIDE 23

Solros Architecture

Control-plane OS: actual OS service + system-wide coordination Data-plane OS: thin communication layer to host processor

control data Application

Host processor

OS proxy Mem Core

I/O device

SSD / NIC Application

Co-processor

OS stub Core Mem

Policy Changwoo Min Solros: Data-Centric OS April 26, 2018 10 / 21

slide-24
SLIDE 24

Solros Architecture

Control-plane OS: actual OS service + system-wide coordination Data-plane OS: thin communication layer to host processor

control data Application

Host processor

OS proxy Mem Core

I/O device

SSD / NIC Application

Co-processor

OS stub Core Mem

① ②

Policy Changwoo Min Solros: Data-Centric OS April 26, 2018 10 / 21

slide-25
SLIDE 25

Solros Architecture

Control-plane OS: actual OS service + system-wide coordination Data-plane OS: thin communication layer to host processor

control data Application

Host processor

OS proxy Mem Core

I/O device

SSD / NIC Application

Co-processor

OS stub Core Mem

① ② ③

Policy Changwoo Min Solros: Data-Centric OS April 26, 2018 10 / 21

slide-26
SLIDE 26

Solros Architecture

Control-plane OS: actual OS service + system-wide coordination Data-plane OS: thin communication layer to host processor

control data Application

Host processor

OS proxy Mem Core

I/O device

SSD / NIC Application

Co-processor

OS stub Core Mem

① ② ③

Policy

Co-processor has OS abstraction with minimal effort Best use of each of the fat and lean processors Efficient global coordination among devices (policy)

Changwoo Min Solros: Data-Centric OS April 26, 2018 10 / 21

slide-27
SLIDE 27

Operating System Services

1 Transport service 2 Filesystem service 3 Network service Changwoo Min Solros: Data-Centric OS April 26, 2018 11 / 21

slide-28
SLIDE 28

Operating System Services

1 Transport service 2 Filesystem service 3 Network service Changwoo Min Solros: Data-Centric OS April 26, 2018 12 / 21

slide-29
SLIDE 29

Transport Service

High performance data transfer among devices are challenging: Uniform data transfer among devices High contention in massively-parallel co-processor Asymmetric performance between host processor and co-processor

Changwoo Min Solros: Data-Centric OS April 26, 2018 13 / 21

slide-30
SLIDE 30

Transport Service

High performance data transfer among devices are challenging: Uniform data transfer among devices High contention in massively-parallel co-processor Asymmetric performance between host processor and co-processor

Our approach

Uniform data transfer ⇒ system-mapped PCIe window High contention ⇒ combining, replication, interleaving, etc. Asymmetric performance ⇒ flexibly configurable (host DMA engine

  • vs. co-processor DMA engine)

Changwoo Min Solros: Data-Centric OS April 26, 2018 13 / 21

slide-31
SLIDE 31

Transport Service

High performance data transfer among devices are challenging: Uniform data transfer among devices High contention in massively-parallel co-processor Asymmetric performance between host processor and co-processor

Our approach

Uniform data transfer ⇒ system-mapped PCIe window High contention ⇒ combining, replication, interleaving, etc. Asymmetric performance ⇒ flexibly configurable (host DMA engine

  • vs. co-processor DMA engine)

See details in the paper

Changwoo Min Solros: Data-Centric OS April 26, 2018 13 / 21

slide-32
SLIDE 32

Filesystem Service

Peer-to-peer operation Buffered operation

Host processor

SSD DMA engine Application

Co-processor

File system stub

File system proxy File system

PCIe

control data

Zero-copy of data between co-processor memory and SSD Minimal data transfer

Changwoo Min Solros: Data-Centric OS April 26, 2018 14 / 21

slide-33
SLIDE 33

Filesystem Service

Peer-to-peer operation Buffered operation

Host processor

SSD DMA engine Application

Co-processor

File system stub

File system proxy File system

PCIe

control data

Zero-copy of data between co-processor memory and SSD Minimal data transfer

Changwoo Min Solros: Data-Centric OS April 26, 2018 14 / 21

slide-34
SLIDE 34

Filesystem Service

Peer-to-peer operation Buffered operation

Host processor

SSD DMA engine Application

Co-processor

File system stub

① ③

File system proxy File system

PCIe

control data

Zero-copy of data between co-processor memory and SSD Minimal data transfer

Changwoo Min Solros: Data-Centric OS April 26, 2018 14 / 21

slide-35
SLIDE 35

Filesystem Service

Peer-to-peer operation Buffered operation

Host processor

SSD DMA engine Application

Co-processor

File system stub

① ③

File system proxy File system

PCIe

control data

Zero-copy of data between co-processor memory and SSD Minimal data transfer

Changwoo Min Solros: Data-Centric OS April 26, 2018 14 / 21

slide-36
SLIDE 36

Filesystem Service

Peer-to-peer operation Buffered operation

Host processor

SSD DMA engine Application

Co-processor

File system stub

① ③

File system proxy File system

PCIe

control data

Zero-copy of data between co-processor memory and SSD Minimal data transfer

Changwoo Min Solros: Data-Centric OS April 26, 2018 14 / 21

slide-37
SLIDE 37

Filesystem Service

Peer-to-peer operation Buffered operation

Host processor

SSD DMA engine Application

Co-processor

File system stub

File system proxy File system

PCIe

control data Buffer cache

Reduce storage IO by leveraging shared buffer cache among co-processors Avoid performance anomaly of peer-to-peer communication over PCIe bus

Changwoo Min Solros: Data-Centric OS April 26, 2018 15 / 21

slide-38
SLIDE 38

Filesystem Service

Peer-to-peer operation Buffered operation

Host processor

SSD DMA engine Application

Co-processor

File system stub

① ③

File system proxy Buffer cache File system

PCIe

control data

Reduce storage IO by leveraging shared buffer cache among co-processors Avoid performance anomaly of peer-to-peer communication over PCIe bus

Changwoo Min Solros: Data-Centric OS April 26, 2018 15 / 21

slide-39
SLIDE 39

Filesystem Service

Peer-to-peer operation Buffered operation

Host processor

SSD DMA engine Application

Co-processor

File system stub

① ③

File system proxy Buffer cache File system

PCIe

control data

Reduce storage IO by leveraging shared buffer cache among co-processors Avoid performance anomaly of peer-to-peer communication over PCIe bus

Changwoo Min Solros: Data-Centric OS April 26, 2018 15 / 21

slide-40
SLIDE 40

Filesystem Service

Peer-to-peer operation Buffered operation

Host processor

SSD DMA engine Application

Co-processor

File system stub

① ③

File system proxy Buffer cache File system

PCIe

control data

Reduce storage IO by leveraging shared buffer cache among co-processors Avoid performance anomaly of peer-to-peer communication over PCIe bus

Changwoo Min Solros: Data-Centric OS April 26, 2018 15 / 21

slide-41
SLIDE 41

Filesystem Service

Peer-to-peer operation Buffered operation

Host processor

SSD DMA engine Application

Co-processor

File system stub

① ③

File system proxy Buffer cache File system

PCIe

control data

④ ⑥

Reduce storage IO by leveraging shared buffer cache among co-processors Avoid performance anomaly of peer-to-peer communication over PCIe bus

Changwoo Min Solros: Data-Centric OS April 26, 2018 15 / 21

slide-42
SLIDE 42

Implementation

Host: 2-socket Xeon processor (12 cores each) Co-processor: 4 Xeon Phi (KNC, 61 cores, Linux, PCIe Gen 3x16) Storage device: 4 NVMe SSD NIC: 100 Gbps Ethernet

Module Lines of code Added lines Deleted lines Transport service 1,035 365 File system Service Stub 5,957 2,073 Proxy 2,338 124 Network Service Stub 2,921 79 Proxy 5,609 34 NVMe device driver 924 25 SCIF kernel module 60 14 Total 18,844 2,714

Changwoo Min Solros: Data-Centric OS April 26, 2018 16 / 21

slide-43
SLIDE 43

Evaluation

Questions: Performance of Solros services Impact on real-world applications

Changwoo Min Solros: Data-Centric OS April 26, 2018 17 / 21

slide-44
SLIDE 44

Performance of Solros Services

0.5 1 1.5 2 2.5 3 2 K B 6 4 K B 1 2 8 K B 2 5 6 K B 5 1 2 K B 1 M B 2 M B 4 M B 20 40 60 80 100 10 100 1000 GB/sec (a) file random read on SSD Phi-Solros Phi-Linux Percentage of requests (%) Latency(usec) (b) TCP latency: 64-byte message Phi-Solros Phi-Linux

File IO performance: 19x faster than the stock Linux on Xeon Phi TCP latency (99 percentile): 7x shorter than the stock Linux on Xeon Phi

Changwoo Min Solros: Data-Centric OS April 26, 2018 18 / 21

slide-45
SLIDE 45

Performance of Solros Services

2 4 6 Phi-Linux Phi-Solros 2 4 6 8 10 Phi-Linux Phi-Solros Latency(msec) (a) file random read on SSD Storage Block/Transport File system (b) TCP latency: 64-byte message Proxy/Transport Network stack

Significant performance gain in data transport Running IO stack on co-processor is slower

Changwoo Min Solros: Data-Centric OS April 26, 2018 19 / 21

slide-46
SLIDE 46

Real-world Application - Image Search

Image search engine is running on Xeon Phi Image database is on NVMe SSD (shared read-only) Image search queries are from network

100 200 300 400 100 200 100 200 300 Elapsed Time (sec) Phi-Solros Phi-Linux Bandwidth (MB/s) Time(sec) Phi-Solros Phi-Linux

Solros performs 2x faster than stock Linux on Xeon Phi

Changwoo Min Solros: Data-Centric OS April 26, 2018 20 / 21

slide-47
SLIDE 47

Conclusion

Solros, a new operating system architecture for co-processors and fast IO devices Control-plane, data-plane architecture allow:

Supporting high-level OS abstraction on co-processor Efficient global coordination among devices Near ideal IO performance from co-processor

We will release source code soon

Changwoo Min Solros: Data-Centric OS April 26, 2018 21 / 21

slide-48
SLIDE 48

Image Search - Scalability

Increase the number of Xeon Phi to 4

1 2 3 4 1 2 3 4 Normalized speedup The number of Xeon Phi co-processors

Solros load balancing mechanism achieves near linear scaling

Changwoo Min Solros: Data-Centric OS April 26, 2018 22 / 21

slide-49
SLIDE 49

Related work

Control-plane/data-plane OS: Arrakis [OSDI’14], IX [OSDI’14] OS for heterogeneous systems: Helios [SOSP]09], M3 [ASPLOS’16], Hydra [ASPLOS’08] IO support for GPU: PTask [SOSP’11], GPUfs [ASPLOS’13], GPUnet [OSDI’14]

Changwoo Min Solros: Data-Centric OS April 26, 2018 23 / 21

slide-50
SLIDE 50

Transport service

Host's physical address space

Host RAM PCIe Window mmio DRAM

NIC

DRAM mmio

Xeon Phi

DRAM mmio

NVMe Host's virtual address space

App 1 App 2

...

mmio

Device's physical address sapce

mapped to (across devices) mapped to (in-host)

Changwoo Min Solros: Data-Centric OS April 26, 2018 24 / 21

slide-51
SLIDE 51

Network Service (TCP)

Outbound operation Inbound operation

Host processor

NIC Application

Co-processor

TCP stub TCP proxy Load balancer TCP stack

PCIe

③ ① ②

control data

A load balancer on a host distributes incoming TCP connections to one of least-loaded co-processors. See details in the paper.

Changwoo Min Solros: Data-Centric OS April 26, 2018 25 / 21

slide-52
SLIDE 52

Discussion

Hardware support other than Xeon Phi

Two atomic instructions: transport service MMU: isolation among co-processor applications

Scalability of control-plane OS

Limited by scalability of OS service, PCIe interconnect, and performance of IO devices

Changwoo Min Solros: Data-Centric OS April 26, 2018 26 / 21

slide-53
SLIDE 53

Real-world Application - Text Search

CLucene text indexing engine running on Xeon Phi Text data is on NVMe SSD

500 1000 1500 2000 2500 3000 0.0 0.5 1.0 1.5 50 100 150 Elapsed Time (sec) Phi-Solros Phi-Linux Bandwidth (GB/s) Time(sec) Phi-Solros Phi-Linux (virtio)

Solros performs 19x faster than stock Linux (ext4/virtio) on Xeon Phi

Changwoo Min Solros: Data-Centric OS April 26, 2018 27 / 21