Resource Disaggregation Yiying Zhang 2 Monolithic Computer OS / - - PowerPoint PPT Presentation

resource disaggregation
SMART_READER_LITE
LIVE PREVIEW

Resource Disaggregation Yiying Zhang 2 Monolithic Computer OS / - - PowerPoint PPT Presentation

Farewell to Servers: Hardware, Software, and Network Approaches towards Datacenter Resource Disaggregation Yiying Zhang 2 Monolithic Computer OS / Hypervisor 3 Application Can monolithic Hardware servers continue to Heterogeneity meet


slide-1
SLIDE 1

Farewell to Servers:

Hardware, Software, and Network Approaches towards Datacenter

Resource Disaggregation

Yiying Zhang

slide-2
SLIDE 2

2

slide-3
SLIDE 3

Monolithic Computer

3

OS / Hypervisor

slide-4
SLIDE 4

Can monolithic servers continue to meet datacenter needs?

Hardware Application Flexibility Perf / $

Heterogeneity

slide-5
SLIDE 5

5

FPGA GPU TPU ASIC HBM NVM NVMe DNA Storage

slide-6
SLIDE 6

Making new hardware work with existing servers is like fitting puzzles

6

slide-7
SLIDE 7

Can monolithic servers continue to meet datacenter needs?

Hardware Application Flexibility Perf / $

Heterogeneity

slide-8
SLIDE 8

Poor Hardware Elasticity

  • Hard to change hardware components
  • Add (hotplug), remove, reconfigure, restart
  • No fine-grained failure handling
  • The failure of one device can crash a whole machine

8

slide-9
SLIDE 9

Can monolithic servers continue to meet datacenter needs?

Hardware Application Flexibility Perf / $

Heterogeneity

slide-10
SLIDE 10

Poor Resource Utilization

  • Whole VM/container has to run on one physical machine
  • Move current applications to make room for new ones

10

Server 1 Server 2 Job 1 Job 2

wasted!

cpu mem

Available Space Required Space

slide-11
SLIDE 11

11

Resource Utilization in Production Clusters

Unused Resource + Waiting/Killed Jobs Because of Physical-Node Constraints

* Google Production Cluster Trace Data. “https://github.com/google/cluster-data” * Alibaba Production Cluster Trace Data. “https://github.com/alibaba/clusterdata."

slide-12
SLIDE 12

Can monolithic servers continue to meet datacenter needs?

Hardware Application Flexibility Perf / $

Heterogeneity

slide-13
SLIDE 13

13

How to achieve better heterogeneity, flexibility, and perf/$?

Go beyond physical node boundary

slide-14
SLIDE 14

Resource Disaggregation:

Breaking monolithic servers into network- attached, independent hardware components

14

slide-15
SLIDE 15

15

slide-16
SLIDE 16

16

Network

Hardware Application Flexibility Perf / $

Heterogeneity

slide-17
SLIDE 17

17

Why Possible Now?

  • Network is faster
  • InfiniBand (200Gbps, 600ns)
  • Optical Fabric (400Gbps, 100ns)
  • More processing power at device
  • SmartNIC, SmartSSD, PIM
  • Network interface closer to device
  • Omni-Path, Innova-2

Intel Rack-Scale System Berkeley Firebox IBM Composable System HP The Machine

slide-18
SLIDE 18

Disaggregated Datacenter

Flexibility $ Cost

Performance

Reliability Heterogeneity Hardware

Unmodified Application

Network OS Dist Sys

End-to-End Solution

slide-19
SLIDE 19

Disaggregated Datacenter

Physically Disaggregated Resources Networking for Disaggregated Resources

RDMA Network Kernel-Level RDMA Virtualization (SOSP’17) New Processor and Memory Architecture

End-to-End Solution

Disaggregated Operating System (OSDI’18)

slide-20
SLIDE 20

20

slide-21
SLIDE 21

Can Existing Kernels Fit?

21

Core Kern GPU Kern

P-NIC

Kern

Shared Main Memory

Monolithic Server

Monolithic/Micro-kernel

(e.g., Linux, L4)

Multikernel

(e.g., Barrelfish, Helios, fos)

mem Disk NIC CPU monolithic kernel

network across servers

Server mem Disk NIC CPU microkernel Server

Disk NIC

slide-22
SLIDE 22

Access remote resources Distributed resource mgmt Fine-grained failure handling

Existing Kernels Don’t Fit

22

Network

slide-23
SLIDE 23

23

The OS should be also

When hardware is disaggregated

slide-24
SLIDE 24

24

OS

Process Mgmt Virtual Memory System File & Storage System Network

slide-25
SLIDE 25

25

Process Mgmt Virtual Memory System File & Storage System Network File & Storage System Network Network Network

Network

slide-26
SLIDE 26

Processor (CPU) Memory

The Splitkernel Architecture

26

  • Split OS functions into monitors
  • Run each monitor at h/w device
  • Network messaging across

non-coherent components

  • Distributed resource mgmt and

failure handling

Memory Monitor Process Monitor

network messaging across non-coherent components

GPU Minitor Processor (GPU) Hard Disk NVM Monitor NVM SSD

Monitor

SSD HDD

Monitor

XPU Manager New h/w (XPU)

slide-27
SLIDE 27

LegoOS

The First Disaggregated OS

27

Processor Storage M e m

  • r

y NVM

slide-28
SLIDE 28

How Should LegoOS Appear to Users?

  • Our answer: as a set of virtual Nodes (vNodes)
  • Similar semantics to virtual machines
  • Unique vID, vIP

, storage mount point

  • Can run on multiple processor, memory, and storage components

28

As a giant machine? As a set of hardware devices?

slide-29
SLIDE 29

Abstraction - vNode

29

One vNode can run multiple hardware components One hardware component can run multiple vNodes

Processor (CPU) GPU Minitor Processor (GPU) Memory Hard Disk

network messaging across non-coherent components

NVM Monitor NVM SSD

Monitor

SSD HDD

Monitor

Memory Monitor Process Monitor XPU Manager New h/w (XPU)

vNode2 vNode1

slide-30
SLIDE 30

Abstraction

  • Appear as vNodes to users
  • Linux ABI compatible
  • Support unmodified Linux system call interface (common ones)
  • A level of indirection to translate Linux interface to LegoOS interface

30

slide-31
SLIDE 31

LegoOS Design

  • 1. Clean separation of OS and hardware functionalities
  • 2. Build monitor with hardware constraints
  • 3. RDMA-based message passing for both kernel and applications
  • 4. Two-level distributed resource management
  • 5. Memory failure tolerance through replication

31

slide-32
SLIDE 32

Separate Processor and Memory

32

Processor CPU CPU $ $ Last-Level

DRAM

TLB MMU

PT

slide-33
SLIDE 33

Separate Processor and Memory

33

Processor CPU CPU $ $ Last-Level

Network

DRAM

TLB MMU

Memory

Separate and move hardware units to memory component

Memory

PT

slide-34
SLIDE 34

Separate Processor and Memory

34

Processor CPU CPU $ $ Last-Level

Network

DRAM

TLB MMU

Memory

Separate and move hardware units to memory component

Memory

PT

Virtual Memory

slide-35
SLIDE 35

Separate Processor and Memory

35

Processor CPU CPU $ $ Last-Level

Network

DRAM

TLB MMU

Memory

Separate and move virtual memory system to memory component

Memory

PT Virtual Memory

slide-36
SLIDE 36

Separate Processor and Memory

36

Processor CPU CPU $ $ Last-Level

Network

DRAM

TLB MMU

Memory

Memory

PT Virtual Memory

Processor components only see virtual memory addresses Memory components manage virtual and physical memory

Virtual Address Virtual Address Virtual Address Virtual Address

All levels of cache are virtual cache

slide-37
SLIDE 37

Challenge: network is 2x-4x slower than memory bus

37

slide-38
SLIDE 38

Add Extended Cache at Processor

38

Processor CPU CPU $ $ Last-Level

Network

DRAM

TLB MMU

Memory

Memory

PT Virtual Memory

slide-39
SLIDE 39

Add Extended Cache at Processor

39

Processor CPU CPU $ $ Last-Level

Network

DRAM

TLB MMU

Memory

Memory

PT Virtual Memory

DRAM

  • Add small DRAM/HBM at processor
  • Use it as Extended Cache, or

ExCache

  • Software and hardware co-

managed

  • Inclusive
  • Virtual cache
slide-40
SLIDE 40

LegoOS Design

  • 1. Clean separation of OS and hardware functionalities
  • 2. Build monitor with hardware constraints
  • 3. RDMA-based message passing for both kernel and applications
  • 4. Two-level distributed resource management
  • 5. Memory failure tolerance through replication

40

slide-41
SLIDE 41

Distributed Resource Management

  • 1. Coarse-grain allocation
  • 2. Load-balancing
  • 3. Failure handling

41

Global Process Manager (GPM) Global Memory Manager (GMM) Global Storage Manager (GSM)

Processor (CPU) GPU Minitor Processor (GPU) Memory Hard Disk

network messaging across non-coherent components

NVM Monitor NVM SSD

Monitor

SSD HDD

Monitor

Memory Monitor Process Monitor

Global Resource Mgmt

Memory Memory Monitor

slide-42
SLIDE 42

Implementation and Emulation

  • Processor
  • Reserve DRAM as ExCache (4KB page as cache line)
  • h/w only on hit path, s/w managed miss path
  • Indirection layer to store states for 113 Linux syscalls
  • Memory
  • Limit number of cores, kernel-space only
  • Storage/Global Resource Monitors
  • Implemented as kernel module on Linux
  • Network
  • RDMA RPC stack based on LITE [SOSP’17]

42 CPU

LLC ExCache

CPU

Processor

Disk

Memory Storage

DRAM LLC Disk DRAM

CPU

LLC Disk

Process Monitor Memory Monitor

Linux Kernel Module

CPU CPU CPU CPU CPU CPU

RDMA Network

slide-43
SLIDE 43

Performance Evaluation

  • Unmodified TensorFlow, running CIFAR-10
  • Working set: 0.9G
  • 4 threads
  • Systems in comparison
  • Baseline: Linux with unlimited memory
  • Swap to SSD, and ramdisk
  • InfiniSwap [NSDI’17]

43

ExCache/Memory Size (MB)

128 256 512

Slowdown

1 3 5 7 Linux−swap−SSD Linux−swap−ramdisk InfiniSwap LegoOS

LegoOS Config: 1P , 1M, 1S

Only 1.3x to 1.7x slowdown when disaggregating devices with LegoOS

To gain better resource packing, elasticity, and fault tolerance!

slide-44
SLIDE 44

LegoOS Summary

  • Resource disaggregation calls for new system
  • LegoOS: a new OS designed and built from

scratch for datacenter resource disaggregation

  • Split OS into distributed micro-OS services,

running at device

  • Many challenges and many potentials

44

slide-45
SLIDE 45

Disaggregated Datacenter

Physically Disaggregated Resources Networking for Disaggregated Resources

RDMA Network Kernel-Level RDMA Virtualization (SOSP’17) New Processor and Memory Architecture

flexible, heterogeneous, elastic, perf/$, resilient, scalable, easy-to-use

Disaggregated Operating System (OSDI’18)

Networking for Disaggregated Resources

RDMA Network Kernel-Level RDMA Virtualization (SOSP’17)

slide-46
SLIDE 46

Network Requirements for Resource Disaggregation

  • Low latency
  • High bandwidth
  • Scale
  • Reliable

46

RDMA

slide-47
SLIDE 47

RDMA (Remote Direct Memory Access)

47

  • Directly read/write remote

memory

  • Bypass kernel
  • Memory zero copy

Benefits: – Low latency – High throughput – Low CPU utilization

NIC NIC Memory CPU User Kernel Memory CPU User Kernel

Socket over Ethernet

NIC NIC Memory CPU User Kernel Memory CPU User Kernel

RDMA

slide-48
SLIDE 48

Things have worked well in HPC

  • Special hardware
  • Few applications
  • Cheaper developer

48

slide-49
SLIDE 49

49

[VLDB ’16]

RSI

[EuroSys ’16]

DrTM+R

[NSDI ’14]

FaRM

[SOSP ’15]

FaRM+Xact

[SIGCOMM ’14]

HERD

[ATC ’16]

HERD-RPC

[OSDI ’16]

FaSST

[ATC ’17]

Octopus

[ATC ’13]

Pilaf

[SoCC ’17]

Hotpot

[OSDI ’16]

Wukong

[SoCC ’17]

APUS

[SOSP ’15]

DrTM

[VLDB ’17]

NAM-DB

[ASPLOS ’15]

Mojim

[ATC ’16]

Cell

RDMA-Based Datacenter Applications

slide-50
SLIDE 50

Things have worked well in HPC

  • Special hardware
  • Few applications
  • Cheaper developer

50

  • Commodity, cheaper hardware
  • Many (changing) applications
  • Resource sharing and isolation

What about datacenters?

slide-51
SLIDE 51

Native RDMA

51

OS

User-Level RDMA App

RNIC

node, lkey, rkey addr

Permission check Address mapping

lkey 1 lkey n rkey 1 rkey n

… …

send recv

Library

Conn Mgmt Mem Mgmt

Cached PTEs

Connections Queues Keys Memory space

User Space Kernel Space Hardware

Kernel Bypassing

slide-52
SLIDE 52

52

Userspace Hardware

RDMA Socket Developers want

Fat applications No resource sharing

Abstraction Mismatch

High-level Easy to use Resource share Isolation Low-level Difficult to use Difficult to share

slide-53
SLIDE 53

Things have worked well in HPC

  • Special hardware
  • Few applications
  • Cheaper developer

53

What about datacenters?

  • Commodity, cheaper hardware
  • Many (changing) applications
  • Resource sharing and isolation
slide-54
SLIDE 54

Native RDMA

54

OS

User-Level RDMA App

RNIC

node, lkey, rkey addr

Permission check Address mapping

lkey 1 lkey n rkey 1 rkey n

… …

send recv

Library

Conn Mgmt Mem Mgmt

Cached PTEs

Connections Queues Keys Memory space

User Space Kernel Space Hardware

Kernel Bypassing

Userspace Hardware

slide-55
SLIDE 55

Requests /us 1.5 3 4.5 6

Total Size (MB)

1 4 16 64 256 1024 Write-64B Write-1K

Userspace Hardware

55

Expensive, unscalable hardware

On-NIC SRAM stores and caches metadata

slide-56
SLIDE 56

Things have worked well in HPC

  • Special hardware
  • Few applications
  • Cheaper developer

56

What about datacenters?

  • Commodity, cheaper hardware
  • Many (changing) applications
  • Resource sharing and isolation
slide-57
SLIDE 57

Are we removing too much from kernel?

Fat applications No resource sharing

Expensive, unscalable hardware

57

slide-58
SLIDE 58

High-level abstraction Protection Resource sharing Performance isolation

Without Kernel

LITE - Local Indirection TiEr

Protection Performance isolation Resource sharing High-level abstraction

58

slide-59
SLIDE 59

RNIC

59

Permission check Address mapping

Cached PTEs lkey 1 lkey n rkey 1 rkey n

… … Library

Connections Queues Keys Memory space

User-Level RDMA App

node, lkey, rkey addr send recv Conn Mgmt Mem Mgmt

User Space Hardware

slide-60
SLIDE 60

LITE

60 Connections Queues Keys Memory space

User-Level RDMA App

node, lkey, rkey addr send recv Conn Mgmt Mem Mgmt

LITE APIs

Memory RPC/Msg APIs Sync APIs

Simpler applications

User Space Kernel Space

RNIC

Permission check Address mapping

Cached PTEs lkey 1 lkey n rkey 1 rkey n

… … Hardware

slide-61
SLIDE 61

LITE RNIC

61 Connections Queues Keys Memory space

User-Level RDMA App

node, lkey, rkey addr send recv Conn Mgmt Mem Mgmt

LITE APIs

Memory RPC/Msg APIs Sync APIs

Permission check Address mapping

Global rkey Global lkey Global lkey Global rkey

Simpler applications

User Space Kernel Space Hardware

Cheaper hardware
 Scalable performance

slide-62
SLIDE 62

62

Implementing Remote memset Native RDMA LITE

slide-63
SLIDE 63

63

Main Challenge: How to preserve the performance benefit

  • f RDMA?
slide-64
SLIDE 64

LITE Design Principles

2.Avoid hardware-level indirection 3.Hide kernel-space crossing cost

64

Great Performance and Scalability

1.Indirection only at local node

except for the problem of too many layers of indirection – David Wheeler

slide-65
SLIDE 65

Requests /us 1.5 3 4.5 6

Total Size (MB)

1 4 16 64 256 1024

Write-64B LITE_write-64B Write-1K LITE_write-1K

LITE RDMA:Size of MR Scalability

65

LITE scales much better than native RDMA wrt MR size and numbers

slide-66
SLIDE 66

LITE Application Effort

  • Simple to use
  • Needs no expert knowledge
  • Flexible, powerful abstraction
  • Easy to achieve optimized performance

Application LOC LOC using LITE Student Days LITE-Log 330 36 1 LITE-MapReduce 600* 49 4 LITE-Graph 1400 20 7 LITE-Kernel-DSM 3000 45 26

66

* LITE-MapReduce ports from the 3000-LOC Phoenix with 600 lines of change or addition

slide-67
SLIDE 67

MapReduce Results

  • LITE-MapReduce adapted from Phoenix [1]

[1]: “Ranger etal., Evaluating MapReduce for Multi-core and Multiprocessor Systems. (HPCA 07)” 67 2 4 6 8 21 23 25

Hadoop Phoenix LITE

Runtime (sec)

Phoenix 2-node 4-node 8-node

LITE-MapReduce outperforms Hadoop 
 by 4.3x to 5.3x

slide-68
SLIDE 68

LITE Summary

  • Virtualizes RDMA into flexible, easy-to-use abstraction
  • Preserves RDMA’s performance benefits
  • Indirection not always degrade performance!

68

  • Division across user space, kernel, and hardware
slide-69
SLIDE 69

Disaggregated Datacenter

Physically Disaggregated Resources Networking for Disaggregated Resources

RDMA Network Kernel-Level RDMA Virtualization (SOSP’17) New Processor and Memory Architecture

flexible, heterogeneous, elastic, perf/$, resilient, scalable, easy-to-use

Disaggregated Operating System (OSDI’18)

slide-70
SLIDE 70

Disaggregated Datacenter

Physically Disaggregated Resources

Networking for Disaggregated Resources

RDMA Network

Kernel-Level RDMA Virtualization (SOSP’17)

New Processor and Memory Architecture

flexible, heterogeneous, elastic, perf/$, resilient, scalable, easy-to-use

Disaggregated OS (OSDI’18) Virtually Disaggregated Resources Network-Attached NVM Disaggregated Persistent Memory Distributed Non-Volatile Memory

Distributed Shared Persistent Memory (SoCC ’17)

InfiniBand

New Network Topology, Routing, Congestion-Ctrl

slide-71
SLIDE 71

Conclusion

  • New hardware and software trends point to

resource disaggregation

  • My research pioneers in building an end-to-end

solution for disaggregated datacenter

  • Opens up new research opportunities in

hardware, software, networking, security, and programming language

71

slide-72
SLIDE 72

Thank you Questions?

wuklab.io