Farewell to Servers:
Hardware, Software, and Network Approaches towards Datacenter
Resource Disaggregation
Yiying Zhang
Resource Disaggregation Yiying Zhang 2 Monolithic Computer OS / - - PowerPoint PPT Presentation
Farewell to Servers: Hardware, Software, and Network Approaches towards Datacenter Resource Disaggregation Yiying Zhang 2 Monolithic Computer OS / Hypervisor 3 Application Can monolithic Hardware servers continue to Heterogeneity meet
Hardware, Software, and Network Approaches towards Datacenter
Yiying Zhang
2
3
OS / Hypervisor
Hardware Application Flexibility Perf / $
Heterogeneity
5
FPGA GPU TPU ASIC HBM NVM NVMe DNA Storage
6
Hardware Application Flexibility Perf / $
Heterogeneity
8
Hardware Application Flexibility Perf / $
Heterogeneity
10
Server 1 Server 2 Job 1 Job 2
wasted!
cpu mem
Available Space Required Space
11
Resource Utilization in Production Clusters
Unused Resource + Waiting/Killed Jobs Because of Physical-Node Constraints
* Google Production Cluster Trace Data. “https://github.com/google/cluster-data” * Alibaba Production Cluster Trace Data. “https://github.com/alibaba/clusterdata."
Hardware Application Flexibility Perf / $
Heterogeneity
13
14
15
16
Network
Hardware Application Flexibility Perf / $
Heterogeneity
17
Why Possible Now?
Intel Rack-Scale System Berkeley Firebox IBM Composable System HP The Machine
Disaggregated Datacenter
Flexibility $ Cost
Performance
Reliability Heterogeneity Hardware
Unmodified Application
Network OS Dist Sys
End-to-End Solution
Disaggregated Datacenter
Physically Disaggregated Resources Networking for Disaggregated Resources
RDMA Network Kernel-Level RDMA Virtualization (SOSP’17) New Processor and Memory Architecture
End-to-End Solution
Disaggregated Operating System (OSDI’18)
20
21
Core Kern GPU Kern
P-NIC
Kern
Shared Main Memory
Monolithic Server
Monolithic/Micro-kernel
(e.g., Linux, L4)
Multikernel
(e.g., Barrelfish, Helios, fos)
mem Disk NIC CPU monolithic kernel
network across servers
Server mem Disk NIC CPU microkernel Server
Disk NIC
Access remote resources Distributed resource mgmt Fine-grained failure handling
22
Network
23
24
OS
Process Mgmt Virtual Memory System File & Storage System Network
25
Process Mgmt Virtual Memory System File & Storage System Network File & Storage System Network Network Network
Network
Processor (CPU) Memory
The Splitkernel Architecture
26
non-coherent components
failure handling
Memory Monitor Process Monitor
network messaging across non-coherent components
GPU Minitor Processor (GPU) Hard Disk NVM Monitor NVM SSD
Monitor
SSD HDD
Monitor
XPU Manager New h/w (XPU)
The First Disaggregated OS
27
Processor Storage M e m
y NVM
How Should LegoOS Appear to Users?
, storage mount point
28
As a giant machine? As a set of hardware devices?
29
One vNode can run multiple hardware components One hardware component can run multiple vNodes
Processor (CPU) GPU Minitor Processor (GPU) Memory Hard Disk
network messaging across non-coherent components
NVM Monitor NVM SSD
Monitor
SSD HDD
Monitor
Memory Monitor Process Monitor XPU Manager New h/w (XPU)
vNode2 vNode1
30
31
Separate Processor and Memory
32
Processor CPU CPU $ $ Last-Level
DRAM
TLB MMU
PT
Separate Processor and Memory
33
Processor CPU CPU $ $ Last-Level
Network
DRAM
TLB MMU
Memory
Separate and move hardware units to memory component
Memory
PT
Separate Processor and Memory
34
Processor CPU CPU $ $ Last-Level
Network
DRAM
TLB MMU
Memory
Separate and move hardware units to memory component
Memory
PT
Virtual Memory
Separate Processor and Memory
35
Processor CPU CPU $ $ Last-Level
Network
DRAM
TLB MMU
Memory
Separate and move virtual memory system to memory component
Memory
PT Virtual Memory
Separate Processor and Memory
36
Processor CPU CPU $ $ Last-Level
Network
DRAM
TLB MMU
Memory
Memory
PT Virtual Memory
Processor components only see virtual memory addresses Memory components manage virtual and physical memory
Virtual Address Virtual Address Virtual Address Virtual Address
All levels of cache are virtual cache
37
Add Extended Cache at Processor
38
Processor CPU CPU $ $ Last-Level
Network
DRAM
TLB MMU
Memory
Memory
PT Virtual Memory
Add Extended Cache at Processor
39
Processor CPU CPU $ $ Last-Level
Network
DRAM
TLB MMU
Memory
Memory
PT Virtual Memory
DRAM
ExCache
managed
40
Distributed Resource Management
41
Global Process Manager (GPM) Global Memory Manager (GMM) Global Storage Manager (GSM)
Processor (CPU) GPU Minitor Processor (GPU) Memory Hard Disk
network messaging across non-coherent components
NVM Monitor NVM SSD
Monitor
SSD HDD
Monitor
Memory Monitor Process Monitor
Global Resource Mgmt
Memory Memory Monitor
Implementation and Emulation
42 CPU
LLC ExCache
CPU
Processor
Disk
Memory Storage
DRAM LLC Disk DRAM
CPU
LLC Disk
Process Monitor Memory Monitor
Linux Kernel Module
CPU CPU CPU CPU CPU CPU
RDMA Network
43
ExCache/Memory Size (MB)
128 256 512
Slowdown
1 3 5 7 Linux−swap−SSD Linux−swap−ramdisk InfiniSwap LegoOS
LegoOS Config: 1P , 1M, 1S
Only 1.3x to 1.7x slowdown when disaggregating devices with LegoOS
To gain better resource packing, elasticity, and fault tolerance!
scratch for datacenter resource disaggregation
running at device
44
Disaggregated Datacenter
Physically Disaggregated Resources Networking for Disaggregated Resources
RDMA Network Kernel-Level RDMA Virtualization (SOSP’17) New Processor and Memory Architecture
flexible, heterogeneous, elastic, perf/$, resilient, scalable, easy-to-use
Disaggregated Operating System (OSDI’18)
Networking for Disaggregated Resources
RDMA Network Kernel-Level RDMA Virtualization (SOSP’17)
Network Requirements for Resource Disaggregation
46
RDMA (Remote Direct Memory Access)
47
memory
Benefits: – Low latency – High throughput – Low CPU utilization
NIC NIC Memory CPU User Kernel Memory CPU User Kernel
Socket over Ethernet
NIC NIC Memory CPU User Kernel Memory CPU User Kernel
RDMA
Things have worked well in HPC
48
49
[VLDB ’16]
RSI
[EuroSys ’16]
DrTM+R
[NSDI ’14]
FaRM
[SOSP ’15]
FaRM+Xact
[SIGCOMM ’14]
HERD
[ATC ’16]
HERD-RPC
[OSDI ’16]
FaSST
[ATC ’17]
Octopus
[ATC ’13]
Pilaf
[SoCC ’17]
Hotpot
[OSDI ’16]
Wukong
[SoCC ’17]
APUS
[SOSP ’15]
DrTM
[VLDB ’17]
NAM-DB
[ASPLOS ’15]
Mojim
[ATC ’16]
Cell
RDMA-Based Datacenter Applications
Things have worked well in HPC
50
51
OS
User-Level RDMA App
RNIC
node, lkey, rkey addr
Permission check Address mapping
lkey 1 lkey n rkey 1 rkey n
… …
send recv
Library
Conn Mgmt Mem Mgmt
Cached PTEs
Connections Queues Keys Memory space
User Space Kernel Space Hardware
Kernel Bypassing
52
Userspace Hardware
RDMA Socket Developers want
Fat applications No resource sharing
Abstraction Mismatch
High-level Easy to use Resource share Isolation Low-level Difficult to use Difficult to share
Things have worked well in HPC
53
What about datacenters?
54
OS
User-Level RDMA App
RNIC
node, lkey, rkey addr
Permission check Address mapping
lkey 1 lkey n rkey 1 rkey n
… …
send recv
Library
Conn Mgmt Mem Mgmt
Cached PTEs
Connections Queues Keys Memory space
User Space Kernel Space Hardware
Kernel Bypassing
Userspace Hardware
Requests /us 1.5 3 4.5 6
Total Size (MB)
1 4 16 64 256 1024 Write-64B Write-1K
Userspace Hardware
55
Expensive, unscalable hardware
On-NIC SRAM stores and caches metadata
Things have worked well in HPC
56
What about datacenters?
Are we removing too much from kernel?
Fat applications No resource sharing
Expensive, unscalable hardware
57
High-level abstraction Protection Resource sharing Performance isolation
LITE - Local Indirection TiEr
Protection Performance isolation Resource sharing High-level abstraction
58
RNIC
59
Permission check Address mapping
Cached PTEs lkey 1 lkey n rkey 1 rkey n
… … Library
Connections Queues Keys Memory space
User-Level RDMA App
node, lkey, rkey addr send recv Conn Mgmt Mem Mgmt
User Space Hardware
LITE
60 Connections Queues Keys Memory space
User-Level RDMA App
node, lkey, rkey addr send recv Conn Mgmt Mem Mgmt
LITE APIs
Memory RPC/Msg APIs Sync APIs
Simpler applications
User Space Kernel Space
RNIC
Permission check Address mapping
Cached PTEs lkey 1 lkey n rkey 1 rkey n
… … Hardware
LITE RNIC
61 Connections Queues Keys Memory space
User-Level RDMA App
node, lkey, rkey addr send recv Conn Mgmt Mem Mgmt
LITE APIs
Memory RPC/Msg APIs Sync APIs
Permission check Address mapping
Global rkey Global lkey Global lkey Global rkey
Simpler applications
User Space Kernel Space Hardware
Cheaper hardware Scalable performance
62
Implementing Remote memset Native RDMA LITE
63
2.Avoid hardware-level indirection 3.Hide kernel-space crossing cost
64
Great Performance and Scalability
1.Indirection only at local node
except for the problem of too many layers of indirection – David Wheeler
Requests /us 1.5 3 4.5 6
Total Size (MB)
1 4 16 64 256 1024
Write-64B LITE_write-64B Write-1K LITE_write-1K
LITE RDMA:Size of MR Scalability
65
LITE scales much better than native RDMA wrt MR size and numbers
Application LOC LOC using LITE Student Days LITE-Log 330 36 1 LITE-MapReduce 600* 49 4 LITE-Graph 1400 20 7 LITE-Kernel-DSM 3000 45 26
66
* LITE-MapReduce ports from the 3000-LOC Phoenix with 600 lines of change or addition
[1]: “Ranger etal., Evaluating MapReduce for Multi-core and Multiprocessor Systems. (HPCA 07)” 67 2 4 6 8 21 23 25
Hadoop Phoenix LITE
Runtime (sec)
Phoenix 2-node 4-node 8-node
LITE-MapReduce outperforms Hadoop by 4.3x to 5.3x
68
Disaggregated Datacenter
Physically Disaggregated Resources Networking for Disaggregated Resources
RDMA Network Kernel-Level RDMA Virtualization (SOSP’17) New Processor and Memory Architecture
flexible, heterogeneous, elastic, perf/$, resilient, scalable, easy-to-use
Disaggregated Operating System (OSDI’18)
Disaggregated Datacenter
Physically Disaggregated Resources
Networking for Disaggregated Resources
RDMA Network
Kernel-Level RDMA Virtualization (SOSP’17)
New Processor and Memory Architecture
flexible, heterogeneous, elastic, perf/$, resilient, scalable, easy-to-use
Disaggregated OS (OSDI’18) Virtually Disaggregated Resources Network-Attached NVM Disaggregated Persistent Memory Distributed Non-Volatile Memory
Distributed Shared Persistent Memory (SoCC ’17)
InfiniBand
New Network Topology, Routing, Congestion-Ctrl
resource disaggregation
solution for disaggregated datacenter
hardware, software, networking, security, and programming language
71
wuklab.io