Resource Disaggregation Yiying Zhang 2 Monolithic Computer OS / - PowerPoint PPT Presentation

Farewell to Servers: Hardware, Software, and Network Approaches towards Datacenter Resource Disaggregation Yiying Zhang

Monolithic Computer OS / Hypervisor 3

Application Can monolithic Hardware servers continue to Heterogeneity meet Flexibility datacenter needs? Perf / $

TPU GPU FPGA HBM NVM ASIC DNA Storage NVMe 5

Making new hardware work with existing servers is like fitting puzzles 6

Poor Hardware Elasticity • Hard to change hardware components Add (hotplug), remove, reconfigure, restart - • No fine-grained failure handling The failure of one device can crash a whole machine - 8

Poor Resource Utilization • Whole VM/container has to run on one physical machine Move current applications to make room for new ones - wasted! cpu mem Server 1 Server 2 Job 1 Job 2 Available Space Required Space 10

Resource Utilization in Production Clusters * Google Production Cluster Trace Data. * Alibaba Production Cluster Trace Data. “https://github.com/google/cluster-data” “https://github.com/alibaba/clusterdata." Unused Resource + Waiting/Killed Jobs Because of Physical-Node Constraints 11

How to achieve better heterogeneity, flexibility, and perf/$? Go beyond physical node boundary 13

Resource Disaggregation : Breaking monolithic servers into network- attached, independent hardware components 14

Application Flexibility Heterogeneity Hardware Perf / $ Network 16

Why Possible Now? • Network is faster • InfiniBand ( 200Gbps, 600ns ) Berkeley Firebox • Optical Fabric ( 400Gbps, 100ns ) • More processing power at device • SmartNIC, SmartSSD, PIM Intel Rack-Scale • Network interface closer to device HP The Machine System • Omni-Path, Innova-2 IBM Composable System 17

Disaggregated Datacenter End-to-End Solution Unmodified Performance Application Heterogeneity Dist Sys Flexibility OS Reliability Network Hardware $ Cost

Disaggregated Datacenter End-to-End Solution Physically Disaggregated Resources Disaggregated Operating System (OSDI’18) New Processor and Memory Architecture Networking for Disaggregated Resources Kernel-Level RDMA Virtualization (SOSP’17) RDMA Network

Can Existing Kernels Fit? Kern Kern Kern monolithic microkernel kernel Core GPU P-NIC CPU CPU mem NIC mem NIC Shared Main Memory Disk Disk Server Server network across servers Disk NIC Monolithic Server Monolithic/Micro-kernel Multikernel (e.g., Linux, L4) (e.g., Barrelfish, Helios, fos) 21

Existing Kernels Don’t Fit Access remote resources Network Distributed resource mgmt Fine-grained failure handling 22

When hardware is disaggregated The OS should be also 23

OS Virtual File & Process Memory Storage Mgmt System System Network 24

Network File & Process Storage Mgmt System Network Network Virtual Memory File & System Storage Network System Network 25

The Splitkernel Architecture • Split OS functions into monitors • Run each monitor at h/w device • Network messaging across non-coherent components GPU XPU Process Minitor Manager Monitor • Distributed resource mgmt and New h/w Processor Processor failure handling (XPU) (CPU) (GPU) network messaging across non-coherent components Memory NVM HDD SSD Monitor Monitor Monitor Monitor Memory NVM Hard Disk SSD 26

LegoOS The First Disaggregated OS M e m o r y Processor Storage NVM 27

How Should LegoOS Appear to Users? As a set of hardware devices? As a giant machine? • Our answer: as a set of virtual Nodes ( vNodes ) Similar semantics to virtual machines - Unique vID, vIP , storage mount point - Can run on multiple processor, memory, and storage components - 28

Abstraction - vNode Process GPU XPU Monitor Minitor Manager vNode1 Processor Processor New h/w (CPU) (GPU) (XPU) network messaging across non-coherent components vNode2 Memory NVM HDD SSD Monitor Monitor Monitor Monitor Memory NVM Hard Disk SSD One vNode can run multiple hardware components One hardware component can run multiple vNodes 29

Abstraction • Appear as vNodes to users • Linux ABI compatible • Support unmodified Linux system call interface (common ones) • A level of indirection to translate Linux interface to LegoOS interface 30

LegoOS Design 1. Clean separation of OS and hardware functionalities 2. Build monitor with hardware constraints 3. RDMA-based message passing for both kernel and applications 4. Two-level distributed resource management 5. Memory failure tolerance through replication 31

Separate Processor and Memory Processor CPU $ CPU $ Last-Level TLB MMU DRAM PT 32

Separate Processor and Memory Separate and move Processor CPU $ CPU $ hardware units Last-Level to memory Network component Memory TLB MMU DRAM PT Memory 33

Separate Processor and Memory Virtual Memory Separate and move Processor CPU $ CPU $ hardware units Last-Level to memory Network component Memory TLB MMU DRAM PT Memory 34

Separate Processor and Memory Separate and move Processor virtual memory CPU $ CPU $ Last-Level system Network to memory Memory component TLB MMU Virtual Memory DRAM PT Memory 35

Separate Processor and Memory Virtual Virtual Address Address Processor Processor components only Virtual CPU $ CPU $ Address see virtual memory addresses All levels of cache are Last-Level virtual cache Network Virtual Address Memory Memory components manage TLB MMU Virtual Memory virtual and physical memory DRAM PT Memory 36

Challenge: network is 2x-4x slower than memory bus 37

Add Extended Cache at Processor Processor CPU $ CPU $ Last-Level Network Memory TLB MMU Virtual Memory DRAM PT Memory 38

Add Extended Cache at Processor Processor • Add small DRAM/HBM at processor CPU $ CPU $ Last-Level • Use it as Extended Cache, or ExCache DRAM • Software and hardware co- Network managed Memory • Inclusive TLB MMU Virtual Memory • Virtual cache DRAM PT Memory 39

LegoOS Design 1. Clean separation of OS and hardware functionalities 2. Build monitor with hardware constraints 3. RDMA-based message passing for both kernel and applications 4. Two-level distributed resource management 5. Memory failure tolerance through replication 40

Distributed Resource Management Global Process Manager ( GPM ) Process GPU Global Monitor Minitor Resource Mgmt Global Processor Processor Memory Manager ( GMM ) (CPU) (GPU) Global network messaging across non-coherent components Storage Manager ( GSM ) Memory Memory NVM HDD SSD 1. Coarse-grain allocation Monitor Monitor Monitor Monitor Monitor Memory Memory NVM Hard Disk SSD 2. Load-balancing 3. Failure handling 41

Implementation and Emulation Process Monitor • Processor • Reserve DRAM as ExCache (4KB page as cache line) CPU CPU CPU CPU Processor • h/w only on hit path, s/w managed miss path LLC Disk ExCache • Indirection layer to store states for 113 Linux syscalls • Memory RDMA Network • Limit number of cores, kernel-space only Memory Monitor Linux Kernel Module • Storage/Global Resource Monitors CPU CPU CPU CPU CPU • Implemented as kernel module on Linux LLC Disk LLC Disk • Network DRAM DRAM Memory Storage • RDMA RPC stack based on LITE [ SOSP’17 ] 42

Performance Evaluation • Unmodified TensorFlow, running CIFAR-10 7 Linux − swap − SSD Linux − swap − ramdisk • Working set: 0.9G Slowdown InfiniSwap 5 LegoOS • 4 threads 3 • Systems in comparison 1 128 256 512 • Baseline: Linux with unlimited memory ExCache/Memory Size (MB) LegoOS Config: 1P , 1M, 1S • Swap to SSD, and ramdisk Only 1.3x to 1.7x slowdown when • InfiniSwap [ NSDI’17 ] disaggregating devices with LegoOS To gain better resource packing, 43 elasticity, and fault tolerance!

LegoOS Summary • Resource disaggregation calls for new system • LegoOS : a new OS designed and built from scratch for datacenter resource disaggregation • Split OS into distributed micro-OS services, running at device • Many challenges and many potentials 44

Disaggregated Datacenter flexible, heterogeneous, elastic, perf/$, resilient, scalable, easy-to-use Physically Disaggregated Resources Disaggregated Operating System (OSDI’18) New Processor and Memory Architecture Networking for Disaggregated Resources Networking for Disaggregated Resources Kernel-Level RDMA Virtualization (SOSP’17) Kernel-Level RDMA Virtualization (SOSP’17) RDMA Network RDMA Network

Network Requirements for Resource Disaggregation • Low latency RDMA • High bandwidth • Scale • Reliable 46

Resource Disaggregation Yiying Zhang 2 Monolithic Computer OS / - PowerPoint PPT Presentation

Farewell to Servers: Hardware, Software, and Network Approaches towards Datacenter Resource Disaggregation Yiying Zhang 2 Monolithic Computer OS / Hypervisor 3 Application Can monolithic Hardware servers continue to Heterogeneity meet

Efforts in Monitoring SDG with Disaggregation in the Philippines International Workshop on Data

Tolerating Faults in Disaggregated Datacenters Amanda Carbonari , Ivan Beschastnikh University

Disaggregation and the Application Sebastian Angel Mihir Nanavati Siddhartha Sen Traditional

Disaggregation March 4, 2020 D. Pallme, A. Kosanovic, TN-DOT K. Pujats, M. Golias, S. Mishra,

Improving Population Mapping and Exposure Assessment: 3-Dimensional Dasymetric Disaggregation in

Data disaggregation to provide information for Minnesotas diverse cultural communities (or .

Data Generation and Disaggregation for Monitoring and Evaluation of SDGs Md. Alamgir Hossen

NVMe-over-Fabrics Performance Characterization and the Path to Low-Overhead Flash Disaggregation

1 out of 20 possible scenarios: how to perform temporal disaggregation of annual sector accounts

The Importance & Feasibility of Disaggregation by Disability Status Monitoring the UNCRPD and

Cultural & Technological Shifts Toward SLO Disaggregation Colin Williams, Coordinator

Exploring The Value of Energy Disaggregation through actionable feedback Nipun Batra , Amarjeet

Efficient Memory Disaggregation with Infiniswap Juncheng Gu , Youngmoon Lee, Yiwen Zhang,

Disintermediation, Dematerialization, Disaggregation ! Disruption ! History of Information 103 !

NILM 2016 Lightning Talks Running order 1. Occupancy-aided Energy Disaggregation 2. Analyzing

Handling Imperfections in Energy Disaggregation Authors: Angshul Majumdar, Megha

Office Hours: COVID-19 Planning and Response July 17, 2020 Housekeeping A recording of

DataCite and Data Citation Joan Starr California Digital Library DataCite & Data Citation

Living Textbook Grand Rounds Series Pragmatic Clinical Trials: How Do I Start? January 31, 2020

Dynamic Delegation of Experimentation Yingni Guo Northwestern University ngni Guo (NU)

from ATSDR Fact sheet: https://go.usa.gov/xn8Mn Ben Gerhardstein, MPH Pam Tucker, MD Webinar

SOCIAL-EMOTIONAL SUPPORT STRATEGIES HUDDLE Guidance and support for child care programs in the

Finding Freedom on Your Yoga Mat! Presented by: Cheryl Killilea- Owner of Changing Lanes Fitness

Neuroplasticity 94 Death is nothing to fear beyond life lies beauty. Beyond life

Resource Disaggregation Yiying Zhang 2 Monolithic Computer OS / - PowerPoint PPT Presentation

Farewell to Servers: Hardware, Software, and Network Approaches towards Datacenter Resource Disaggregation Yiying Zhang 2 Monolithic Computer OS / Hypervisor 3 Application Can monolithic Hardware servers continue to Heterogeneity meet

Efforts in Monitoring SDG with Disaggregation in the Philippines International Workshop on Data

Tolerating Faults in Disaggregated Datacenters Amanda Carbonari , Ivan Beschastnikh University

Disaggregation and the Application Sebastian Angel Mihir Nanavati Siddhartha Sen Traditional

Disaggregation March 4, 2020 D. Pallme, A. Kosanovic, TN-DOT K. Pujats, M. Golias, S. Mishra,

Improving Population Mapping and Exposure Assessment: 3-Dimensional Dasymetric Disaggregation in

Data disaggregation to provide information for Minnesotas diverse cultural communities (or .

Data Generation and Disaggregation for Monitoring and Evaluation of SDGs Md. Alamgir Hossen

NVMe-over-Fabrics Performance Characterization and the Path to Low-Overhead Flash Disaggregation

1 out of 20 possible scenarios: how to perform temporal disaggregation of annual sector accounts

The Importance &amp; Feasibility of Disaggregation by Disability Status Monitoring the UNCRPD and

Cultural &amp; Technological Shifts Toward SLO Disaggregation Colin Williams, Coordinator

Exploring The Value of Energy Disaggregation through actionable feedback Nipun Batra , Amarjeet

Efficient Memory Disaggregation with Infiniswap Juncheng Gu , Youngmoon Lee, Yiwen Zhang,

Disintermediation, Dematerialization, Disaggregation ! Disruption ! History of Information 103 !

NILM 2016 Lightning Talks Running order 1. Occupancy-aided Energy Disaggregation 2. Analyzing

Handling Imperfections in Energy Disaggregation Authors: Angshul Majumdar, Megha

Office Hours: COVID-19 Planning and Response July 17, 2020 Housekeeping A recording of

DataCite and Data Citation Joan Starr California Digital Library DataCite &amp; Data Citation

Living Textbook Grand Rounds Series Pragmatic Clinical Trials: How Do I Start? January 31, 2020

Dynamic Delegation of Experimentation Yingni Guo Northwestern University ngni Guo (NU)

from ATSDR Fact sheet: https://go.usa.gov/xn8Mn Ben Gerhardstein, MPH Pam Tucker, MD Webinar

SOCIAL-EMOTIONAL SUPPORT STRATEGIES HUDDLE Guidance and support for child care programs in the

Finding Freedom on Your Yoga Mat! Presented by: Cheryl Killilea- Owner of Changing Lanes Fitness

Neuroplasticity 94 Death is nothing to fear beyond life lies beauty. Beyond life

The Importance & Feasibility of Disaggregation by Disability Status Monitoring the UNCRPD and

Cultural & Technological Shifts Toward SLO Disaggregation Colin Williams, Coordinator

DataCite and Data Citation Joan Starr California Digital Library DataCite & Data Citation