Diverse Workloads need Specialized System Software: An approach of - PowerPoint PPT Presentation

Diverse Workloads need Specialized System Software: An approach of Multi-kernels and Application Containers Balazs Gerofi bgerofi@riken.jp Exa-scale System Software Research Group RIKEN Advanced Institute for Computational Science, JAPAN 2017/Aug/28 -- ROME’17 Santiago De Compostela, Spain

What is RIKEN? § RIKEN is Japan's largest (government funded) research institution § Established in 1917 § Research centers and institutes across Japan Advanced Institute for Computational Science (AICS) K Computer 2

Towards the Next Generation Flagship Japanese Supercomputer (without Tsubame series) PF 1000 Post K Computer RIKEN • Oakforest PACS (OFP) is operated by Univ. of Tsukuba and Univ. of Tokyo 9 Universities 100 and National Laboratories Oakforest PACS U. of Tsukuba 10 U. of Tokyo KNL + OmniPath (~25PF, 8100 nodes) • Machine resources are partly used for • T2K stands for developing the system software stack U. of Tsukuba for Post K 1 U. of Tokyo & Kyoto U. T2K 2008 2010 2012 2014 2016 2018 2020 3

Agenda § Motivation § Lightweight Multi-kernels § IHK/McKernel § Linux container concepts § conexec: integration with multi-kernels § Results § Conclusion 4

Motivation – system software/OS challenges for high-end HPC (and for converged HPC/BD/ML stack?) § Node architecture: increasing complexity and heterogeneity Large number of (heterogeneous) CPU cores, deep memory hierarchy, complex § cache/NUMA topology, specialized PUs § Applications: increasing diversity Traditional/regular HPC + in-situ data analytics + Big Data processing + Machine § Learning + Workflows, etc. § What do we need from the system software/OS (HPC perspective)? Performance and scalability for large scale parallel apps § Support for Linux APIs – tools, productivity, monitoring, etc. § Full control over HW resources § Ability to adapt to HW changes § Emerging memory technologies, power constrains, specialized PUs § Performance isolation and dynamic reconfiguration § According to workload characteristics, support for co-location § 5

Approach: embrace diversity and complexity § Enable dynamic specialization of the system software stack to meet application requirements § User-space: Full provision of libraries/dependencies for all applications will likely not be feasible: Containers (i.e., namespaces) – specialized user-space stack § § Kernel-space: Single monolithic OS kernel that fits all workloads will likely not be feasible: Specialized kernels that suit the specific workload § Lightweight multi-kernels for HPC § 6

Lightweight Multi-Kernels 7

Background – HPC Node OS Architecture § Traditionally: driven by the need for scalable, consistent performance for bulk- synchronous HPC Full Linux API • Start from Linux and remove features impeding Linux HPC performance Complex Mem. • Eliminate OS noise (daemons, timer IRQ, etc..), TCP stack VFS Mngt. simplify memory mngt., simplify scheduler General File Sys Driers Dev. Drivers scheduler “Stripped Often breaks the down Linux API and Linux” Linux like API introduces hard to approach maintain HPC OS (Cray’s Extr. Scale Linux, modifications/patche Fujitsu’s Linux, Network s to the Linux kernel! ZeptoOS, etc..) Simple Simple Mem. Driver Scheduler Mngt. 8

Background – HPC Node OS Architecture § Traditionally: driven by the need for scalable, consistent performance for bulk- synchronous HPC • Start from a thin Lightweight Kernel (LWK) written from Limited API scratch and add features to provide a more Linux like Thin LWK I/F, but keep scalability Very simple Co-operative • Support dynamic libraries, allow thread over- mem mngt. scheduler subscription, support for /proc filesystem, etc.. “Enhanced No full Linux LWK” API, lack of approach device drivers Linux like API (Catamount, CNK, and support for Kitten, etc..) tools! HPC OS Network Simple Simple Mem. Driver Scheduler Mngt. 9

High Level Approach: Linux + Lightweight Kernel With the abundance of CPU cores comes the hybrid approach: run Linux and LWK side-by-side § in compute nodes! Partition resources (CPU core, memory) explicitly § Run HPC apps on LWK § How to design such system? Selectively serve OS features with the help of Linux by offloading requests § Where to split OS functionalities? How do multiple kernels interplay? OS jitter contained in Linux, LWK is isolated In-situ workloads System daemon HPC Application Full Linux API Limited API Linux ? Thin LWK … … CPU CPU CPU CPU Memory Interrupt Partition Partition 10

Hybrid/Specialized (co)-Kernels The idea of combining FWK+LWK was first proposed by FusedOS @ IBM! Argo (nodeOS), led by Argonne National Laboratory mOS @ Intel Corporation Applica*on Analysis Tool SimulaGon A SimulaGon B TCASM ADIOS XEMEM XEMEM TCASM TCASM ADIOS ADIOS XEMEM Hobbes Run*me Leviathan Node Manager Full Linux VM Vendor Linux Palacios VMM Opera*ng KiCen Co-Kernel (e.g., Cray Linux) KiCen Co-Kernel System Pisces Compute Node Hardware Hobbes, led by Sandia National Laboratories FFMK, led by TU Dresden Property/Project Unmodified Linux Device Driver Kernel Level Full POSIX Support Development Effort Kernel Transparency in LWK Workload Isolation on LWK Argo No Yes No Yes Ideally small mOS No Yes Yes/No? Yes Ideally small Hobbes (a.k.a., Yes No Yes No Significant Pisces+Kitten) FFMK (L4+Linux) No No Yes No Significant IHK/McKernel Yes Yes Yes Yes Significant

IHK/McKernel Architectural Overview Interface for Heterogeneous Kernels (IHK): § Allows dynamic partitioning of node resources (i.e., CPU cores, physical memory, etc.) § Enables management of multi-kernels (assign resources, load, boot, destroy, etc..) § Provides inter-kernel communication (IKC), messaging and notification § McKernel: § No Linux modifications! A lightweight kernel developed from scratch, boots from IHK Dynamic reconfiguration. § No reboot of the host Linux required! Designed for HPC, noiseless, simple, implements only performance sensitive system calls (roughly process and § memory management) and the rest are offloaded to Linux System Proxy process OS jitter contained in Linux, LWK is isolated daemon HPC Application Linux Delegator module McKernel Kernel System System IHK Co-kernel IHK Linux daemon call call … … CPU CPU CPU CPU Memory Interrupt Partition Partition 12

McKernel and System Calls McKernel is a lightweight (co-)kernel designed for HPC • Linux ABI compatible • Boots from IHK (no intention to boot it stand-alone) • • Noiseless, simple, with a minimal set of features implemented and the rest offloaded to Linux Implemented Planned Process arch_prctl, clone, execve, exit, exit_group, fork, get_thread_area, getrlimit, Thread futex, getpid, getrlimit, kill, pause, ptrace, rt_sigtimedwait, set_thread_area, rt_sigaction, rt_sigpending, rt_sigprocmask, setrlimit rt_sigqueueinfo, rt_sigreturn, rt_sigsuspend, set_tid_address, setpgid, sigaltstack, tgkill, vfork, wait4, signalfd, signalfd4, ptrace Memory brk, gettid, madvise, mlock, mmap, mprotect, get_robust_list, mincore, mlockall, management mremap, munlock, munmap, remap_file_pages, modify_ldt, munlockall, shmat, shmctl, shmdt, shmget, mbind, set_robust_list set_mempolicy, get_mempolicy Scheduling sched_getaffinity, sched_setaffinity, getitimer, setitimer, time, times gettimeofday, nanosleep, sched_yield, settimeofday Performance Direct PMC interface: pmc_init, pmc_start, Counter pmc_stop, pmc_reset, PAPI Interface System calls not listed above are offloaded to Linux • • POSIX compliance: almost the entire LTP test suite passes! (2013 version: 100%, 2015: 99%) 13

Proxy Process and System Call Offloading in IHK/McKernel For each application process a “proxy-process” resides on Linux § Proxy process: § Provides execution context on behalf of the application so that offloaded calls can be directly § invoked in Linux Enables Linux to maintain certain state information that would have to be otherwise kept track of in § the LWK (e.g., file descriptor table is maintained by Linux) § ⑥ Proxy request OS jitter contained in Linux, LWK is isolated System Proxy process delegator to ⑤ Linux forward result daemon execute HPC Application ④ s ⑧ McKernel syscall Proxy returns to and makes userspace Linux returns syscall in Linux ① Application Delegator ⑦ IKC from Linux to makes a system call McKernel module ③ McKernel Kernel System Delegator System wakes up IHK Linux IHK Co-kernel daemon call ② McKernel sends call proxy … … IKC message to process Linux CPU CPU CPU CPU Memory Partition Partition Interrupt 14

Diverse Workloads need Specialized System Software: An approach of - PowerPoint PPT Presentation

Diverse Workloads need Specialized System Software: An approach of Multi-kernels and Application Containers Balazs Gerofi bgerofi@riken.jp Exa-scale System Software Research Group RIKEN Advanced Institute for Computational Science, JAPAN

Introduction Workloads for Experiments Introduction to workloads CS 239 Workload

Understanding Big Data Workloads on Understanding Big Data Workloads on Modern Processors using

Evaluation of Memory and CPU usage via Cgroups of ATLAS workloads via Cgroups of ATLAS workloads

Who Are Diverse Learners? How Do We Reach Them? Oklahoma State Department of Education Diverse

SPECIALIZED BANK LICENSE IN EU LITHUANIA Specialized Bank regime Cia reikia nuotraukos su

WE NEED DIVERSE BOOKS A Social Cognition Approach What does diversity mean? Why do we need

ERMIA: Fast Memory-Optimized Database System for Heterogeneous Workloads Kangnyeon Kim Tianzheng

Zones - Containers Server Consolidation Run multiple workloads on system Improve utilization of

Biscuit: A Framework for Near-Data Processing of Big Data Workloads Oct 21, 2016 Duck-Ho Bae

Anomaly Analysis and Diagnosis for Co-located Datacenter Workloads in the Alibaba Cluster

Mostly-Optimistic Concurrency Control for Highly Contended Dynamic Workloads on a Thousand Cores

Traffic Footprint Characterization of Workloads using BPF Aditi Ghag aghag@vmware.com VMware

How French companies specialized How French companies specialized in mountain planning develop

0.07 0.06 0.05 0.04 Unspecialized inside Specialized inside (rot, trans) Specialized inside

Pre-Admission Screen and Residential Review (PASRR) and Specialized Services Ma ura K le in

NATIONAL ALLIANCE OF SPECIALIZED INSTRUCTIONAL SUPPORT PERSONNEL IMPLEMENTING THE EVERY STUDENT

11/17/08 Today Scalable content distribution P561: Network Systems Infrastructure Week 8:

EGI- FORUM Vilnius 2001 CORAL A Relational Abstraction Layer for C++ and Python applications

TRAINING & DEVELOPMENT ON A LEAN BUDGET Presented by: Janice Gonzalez, Lillian Ihenetu, and

Attacking Multicast Group Key Management Protocols Graham Steel and Alan Bundy I V N E U R

Projected Impacts of Climate Change Global temperature change (relative to pre-industrial) 0C

Words Can Shift: Dynamically Adjusting Word Representations using Nonverbal Behaviors Yansen

Early Experience in Benchmarking Edge AI Processors with Object Detection Workloads Bench 2019

Planning for the Future of Data, Storage, and I/O at NERSC Glenn K. Lockwood, Ph.D Advanced

Diverse Workloads need Specialized System Software: An approach of - PowerPoint PPT Presentation

Diverse Workloads need Specialized System Software: An approach of Multi-kernels and Application Containers Balazs Gerofi bgerofi@riken.jp Exa-scale System Software Research Group RIKEN Advanced Institute for Computational Science, JAPAN

Introduction Workloads for Experiments Introduction to workloads CS 239 Workload

Understanding Big Data Workloads on Understanding Big Data Workloads on Modern Processors using

Evaluation of Memory and CPU usage via Cgroups of ATLAS workloads via Cgroups of ATLAS workloads

Who Are Diverse Learners? How Do We Reach Them? Oklahoma State Department of Education Diverse

SPECIALIZED BANK LICENSE IN EU LITHUANIA Specialized Bank regime Cia reikia nuotraukos su

WE NEED DIVERSE BOOKS A Social Cognition Approach What does diversity mean? Why do we need

ERMIA: Fast Memory-Optimized Database System for Heterogeneous Workloads Kangnyeon Kim Tianzheng

Zones - Containers Server Consolidation Run multiple workloads on system Improve utilization of

Biscuit: A Framework for Near-Data Processing of Big Data Workloads Oct 21, 2016 Duck-Ho Bae

Anomaly Analysis and Diagnosis for Co-located Datacenter Workloads in the Alibaba Cluster

Mostly-Optimistic Concurrency Control for Highly Contended Dynamic Workloads on a Thousand Cores

Traffic Footprint Characterization of Workloads using BPF Aditi Ghag aghag@vmware.com VMware

How French companies specialized How French companies specialized in mountain planning develop

0.07 0.06 0.05 0.04 Unspecialized inside Specialized inside (rot, trans) Specialized inside

Pre-Admission Screen and Residential Review (PASRR) and Specialized Services Ma ura K le in

NATIONAL ALLIANCE OF SPECIALIZED INSTRUCTIONAL SUPPORT PERSONNEL IMPLEMENTING THE EVERY STUDENT

11/17/08 Today Scalable content distribution P561: Network Systems Infrastructure Week 8:

EGI- FORUM Vilnius 2001 CORAL A Relational Abstraction Layer for C++ and Python applications

TRAINING &amp; DEVELOPMENT ON A LEAN BUDGET Presented by: Janice Gonzalez, Lillian Ihenetu, and

Attacking Multicast Group Key Management Protocols Graham Steel and Alan Bundy I V N E U R

Projected Impacts of Climate Change Global temperature change (relative to pre-industrial) 0C

Words Can Shift: Dynamically Adjusting Word Representations using Nonverbal Behaviors Yansen

Early Experience in Benchmarking Edge AI Processors with Object Detection Workloads Bench 2019

Planning for the Future of Data, Storage, and I/O at NERSC Glenn K. Lockwood, Ph.D Advanced

TRAINING & DEVELOPMENT ON A LEAN BUDGET Presented by: Janice Gonzalez, Lillian Ihenetu, and